Kaggle-OpenProblem

9/6/2022 kaggle

# Kaggle-OpenProblem


目录

# 数据集


# 数据集解释

  • evaluation_ids.csv

  • metadata.csv

  • sample_submission.csv

  • test_cite_inputs.h5

  • test_multi_inputs.h5

  • train_cite_inputs.h5

  • train_cite_targets.h5

  • train_multi_inputs.h5

  • train_multi_targets.h5


本次比赛有两个任务,一个是citeseq,一个是multiome,可以看成两个比赛

cite和multi分别对应citeseq和multiome

For the Multiome samples: given chromatin accessibility, predict gene expression. DNA->RNA


For the CITEseq samples: given gene expression, predict protein levels. RNA->Protein

0.743 for CITE and 0.257 for MULTI

row中的每一个(cell_id,gene_id)对是二维单元格的位置

# evaluation_ids.csv


# metadata.csv

描述了测量天数,捐赠者,细胞类型(不准确),测量技术

MasP = Mast Cell Progenitor

MkP = Megakaryocyte Progenitor

NeuP = Neutrophil Progenitor

MoP = Monocyte Progenitor

EryP = Erythrocyte Progenitor

HSC = Hematoploetic Stem Cell

BP = B-Cell Progenitor


# sample_submission.csv

evaluation_ids.csv一一对应

evaluation_id sample_id

# train_cite_inputs.h5

Citeseq中每个细胞有22050个特征,且大部分为0。表格没有缺失值。


# train_cite_targets.h5

70988 cell × 140

140列为已被dsb归一化的相同细胞的表面蛋白水平。

target 是 140 ,可以用140个机器学习器(lgbm xgb),也可以全部预测用cell做mse loss或pearson loss

Gene列的名称有蛋白质关系


Important_cols is the set of all features whose name matches the name of a target protein. If a gene is named 'ENSG00000114013_CD86', it should be related to a protein named 'CD86'. These features will be used for the model unchanged, that is, they don't undergo dimensionality reduction.

所有的140 column如下

CD86 CD274 CD270 CD155 CD112 CD47 CD48 CD40 CD154 CD52 CD3 CD8 CD56 CD19 CD33 CD11c HLA-A-B-C CD45RA CD123 CD7 CD105 CD49f CD194 CD4 CD44 CD14 CD16 CD25 CD45RO CD279 TIGIT Mouse-IgG1 Mouse-IgG2a Mouse-IgG2b Rat-IgG2b CD20 CD335 CD31 Podoplanin CD146 IgM CD5 CD195 CD32 CD196 CD185 CD103 CD69 CD62L CD161 CD152 CD223 KLRG1 CD27 CD107a CD95 CD134 HLA-DR CD1c CD11b CD64 CD141 CD1d CD314 CD35 CD57 CD272 CD278 CD58 CD39 CX3CR1 CD24 CD21 CD11a CD79b CD244 CD169 integrinB7 CD268 CD42b CD54 CD62P CD119 TCR Rat-IgG1 Rat-IgG2a CD192 CD122 FceRIa CD41 CD137 CD163 CD83 CD124 CD13 CD2 CD226 CD29 CD303 CD49b CD81 IgD CD18 CD28 CD38 CD127 CD45 CD22 CD71 CD26 CD115 CD63 CD304 CD36 CD172a CD72 CD158 CD93 CD49a CD49d CD73 CD9 TCRVa7.2 TCRVd2 LOX-1 CD158b CD158e1 CD142 CD319 CD352 CD94 CD162 CD85j CD23 CD328 HLA-E CD82 CD101 CD88 CD224

CD含义: Cluster of Differentiation 分化簇

Cluster map绘制 (opens new window)


# train_multi_inputs.h5

每个细胞有22万个特征


# train_multi_targets.h5

每个细胞23418个目标


# 测试集

两个测试集,除了没有标签之外其他和train相同

  • test_cite_inputs.h5
  • test_multi_inputs.h5

# 汇总

综上所述

  • mutiome任务,输入维度22万,输出标签23418个

  • citeseq任务,输入维度2万,输出标签140个

数据量巨大

⭐在特征降维和数据加载上都具有挑战

row中的每一个(cell_id,gene_id)对是二维单元格的位置

# 模型&提交相关


# CV划分


kf.split(X, groups=meta.donor)
1

特征和天数也有关系,随着时间有固定方向的偏移

注意:ensemble要统一cv


# 模型训练-Pytorch

config = dict(
    layers = [128, 128, 128],
...
class MLP(nn.Module):
    def __init__(self, layer_size_lst, add_final_activation=False):
        super().__init__()
        
        assert len(layer_size_lst) > 2
        
        layer_lst = []
        for i in range(len(layer_size_lst)-1):
            sz1 = layer_size_lst[i]
            sz2 = layer_size_lst[i+1]
            layer_lst += [nn.Linear(sz1, sz2)]
            if i != len(layer_size_lst)-2 or add_final_activation:
                 layer_lst += [nn.ReLU()]
        self.mlp = nn.Sequential(*layer_lst)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

在Kaggle 16G GPU,13G RAM中,可以传入进来一个22万维的tensor稀疏向量torch.Size([512, 228942])

缺点:
只能使用max归一化;减去平均值,会破坏这里的tensorCSR稀疏性。操作受限。


# 模型训练-降维

  • 人工特征+PCA/SVD降维+树模型/MLP
graph LR
A[人工特征] --> B[PCA/SVD/UMAP]
B --> C[树模型/ML模型/MLP]
A --> D[TensorCSR]
D --> C
1
2
3
4
5
  • 其他降维方法
list_fast_methods = ['PCA','umap','FA', 'NMF','RandProj','RandTrees'] # 'ICA',
list_slow_methods = ['t-SNE','LLE','Modified LLE','Isomap','MDS','SE','LatDirAll','LTSA','Hessian LLE']
1
2

# Loss

Pearson MSE
class NegativeCorrLoss(nn.Module):
    def __init__(self):
        super().__init__()
    def forward(self, preds, targets):
        my = torch.mean(preds, dim=1)
        my = torch.tile(torch.unsqueeze(my, dim=1), (1, targets.shape[1]))
        ym = preds - my
        r_num = torch.sum(torch.multiply(targets, ym), dim=1)
        r_den = torch.sqrt(
            torch.sum(torch.square(ym), dim=1) * float(targets.shape[-1])
        )
        r = torch.mean(r_num / r_den)
        return -r
def criterion(outputs, labels):
    """ MSE Loss function"""
    return nn.MSELoss()(outputs, labels)

两个可以都尝试一下


# ensemble策略

  • Statement 1. Correlation loss is insensitive to linear transformations of predictions

  • Statement 2. Per-cell_id standardization helps to rescale base submissions
    Under assumption that two base submissions are similar and demonstrate similar performance we could rescale them in the way that they become comparable and weighting in a regular way becomes adequate:

  • Statement 3. Weighting coefficients don't have to add up to 1!
    This is one of the benefit of the loss function that is agnostic to linear transformations. You don't have to weight base submissions as usual with iwi=1\sum_i w_i=1. Any coefficients will do the job!

作者做了一个实验




std前 std后
0.92417836 0.94238122

# 比赛特点

本次比赛只需要提交submission.csv,也就是纯表格赛。人数会非常多。

训练时间无限,可无限融合


# 参考信息

In my case my out of folds CV for cite is 0.8882 and for multi is 0.6601 (opens new window),baseline&改进 训练出这个数大概就是对了


# 2021年冠军方案


# AE-JAE

每种模式首先会被SVD转换并连接到一起(表示为x)。与标准AE的主要区别是,我们纳入了细胞注释的信息(例如,细胞标签、细胞周期得分和细胞批次)来约束潜在特征的结构。我们希望一些潜在特征(c)预测细胞类型信息,一些特征预测细胞周期得分。值得注意的是,对于特征(b),我们希望它尽可能随机地预测批次标签,以潜在地消除批次效应。

在预训练阶段,JAE是用细胞注释信息(细胞类型、细胞周期阶段得分)可用的探索数据进行训练。在没有细胞注释信息的测试阶段,我们只用较小的学习率(微调)来最小化自动编码器的重建损失。


# AE-CLUE

它采用变异自动编码器将来自不同模式的细胞投射到一个统一的低维嵌入空间,在那里可以进行模式匹配。特别是,我们将每种模式的数据建模为由完整细胞嵌入的特定模式子空间产生。通过交叉编码器矩阵,CLUE将每个模态中的细胞投射到所有特定模态的子空间中,然后将这些子空间连接起来,建立一个全面的嵌入,使该模型能够捕捉到共享的和特定模态的信息。

# Novel team

以与CLIP模型相同的方式学习。所有模式的编码器都是完全连接的。其中权重是样本嵌入之间的余弦相似度。

# MY EDA


# 表格列名

DNA->RNA(1)
RNA(2)->Protein
RNA(1)和RNA(2)之间有特征重叠,但是细胞没有交集


https://www.proteinatlas.org/ENSG00000121410


# Citeseq训练结果

Citeseq best score: 0.893


# Multiome训练结果

multiome best score: 0.662


# Multiome列名顺序

graph LR
A(GL000194.1:114519-115365) --> B(GL000194.1)
A --> C(114519-115365)
B --> D("{name:GL000194.1,range:[114519-115365]}")
C --> D
1
2
3
4
5

multiom column列名全部为有序排列的,是否可以直接卷积?

https://lanceotron.molbiol.ox.ac.uk/projects/peak_search_basic/6243


v4-cnn:
graph BT
A("8 channel 1*3,MaxPooling1D(3),ReLU")
B("16 channel 1*3,MaxPooling1D(3),ReLU")
C("32 channel 1*3,MaxPooling1D(3),ReLU")
D("64 channel 1*3,MaxPooling1D(3),ReLU")
E("128 channel 1*3,MaxPooling1D(3),ReLU")
F("256 channel 1*3,MaxPooling1D(3),ReLU")
G("512 channel 1*3,MaxPooling1D(3),ReLU")
H("1024 channel 1*3,MaxPooling1D(3),ReLU")
I("2048 channel 1*3,MaxPooling1D(3),ReLU")
J("2048 channel 1*3,MaxPooling1D(3),ReLU")
A --> B
B <--> C
C --> D
D --> E
E --> F
F --> G
G --> H
H --> I
I --> J
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

# CNN Results

名称 说明 结果
v4-cnn cnn 8~2048 channel 0.6421
base baseline batch 512 0.6626
v5-mlp baseline batch 16 0.666
v6-cnn 8 kernel channel each layer , 8 layers 0.647

  • 问题

torch sparse中没有reshape方法


名称 说明 结果
base baseline batch 512 0.6626
v6-cnn 8 kernel channel each layer , 8 layers 0.647
v7-cnn 16~32 channels 0.65
v8-cnn 128 channel 0.6508
v10-cnn BN,residual connection,48 channels 0.6544

# 公开的NoteBook


# MSCI CITEseq Keras Quickstart + Dropout - LB 0.810

  • Solution for citeseq

  • Dimensionality reduction: To reduce the size of the 10.6 GByte input data, we project the 22050 features to a space with only 64 dimensions by applying a truncated SVD. To these 64 dimensions, we add 144 features whose names shows their importance.

结合了PCA和人工筛选特征的优势


  • Hyperparameter tuning with KerasTuner: We tune the hyperparameters with KerasTuner BayesianOptimization.
  • Cross-validation: Submitting unvalidated models and relying only on the public leaderboard is bad practice. The model in this notebook is fully cross-validated with a 3-fold GroupKFold.

泛化能力、模拟private set真实场景


  • Use pearson loss directly
  • The model is a sequential dense network with four hidden layers.
  • Define two sets of features:

constant_cols is the set of all features which are constant in the train or test datset.

important_cols is the set of all features whose name matches the name of a target protein. If a gene is named 'ENSG00000114013_CD86', it should be related to a protein named 'CD86'.They don't undergo dimensionality reduction.

Finally ,we get 256 SVD features + 144 important features


但是根据这篇讨论帖子 (opens new window),important_cols 筛选的基因其实相关系数低


  • 回归任务上的问题:Dropout
dropout without dropout

一些观点 (opens new window)

  • 在训练期间使用dropout时,将对其输出进行放缩,以在dropout层之后保留其平均值。但是,variance尚未保留。因此,它仅仅获取了trainset的统计信息,因此在dropout时在valset时预测失效。

  • dropout仅仅适用于只有输出的相对大小很重要的任务,例如猫狗分类logits;输出Regression中代表绝对数值时,会推理时性能差。


# [🥈LB_T15| MSCI Multiome] CatBoostRegressor - LB 0.810

  • Solution for multiome

  • CatBoostRegressor

  • 2 PCAs , 1 for input , 1 for target

  • 优点

输入输出都用pca降维,节约模型训练需要的空间,减少模型训练难度

  • 缺点

pca反向转换有损,且结果难以解释


# 🔬[Multi:.67;CITE:.89] PyTorch Swiss Army Knife🔬 - LB 0.809

  • TruncatedSVD is used to project raw features to 512 dimensional space.

  • Raw data is loaded to memory as sparse matrices and is lazily uncomressed and concatenated with cell_id features in the MSCIDatasetSparse class.

  • Optuna Hyperparameter Optimization

  • Random kfold split

  • MLP Model


# MSCI Multiome Torch Quickstart Submission - LB 0.808

  • Solution for multiome/citeseq

  • 使用Pytorch Sparse Tensor

大幅减少内存压力,无需预先PCA降维


  • MLP
graph LR
A["Input(228942)"] --> B["Linear(128)"]
B --> C["ReLU"]
C --> D["Linear(128)"]
D --> E["ReLU"]
E --> F["Linear(128)"]
F --> G["ReLU"]
G --> H["Linear(23418)"]
1
2
3
4
5
6
7
8
  • 模型简单有效

  • 缺点

sparse tensor只能为二维,[batch,feature],仅适用于mlp。想使用其他方法,必须转换为dense tensor


# Fork of [MSCI Multiome] RandomSampling | Sp 6b182b - LB 0.804

  • Solution for Multiome

  • Pearson loss

  • Random KFold

  • KernelRidge/Tabnet Regression

  • pca inverse transform


# MSCI CITEseq Quickstart - LB 0.803


  • Dimensionality reduction . PCA->512 features

  • Domain knowledge: The column names of the data reveal which features are most important.

  • We fit 140 LightGBM models to the data (because there are 140 targets).

训练了140个学习器,因为单个模型不能适配所有任务;但对于multiome任务,训练2w个学习器不可行


  • 3-fold GroupKFold

# CITEseq - RNA to Protein Encoder-Decoder NN - LB 0.798

  • PCA 降维

  • Encoder Decoder NN

  • AdamW optimizer with Cosine scheduler

可改进

尝试 rnn 、 cnn

尝试 attention mechanism

更改网络结构、添加新的特征


# 预处理 normalize Y to 1e6 (Multiome)

  1. calculate predictions Y

  2. calculate normalizer Z = sum(exp(Y))

  3. renorm: Y_i -> Y_i + (log((1e6+22050 )/Z))


# Count nonzero genes - decrease daily

随着天数的增加,非0基因表达的数量减少

猜测是细胞增殖速度减慢

基因表达的sum和与基因表达的非0基因个数的相关系数0.996


# Tips on Dimensionality Reduction


  • Handle Zeros

数据集包含是大量的零。甚至还有整个列仅由零组成

Here's a tip on how to remove them.

all_zero_columns = (X == 0).all(axis=0)
X = X[:,~all_zero_columns]
1
2

  • ICA

独立的组件分析(ICA)发现哪些向量是数据的独立子元素。换句话说,PCA有助于压缩数据,ICA有助于分离数据。

Example code:

from sklearn.decomposition import FastICA
ica = FastICA(n_components=n)
X = ica.fit_transform(X)
1
2
3

  • PCA

主成分分析(PCA)是一种线性降维,利用数据的奇异值分解将其投射到一个较低的维度空间。

Example code:

from sklearn.decomposition import PCA
pca = PCA(n_components=n)
X = pca.fit_transform(X)
1
2
3



但是PCA会破坏稀疏性,不支持稀疏向量


  • t-SNE

T-SNE 是一种无监督的非线性降维和数据可视化分析技术,可以作为主成分分析的另一种替代方法.

Example code:

from sklearn.manifold import TSNE
tsne = TSNE(n_components=n)
X = tsne.fit_transform(X)
1
2
3


  • Ivis

IVIS有生物分子任务应用的前景。它使用Siamese神经网络来创建嵌入并降低尺寸的数量。预测它可能在此挑战中有良好的应用。

Example code:

from ivis import Ivis
model = Ivis(embedding_dims=dims, k=k, batch_size=bs, epochs=ep, n_trees=n_trees)
X = model.fit_transform(X)
1
2
3


# 单细胞数据分析软件包

几个用于单细胞数据分析的Python软件包:


# Pearson Correlations EDA


  • Quick view

the individual correlations between a single target and a single input are rather small

10 Multiome inputs constantly equal to zero; 560 targets constantly equal to zeros.

  • Examples(一些重要的基因) :

    • For example chr1:47180897-47181792 seem to be a enhancer for 35% of targets, and an inhibitor for 11% of them.

    • On the contrary, chr2:28557270-28558187 seem to be an inhibitor for 30% of targets and an enhancer for 11% of them.


  • Sub groups
These approximate ratios of 30% / 10% of negative/positive correlations appear surprisingly often.

Possibly there are two subgroups of highly correlated targets representing about 30% and 10% of all targets and that have a very similar response to the same inputs.


# 关联规则

对于 inputs -> inputs / targets -> targets

则内部存在耦合

inputs -> targets / targets -> inputs

可提取为特征

Steps :

  1. df=(df-df.mean())/df.std()
  2. (df>13) 每列选择十几个,减少时间复杂度
  3. apriori(data, min_support=sup_thresh, min_confidence=conf_thresh)

其中sup_thresh,conf_thresh = 0.0005,0.5


# Summary

  • 特征

  • 降维

input pca降维,output pca降维

  • 模型

Catboost、LGBM、Tabnet、Ridge、MLP、Encoder Decoder NN、SVR ; Tricks

  • cv

Group kfold on donor

  • 调参

keras tuner , optuna


# 集成Tricks


# 60种特征工程


# LR

  • 当training loss大于一个阈值时,进行正常的梯度下降;当training loss低于阈值时,会反过来进行梯度上升,让training loss保持在一个阈值附近,让模型持续进行“random walk”

  • 每隔一段时间重启学习率,这样在单位时间内能收敛到多个局部最小值,可以得到很多个模型做集成。

# 其他

  • 几十、几百个模型的集成

  • 调种子

  • HyperParam Tunner

  • torch & tf & jax


# Feature Importance


# 基于MLP的特征重要性


# 基于相关系数的特征重要性


# 基于关联规则的特征重要性


最后更新时间: 3/7/2024, 1:47:24 AM