datawhale笔记|分子预测大模型竞赛-baseline精读分享|笔记one

1.需要会使用Python一些基本的库

如 os sys re numpy之类的这里知道了一个新库rdkit,这是一个用于处理化学结构的库

~~时间和期末考试时间冲突了5号会吧笔记补充完整~~

库引用部分

1
import numpy as np
2
import pandas as pd
3
from catboost import CatBoostClassifier
4
from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold
5
from sklearn.metrics import f1_score
6
from rdkit import Chem
7
from rdkit.Chem import Descriptors
8
from sklearn.feature_extraction.text import TfidfVectorizer
9
import tqdm, sys, os, gc, re, argparse, warnings
10
warnings.filterwarnings('ignore')

2.数据预处理部分

1
train = pd.read_excel('./dataset-new/traindata-new.xlsx')
2
test = pd.read_excel('./dataset-new/testdata-new.xlsx')
3

4
# test数据不包含 DC50 (nM) 和 Dmax (%)
5
train = train.drop(['DC50 (nM)', 'Dmax (%)'], axis=1)
6

7
# 定义了一个空列表drop_cols，用于存储在测试数据集中非空值小于10个的列名。
8
drop_cols = []
9
for f in test.columns:
10
    if test[f].notnull().sum() < 10:
11
        drop_cols.append(f)
12

13
# 使用drop方法从训练集和测试集中删除了这些列，以避免在后续的分析或建模中使用这些包含大量缺失值的列
14
train = train.drop(drop_cols, axis=1)
15
test = test.drop(drop_cols, axis=1)
16

17
# 使用pd.concat将清洗后的训练集和测试集合并成一个名为data的DataFrame，便于进行统一的特征工程处理
18
data = pd.concat([train, test], axis=0, ignore_index=True)
19
cols = data.columns[2:]
20

21

22
drop_cols = []
23
for f in test.columns:
24
    if test[f].notnull().sum() < 10:
25
        drop_cols.append(f)

此处不删除数据，可以提升精度，但是需要对数据进行更加精细的分析

3.特征工程

1
# 将SMILES转换为分子对象列表,并转换为SMILES字符串列表
2
data['smiles_list'] = data['Smiles'].apply(lambda x:[Chem.MolToSmiles(mol, isomericSmiles=True) for mol in [Chem.MolFromSmiles(x)]])
3
data['smiles_list'] = data['smiles_list'].map(lambda x: ' '.join(x))
4

5
# 使用TfidfVectorizer计算TF-IDF
6
tfidf = TfidfVectorizer(max_df = 0.9, min_df = 1, sublinear_tf = True)
7
res = tfidf.fit_transform(data['smiles_list'])
8

9
# 将结果转为dataframe格式
10
tfidf_df = pd.DataFrame(res.toarray())
11
tfidf_df.columns = [f'smiles_tfidf_{i}' for i in range(tfidf_df.shape[1])]
12

13
# 按列合并到data数据
14
data = pd.concat([data, tfidf_df], axis=1)
15

16
# 自然数编码
17
def label_encode(series):
18
    unique = list(series.unique())
19
    return series.map(dict(zip(
20
        unique, range(series.nunique())
21
    )))
22

23
for col in cols:
24
    if data[col].dtype == 'object':
25
        data[col]  = label_encode(data[col])
26

27
train = data[data.Label.notnull()].reset_index(drop=True)
28
test = data[data.Label.isnull()].reset_index(drop=True)
29

30
# 特征筛选
31
features = [f for f in train.columns if f not in ['uuid','Label','smiles_list']]
32

33
# 构建训练集和测试集
34
x_train = train[features]
35
x_test = test[features]
36

37
# 训练集标签
38
y_train = train['Label'].astype(int)

代码执行了以下步骤：

SMILES转换 ：使用RDKit库将数据集中的SMILES字符串转换回SMILES字符串的列表。这里看起来有些冗余，因为您已经拥有SMILES字符串，但可能您想确保所有SMILES都是以相同的方式（例如，考虑异构体信息）处理的。
字符串处理 ：将SMILES字符串列表转换为单个字符串，每个SMILES之间用空格分隔。
TF-IDF计算 ：使用TfidfVectorizer从处理后的SMILES字符串创建TF-IDF特征矩阵。
转换为DataFrame ：将TF-IDF矩阵转换为DataFrame，以便与原始数据集结合。
自然数编码 ：定义了一个函数label_encode，用于将分类特征（对象类型）转换为整数编码。
特征和标签准备 ：
1. 对于所有的特征列（cols），如果它们的数据类型是对象（通常表示为字符串），则应用自然数编码。
2. 从合并后的数据集中分离出训练集和测试集，其中训练集包含标签（Label），测试集不包含。
特征和标签的筛选 ：从训练集和测试集中筛选出特征列（不包括uuid、Label和smiles_list），并从训练集中提取标签列。
数据类型转换 ：将标签列Label转换为整数类型，以便于模型训练。

3.模型训练与预测

1
def cv_model(clf, train_x, train_y, test_x, clf_name, seed=2022):
2

3
    kf = KFold(n_splits=5, shuffle=True, random_state=seed)
4

5
    train = np.zeros(train_x.shape[0])
6
    test = np.zeros(test_x.shape[0])
7

8
    cv_scores = []
9
    # 100， 1 2 3 4 5
10
    # 1 2 3 4    5
11
    # 1 2 3 5。  4
12
    # 1
13
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
14
        print('************************************ {} {}************************************'.format(str(i+1), str(seed)))
15
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
16

17
        params = {'learning_rate': 0.1, 'depth': 6, 'l2_leaf_reg': 10, 'bootstrap_type':'Bernoulli','random_seed':seed,
18
                  'od_type': 'Iter', 'od_wait': 100, 'allow_writing_files': False, 'task_type':'CPU'}
19

20
        model = clf(iterations=20000, **params, eval_metric='AUC')
21
        model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
22
                  metric_period=100,
23
                  cat_features=[],
24
                  use_best_model=True,
25
                  verbose=1)
26

27
        val_pred  = model.predict_proba(val_x)[:,1]
28
        test_pred = model.predict_proba(test_x)[:,1]
29

30
        train[valid_index] = val_pred
31
        test += test_pred / kf.n_splits
32
        cv_scores.append(f1_score(val_y, np.where(val_pred>0.5, 1, 0)))
33

34
        print(cv_scores)
35

36
    print("%s_score_list:" % clf_name, cv_scores)
37
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
38
    print("%s_score_std:" % clf_name, np.std(cv_scores))
39
    return train, test
40

41
cat_train, cat_test = cv_model(CatBoostClassifier, x_train, y_train, x_test, "cat")
42

43
pd.DataFrame(
44
    {
45
        'uuid': test['uuid'],
46
        'Label': np.where(cat_test>0.5, 1, 0)
47
    }
48
).to_csv('submit.csv', index=None)

代码定义了一个名为 cv_model 的函数，用于使用交叉验证训练分类模型，并对测试集进行预测。以下是代码的详细步骤和说明：

函数定义 ：cv_model 函数接受分类器对象 clf，训练特征集 train_x 和标签集 train_y，测试特征集 test_x，分类器名称 clf_name，以及可选的随机种子 seed。
交叉验证设置 ：使用 KFold 进行5折交叉验证，shuffle=True 表示在分折前打乱数据。
初始化变量 ：创建两个数组 train 和 test 来存储交叉验证过程中的训练集预测和测试集预测。
循环执行交叉验证 ：
1. 对每一折数据，使用训练索引 train_index 和验证索引 valid_index 分割训练集和验证集。
2. 打印当前折数和随机种子。
3. 设置 CatBoost 分类器的参数 params。
模型训练 ：使用 CatBoost 分类器训练模型，iterations=20000 表示最大迭代次数，eval_metric='AUC' 表示使用 AUC 作为评估指标。
模型评估 ：
1. 使用验证集 val_x 和 val_y 对模型进行评估，获取预测概率 val_pred。
2. 使用测试集 test_x 获取测试集预测概率 test_pred。
保存结果 ：将验证集的预测结果存储在 train 数组中，将测试集的预测结果累加到 test 数组中，并计算当前折的 F1 分数。
输出结果 ：打印所有折的 F1 分数、平均值和标准差。
返回结果 ：返回训练集预测结果 train 和测试集预测结果 test。
模型应用 ： 1. 使用 cv_model 函数训练 CatBoost 分类器，并将返回的测试集预测结果 cat_test 用于生成提交文件。 2. 根据预测概率 cat_test 生成二元标签，概率大于0.5的预测为1，否则为0。
生成提交文件 ：创建一个包含 uuid 和预测标签 Label 的 DataFrame，并将其保存为 CSV 文件。

官方学习文档https://datawhaler.feishu.cn/wiki/YgNbwUJHKiMuCekhoZ9cEzgxnZc