Datasets


Elucidation of Chemical-Protein Interaction (CPI) is the basis of target identification and drug discovery. It is time-consuming and costly to determine CPI experimentally, and computational methods will facilitate the determination of CPI. In this study, two methods, multitarget quantitative structure-activity relationship (mt-QSAR) and computational chemogenomics, were developed for CPI prediction. Two comprehensive data sets were collected from ChEMBL database for method assessment. One data set consisted of 81,689 CPI pairs among 50,924 compounds and 136 G-protein coupled receptors (GPCRs), while the other one contained 43,965 CPI pairs among 23,376 compounds and 176 kinases (Tables S1-S3). The range of the area under the receiver operating characteristic curve (AUC) for the test sets was 0.95 to 1.0 and 0.82 to 1.0 for 100 GPCRs mt-QSAR models and 100 kinases mt-QSAR models, respectively. The AUC of 5-fold cross validation were about 0.92 for both 176 kinases and 136 GPCRs using chemogenomic method. However, the performance of chemogenomic method was worse than that of mt-QSAR for the external validation set (Table S6). Further analysis revealed that there was a high false positive rate for the external validation set when using chemogenomic method. The methods and tools would have potential applications in network pharmacology and drug repositioning.

Data set download:

Table S1.

Table S2.

Table S3.

Table S6.

The detailed description about CPI-predictor data sets can be as follows:

ChEMBL

DrugBank

KEGG

Other links

LMMD

admetSAR

Citing Reference

Feixiong Cheng, Yadi Zhou, Jie Li, Weihua Li, Guixia Liu and Yun Tang*. Prediction of Chemical-Protein Interactions: Multitarget-QSAR versus Computational Chemogenomic Methods. Mol. BioSyst.,2012, in press, DOI:10.1039/C2MB25110H