程序代写代做代考 Bioinformatics Excel database algorithm Using the Ranking-Based KNN Approach for Drug Repositioning Based on Multiple Information

Using the Ranking-Based KNN Approach for Drug Repositioning Based on Multiple Information
Xin Tian1, Mingyuan Xin1, Jian Luo1, and Zhenran Jiang2(&)
1 Shanghai Key Laboratory of Regulatory Biology,
Institute of Biomedical Sciences and School of Life Sciences,
East China Normal University, Shanghai 200241, China
2 Shanghai Key Laboratory of Multidimensional Information Processing, Department of Computer Science and Technology,
East China Normal University, Shanghai 200241, China zrjiang@cs.ecnu.edu.cn
Abstract. Using effective computer methods to infer potential drug-disease relationships can provide clues for the discovery new uses of old drugs. This paper introduced a Ranking-based k-Nearest Neighbor (Re-KNN) method to drug repositioning for cardiovascular diseases. The main characteristic of the Re-KNN lies in combining conventional KNN algorithm with Ranking SVM (Support Vector Machine) algorithm to get neighbors that are more trustable. By integrating the chemical structural similarity, target-based similarity, side-effect similarity and topological similarity information, Re-KNN method can obtain an improved AUC (Area under ROC Curve) and AUPR (Area under Precision-Recall curve) compared with other methods, which prove the validity and efficiency of multiple features integration.
Keywords: Cardiovascular disease 􏶒 Drug reposition 􏶒 Ranking-based KNN 􏶒 Ranking SVM
1 Introduction
Drug discovery is an extremely time-consuming and high-risk procedure. It has been reported that developing a new drug often takes about 15 years and $800 million to $1 billion for pharmaceutical companies [1]. However, FDA approves only a relatively small number of new drugs each year. Therefore, drug repositioning (New uses of old drugs) is widely regarded as an effective solution to solve the dilemma [2].
The existing methods for drug repositioning can be categorized into drug-based method and disease-based method. (1) The drug-based method uses the drug infor- mation about the chemical or pharmaceutical perspective to predict the new function of ‘old’ drugs. The drug information contain chemical similarity, target-based similarity or side effect similarity and so on [3, 4]. These methods are usually applied when the drug profiling are more comprehensive and important for drug repositioning. (2) The disease-based method concerns more about the disease phenotype or pathology. This method can be applied to solve the problem of missing drug knowledge or to focus on a
© Springer International Publishing Switzerland 2016
D.-S. Huang et al. (Eds.): ICIC 2016, Part I, LNCS 9771, pp. 317–327, 2016. DOI: 10.1007/978-3-319-42291-6_31

318 X. Tian et al.
specific disease [5, 6]. These methods provide a variety of similarity calculation methods for reference. However, these methods cannot be used for samples with missing biological information. Recently, network-based methods combined with more biological knowledge get more applications for drug repositioning [7, 8].
Wu et al. combined a network-based precition method with the high-throughput genome-wide data for drug repositioning [9]. However, the noise of the gene expression data can influence prediction accuracy. Gottlieb et al. developed a novel algorithm, PRETICT, to infer novel drug indications [10]. The PRETICT method integrates mul- tiple drug-drug and disease-disease similarity based on their biological nature, and get excellent prediction performance. However, the number of negative drug-disease interactions is far more than the positive ones in the drug-disease networks, which affects the accuracy of the PRETICT algorithm. Chen et al. introduced the ProbS and HeatS to predict direct drug-disease associations based only on the basic network topology [11]. The method implement naïve topology-based inference and do not take into account important features within the drug-disease domain. Zhang and Wang attempted to utilize chemical structure, protein targets, or phenotypic information to construct computational models and achieved promising results [12, 13]. This illustrates the importance of comprehensive information. Therefore, this paper attempted to integrate four types of information, including the chemical structural similarity, the sequence similarity, the side-effect similarity and the topology similarity for drug repositioning.
We introduce a new multi-label classification algorithm called Ranking-based KNN approach (denoted as Re-KNN, [14]) based on four types similarity for drug reposi- tioning. The Re-KNN approach can be divided into two phases. (i) Using traditional KNN algorithm to identify the k-nearest-neighbors of the test drug. (ii) The selected neighbors are re-ranked by the Ranking SVM model [15] trained on their trustiness. The higher weights will be assigned to the trustable neighbors for making the weighted-voting decisions. The Re-KNN methods combined with the similarity matrix of drug-drug was applicated in the drug-disease dataset by 10-fold cross-validation tests. To illustrate the performance of the method, we compared Re-KNN method with WNN-GIP method [16], Super-Target method [17] and KNN method. The increased AUC and AUPR of the Re-KNN method illustrate a higher accuacy prediction than the three methods. Further, some interesting relationships between cardiovascular disease and drugs were found in this paper.
2 Materials
Cardiovascular disease is a type of chronic disease and a serious threat to the health of the elderly. 38 different kinds of cardiovascular diseases treated as the research objective were obtained from Human Diseases database of KEGG [18]. In addition, 104 different types of drugs, which are known as the treatment for the 38 cardiovas- cular diseases were drawn from the KEGG DRUG database. Including 38 cardiovas- cular diseases associated with these 104 kinds of drugs, we also picked out other 55 diseases. Both these diseases and cardiovascular disease at least have one same drug. Finally, we obtained 93 kinds of diseases and 540 kinds of related drugs. There are 1,060 known disease-drug relationships (Table 1).

Using the Ranking-Based KNN Approach for Drug Repositioning 319 Table 1. Statistic of the validated drug-disease association network
Number of diseases
Number of drugs
Number
of known associations
93
540
1060
Attention
Cardiovascular disease
3 Methods
3.1 Overview of the Method
Average degree of diseases
11.4
Tsung-Hsien Chiang first developed the Ranking-based KNN method for multi-label classification. In the paper, the drug-repositioning problem will be relocated as a multi-label classification problem. The ‘old’ drugs are considered as samples to be classified while the names of diseases will be treated as the classification labels. First, all of the drugs were divided into training set and test set, then the similarity matrix of the training set was acquired by combining four categories of similarity calculation. With the given a test drug D, we need to find its k-nearest neighbors using KNN algorithm, denoted as {Nk (D,j)|j = 1,…,k}.Then, we should create new instances from {Nk (D,j)|j = 1,…,k} and send all the new instances to train the Ranking SVM model for re-ranking. The re-ranking k-nearest neighbors of drug D, which denoted as{Nk′ (D, j)|j = 1,…,k}, can be obtained by the Ranking SVM model. Therefore, the re-ranked neighbors can be treated as trustable neighbors that possess the most similarity with the labels of drug D. Finally, a weighted voting strategy was applied to obtain the final
Fig. 1. The flowchart of the ranking-based KNN approach

320 X. Tian et al.
predicted labels for the test drug D. The flowchart of the Re-KNN approach in drug-repositioning process is shown in Fig. 1. In the following section, the whole training process of Ranking SVM will be explained in details.
3.2 Similarity Measures
To take full advantage of biological prior knowledge and known network topology information, the linear combination of four kinds of similarity calculation methods, including the chemical structure similarity, target-based similarity, side-effect similarity and topology similarity were applied.
(1) Chemical similarity: We downloaded the canonical MOL files from KEGG. The chemical fingerprints were computed using PaDel-Descriptor software [19]. A chemical fingerprint contains 881 chemical substructures according to Pub- Chem database. Each drug D is described by a binary fingerprint C(D). The element of C(D) is encoded as 1 or 0 which indicates the presence or absence of corresponding PubChem substructure. The similarity score between two drugs is computed based on their fingerprints C(D) according to the Tanimoto score. Tanimoto score is equivalent to the Jaccard score of the drugs’ fingerprints. We use the Tanimoto score as the chemical similarity of drugs.
􏶓 􏶒 jCðDiÞ \CðDjÞj
Scs Di;Dj 1⁄4jCðDÞjþC􏶓D􏶒􏶓jCðDÞ \CðDÞj ð1Þ
ijij
Where jCðDiÞj and jCðDjÞj means the counts of structure fragments in drug i and drug j, respectively. The jCðDiÞ \ CðDjÞj means the amount of chemical substructures shared by drug i and drug j.
(2) Target-based similarity: The targets of drugs can be obtained from DrugBank or KEGG DRUG. To compute the sequence similarity between two proteins, we use the normalized version of Smith-Waterman scores [20].
(3) Side-effect similarity: SIDER database contains information about 1430 kinds of marked drugs and 5868 kinds of recorded adverse drug reactions. Each drug D was represented by 5868-dimensional binary side-effect profile E (D). The ele- ments of E (D) is encoded as 1 or 0 which indicate the presence or absence of each of the side-effect key words respectively. The pairwise side-effect similarity between drug i and drug j is computed as the Tanimoto score of their chemical structure similarity:
􏶓 􏶒 jEðDiÞ \EðDjÞj
Sse Di;Dj 1⁄4jEðDÞjþE􏶓D􏶒􏶓jEðDÞ \EðDÞj ð2Þ
ijij
(4) Topology similarity: According to the existing drugs-diseases network relation- ship, we constructed a matrix K, where the element kij of the K represents the number of shared diseases by drug i and drug j [21]. Drug i and drug j shared

Using the Ranking-Based KNN Approach for Drug Repositioning 321
more diseases, the more similar the two drugs will be. At the same time, we use Floyd algorithm to find the shortest distance between two nodes in the drug-disease network, and then calculate the Pearson correlation coefficients of the distance vectors between any two drugs to construct a similar matrix. The matrix reflects the topology similarity between the drugs.
Pnk1⁄41ðVik 􏶓 ViÞðVjk 􏶓 VjÞ Pnk1⁄41ðVik 􏶓 ViÞ2 Pnk1⁄41ðVjk 􏶓 VjÞ2
􏶓􏶒
Stp Di; Dj 1⁄4 j qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi j ð3Þ
3.3
Ranking SVM Model
KNN algorithm is an effective learning method for solving conventional classification problems. The label of a sample is usually based on a majority vote of its k-nearest-neighbors. However, not all of the neighbors are trustable for label predic- tion. Thus, the Ranking SVM model is used to give the trustable ranking of these neighbors. Figure 2 shows the training process of the Ranking-SVM model, where x denotes the drug’s feature vector and the Y(x) denotes the label set associated with x.
First, with KNN algorithm, the k-nearest neighbors for the sample xi in training set will be searched.
Second, by constructing k instances Q1⁄4fq1;q2;…;qkg for the sample xi, the ranking SVM model is built. A new instance qj can be established for x0is j-th nearest neighbors by KNN. The elements of the q0js features are listed as follows:
• The difference among each feature value of j-th nearest neighbors and sample xi (size = dim(S)).
• The Euclidean distance between j-th nearest neighbors and sample xi (size = 1).
• The cosine distance between j-th nearest neighbors and sample xi (size = 1).
• The label set of j-th nearest neighbors (size = n).
Fig. 2. The training process of the ranking SVM model

322 X. Tian et al.
Then, using Hamming loss function [22] to determine the quality of each new instance qj, Hamming loss calculates the percentage of labels that are not consistent between two instances. It is defined as follows:
HammingLossðL1; L2Þ 1⁄4 1 jL1DL2j ð4Þ jLj
Where L1 and L2 represent two label sets respectively. The lower Hamming loss is, the higher rank will be assigned to the new instance qj.
3.4 Weighted Voting
According to the above training process, a Ranking SVM model is built on the training
data set. With sending an un-labeled drug x to the model, we get the re-ranked trustable
neighbors fNk0 ðx; jÞjj 1⁄4 1; . . .; kg of drug x. In this section, we will give the corre-
sponding weight to each trustable neighbors. Let ðw1;w2;…;wkÞ denote the weight
scores of the re-rank neighbors fNk0 ðx; jÞjj 1⁄4 1; . . .; kg. We choose the 􏶔􏶕
wjjj 1⁄4 1; . . .; k with 10-fold cross-validation to maximize AUC, where wj 2 ð0; 1Þ follow the uniform distribution.
The probability PriðxÞ between drug x and disease-label yi is defined as following formula. As the PriðxÞ increases, the probability that the drug x can treat the disease yi will also arise.
4 Results
Pkj1⁄41 wj 􏶒yiðNk0 ðx;jÞÞ
PriðxÞ 1⁄4 Pk w ð5Þ
j1⁄41 j
In order to find the trustable neighbors, the Re-KNN method is applied to the drug-disease network associated with cardiovascular disease. In addition, 10-fold cross validation tests is used to evaluate the performance of the method. The known drugs-diseases dataset are randomly divided into ten subsets of equal size. Each one of the ten parts will be used as test data in turn while the remaining nine parts should be treated as training data. Both AUC and AUPR are used to evaluate the results as the quality measures.
4.1 Comparison with Other Methods
Recently, several similar types of methods have been proposed for drug-disease association prediction. This study selects the WNN-GIP, the Super-Target and KNN in order to compare the performance of Re-KNN method.
The performance comparision for the Re-KNN method, WNN-GIP method, Super-Target method and KNN method on cardiovascular dataset are listed in Table 2. As shown in Table 2, we can see the Re-KNN method performed better with both AUC and AUPR.

Using the Ranking-Based KNN Approach for Drug Repositioning 323
Table 2. Comparison results for the four prediction methods
Method KNN AUC 0.821 AUPR 0.258 SN 0.147 SP 0.982 ACC 0.969 MCC 0.250
SN–Sensitivity; SP–Specificity; ACC–Accuracy; MCC–Matthews Correlation Coefficient.
By comparing with WNN-GIP method, the AUC value and the AUPR value of the Re-KNN method is increased by 0.08 and 0.21 respectively. In addition, the Re-KNN method had a better accuracy by comparing with other three methods. For instance, the accuracy 0.983 of the Re-KNN method is larger than the accuracy of the WNN-GIP. Since the number of negative sample is far more than the number of the positive ones, sensitivity should be given higher priority. The sensitivity (0.388) of Re-KNN method is larger than other three methods, which justified the contribution of Ranking-SVM model. The ROC curves and the P-R curves of four methods are shown in Fig. 3.
4.2 Predicted New Drug-Disease Associations for Cardiovascular Diseases
After the Re-KNN method is proved to perform better than other three methods, the experiment was repeated 10 times for the drug-disease dataset associated with car- diovascular diseases. By ranking the 10 experimental results, the diseases that can be possibly treated are picked out. In particular, we found 300 new associations between 117 drugs and 35 cardiovascular diseases that can be seen on the top of Fig. 4. The red diamonds represent cardiovascular diseases, green triangles represent the drugs, black solid line represents the existing relationship and the purple dashed line represents the
Re-KNN
WNN-GIP
Super-target
0.928
0.845
0.800
0.588
0.375
0.322
0.388
0.134
0.111
0.977
0.985
0.968
0.983
0.974
0.959
0.542
0.351
0.343
Fig. 3. ROC and P-R curve of four methods in cardiovascular diseases related datasets

324 X. Tian et al.
possible existing new associations predicted by our methods. As the figure shows, the new relationship is distributed more concentratedly in the marked Part A and Part B respectively. The gathered drugs in part A that used to be treatment for other diseases may become the therapeutic for cardiovascular. The diseases of the cardiovascular type can possibly treat another disease type in Part B.
The bottom part of Fig. 4 presents the specific examples predicted by Re-KNN method. In the left of botton part shows the drug Adomiparin (D07510) can be used as the treatment for the inherited thrombophilia (H00223) and Gordon’s syndrome (H00243) simultaneously. The drugs, Digiglusin (D03819) and Quelicin (D00766), may become the treatment to H00233, which known as the therapeutic method of Gordon’s syndrome. In the right, primary pulmonary hypertension (H01300) may be treated by the drugs D08953, D03658, D06414, etc. In addition, these drugs can also treat chronic myeloid leukemia (H00004), a cancer of haematopoietic and lymphoid tissues.
In order to verify the predicted new associations, we searched the literatures in the PubMed database with the predicted relationships. Table 3 lists a part of the searched results. It showed some of the drug predicted by this paper can be validated by relevant literatures and show potentiality for further study.
Fig. 4. The predictied associations and the known interactions between drug-diseases (Color figure online)

Using the Ranking-Based KNN Approach for Drug Repositioning 325
Table 3. The confirmed drug-disease associations predicted by Re-KNN in PubMed Diseases(KEGG code) References
Hemophilia (H00219) [23]
Inherited thrombphilia [24] (H00223)
Beta thalassemia [25] (H00228)
Preexcitation [26] syndrome(H01154)
Primary pulmonary [27] hypertension
(H01300)
5 Discussion and Conclusion
This paper proposed a Re-KNN method to the drug repositioning of cardiovascular diseases. One characteristic of the method lies in combining four types of similarity calculation methods including chemical structural similarity, target-based similarity, side-effect similarity and topological similarity information. By comparing with WNN-GIP method and Super-Target method, the increased AUC value and AUPR value proved the better predictive capability of Re-KNN method. In addition, some part of new predicted associations can be supported by the PubMed database. We expect the method can provide valuable clues for drug discovery.
Although the Re-KNN method had a better predictive capability in drug reposi- tioning, the performance can be further improved by integrating more biological information including gene expression experiments and pathway-related information. Further, good prediction models need to integrate more prior information and con- stantly improve its own algorithm. In the Re-KNN method, the ranking-SVM model and the weighted vote strategy still have a room for improvement. In the future, we will devote ourselves to improve the method from the above aspects.
Acknowledgements. This work was partially supported by National Basic Research Program of China (Grants No. 2012CB910400), the Fundamental Research Funds for the Central Univer- sities (78260026), and National Science and Technology Support Plan Project (2015BAH12F01).
References
1. Adams,C.P.,Brantner,V.V.:Estimatingthecostofnewdrugdevelopment:isitreally$802 million? Health Aff. 25(2), 420–428 (2006)
2. Ashburn, T.T., Thor, K.B.: Drug repositioning: identifying and developing new uses for existing drugs. Nat. Rev. Drug Discov. 3(8), 673–683 (2004)
Known drugs
Predicted drugs
Blood-coagulation factor VIII, etc.
Epsilon-Aminocaproic acid
Warfarin and Heparin
Dalteparin sodium
Deferoxamine, etc.
Vitamin E
Vagal maneuvers, etc.
Disopyramide phosphate
Imatinib
Nilotinib

326 X. Tian et al.
3. Yang, L., Agarwal, P.: Systematic drug repositioning based on clinical side-effects. PLoS ONE 6(12), e28025 (2011)
4. Sanseau, P., Agarwal, P., Barnes, M.R., et al.: Use of genome-wide association studies for drug repositioning. Nat. Biotechnol. 30(4), 317–320 (2012)
5. Clouser,C.L.,Patterson,S.E.,Mansky,L.M.:Exploitingdrugrepositioningfordiscoveryof a novel HIV combination therapy. J. Virol. 84(18), 9301–9309 (2010)
6. Anne, C., James, P., Alistair, B., et al.: Drug repositioning for Alzheimer’s disease. Nat. Rev. Drug Discov. 11(11), 833–846 (2012)
7. Cheng, F., Liu, C., Jiang, J., et al.: Prediction of drug-target interactions and drug repositioning via network-based inference. PLoS Comput. Biol. 8(5), e1002503 (2012)
8. Alaimo, S., Pulvirenti, A., Giugno, R., et al.: Drug-target interaction prediction through
domain-tuned network-based inference. Bioinformatics 29(16), 2004–2008 (2013)
9. Wu, Z., Wang, Y., Chen, L.: Network-based drug repositioning. Mol. BioSyst. 9(6), 1268–
1281 (2013)
10. Gottlieb, A., Stein, G.Y., Ruppin, E., et al.: Predict: a method for inferring novel drug
indications with application to personalized medicine. Mol. Syst. Biol. 7, 496 (2011)
11. Chen, H., Zhang, H., Zhang, Z., et al.: Network-based inference methods for drug
repositioning. Comput. Math. Methods Med. 2015, 130620 (2015)
12. Zhang, P., Agarwal, P., Obradovic, Z.: Computational drug repositioning by ranking and
integrating multiple data sources. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013, Part III. LNCS, vol. 8190, pp. 579–594. Springer, Heidelberg (2013)
13. Wang, Y., Chen, S., Deng, N., et al.: Drug repositioning by kernel-based integration of molecular structure, molecular activity, and phenotype data. PLoS ONE 8(11), e78518 (2013)
14. Chiang, T.H., Lo, H.Y., Lin, S.D.: A ranking-based KNN approach for multi-label classification. JMLR W&CP. 25, 81–96 (2012)
15. Yu, H., Kim, J., Kim, Y., et al.: An efficient method for learning nonlinear ranking SVM functions. Inf. Sci. 209, 37–48 (2012)
16. Laarhoven, T.V., Marchiori, E.: Predicting drug-target interactions for new drug compounds using a weighted nearest neighbor profile. PLoS ONE 8(6), e66952 (2013)
17. Shi, J.Y., Yiu, S.M., Li, Y., et al.: Predicting drug-target interaction for new drugs using enhanced similarity measures and super-target clustering. Methods 83, 98–104 (2015)
18. Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids
Res. 28(1), 29–34 (2000)
19. Yap, C.W.: PaDEL-descriptor: an open source software to calculate molecular descriptors
and fingerprints. J. Comput. Chem. 32(7), 1466–1474 (2011)
20. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol.
Biol. 147(1), 195–197 (1981)
21. Xia,Z.,Wu,L.Y.,Zhou,X.,etal.:Semi-superviseddrug-proteininteractionpredictionfrom
heterogeneous biological spaces. BMC Syst. Biol. 4(Suppl 2), S6 (2010)
22. Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Depatment of
Informatics, Aristotle University of Thessaloniki, Greece (2006)
23. Biggs, R., Hayton-Williams, D.S.: Epsilon-aminocaproic acid in the treatment of
hemophilia. Br. Dent. J. 124(124), 157 (1968)
24. Leduc,L.,Dubois,E.,Takser,L.,etal.:Dalteparinandlow-doseaspirininthepreventionof
adverse obstetric outcomes in women with inherited thrombophilia. J. Obstet. Gynaecol.
Can. 29(10), 787–793 (2007)
25. Giardini, O., Cantani, A., Donfrancesco, A.: Vitamin E therapy in homozygous
beta-thalassemia. N. Engl. J. Med. 305(11), 644 (1981)

Using the Ranking-Based KNN Approach for Drug Repositioning 327
26. Kerr, C.R., Prystowsky, E.N., Smith, W.M., et al.: Disopyramide phosphate in wolff-parkinson-white syndrome. Am. J. Cardiol. 47(2), 495 (1981)
27. Chaumais, M.C., Perros, F., Dorfmuller, P., et al.: Nilotinib and imatinib therapy in experimental pulmonary hypertension. Arterioscler. Thromb. Vasc. Biol. 32(6), 1354–1365 (2012)