代写 R C algorithm game html database graph statistic This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing
2
to common Web APIs’ invoking or tags’ marking, and form a large service network based on their relationships [13]. The relationships among Mashups can be exploited to mine useful and novel topics for significantly improv- ing accuracy of both clustering and recommendation.
The second problem is that most existing methods may produce a recommendation result with only a single focus, such popularity. It is desirable to return a set of diverse Web APIs in order to better cover the searching space of Web APIs, including both popular and unpopular ones. According to the statistical results in the Programmable- Web [5], the top 10 (200) popular Web APIs invocations in Mashups cover about 30.6% (99%) of all Web APIs invoca- tions in Mashups. This indicates only a small portion of Web APIs is frequently used in Mashups, while most of them are actually with low usage or even not used. Since users usually consider the historical usage of Web APIs and tend to create new Mashups by using these popular Web APIs, which will result in the recommendation re- sults only focusing on the popluar Web APIs. Consequent- ly, those unpopular Web APIs which are really suitable for users’ Mashup requirement may not be found [18]. Actu- ally, the implicit co-invocation records between Web APIs in common historical Mashups can be used to predict us- age probability of unpopular Web APIs in Mashups. The prediction results can be used to diversify the Web APIs recommendation.
Inspired by above approaches, the goal of this paper is to use the relationship among Mashup service documents and the implicit co-invocation between Web APIs to ob- tain more accurate and diverse Web APIs recommenda- tion results for users’ Mashup development. The contri- butions are summarized as follows:
 We present a novel two-level topic clustering model for min-
ing more useful topics from the relationship among Mashups, and modeling Mashup services in terms of both their content and network for effective representation of functional features of Mashup services. In this model, we employ random walk process on the Mashup service network to identify novel top- ics from the linked Mashup documents at the network level and incorporate them into the topic probability distribution of original Mashup service documents at the content level.
 We use Jensen-Shannon (JS) divergence to calculate the simi- larity based on latent functional topics between Mashup ser- vice documents, and combine K-Means and Agnes algo- rithms to perform Mashup service clustering. Based on the clustering results of Mashups, we design a CF-based Web APIs recommendation algorithm, to explore the historical in- vocation history between Mashups clusters and the corre- sponding Web APIs, and derive the implicit co-invocation re- lationship among Web APIs to rank and recommend diverse Web APIs for users’ Mashup requirement.
 We develop a real-world dataset from ProgrammableWeb, which can be accessed and downloaded in the URL http://49.123.0.60:8080/MashupNetwork2.0/dataset.jsp. We also conduct a set of experiments and experimental results show that our method achieves a significant improvement in terms of clustering accuracy (precision, recall, purity and entropy), recommendation accuracy (DCG) and diversity (HMD), compared with other existing approaches.
IEEE TRANSACTIONS ON SERVICES COMPUTING
The rest of this paper is organized as follows: Section 2 introduces background. Section 3 describes the method. Section 4 reports experimental results. Section 5 shows the related works. Section 6 provides dicussions. Finally, con- clusions and future work are presented in Section 7.
2 BACKGROUND
2.1 Motivation Example
Suppose that a developer Bob wants to develop a Mashup application that can track a package with map display and message notification. He firstly enters his search require- ment “Package tracking with map and message” in the search engine of ProgrammableWeb, it found 0 search results. It does not help at all. Then, he searches Web APIs of the target Mashup using other search requirements such as “package tracking”, “map” and “message”, and respectively obtains 52/ 1011/ 1222 search results, as illustrated in Fig- ure 1.
Fig. 1. Motivation Example
Specifically, by examining the 52 search results using keywords “package tracking”, we have the following obser- vations:
 They belong to 19 service categories with different function-
alities.
 Many relevant Web APIs are missing. For example, several
related services were not found, such as Canada-Post-
Tracking, Australia-Post-Tracking.
 Many of them are irrelevant. They do not meet the require-
ment of Bob for Web APIs of his target Mashup, such as, Bower, GAMEhud, and VersaPay.
The same problems exist with other two search results
of using keywords “map” and “message”. As a result, it is very challenging to select proper Web APIs to compose them for the Mashup creation based on the above unsatis- factory search results with a large number of Web APIs. There are several main causes for the above problems. Firstly, the manual, predefined categories in Programma- bleWeb are too rigid, incomplete, and vague [1]. An effec- tive service clustering method is needed and relationships between services should be used for service categorization. Secondly, the search engine of ProgrammableWeb ne- glects the diversity of Web APIs which can be used for novel Mashup development. For example, even if Canada- Post-Tracking may be unpopular in service repository, we can compose it with Google Maps and MessageBird to build a personalized Mashup application that can track a Cana- da-Post package with Google Maps display and MessageBird notification in Canada. Finally, the keyword-based search-
ing technique used by ProgrammableWeb leads to many
1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing
CAO ET AL.: INTEGRATED CONTENT AND NETWORK-BASED SERVICE CLUSTERING AND WEB APIS RECOMMENDATION FOR MASHUP DEVELOPMENT 3
irrelevant search results. The historical combinations of Web APIs in common Mashups can be used to increase the relevance of service search.
2.2 Requirements and Challenges
To resolve the above problems, we identify the following major requirements in order to make effective Web APIs recommendation for Mashup development:
 High clustering accuracy. Mashup service clustering with
high accuracy significantly improves the quality of Web APIs recommendation. Similar Mashup services in terms of functionality are grouped into clusters in order that they can be searched and discovered.
 Diverse recommendation result. A recommendation result can’t include only popular Web APIs and other unpolpolar but useful Web APIs should be included too. A diverse Web APIs recommendation result would have a better coverage of Web APIs in Mashup development.
 Good recommendation relevance. A good recommenda- tion method should recommend more relevant Web APIs and fewer irrelevant ones, particularly in the situations where the Mashup-Web API matrix is very sparse.
As discussed in the Section 1, the existing clustering- based Web APIs recommendation methods for Mashup creation cannot satisfy all the above three requirements at the same time. It is highly desirable for Web service search engine to classify Mashup services into clusters accurately and recommend Web APIs with good rele- vance and diversity for Mashup development. Our meth- od will incorporate relationships among Mashup services and implicit combinatorial usage of Web APIs to achieve high-quality Web APIs recommendation. In this method, we focus on the following three key challenges:
 Mashup service clustering. How to exploit the relationship among Mashup services and integrate it with service content in order to classify Mashup services into various clusters with high accuracy for providing a basis for Web APIs rec- ommendation for Mashup development?
 Modeling and matching between Mashup requirement and Mashup clusters. How to model and represent Mashup requirement and Mashup clusters, and map the Mashup re- quirement to the best suitable Mashup cluster for conducting subsequent Web APIs recommendation?
 Web APIs recommendation. How to explore the implicit combinatorial usage of Web APIs in common Mashup service clusters from their usage history to rank and recommend rel- evant and diverse Web APIs?
3 METHOD OVERVIEW
To address the above challenges, we have designed a framework of Mashup service clustering and Web APIs rec- ommendation, as shown in Figure 2.
The framework is composed of two phases: Mashup ser-
vice clustering (Phase 1), and Web APIs recommendation
(Phase 2). In phase 1, functional profile information of
Mashup services is crawled and parsed from online service
repository. The Mashup service-Web APIs usage history is
obtained from the historical invocation times between
Mashup services and Web APIs. The core process of the
ship between MSi and MSj. Generally speaking, if two 1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
phase 1 is two-level topic model based Mashup service
information.
Fig. 2. Framework
clustering in terms of both Mashup service content and Mashup service network. K Mashup service clusters are obtained by applying the two-level topic model to derive topics of Mashup services for more accurate service clus- tering. In phase 2, when an active user submits a Mashup development requirement with textual description, which is firstly converted a corresponding Mashup requirement topic feature vector by using LDA technology. Meanwhile, the same process is performed as for identification of K Mashup service clusters in phase 1, and obtains Mashup service clusters topic feature vectors. Then similarity based matching using JS divergence between Mashup requirement and clusters is used to identify a Mashup cluster with the highest similarity. Finally, a Web APIs recommendation based on CF algorithm is designed to recommend top-R Web APIs to the user. It uses historical invocation records between Mashups clusters and Web APIs to rank and recommend diverse Web APIs, includ- ing popluar and unpopular ones.
3.1 Service Content Extraction and Service Network Construction
We firstly crawl functional profile information of Mashup services from internet (including its category, name, de- scription, Web APIs and tags) to build Mashup service documents. Then, we perform a preprocessing to extract feature vectors representing their content. The main steps of it include [18]: (1) build initial vector; (2) remove stop words; and (3) extract stemming. A preprocessed Mashup service document is represented as a content feature vec- tor MS(C, N, T, WA, T), where C is category, N is name, T is description text, WA is Web APIs, and T is tags.
A Mashup Service Network (MSN) is build to repre- sent the relationships among Mashup services and facili- tate latent functionality topics mining. The relationship among Mashup services can be described by an undi- rected network graph MSN=(MS, E(MSi, MSj), W(MSi, MSj)), where, MS is a preprocessed Mashup service nodes set; MSi and MSj represent any two Mashup service nodes in MS; E(MSi, MSj) represents an undirected edge which connects service nodes; W(MSi, MSj) is the weight of E(MSi, MSj), which represents similarity of the relation-

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing
4
Mashup services (i.e., MSi and MSj) have same marking tags or commonly invoke same Web APIs, they are con- sidered to be similar in functionality or belong to a same service category [19]. We use Jaccard similarity coefficient [19] to measure the similarity weight of edge in the MSN:
IEEE TRANSACTIONS ON SERVICES COMPUTING
topics of each original Mashup service document include two parts: topics from itself and other “novel” topics from the linked Mashup services of its MSN. The novel topics is captured and derived by random walk process on its MSN. The probabilistic model is illustrated in Figure 4.
()
| ( ) ( )| | ( ) ( )| |()()| |()()| (1)
Content Level MS Ψ
LMS
tw
|MS|
N
N
Here API(MSi) and API(MSj) represent Web API sets, which are invoked by MSi and MSj respectively. TAG(MSi) and TAG(MSj) represent tags sets, which mark MSi and MSj respectively. and are users’ preferences,
An example of the MSN is shown in Figure 3. There are two Mashup services, i.e., 100 Most Powerful Celebrities (MSi) and Mileage Calculator (MSj), and four Web APIs, i.e., Yahoo Geocoding, Google Calendar, YouTube and Google Maps, and sixteen tags. Based on the invoking and marking relationships, we construct a MSN and cal- culate its edge weights. As for this example, we set
= =1/2, ( ) 05 /4 05 /6 0208. The edge weight determines the scale and density of the MSN, which significantly affect the clustering performance. So, we set a threshold T for ( ) to investigate its effect on clustering in the experiment section.
φ
NetworkLevel α
Fig. 4. A Two-level Topic Model
 z w |LMS|
utility
Mileage Calculator
travel
Google Calendar
money
100 Most Powerful Celebrities
video
Google Maps
celebrity
Yahoo Geocoding
Mashup
Web service
Tag
Called relationships Marked relationships
YouTube
Mileage Calculator
Jaccard Similarity Coefficient
100 Most
Powerful 0.208 Celebrities
In the Figure 4, the generative process of each Mashup service document is as below:
1) The modeling performs the generative process at the network level for all linked Mashup services in the MSN of original Mashup service document . Where 𝐿 rep- resents the linked Mashup services set of ; |𝐿 |=N, shows that a Mashup service may potentially link all Mashup services (including itself).
 For each directly/indirectly linked Mashup service docu- ment of in𝐿 .Fortheith wordin :
a. Select a topic 𝑧 from the topic distribution of ,
𝑝(𝑧| 𝜃 ), where the distribution parameter 𝜃
is gained from a Dirichlet distribution Dir(𝛼).
b. Select a word 𝑤 which follows the multinomial distri-
bution 𝑝(𝑤|𝑧 φ) conditioned on the topic 𝑧 .
2) The modeling performs the generative process at the content level for original Mashup service documents. Here, we set { … }, i.e. | | 𝑁, 𝑁 is
Mapping
the number of original Mashup service documents.
 For each original Mashup service document in the ith word in :
a. Select a linked Mashup service document 𝐿
𝑝(𝐿 | Ψ), a multinomial distribution on
b. Select a topic 𝑡 from 𝑝(𝑡|𝐿 ) of 𝐿 at the net-
work level, a multinomial distribution on 𝐿 .
c. Select a word 𝑤 which follows the multinomial distri-
bution 𝑝(𝑤|𝑡 φ) conditioned on the topic 𝑡 .
As described in the above generative processes, t and z respectively represent the latent topics at the content level and network level. It is worth noting that, Ψ and  are two different coefficient matrix, and they represent how many the contents and topics of at the content level are from 𝐿 at the network level, respectively. The composition of Ψ  models the topic distribution at the content level. Ψ is a N*N Mashup service selection coefficient matrix,
Mashup-Mashup Service Network (MSN)
. For
events viewer display places address location deadpool geocoding media
Fig. 3. An Example of Mashup Service Network
3.2 Two-level Topic Model based Clustering
Service
from .
We develop a two-level topic level to model Mashup ser- vices in terms of both their content and network, and de- rive useful topics. In this model, two random walk pro- cesses are employed to incorporate the topics at the net- work level into the topics at the content level. Based on the topic distribution results of Mashup services, we use JS divergence to calculate the similarity between Mashup services, and combine K-Means and Agnes algorithms to achieve Mashup service clustering.
3.2.1 Two-level Topic Model
Different from existing topic models [20-21], the topics of
Mashup service document is modeled at two levels: con-
tent and network. Our submodel at the content level in-
corporates original Mashup service documents for clus-
tering. We construct their MSN from the original Mashup
service documents, representing relationships of Mashup
services based on directly and indirectly linked Mashup
services from another submodel at the network level. The
topics of linked Mashup services in the MSN at the net-
work level are incorporated into the topics of original
Mashup service documents at the content level. The final
where (𝐿 | ),
which represents the probability of
level will be incorporated into the
 is a L*N topic selection coefficient matrix, where  (𝑡 𝑧 |𝐿 ), which represents the topic selec- tion probability to select the topic 𝑧 from the topic distri- bution of the 𝐿 at the network level, and 𝐿 is the num- ber of different topics generated by . The detailed computation process of Ψ  will be shown with details in the next section.
1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
𝑁 𝑁, at the network at the content level.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing
CAO ET AL.: INTEGRATED CONTENT AND NETWORK-BASED SERVICE CLUSTERING AND WEB APIS RECOMMENDATION FOR MASHUP DEVELOPMENT 5
3.2.2 Random Walk in the Mashup Service Network
We propose two random walk processes to derive the topic distributions of each directly/indirectly linked Mashup service document at the network level.
First of all, we perform a link-level random walk on the MSN of to derive the matrix Ψ. For each in 𝐿 , we associate it with a link probability score from :
(𝐿 | ) -𝛽) -𝛽) (2) Here, 𝛽 is one probability that random walk stops at the
current node ; ( -𝛽) is another probability that ran- dom walk jumps to any other nodes in the MSN of . 𝛽 depends on the scale and density of the MSN, and we will investigate its effect on clustering in the experiment sec- tion. M is an initial probability distribution vector of all
in the MSN, [𝑚 𝑚 … 𝑚 ] 𝑚 /𝐿( ) .
( 𝑧) 𝛽 ∑ [𝛾( | 𝑧)
𝐿( ) is a degree of the nodes directly connect to
, representing how many in the MSN.
is an adjacency matrix, i.e.
It is worth noting that
[𝑞 ] . 𝑞 is a transition probability random walk
( −𝛾)|𝑇|∑ (
( − 𝛽)| | (𝑧 | ) (5)
𝑧 )—the topic probability score of on the  (𝑧 | )—the probability of topic 𝑧 generated by ;
Here,
 (
topic 𝑧 ;
 (
MSN of ;
 |𝑇|—the topic number generated by .
The parameter 𝛽 in the above formula (5) is the same as the one described in the formula (2), which specifies the probability of random walk from to ( − 𝛽) is the probability of random walk which jumps to any other indirectly linked nodes in the MSN of . Therefore, for all topics generated by , we gain their topic probabil- ity scores by the formula (5) and use them to build a final topic probability score vector ( 𝑧), i.e. ( 𝑧)
[ ( 𝑧 ) ( 𝑧 ) … ( 𝑧 )] , 𝐿 is the number of topics generated by .
Similar to the construction of Ψ, we integrate the topic probability score vectors of all in the MSN of each
to build the, i.e.[ ] [ (𝑡 𝑧 |𝐿 )] [( 𝑧)( 𝑧)…( 𝑧)].
3.2.3 Mashup Service Clustering
For the topic distributions of all original Mashup service documents at the content level described in the above sec- tion, we use Kullback-Leibler (KL) and JS divergence [22] to calculate the similarity among Mashup services, and then combine K-Means and Agnes algorithms to perform service clustering. The topics of Mashup service are a sim- ple mapping of its document vector space and represent its core content in order that the similarity among Mashup services can be calculated by using their topic probability distribution. Since the topic is a mixture distribution of word vector, the below KL divergence can be introduced as the similarity measurement:
| 𝑧 )—the transition probability from to on the common topic 𝑧 ;
𝑧 | 𝑧 )—the transition probability from
across different topics 𝑧 and 𝑧 ;
 |𝐷|—the number of Mashup service documents in the
 ( to
𝑧| 𝑧)]
transfers from to . The value of 𝑞 is equal to ( ) in the formula (1), i.e. 𝑞 ( ). Through the above random walk process, the link
probability scores of all in the MSN of each will be derived. For all original Mashup service documents, we repeat the above process and obtain their final link probability distribution to construct the matrix Ψ, i.e. [] [(𝐿 |)].
Secondly, we perform a topic-level random walk on the MSN of the to derive the matrix . For in 𝐿 , we associate it with a topic probability score vector ( 𝑧), each of which is specific to topic 𝑧. Random walk is per- formed along with in the MSN within all the topics. For having a link to , we define two types of tran- sition probabilities from to : topic-intra (sharing common topics) transition probability and topic-inter (across different topics) transition probability:
( | 𝑧) ( ) (3) ( 𝑧| 𝑧) (𝑧| ) (𝑧| ) (4)
Here,
 ( from
 𝐿( )—the degree of , representing how many
|
𝑧 )—the topic-intra transition probability
to
on the common topic 𝑧 ;
𝑧 )—the topic-inter transition probabil-
 (
ity from to across different topics 𝑧 and 𝑧 ;
𝑧 |
 
( − 𝛾) probability to find across different topics on . Though the above topic-level random walk process, we
gain a topic probability score for on its topic 𝑧 :
(𝑧 | )—the probability of topic 𝑧 generated by . Besides, we introduce a parameter 𝛾 to represent pref- erence to topic-intra or topic-inter transition probability. Therefore, the random walk starting from will have a 𝛾 probability to access the common topics on and
vices and ;
 𝑡—a variable shows common topics in
 𝑇—the total number of common topics in and
 𝑝 —the probability of topic t in ;
 𝑞 —the probability of topic t in .
Although the KL divergence is often used as a way of
measuring the distance between topic probability distri-
butions, it is not an effective metric in our application. It
does not show the triangle inequality and is asymmetric,
1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
𝐷()∑𝑝 (6)  𝐷 ( )—KL divergence between Mashup ser-
nodes directly connect to ; Here, (𝑧 | )—the probability of topic 𝑧 generated by ;
and ;

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing
6
IEEE TRANSACTIONS ON SERVICES COMPUTING
3. Sim(MSi, MSj)=Djs(MSi, MSj);
4. End For End For End For
5. Construct MSim by matrix element sim(MSi, MSj);
6. 𝐴𝑣𝑒𝑚()∑𝑁𝑚()/𝑁;
7. ASS={Avesim(MS1), Avesim(MS2), …, Avesim MSN)};
8. For K=1 to N’ //build N’ atom-clusters by K-Means algorithm// 9. Select MSi with the max average similarity in ASS;
10. Divide MSi into Mk, and Remove MSi from ASS;
11. Forj=1toN
12. IF(Sim(MSj, MSi)>AveSim(MSi))
13. Divide MSj into Cluster Mk, and Remove MSj from ASS; 14. End For End For
15. Forx=1toN’ For y=1toN’
16. // hierarchical clustering for N’ atom-clusters by Agnes algorithm// 17. Forp=1toP Forq=1toQ
18. //P, Q represent Mashup service’s number in Mx, My respectively// 19. 𝑚( ) ∑ ∑ 𝑚( )/ ;
20. Construct CSim by matrix element sim(Mx, My);
21. End For End For
22. End For End For
23. Do { Find Mx and My with the highest similarity in CSim; 24. If (Sim(Mx, My)>Thr0)
25. Merge Mx and My into a new Mashup Cluster;
26. End If
27. Update CSim, the similarity between the merged new clus-
ter and other clusters is equal to the mean similarity of them; 28. } While (! (all clusters are merged into a cluster ||
Sim(Ci,Cj)< Thr0) ) 29. Return K Mashup Service Clusters. 3.3 Web APIs Recommendation based on CF Algorithm for Mashup Cluster As described in Figure 2, when a user submits a Mashup development requirement, its textual description of the requirement is first converted to a Mashup requirement topic feature vector by using the LDA topic model. Proba- bility of a word in a document can be assessed by the LDA based on probability of a topic in a document and proba- bility of a word in the topic. The textual description of the requirement may be viewed as a mixture of various topics, and it can be characterized by a particular set of topics. Similarly, the same LDA-based process can be applied to the generated K Mashup service clusters, and the topic feature vector of each Mashup service cluster will be de- rived. Then, we use the JS divergence to measure the topic matching probabilities of all Mashup service clusters re- garding the given Mashup requirement. Therefore, a Mashup service cluster with the highest matching similari- ty measurement with the Mashup requirement should be the most one satisfying the users’ Mashup requirement in all clusters. Afterwards, a recommendation method is de- signed to rank and recommend top-R Web APIs within the Mashup service cluster with the highest similarity measurement. Since the number of Web APIs is huge while number of Mashup service clusters is small, we choose item-based CF algorithm as our recommendation method. It uses historical invocation records between Mashups clusters and Web APIs to rank and recommend diverse Web APIs, including popular and unpopular ones. i.e. ( ) ( ship of two topics in practical to use the KL divergence to measure the similar- ity between Mashup services. The JS divergence is an im- proved, symmetrical version of KL divergence, which can exactly measure the divergence between topic semantics. Here, we below employ the JS divergence on top of KL divergence to measure the similarity between and . A smaller JS divergence means a higher similarity between any two Mashup services. 𝐷( )[𝐷( )𝐷( )] (7) Suppose the topic distribution probabilities of Mashup services MS1 and MS2 as the below TableI(t1, t2 and t3 are common topics in MS1 and MS2), based on the formulas (6) and (7), we can gain the KL and JS divergences of them, i.e., 𝐷()00402 and ). Since an interrelation- is symmetric, it is im- −00 ，𝐷 ( 0 005 ，𝐷 ( ) 𝐷 ( 22 ) 0 ) 05 )] 00254 025 [𝐷 ( TABLEI. TOPICDISTRIBUTIONSOFMASHUPSERVICES t1 t2 t3 MS1 0.31 0.14 0.27 MS2 0.37 0.25 0.19 Therefore, the similarity of any two Mashup service documents, denoted as 𝑚( ), can be calculated by the above formula (7), i.e. 𝑚( ) 𝐷 ( ). It is used as matrix’s elements to construct a similarity matrix among all Mashup service documents for cluster- ing. Similar to our previous work [22], K-means and Ag- nes algorithm are used to perform the below Mashup service clustering process in terms of their similarities. (1) Build a Mashup service similarity matrix MSim, and an average similarity set ASS for each Mashup service; (2) Rank Mashup services and select some of them with higher average similarity to build N’ atom-clusters by K- Means algorithm. Build a Mashup clusters similarity ma- trix CSim between the N’ atom-clusters; (3) Use Agnes algorithm to conduct hierarchical cluster- ing for the N’ atom-clusters, to merge some of them with a higher similarity more than a threshold Thr0; (4) Output clustering result with K clusters until a ter- mination condition is reached. The below Algorithm 1 presents a brief implementation of Mashup services clustering. Algorithm 1. Mashup Service Clustering Input: MS={MS1, MS2,..., MSN}, MSi(Z1, Z2,..., ZT), Thr0 // MS is the original Mashup services set for clustering; MSi (Z1, Z2,..., ZT) is the topic vector of MSi, which is de- rived from the two-level topic model in Section 3.2.1 and random wank processes in Section 3.2.2// Output: K Mashup Service Clusters {M1, M2, ...,MK} 1. Fori=1toN Forj=1toN Fort=1toT 2. Calculate Djs(MSi, MSj) by the formula(7); 1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing CAO ET AL.: INTEGRATED CONTENT AND NETWORK-BASED SERVICE CLUSTERING AND WEB APIS RECOMMENDATION FOR MASHUP DEVELOPMENT 7 3.3.1 Problem Definition Suppose that there are K Mashup service clusters {M1, M2, ..., MK} and N Web APIs {WA1, WA2, ..., WAN}, the invocation relationship between Mashup service clusters and Web APIs can be represented by a matrix [ ] , and the value of is equal to a normalized popularity of WAj in Mi, 0 , representing the total frequency of all Mashup services in Mi which historically invoked WAj. If all the Mashup services in Mi did not invoke WAj, then 0. As shown in Table II, there are three Mashup clus- ters (i.e. M1, M2, M3) and five Web APIs (i.e. WA1, WA2, WA3, WA4, WA5). As M3 does not invoke WA4, M3 is regarded as an active user and WA4 is looked as a target item. TABLE II. THE INVOCATION RELATIONSHIP MATRIX R WA1 WA2 WA3 WA4 WA5 M1 0.6 0.3 0.4 0.3 0 M2 0 0.5 0.7 0.4 0 M3 0.1 0.2 0.7 0 0.8 3.3.2 Personalized Similarity between Web APIs Person Correlation Coefficient (PCC) was used to calculate similarity in recommendation system. Item-based CF algo- rithm adopts PCC to calculate the similarity between WAi and WAj, which can be denoted as below formula: ={ WA3, WA2}, and then perform the missing value predic- tion of WA4 to M3 by using formula(9), i.e., 0 266. 3.3.4 Web APIs Recommendation for Mashup Clusters The predicted missing values can be employed to recom- mend optimal Web APIs for Mashup clusters. The rec- ommendation list is generated by combining the known popularity values and the predicted missing values of Web APIs for a special Mashup service cluster. When an active user submits a Mashup requirement, the below pro- cess will be performed to recommend top-R Web APIs. (1) Users’ Mashup requirement is converted a topic feature vector Tuser by applying LDA technology. (2) For each Mashup cluster, we also obtain its topic feature vector TMi by appling LDA technology. (3) Similarity matching sim (Tuser, TMi) will be performed between Tuser and TMi by JS divergence, a specific Mashup cluster with the highest similarity will be assigned. (4) For the specific Mashup cluster, a recommendation list with top-R Web APIs will be returned by implement- ing item-based CF algorithm. Algorithm 2 presents the implementation of Web APIs recommendation for Mashup clusters. Algorithm 2. Web APIs Recommendation Input: MR, M={M1 M2,..., MK}, WA= {WA1 WA2,..., WAN} // MR is users’ Mashup development requirement description; M is a set of Mashup service clusters obtained in the Algorithm1; WA is a set of Web APIs waiting for recommendation // Output: Top-R Web APIs for Mashup cluster 𝑚(𝐴 𝐴) ∑ ( ( − ̅̅̅̅̅̅) ( −̅̅̅̅̅̅) √∑ − ̅̅̅̅̅̅) ( −̅̅̅̅̅̅) (8) √∑ Here, M is a set of Mashup service clusters that invoked both WAi and WAj, and represent the fre- quency of Mashup services in the cluster m historical in- voked WAi and WAj respectively. ̅̅̅̅̅̅ and ̅̅̅̅̅̅ represent average frequency values of WAi and WAj invoked by dif- ferent Mashup service clusters respectively. As for Table II, the similarities between WA4 and all other Web APIs are computed by using the formula (8), i.e., sim(WA4, WA1)=0.1, sim(WA4, WA2)=0.83, sim(WA4, WA3)=0.95, sim(WA4, WA1) = 0.1. 3.3.3 Similar Neighbor Selection and Missing Value Prediction After calculating the similarities between Web APIs, we obtain a Web APIs similarity matrix and apply top-K algo- rithms which widely used in recommendation system to identify a set of similar neighbors for WAi, denoted as S(WAi). Then we use S(WAi) to predict the missing values for the target Web API by employing below formula: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. Tuser=LDA(MR); Fori=1toK TMi= LDA(Mi); Calculate sim(Tuser,TMi) by formula(7); End For Rank sim(Tuser,TMi) and Identify Mp with the highest similarity; For j=1 to P // the number of Web APIs in Cluster Mp is P// For k=j+1 to P Calculate sim(WAj, WAk) by formula(8); End For Select top similar neighbors for WAj and construct S(WAj); If the value of Mp to WAj is missing then Predict the missing values for Mp to WAj by formula(9); End If End For Forl=1toP //thenumberofWebAPIsinClusterMp isP// Rank the popularity values or prediction values of WAl; End For Return top-R Web APIs for Mp. ∑ ( ) ( )( ̅̅̅̅̅̅̅̅) ()𝐴 ∑()( ) (9)4EXPERIMENTS ̅̅̅̅̅̅ ̅̅̅̅̅̅ Here, 𝐴 is an average frequency value of the target To demonstrate the performance of our proposed method, the experiments on the Mashup service clustering and Web APIs recommendation respectively are conducted. 4.1 Experiment Dataset, Platform and Settings We crawled 6960 real Mashup services and their related data from the ProgrammableWeb site, as to November 2015. For each Mashup service, we first obtained the me- Web APIs item WAi invoked by different Mashup service ̅̅̅̅̅̅̅ clusters, and 𝐴 is an average frequency value of the similar Web APIs item 𝐴 invoked by different Mashup service clusters. As we know from Section 3.3.2, the simi- larities between WA4 and WA3, WA2 is large, we firstly select them as similar neighbors of WA4, i.e., S(WA4) 1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing 8 ta-data of Mashup service’s category, name, descriptive text, Web APIs and tags. After implementing the preprocessing on the crawled dataset, we obtained the standard content information of Mashup services. From the dataset, we know that the total amount of categories for 6960 Mashup services is 445, and average number of Mashup services per category is 15.64. It has been also observed that the difference between numbers of Mashup services in cate- gories is large. For example, category Mapping contains 1038 Mashup services, but category Addresses contains only 1 Mashup service. Since small numbers of Mashup services lead to poor service clustering, we choose top 20 categories which involve 3929 Mashup services and 62078 words in the dataset as the experimental dataset. The de- tailed distribution information of Mashup services in top 20 categories is shown in Table III, in which each category contains more than 50 Mashup services. The crawled da- taset can be accessed and downloaded in the URL ad- dress http://49.123.0.60:8080/MashupNetwork2.0/dataset.jsp. TABLE III. THE DISTRIBUTION OF MASHUP SERVICES IN TOP 20 CATEGORIES IEEE TRANSACTIONS ON SERVICES COMPUTING and false positives (FP), which are items incorrectly la- beled as belonging to the class). Recall for a class is the number of TP divided by the total number of elements that actually belong to the class (i.e. the sum of TP and false negatives (FN), which are items which were not la- beled as belonging to the class but should have been). We firlstly created a document to include a standard/positive classification result for all original Mashup service docu- ments MS, which can be jointly determined by domain experts and users. Then, we represent the stand- ard/positive classification result for MS as SM={SM1, SM2, ..., SMV}, and the experimental service clustering result as M={M1, M2, ..., MK}. So the precision and recall are respectively defined as formulas (10) and (11) below. 𝑒𝑐 𝑜 ( ) | | (10) || 𝑒𝑐𝑎()|| (11) || Here,       TP—the number of Mashup services in Mashup cluster correctly labeled as belonging to the corresponding Category Mapping Search Social Ecommerce Photos Music Travel Video Messaging Mobile Number of Mashup Service 1038 305 298 295 260 251 192 174 137 126 Category Number of Mashup Service Mashup cluster ; FP—the number of Mashup services in incorrectly labeled as belonging to ; FN—the number of Mashup services which were not labeled as belonging to but should have been; | | | appeared in SMi and Mi. Based on the dataset and relationships among Mashup services, we have developed a Mashup Service Network Platform (http://49.123.0.60:8080/MashupNetwork2.0). In our experiment, we used the platform to construct the MSN, and employed random walk process on the MSN, which served as a basis of service clustering. For the sim- plicity of experiment, we set and in formula (1) as 1/2. Besides, according to the existing experience [23-24], we set 𝛾 in formula (5) to 1. When using two-level topic model to perform unsupervised learning, Dirichlet hyper- parameters α is set to 0.01, and Gibbs sampling iteration is set to 2000. In algorithm 1, we choose the best value of Thr0 to produce optimal Mashup clustering quality. 4.2 Mashup Service Clustering To consolidate the relationship among Mashups on clus- tering accuracy, we perform service clustering experi- ments, which is as a basis of Web APIs recommendation. 4.2.1 Evaluation Metrics We choose precision and recall from information retrieval to evaluate the accuracy of Mashup service clustering. Actually, Mashup service clustering is a classification process. The precision for a class is the number of true positives (TP) (i.e. the number of items correctly labeled as belonging to the class) divided by the total number of elements labeled as belonging to class (i.e. the sum of TP Besides, we also employ purity and entropy to evaluate the accuracy of service clustering. The purity of each Mi and the mean purity of all Mashup service clusters in M are respectively defined in the formulas (12) and (13). Similarily, the entropy of each Mi and the mean entropy of all Mashup service clusters in M are respectively de- fined in the formulas (14) and (15). 𝑢𝑡𝑦( ) | |𝑚𝑎𝑥 𝐾 V(12) 𝑢 𝑡𝑦( ) ∑ | | 𝑢 𝑡𝑦( ) (13) || 𝐸 𝑡 𝑜𝑝𝑦( ) − ∑ 𝑜𝑔( ) (14) || || 𝐸 𝑡 𝑜𝑝𝑦( ) ∑ | | 𝐸 𝑡 𝑜𝑝𝑦( ) (15) || | is the number of Mashup services in cluster Mi, is the number of Mashup services belong to SMj which are successfully divided into Mi, and | | is the total amount of Mashup services in the MS. In short, the bigger recall, precision, purity and the smaller entropy, mean that the clustering accuracy is the better. 4.2.2 Baseline Methods We compare our method with three existing methods: • K-Means. Both description text and tags of Mashup services are used. K-Means is used to cluster Mashup services according to the composite similarity [25]. • LDA. LDA technique is used to cluster Mashup ser- vices, in terms of both their description text and tags [20]. The similarities between services are measured Sports Telephony Blogging Reference Electronic Signature Widgets Visualizations Humor Government Games 112 99 98 98 95 86 78 67 66 54 | —the number of Mashup services in SMi; | —the number of Mashup services in Mi; | —the number of Mashup services jointly 1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Here, | This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing CAO ET AL.: INTEGRATED CONTENT AND NETWORK-BASED SERVICE CLUSTERING AND WEB APIS RECOMMENDATION FOR MASHUP DEVELOPMENT 9 by KL divergence of services’ topic feature vectors [1]. • DAT-LDA. In our previous work [22], we proposed a Mashup service clustering method by using LDA model built from multiple data sources. • ICNC. The proposed clustering method in this paper, which integrates Mashup service’s content (descrip- tion, Web APIs and tags) and network by using a two- level topic model to cluster Mashup service. 4.2.3 Experimental Results We compare different approaches to evaluate their clus- tering performance in terms of precision, recall, purity and entropy, and investigate the impact of the parameter 𝑇 and 𝛽 on the clustering result in our ICNC approach. (1) Clustering Performance Comparison To study the clustering performance, we compare our method with other three baseline methods. As for ICNC, we choose optimal 𝑇 and 𝛽 for different sets of categories to achieve the best performance. A detailed investigation of 𝑇 and 𝛽 will be described in subsequent Section. Fig- ure 5 reports the comparisons on the ProgrammableWeb dataset when the number of categories (clusters) in Table III ranging from 5 to 20 with a step 5 (i.e. top- k’=5/10/15/20). The comparisons show that our ICNC significantly improves the clustering accuracy and out- performs all other baseline methods in terms of the mean values of precision, recall, purity and entropy. siders relationships among Mashup services and de- rives useful topics from Mashup service network, which significantly improves clustering accuracy. • The performance of all methods in top-10 categories surpasses to those of top-5, 15 and 20 categories in most cases. When the number of categories is small, an increase in the number of categories (e.g. from 5 to 10) improves clustering performance, since more Mashup services in these categories can be used to construct MSN for better clustering accuracy. However, when the number of categories continues to increase from 10 to 20, the clustering accuracy decreases. The reason for this is that additional categories only have fewer Mashup services with less linked relationships, which leads to a few isolated nodes in the MSN and therefore weakens the clustering accuracy. The observations in- dicate that it is important to choose an appropriate number of categories for clustering. (2) Impacts of 𝑇 and 𝛽 The experiments investigate the impact of parameter 𝑇 on Mashup service clustering in our ICNC. During the exper- iments, we select the best values of 𝛽 for different sets of categories (i.e. 𝛽=0.8 for top 5 and 10 categories, 𝛽=0.6 for top 15 and 20 categories). We change the value of 𝑇 from 0 to 1 with a step of 0.25, and obtain the mean values of preci- sion, recall, purity and entropy in Figure 6. The experi- mental results indicate that when we choose T=0.5 for top 5 and 10 categories and T=0.25 for top 15 and 20 catego- ries, the precision, recall, purity and entropy of them reach their peak values. When T increases from 0.5 to 1, the clustering performances at all different sets of catego- ries constantly decrease. The reason for this is a larger T leads to smaller MSN with high-similarity Mashup ser- vices, in which the number of incorporated Mashup ser- vices from network level to content level becomes smaller. In addition, when T decreases from 0.25 to 0, the cluster- ing performance at all different sets of categories also constantly decrease. Smaller T means a larger MSN with more low-similarity Mashup services, which may bring too much noise and decrease clustering performance. (a) Precision (c) Purity (b) Recall (d) Entropy 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 Top-5 Categories Top-10 Categories Top-15 Categories Top-20 Categories 0.25 0.5 0.75 1 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0 0.25 0.5 0.75 1 Top-5 Top-1 Categories 0 Categories Top-1 Top-2 5 Categories 0 Categories Fig. 5. Mashup Services Clustering Performance Comparison Specifically, we have the below observations: • The performance of K-Means is the worst in most cases. This is because K-Means only uses the term-based vec- tor space model to represent the functional features of Mashup services without considering latent semantic correlation behind the terms of Mashup service docu- ments. LDA, DAT-LDA and ICNC all exploit the LDA technique to mine latent functional topics from Mashup service documents to improve clustering ac- curacy. Compared with LDA, DAT-LDA shows the bet- ter performance since it simultaneously takes multiple data sources and a hybrid clustering algorithm into TT (a) Precision (b) Recall Top-5 Top-1 Categories 0 Categories Top-1 Top-2 5 Categories 0 Categories 0.8 0.75 0.7 0.65 0.6 0.55 Top-5 Categories Top-10 Categories Top-15 Categories Top-20 Categories 0 0.25 0.5 0.75 1 0.5 0 0.25 0.5 0.75 1 TT (c) Purity (d) Entropy Fig. 6. Impact of T on Mashup Service Clustering Results information. consideration. The two-level topic model of ICNC con- 1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more Purity Precision Entropy Recall This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing 10 Figure 7 shows the impact of different 𝛽 from 0 to 0.8 with a step of 0.2 on Mashup service clustering in our ICNC. Similarly, we select the best values of 𝑇 for different sets of categories (i.e. T=0.5 for top 5 and 10 categories, T=0.25 for top 15 and 20 categories). The ex- perimental results indicate that when we choose 𝛽=0.8 for top 5 and 10 categories and 𝛽=0.6 for top 15 and 20 categories, the mean values of precision, recall, purity and entropy reach their peak values. As expected, when 𝛽 closes to 0, the performance of our ICNC simi- lar to DAT-LDA, since the relationships among Mashup services are not considered at all. As 𝛽 be- comes larger, ICNC provides a better clustering per- formance in most cases of different sets of categories, which indicates that a larger 𝛽 achieves a better clus- tering performance when the linked Mashup services in the MSN are closely related to each other. IEEE TRANSACTIONS ON SERVICES COMPUTING cluster, then r(i) is equal to the normalized popularity of the Web API, otherwise the normalized predicted missing value of Web API obtained by formula (9) in Section 3.3.3. Moreover, we use Hamming Distance (HMD) to evalu- ate the diversity of recommendation. Hamming distance measures the quality of recommending diverse Web APIs for different Mashup requirement. A higher hamming distance means better diversity. It is defined as below: 𝐻 𝐷(𝑚 𝑚)@ − ( ) (17) Here, (𝑚 𝑚 ) represents the number of common Web APIs in the top-R recommended lists of Mashup service clusters 𝑚 and 𝑚 , R is the total length (the number of Web APIs) in the recommended list. If the Web APIs rec- ommendation lists of 𝑚 and 𝑚 are common, 𝐻(𝑚 𝑚 ) 0; If there are not any common Web APIs in 0.85 0.8 0.75 0.7 0.65 0.6 0 0.2 0.4 Top-5 Categories Top-10 Categories Top-15 Categories 0.8 their recommendation lists, 𝐻(𝑚 4.3.2 Baseline Methods 𝑚 ) . Top-20 Categories 0.75 0.7 0.65 0.6 Top-5 Categories Top-10 Categories Top-15 Categories Top-20 Categories 0.55 We compare our method with four baseline methods on top of multiple clustering methods in Section 4.2.2. • PopR. This method recommends Web APIs which are most popular in the matching Mashup cluster for us- er’s Mashup requirement. The popularity of a Web API is measured by the number of Mashup service in the cluster which contains the Web API. It ranks and rec- ommends Web APIs with top-R popularity. • KCF. This method firstly uses K-Means algorithm to cluster Mashup service into different Mashup clusters (categories) according to the composite similarity [25], then applies item-based CF algorithm to rank and rec- ommend top-R Web APIs for Mashup clusters. The method has been adopted in research work [18]. • LCF. In this method, LDA is used to cluster Mashups according to services’ topic feature vectors [1] [20]. Item-based CF algorithm is employed to rank and rec- ommend top-R Web APIs for Mashup cluster. • DL-CF. In this method, DAT-LDA is applied to cluster Mashups based on services’ description, Web APIs and tags [22]. Item-based CF algorithm is used to rank and recommend top-R Web APIs for Mashup cluster. • ICNC-CF. The proposed method in this paper. It con- siders the relationship among Mashup services, and in- tegrates Mashup service’s content (description, Web APIs and tags) and network by using a two-level topic model to cluster Mashup service. It also considers the historical invocation history between Mashups clusters and Web APIs, and exploits item-based CF algorithm to rank and recommend diverse Web APIs (popluar and unpopular) for users’ Mashup requirement. 4.3.3 Experimental Results Figure 8 reports the comparisons of experimental results when the number of categories in Table III ranging from 5 to 20 with a step 5 (i.e. top-k’=5/10/15/20). The compari- sons show that our ICNC-CF achieves a better recom- mendation performance over all other baseline methods in all cases in terms of the mean values of DCG and HMD with different number of recommended Web APIs (i.e. 0.6 0.8 0 0.2 0.4 0.6 0.8 ββ (a) Precision 0.8 0.8 Top-5 Categories (b) Recall 0.75 0.7 0.65 0.6 0.55 0.5 0 0.2 0.4 Top-10 Categories Top-15 Categories Top-20 Categories 0.6 0.8 0.75 0.7 0.65 0.6 0.55 0.5 Top-5 Categories Top-10 Categories Top-15 Categories Top-20 Categories ββ (c) Purity (d) Entropy Fig. 7. Impact of on Mashup Service Clustering Results 4.3 Web APIs Recommendation To solidify the implicit co-invocation relationship between Web APIs on top of Mashup service clustering results in Section 4.2, on recommendation accuracy and diversity, we perform Web APIs recommendation experiments. 4.3.1 Evaluation Metrics We evaluate the performance of different recommenda- tion methods in terms of accuracy and diversity from in- formation retrieval. Discounted Cumulative Gain@top R (DCG@R) is a measure of ranking quality, where R is the number of Web APIs in the recommendation list. It is of- ten used to measure the accuracy of recommendation based on the relevance of top-R recommended items in the recommendation list. The bigger DCG@R, the better recommendation accuracy. It is defined as below: 𝐷𝐶𝐺@ ∑ () (16) () Here, i is the rank position in the top-R Web APIs rec- ommendation list, r(i) is the relevant score of the ith rank position on the top-R recommendation list, 0 ( ) . If the recommended Web API is used by the Mashup 0 0.2 0.4 0.6 0.8 1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Purity Precision Entropy Recall This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing CAO ET AL.: INTEGRATED CONTENT AND NETWORK-BASED SERVICE CLUSTERING AND WEB APIS RECOMMENDATION FOR MASHUP DEVELOPMENT 11 R=5/10/20/50). Specifically, we have below observations: • The recommendation accuracy (i.e. DCG) of ICNC-CF in Figure 8 (a), (b), (c) and (d) significantly outperforms DL-CF, LCF and KCF consistently, and slightly surpass- es PopR. The clustering method in ICNC-CF considers the relationships among Mashup services and exploits more useful topics for clustering, which significantly improves recommendation accuracy. Moreover, same clustering approach used in ICNC-CF and PopR, mak- ing the difference of recommendance accuracy small. The slight advantage of ICNC-CF to PopR illustrates the popularity remain plays a key role in the Web APIs recommendation. Even so, our ICNC-CF recommends popular and unpopular Web APIs for Mashup cluster by using CF algorithm. • The recommendation diversity (i.e. HMD) of ICNC-CF in Figure 8 (e), (f), (g) and (h) significantly exceeds DL- CF, LCF, KCF and PopR. The HMD of PopR is the worst. This is because PopR only uses the popularity to rank and recommend Web APIs for Mashup service cluster without considering the implicit co-invocation relation- ship between Web APIs, resulting in single recommen- dation lists with many common Web APIs. ICNC-CF, • DL-CF, LCF and KCF all exploit CF technique to mine the implicit co-invocation relationship between Web APIs to improve the diversity of recommendation. Compared with DL-CF, LCF and KCF, ICNC-CF shows better performance since it simultaneously considers the accuracy and diversity of recommendation. The performance of all methods in top-15 categories in Figure 8 are the worst, i.e. the lowest points of the DCG and HMD in all methods appear when k’=15. The rea- son for this is that the total number of common histori- cal Web APIs invocation records for different Mashup services in top-15 categories decreases. That is to say, compared to top-5, 10 and 20, the addition of Mashup categories “Sports”, “Telephony”, “Blogging”, “Refer- ence” and “Electronic Signature” in Table III lowers the total number of common historical Web APIs invoca- tion records, and thus decreases the accuracy of per- sonalized similarity between Web APIs, missing value prediction of the target Web API and DCG measure- ment. Besides, the more sparse Web APIs invocation records, the more common Web APIs appeared in the top-R recommendation lists of Mashup service clusters. This directly leads to the decreasing of DCG. (c) (d) (g) (h) (a) (e) 5 RELATED WORK The related works are investigated in three parts: service clustering, service recommendation and clustering-based service recommenation. 5.1 Service Clustering Web service clustering technology plays an important role in service searching and effectively improves the quality of service discovery and recommendation [1, 26]. A num- ber of research works have been done on it. Service documents are a main information sources for service clustering [27]. The functional feature vectors of service generally are characterized as a term-based vector space model by processing service document [19]. The similarity among services was measured by using similari- ty methods, such as cosine similarity. In addition, tags (b) (f) Fig. 8. Web APIs Recommendation Performance Comparison 1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. were considered as an important data source to boost per- formance of service clustering. Especially, L. Chen et al. used both WSDL documents and tags to cluster web ser- vices, and proposed WTCluster [25] and WT-LDA [20] respectively. However, only using a limited number of terms and tags for similarity measurement in these works, it may result in unsatisfactory clustering accuracy. Several recent works show a promising advancement in clustering accuracy through mining latent functional fac- tors from service documents [28]. Factor analysis [29], top- ic model [21] and matrix factorization [3] are used to iden- tify the latent functional factors and discover implicit se- mantic correlation among service documents. Where, Q. Yu et al. identified suitable number of latent functional factors for large-scale service clustering [28]. These meth- ods definitely boost service clustering performance. How- ever, few of them perceive service documents are related This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing 12 to each other in many ways instead of being independent [30] and none of them uses relationship for service cluster- ing. The relationship among services can be used to derive useful topics for achieving better clustering accuracy. Z. Guo et al. [30] proposed a two-level topic model towards knowledge discovery from citation networks to discover latent topics and recommend important citations. J. Tang et al [23-24] developed a topic modeling and its integra- tion into random walking framework for academic search. Motivated by above approaches and our observation of impact of relationships among Mashup services on their clustering, we develop a two-level topic model to mine more useful topics for better service clustering. 5.2 Service Recommendation Many research works have been performed in service recommendation, which can be reviewed as four catego- ries: content-based [2], structure-based [11-12, 31], QoS- based [7-10], and hybrid methods [14-17]. First of all, content-based service recommendation mainly explores similarity matching between users’ re- quirement and services to rank and recommend services with high-matching [2, 6]. The similarity commonly is measured by using services’ functionality description. X. Liu et al. [2] proposed to use collaborative topic regres- sion to recommend Web APIs. Secondly, recommending Web APIs needs to identify components that can fit the Mashup under composition as well as from the structural perspective – not only from the content point of view. Especially, Q. Greenshpan et al. presented an efficient and intuitive development tech- nique of Mashup based on a novel autocompletion mech- anism [11]. It uses similarities between the ways users glue together Mashup components and recommends pos- sible completions for a user’s partial Mashup specification, based on the syntactic inheritance of Mashlet components and GP compositions. The inheritance plays a central role in autocompletion mechanism, a Mashlet is modeled as its set of input and output relations, and a GP is modeled as a graph of connections between the input and output relations of the mashlets that it links. However, the inher- itance relationship among Mashlet/GP is currently not recorded in the ProgrammableWeb repository. Different from this work, the way Web APIs are glues together in our paper is clearly specified by existing Web APIs com- position in common Mashups or derived by the historical invocations between Mashups and Web APIs. M. A. So- liman et al. proposed a Mashup authoring and processing system MashRank, based on concepts from rank-aware processing, probabilistic databased, and information ex- traction to enable ranked Mashups of unstructured sources with uncertain ranking attributes [12]. Thirdly, Quality of Service (QoS) plays a very im- portant role in Web services recommendation [7-10]. Z. Zheng et al. [7] presented a QoS-aware Web service rec- ommendation WSRec. K. Fletch et al. [8] proposed an elastic personalized nonfunctional attribute preference and trade-off based service recommendation. S. Wang et al. [9] explored reputation measurement and malicious feedback rating prevention in Web service recommenda- IEEE TRANSACTIONS ON SERVICES COMPUTING tion system. More importantly, the quality of Mashup under construction needs to be considered. The recom- mended APIs should somehow help the developers im- prove their Mashup applications. C. Cappiello et al. firstly defined a Mashup component quality model from the perspectives of API quality, data quality and presentation quality and identified a set of related quality dimensions and metrics [35]. Then they designed a quality model for Mashups, which emphasizes the component-based nature of Mashups and focuses on composition quality such as added value, component suitability, component usage, consistency, and Mashup availability [36]. They also used the quality as a driving factor for recommendation and developed a tool for Mashup design [37-38]. Hybrid service recommendation integrated the above methods to recommend services. Kang et al. [14] incorpo- rated functional interest, QoS preference, and diversity feature to recommend top-k diversified Web services. Y. Zhong et al. [15] explored service evolution, collaborative filtering and content matching for time-aware service rec- ommendation. C. Li et al. [16] used a relational topic mod- el to recommend Web API in Mashup creation. L. Yao et al. [17] firstly proposed a novel hybrid Web service recom- mendation method by using a three-way aspect model, and then explored both explicit textual similarity and im- plicit correlation of APIs to make API recommendation [5]. 5.3 Clustering-based Service Recommendation Although clustering and recommending are two well- known techniques in service-oriented computing, they are usually regarded as two independent processes [32], which may result in poor and single recommendation re- sults. Currently, few researchers addressed the problem and incorporated service clustering into WSDL/SOAP based Web service recommendation [32-34]. Most recently, W. Gao et al. [18] categorized existing Mashup into func- tionally similar clusters, and then recommended Web APIs for each Mashup cluster using a manifold ranking algorithm. The work used TF-IDF to measure the similari- ty between service documents without considering latent semantic correlation behind the terms of service docu- ments. Xia et al. [1] proposed a category-aware API clus- tering and distributed recommendation method. They improved the similarity measurement in service clustering to provide a basis for Web APIs recommendation by using a extend K-Means algorithm based on LDA. This method only used independent Mashup services for clustering without considering relationship among them. L. Yao et al. [5] investigated the historical invocation relations between Web APIs and Mashups to infer the implicit correlations among Web APIs, and incorporated the correlations into matrix factorization model for service recommendation. The diversity of recommendation is not addressed in this work. In fact, Mashups with similar functionality can be grouped as a cluster, the invocation history between Mashups and Web APIs can be expanded to Mashup clus- ter and Web APIs [18]. Consenquently, the sparsity will be largely relieved, and more implicit correlation among Web APIs will be mined to recommend diverse Web APIs. 1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing CAO ET AL.: INTEGRATED CONTENT AND NETWORK-BASED SERVICE CLUSTERING AND WEB APIS RECOMMENDATION FOR MASHUP DEVELOPMENT 13 6 DISCUSSION In our paper, the ways Web APIs are combined together include: (1) the existing Web APIs composition in common Mashups; (2) the implicit correlation of Web APIs derived by the historical invocations between Mashup clusters and Web APIs. Some useful topics are derived from the con- structed service network to calculate the similarity be- tween Mashup services. Mashup services with similar functionalities are clustered. A list of similar Web APIs in Mashup service clusters is recommended for Mashup de- velopment. The accuracy of Web APIs recommendation is significantly improved as a result. On the other hand, the implicit correlation of Web APIs is exploited to rank and recommend a list of diverse Web APIs for Mashup clusters. The total frequency of Web APIs within Mashup clusters is used to identify a set of similar neighbors, and predict the missing values of Web APIs. The frequency or predic- tion values of Web APIs are ranked. A diverse Web APIs set is recommended for the given Mashup cluster. The diversity of Web APIs recommendation is also achieved. C. Cappiello et al. [39] analyzed most popular Mashups published on ProgrammableWeb and identified the mas- ter and slave roles of components in the Mashup under construction. The master component is the one user inter- acts with the most. The slave component’s behavior de- pends on the master component. As mentioned in the Sec- tion 1, the top 200 most popular Web API invocations in Mashups cover about 99% of all Web APIs invocations in Mashups. Only a few popular Web APIs are invoked by many Mashups. The popular Web APIs can be considered as master components, while the unpopular even unused Web APIs can be considered slave components. When Web APIs are selected for Mashup development, the mas- ter components with higher popularity are preferred from all matching Web APIs components with similar function- ality. The slave components are ranked and recommended by integrating their functional similarity and popularities or prediction values so that they can work with the master component for providing a composable Web APIs set. The compatibility of recommended Web APIs with re- spect to the Web APIs already in Mashups should be con- siderd. It represents the composition possibility of one recommended Web API with those already included in a Mashup. Different from SOAP/WSDL-based Web services, RESTful Web API do not have a standard description of service capability, making Mashup harder compared with traditional Web service composition that uses syntactic and semantic matching between input/output parameters. Mashup construction also is not trivial as the names of Web APIs’ input/output variables are not always mean- ingful or uniform [11]. Inconsistency/incom patibility be- tween Web APIs may cause failure of Mashups. This type of composition is not an automatic process and needs manual intervention or assistance. User-driven incon- sistency resolution technique may be a good solution [39]. More fine-granularity (for example, API function-level) based Mashup construction can be further explored. Addi- tional service composition relationship can be mined and used to build novel Mashup application based on users’ personal requirement. However, due to the incompatibil- ity/inconsistency between API functions (for example, different protocols, languages and data formats, different data types between input/output parameters exposed by APIs [37]), the API function-level Mashup construction may not be accurate and even fail. In this paper, only Web API-level Mashup construction is considered. The selection of spe- cial API functions depends on users’ personal requirement. It is suggested to identify “right” APIs by considering technological, syntactic, and semantic compatibility of each API within Mashup construction [38]. A type of API interface is proposed to specify how API functions interact with each other as well as composition rules [11]. Quality-driven Mashup development technique is at- tractive. Recently, several works [35-38] have concentrated on the quality issue of Mashups and their components to help a Mashup composer or user in the selection and rec- ommendation of components and composition. Addition- al in-depth researches on quality-driven Mashup devel- opment technique can be performed, such as further in- vestigation of metrics of quality attributes and optimal Web APIs composition with quality constraints. 7 CONCLUSIONS AND FUTURE WORK This paper presents an integrated content and network- based service clustering and Web APIs recommendation method for Mashup development. A two-level topic mod- el based on an integration of service content and network is developed to mine topics for more accurate service clus- tering. In the model, via two random walk processes, the novel topics are derived from the linked Mashup docu- ments at the network level and incorporated into the topic probability distribution of original Mashup service docu- ments at the content level. Moreover, a CF-based Web APIs recommendation algorithm is proposed to recom- mend diverse Web APIs for Mashup clusters, via inferring and using the historical invocation history between Mashups clusters and Web APIs. The comparative exper- iments performed on ProgrammableWeb dataset demon- strate the effectiveness of the proposed method and show that our method significantly improves the accuracy and diversity of Web APIs recommendation. In the future work, we will focus on one or more topics mentioned in Section 6, to facilitate and improve Mashup development. ACKNOWLEDGMENT The work was supported by the National Natural Science Foundation of China under grant No. 61572371, 61572186, 61572187, 61402167, 61402168, SKLSE of China (Wuhan University) under grant No mean. SKLSE2014-10-10. REFERENCES [1] B. Xia, Y. Fan, W. Tan, K. Huang, J. Zhang, and C. Wu. Category- aware API Clustering and Distributed Recommendation for Automatic Mashup Creation. IEEE Transactions on Services Computing, 8(5): 674-687, 2015. [2] X. Liu, and I. Fulia. Incorporating User, Topic, and Service Related Latent Factors into Web Service Recommendation. ICWS2015, pp. 185-192,. [3] Z. Zheng, H. Ma, M. Lyu,I. King. Collaborative Web Service QoS Prediction via Neighborhood Integrated Matrix Factorization. IEEE Transactions Services Computing, 6(3): 289-299, 2013. 1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2686390, IEEE Transactions on Services Computing 14 [4] W. Xu, J. Cao, L. Hu, J. Wang, and M. Li. A Social-Aware Service Recommendation Approach for Mashup Creation. ICWS2013, pp. 107-114. [5] L. Yao, X. Wang, Q. Sheng, W. Ruan, and W. Zhang. Service Recommendation for Mashup Composition with Implicit Correlation Regularization. ICWS2015, pp. 217-224. [6] L. Liu, F. Lecue, and N. Mehandjiev. Semantic content-based recommendation of software services using context. ACM Transactions on the Web, 7(3): 17-20, 2013. [7] Z. Zheng, H. Ma, M. Lyu, and I. King. Qos-aware web service recommendation by collaborative filtering. IEEE Transactions on Services Computing, 4(2): 140-152, 2011. [8] K. Fletcher, F. Liu, and M. Tang. Elastic Personalized Nonfunctional Attribute Preference and Trade-off based Service Selection. ACM Transactions on the Web, 9(1): 1-26, 2015. [9] S. Wang, Z. Zheng, Z. Wu, M. Lyu, and F. Yang. Reputation Measurement and Malicious Feedback Rating Prevention in Web Service Recommendation System. IEEE Transactions on Services Computing, 5(8): 755-767, 2015. [10] M. Tang, Y. Jiang, J. Liu, and X. Liu. Location-Aware Collaborative Filtering for QoS-Based Service Recommendation. ICWS2012, pp. 202-209. [11]O. Greenshpan, T. Milo, and N. Polyzotis. Autocompletion for Mashups. VLDB2009, pp. 538-549. [12] M. A. Soliman, I. F. Iiyas, and M. Saleeb. Building Ranked Mashups of Unstructured Sources with Uncertain Information. VLDB2010, pp. 826-837. [13] B. Cao, X. Liu, B. Li, J. Liu, M. Tang, T. Zhang, and M. Shi. Mashup Service Clustering Based on an Integration of Service Content and Network via Exploiting a Two-Level Topic Model. ICWS2016, pp.212-219. [14] G. Kang, M. Tang, J. Liu, X. Liu, and B. Cao. Diversifying Web Service Recommendation Results via Exploring Service Usage History. IEEE Transactions on Services Computing, 9(4): 566-579, 2016. [15] Y. Zhong, Y. Fan, K. Huang, W. Tan, J. Zhang. Time-Aware Service Recommendation for Mashup Creation in an Evolving Service Ecosystem. ICWS 2014, pp. 25-32. [16] C. Li, R. Zhang, J. Huai, and H. Sun. A Novel Approach for API Recommendation in Mashup Development. ICWS 2014, pp. 289-296. [17] L. Yao, Q. Z. Sheng, H. H. Ngu, Y. Yu, and A. Segev, Yu. Unified Collaborative and Content-based Web Service Recommendation. IEEE Transactions on Services Computing, 8(3): 453-466, 2015. [18] W. Gao, L. Chen, J. Wu, and H. Gao. Manifold-learning based API Recommendation for Mashup Creation. ICWS2015. [19]B. Cao, J. Liu, Z. Zheng, and G. Wang. Mashup Service Recommendation based on User Interest and Social Network. ICWS2013, pp. 99-106. [20]L. Chen, Y. Wang, Q. Yu, Z. Zheng, and J. Wu. WT-LDA: User Tagging Augmented LDA for Web Service Clustering. ICSOC2013, pp. 162-176. [21] D. Blei, A. Ng and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research. 3: 993-1022, 2003. [22]B. Cao , X. Liu, J. Liu, and M.Tang. Effective Mashup Service Clustering Method by Exploiting LDA Topic Model from Multiple Data Sources. APSCC2015, pp. 165-180. [23] J. Tang, R. Jin, and J. Zhang. A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search. ICDM2008, pp. 1055-1060. [24] Z. Yang, J. Tang, J. Zhang, J. Li, and B. Gao. Topic-level Random Walk through Probabilistic Model. APWeb2009, pp. 162-173. [25] L. Chen, L. Hu, J. Wu, Z. Zheng,. WTcluster: utilizing tags for web service clustering. ICSOC2011, pp. 204-218. [26] K. Elgazzar, A. Hassan, and P. Martin. Clustering WSDL Documents to Bootstrap the Discovery of Web Services. ICWS2010, pp. 147-154. [27] L. Chen, Q. Yu, P. Yu, and J. Wu. WS-HFS: A Heterogeneous Feature Selection Framework for Web Services Mining. ICWS2015, pp. 193- 200. [28] Q. Yu, H Wang, and L Chen. Learning Sparse Functional Factors for Large-scale Service Clustering. ICWS2015, pp. 201-208. [29]C. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York. Inc., 2006. [30] Z. Guo, Z. Zhang, S. Zhu, Y. Chi, and Y. Gong. A Two-level Topic Model Towards Knowledge Discovery from Citation Networks. IEEE Transactions on Knowledge and Data Engineering, 26(4): 780-794, 2014. [31] K. Huang, Y. Fan, W. Tan, and X. Li. Service Recommendation in an Evolving Ecosystem: A Link Prediction Approach. ICWS 2013, pp. 507-514. IEEE TRANSACTIONS ON SERVICES COMPUTING [32] Y. Zhou, L. Liu, C.S. Perng, et al. Ranking Services by Service Network Structure and Service Attributes. ICWS 2013, pp. 26-33. [33] D. Skoutas, D. Sacharidis, and T. Sellis. Ranking and clustering web services using multicriteria dominance relationships. IEEE Transactions on Services Computing, 3(3): 163-177, 2010. [34]Y. Xu, J. Yin, Y. Li. A collaborative framework of web service recommendation with clustering-extended matrix factorisation. IJWGS, 12(1): 1-25, 2016. [35] C. Cappiello, F. Daniel, and M. Matera. A Quality Model for Mashup Components. ICWE2009, pp.236-250. [36] C. Cappiello, F. Daniel, A. Koschmider, M. Matera, and M. Picozzi. A Quality Model for Mashups. ICWE2011, pp. 137-151. [37] M. Picozzi, M. Rodolfi, C. Cappiello, and M. Matera. Quality-based Recommendations for Mashup Composition. ICWE2010, pp. 360-371. [38] C. Cappiello, M. Matera, M. Picozzi, F. Daniel, and A. Fernandez. Quality-aware Mashup Composition: Issues, Techniques and Tools. QUATIC2012, pp.10-19. [39] C. Cappiello, F. Daniel, M. Matera, and C. Pautasso. Information Quality in Mashups. IEEE Internet Computing, 14(4): 14-22, 2010. Buqing Cao is currently an associate professor in School of Computer Science and Engineering, Hunan University of Science and Technology, China. His current interests include service computing and soft- ware engineering. He worked as a post-doctoral in the Department of Computer Science and Computer Engineering at the University of Arkansas in Fayette- ville, USA, from March 2015 to March 2016. Xiaoqing (Frank) Liu is currently a professor and department head and holds the Rodger S. Kline en- dowed leadership chair in the Department of Com- puter Science and Computer Engineering, at the University of Arkansas in Fayetteville, USA. His cur- rent interests include software engineering, service computing, web-based argumentation, and intelligent systems. He published more than 120 refered papers in numerous journals and conferences, such as TWeb, TSC, SPIP, SQJ, ICSE, JSS, ICWS and CSCW. He received his PhD in comput- er science from the Texas A&M University in College Station in 1995. Md Mahfuzer Rhaman is currently a PhD candidate in the Department of Computer Science and Com- puter Engineering, at the University of Arkansas in Fayetteville, USA. He received his BSc degree from Bangladesh University of Engineering and Technol- ogy in 2014 and his research interests include ser- vice discovery and recommendation. Bing Li is currently a professor of software engineer- ing department in the International School of Soft- ware, at Wuhan University. His current interests include software engineering and service computing. He has published more than 80 papers in well-known conferences and journals. He received his Ph.D. degree in Computer Science School from Huazhong University of Science and Technology in 2003. Jianxun Liu is now a professor in School of Com- puter Science and Engineering, at the Hunan Uni- versity of Science and Technology. His current inter- ests include service computing and cloud computing. He has published more than 100 papers in well- known conferences and journals. He received his Ph.D. degree in computer science from Shanghai Jiao Tong University in 2003. Mingdong Tang is now a professor in School of Computer Science and Engineering, Hunan Universi- ty of Science and Technology. His current interests include service computing and social network. He has published more than 60 papers in well-known conferences and journals. He received his Ph.D. degree in computer science from the ICT in the Chinese Academy of Sciences, China, in 2009. 1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Related Posts