Abstract— Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation which can be edited by anyone. Its knowledge is increasingly used within Wikipedia as well as in all kinds of information systems, which imposes high demands on its integrity.[1] However, since Wikidata is open and free and can be edited by all people and machines, it is possible that vandalized and falsified information could be introduced and spread to others. So it is essential to have a scheme to detect the validity of the data. The WSDM 2017 Wiki Vandalism Detection Challenge mainly focus on this problem, and the task is to compute a vandalism score denoting the likelihood of this revision being vandalism, when there is a Wikidata revision. In this paper, we represent our solution to the challenge using [方法名称], With our approach we can achieve AU-ROC of [结果0.xxx] on the test data.
Keywords — Vandalism; Trust; Data Quality; [其他]
Introduction
Wikipedia and its sister projects develop at a rate of over 10 edits per second, performed by editors from all over the world. Currently, the English Wikipedia includes more than 5.5 million articles and it averages 800 new articles per day. [2] Those huge amounts of information are created, deleted and edited, through collaboration of volunteers all over the world with different backgrounds. Therefore, a question arises naturally: how do we know whether those data is trustworthy? For example, some people may use this platform to spread rumours intentionally ; others may make mistakes due to their lack of knowledge. Therefore, the problem we are facing is vandalism detection.
Problem Introduction
The goal of the vandalism detection task is to detect vandalism nearly in real time as soon as it happens. The organizer provides Wikidata Vandalism Corpus 2016 as training data, and our group uses these data to train our model. At the same time, the following rules apply: (i) Use of any additional data that is newer than the provided training data is forbidden, so no data that is newer than February 29, 2016. (ii) Sources of publicly available external data having to do with geographical information, demographic information, natural language processing, etc., are also allowed. But this data must not relate to the specific revision label (vandalism vs regular). [1] After the software has been trained, the next step is to test it using the data which is two months of data succeeding the training validation dataset. For each Wikidata revision in the test corpus, the software shall output a vandalism score in the range [0,1]. Finally, the evaluation will based on the ROC-AUC score. For informational purposes, further evaluation measures such as PR-AUC and the runtime of the software may also be measured.
Methods description
Step 1,2,3,4,5…
Experiments design and Evaluation
Data & graphs
Conclusions
In this paper, we describe our approach to Wiki Vandalism Detection Challenge.
References
[Online]. Available:
http://www.wsdm-cup-2017.org/vandalism-detection.html
[Online]. Available:
https://en.wikipedia.org/wiki/Wikipedia:Statistics#cite_note-1
[3]