Review on Detecting Wikipedia Vandalism using WikiTrust

written by Preetham Salehundam

Keywords: Content VandalismWikipedia VandalismWiki Vandalismspam filteringAnamoly detectionMachine LearningText miningReputation based filtering systems

Wikipedia Vandalism detection using a trust-based system called WikiTrust. This solution aligns with our area of interest, i.e. Vandalism detection. Wikipedia itself uses the features computed by wiki trust. Machine Learning algorithms have been used on Wikipedia edit metadata along with author and content reputation to detect Vandalism. WikiTrust works well with both online and historical vandalism detection.

The features generated by WikiTrust capture the latent patterns of vandalism using the edit metadata alone. However, it ignores the influence of author and patterns in his editing preferences. In our approach, the relation between an edit and an article is modelled as a bi-partite graph problem. Using graph deep learning methods, node embeddings can be generated which captures both structural and hand-crafted features. These embeddings therefore could be used for better identification of Vandalism

This work mainly is concerned about how to extract the features from edit metadata. Given the revision content and current edit metadata, how author and content reputation scores can be computed. It uses a decision tree algorithm, C4.5 to compute the probability of vandalism of each revision.

This study uses PAN Wikipedia 2010 workshop dataset to train and evaluate the tool. In our approach, we use the same dataset to train and evaluate our graph-based solution.

Method	Precision	Recall
WikiTrust	48.5%	83.5%
Our Approach	61.0%	64.0%

Our Approach does better in avoiding any false positives. Since, deep learning techniques need huge data to model the edit and article interactions. Our model underperforms when compared to WikiTrust which is based on static features.

The current state of the art techniques includes the use of bots with regex and rules. These bots have 100% precision but very poor recall. Hence, the revisions identified as vandalistic revision are further verified by human reviewers. The issue of subjective reviews is still prevalent when human reviewers are involved. Hence, we propose to model this interaction as a graph problem and address the problem.