Review on Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features

Keywords: Content VandalismWikipedia VandalismWiki Vandalismspam filteringAnamoly detectionMachine LearningText mining

Wikipedia is a collection of publicly available Wiki’s. These Wiki’s can be edited by anyone on the internet. Due to this nature of Wikipedia the chances of vandalism are high. These edits are made in bad faith. In this paper, an integrated model using spatio-temporal analysis of metadata, Reputation based systems like WikiTrust, and natural language processing techniques is proposed and claims that it has surpassed the performing of contemporary techniques. Contemporary approaches include using bots, that use rule-based checks to identify vandalism, statistics and machine learning techniques. In this paper, the features like metadata of the edits, Text of the edit, Reputation of the editor and language are used to develop a model that outperforms the state-of-the-art models. Vandalism detection can be of immediate detection or historical detection.

The main idea proposed in this paper is to combine multiple models to outperform individual models. By combining the features used in various models, a model with an AUC score of 0.96 is achieved for immediate detection and an AUC score of 0.97 is achieved on historical edits.

Strengths:

Detailed explanation of the features and extensive related work study is presented in this paper. Reporting AUC values is a good choice when compared to F1 score or other metrics. This paper establishes a relation between reverts and vandalism. The reverted updates are more likely to be vandalized.

Weakness:

The correlation within features or feature importance is not evaluated. There could be highly correlated features which can be dropped, and this can improve the performance. No Mention of which machine learning algorithm is used. There is no novelty in this paper, the paper recommends combining all the approaches which in my opinion is not a great contribution.

Conclusion:

This paper provides good insights about the severity of vandalism and its effect on Wikipedia. Interesting correlation between the reversion of a commit and vandalism is unique. It is very elusive in explaining the features; however, it failed to mention the machine learning algorithm used and no mention of dataset or feature extraction techniques used is discussed.