Kode_Stylers : Author Identification through Naturalness of Code : An Ensemble Approach

Abstract

Authorship identification plays an important role in detecting undesirable deception of others’ content misuse or exposing the owners of some anonymous hurtful content. The Authorship Identification of SOurce COde (AI-SOCO) competition was held to investigate this task. Our team, namely Kode_Stylers, participated in the competition and used the naturalness of code as the key to our solution. In this working note, we (i) present methods to obtain features such as tokenization, N-gram TF-IDF, warning messages, and coding styles,(ii) implement our framework using Random Forest and Transformer to classify authors through our features, and (iii) apply an ensemble approach to increase the performance of our solutions. The results suggest that the authorship can be identified through the features extracted from source code and selected classifiers with up to an accuracy of 0.82, while the ensemble model outperforms any single model. The report is available here

Publication
2020 12th meeting of Forum for Information Retrieval Evaluation