bibliography
You can hover over an entry to read the abstract. There are ■ journal articles, ● conference proceedings, and ○ working notes.
2018
-
UniNE at CLEF 2018: Author Masking: Notebook for PAN at CLEF 2018
Mirco Kocher & Jacques Savoy.
In Cappellato, L., Ferro, N., Nie, J.Y., & Soulier, L. (Eds), CLEF 2018 Labs Working Notes, Avignon, France, September 10-14, 2018, Aachen: CEUR. Paper
This paper describes and evaluates an author masking model to obfuscate the writer of a document. The suggested strategy works in English with different text genres (e.g., essays, novels, poems) and various text sizes (e.g., from less than 500 to 4,000 tokens). The approach mainly focuses on retaining high soundness and sensibleness in the obfuscated texts with the reduced set of modifications. To improve the safety, rules with a high probability of correctness are applied by attacking the feature frequencies. Depending on the writing style in the comparable documents of an author, a feature is either increased or decreased in the masked text. The evaluations are based on 205 training and 464 test problems (PAN Author Obfuscation task at CLEF 2018). -
Text Clustering with Styles.
Mirco Kocher. Thesis to obtain the Doctor of Philosophy in Computer Science.
University of Neuchâtel, computer science department.
Jury: Prof. J. Savoy, Prof. F. Crestani, Prof. P. Rosso, and Dr. V. Schiavoni. Thesis
★ Best PhD Thesis at IIUN - JAACS Award 2018 for excellent work in computer science InfoThis thesis mainly describes the author clustering problem where, based on a set of n texts, the goal is to determine the number k of distinct authors and regroup the texts into k classes according to their author. We iteratively build a stable and simple model for text clustering with styles. We start by designing a measure reflecting the (un)certainty of the proposed decision such that every decision comes along with a confidence of correctness instead of only giving a single answer. Afterwards, we link those pairs of texts where we see an indication of a shared authorship and have enough evidence that the same person has written them. Finally, after checking every text tuple, if we can link them together, we build the final clusters based on a strategy using a distance of probability distributions. Employing a dynamic threshold, we can choose the smallest relative distance values to detect a common origin of the texts. The most similar observations (or the category with the smallest distance) to the sample in question usually determines the proposed answers. We test multiple inter-textual distance functions in theoretical and empirical tests and show that the Tanimoto and Matusita distances respect all theoretical properties. Both of them perform well in empirical tests, but the Canberra and Clark measures are even better suited even though they do not fulfill all the requirements. Our model can choose the characteristics that are the most relevant to the text in question and can analyze the author adequately. We apply our systems in various natural languages belonging to a variety of language families and in multiple text genres. -
Evaluation of Text Representation Schemes and Distance Measures for Authorship Linking.
Mirco Kocher & Jacques Savoy.
Digital Scholarship in the Humanities, June 2018. Paper
Based on n text excerpts, the authorship linking task is to determine a way to link pairs of documents written by the same person together. To achieve this, various text representation strategies can be applied, such as characters, punctuation symbols, or letter n-grams as well as words, lemmas, Part-Of-Speech (POS) tags, and sequences of them. To estimate the stylistic distance (or similarity) between two text excerpts, different measures have been suggested based on the L1 norm (e.g., Manhattan, Tanimoto), the L2 norm (e.g., Matusita), the inner product (e.g., Cosine), or the entropy paradigm (e.g., Jeffrey divergence). Three corpora, extracted from French and English literature, have been evaluated using standard methodology. No systematic difference can be found between token- or lemma-based text representations. Simple POS tags do not provide an effective solution but short sequences of them form a good text representation. Letter n-grams (with n = 4 to 6) give high precision rates. As distance measures, this study found that the Tanimoto, Matusita, and Clark distance measures perform better than the often-used Cosine function. Finally, applying a pruning procedure (e.g., culling terms appearing once or twice or limiting the vocabulary to the 500 most frequent words) reduces the representation complexity and might even improve the effectiveness of the attribution scheme. -
Distributed Language Representation for Authorship Attribution.
Mirco Kocher & Jacques Savoy.
Digital Scholarship in the Humanities, 33(2), 2018, 425-441. Paper
Distributed language representation (deep learning) has been applied successfully in different applications in natural language processing. Using this model, we propose and implement two new authorship attribution classifiers. In this perspective, a vector-space representation can be generated for each author or disputed text according to words and their nearby context. To determine the authorship of a disputed text, the cosine similarity between vector representations can be applied. The proposed strategies can be adapted without any difficulty to different languages (such as English and Italian) or genres (essays, political speeches, and newspaper articles). Evaluations using the k-nearest neighbors and based on four test collections (the Federalist Papers, the State of the Union addresses, the Glasgow Herald, and La Stampa newspapers) indicate that the distributed language representation preforms well, providing sometimes better effectiveness than state-of-the-art methods such as k-NN, NSC, chi-square, Delta, LDA (latent Dirichlet allocation), or multi-layer perceptron classifier.
2017
-
Distance Measures in Author Profiling.
Mirco Kocher & Jacques Savoy.
Information Processing and Management, 53(5), September 2017, 1103-1119. Paper
Determining some demographics about the author of a document (e.g., gender, age) has attracted many studies during the last decade. To determine the targeted category, different distance measures have been suggested without one approach clearly dominating all others. In this paper, 24 distance measures are studied, extracted from five general families of functions. Moreover, six theoretical properties are presented and we show that the Tanimoto or Matusita distance measures respect all proposed properties. To complement this analysis, 13 test collections extracted from the last CLEF evaluation campaigns are employed to evaluate empirically the effectiveness of these distance measures. The empirical evaluations indicate that the Canberra or Clark distance measures tend to produce better effectiveness, at least in the context of an author profiling task. -
A Simple and Efficient Algorithm for Authorship Verification.
Mirco Kocher & Jacques Savoy.
Journal of the American Society for Information Science and Technology, 68(1), 2017, 259-269. Paper
This paper describes and evaluates an unsupervised and effective authorship verification model. The suggested strategy can be adapted without any problem to different Indo-European languages (such as Dutch, English, Spanish, and Greek) or genres (essay, novel, review, and newspaper article). Applying a simple distance measure and a set of impostors, we can determine whether or not the disputed text was written by the proposed author. Moreover, based on a simple rule, we can define when there is enough evidence to propose an answer or when the attribution scheme is unable to take a decision with a high degree of certainty. Evaluations based on six test collections (PAN CLEF 2014 evaluation campaign) indicate that our approach usually appears in the top three best verification systems, and on an aggregate measure, presents the best performance. -
Author Clustering with an Adaptive Threshold
Mirco Kocher & Jacques Savoy.
In Jones, G. J. F., Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Thomas M., Cappellato, L., & Ferro, N. (Eds), Experimental IR Meets Multilinguality, Multimodality, and Interaction, 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 2017, Proceedings, 186-198. Paper
This paper describes and evaluates an unsupervised author clustering model called SPATIUM. The proposed strategy can be adapted without any difficulty to different natural languages (such as Dutch, English, and Greek) and it can be applied to different text genres (newspaper articles, reviews, excerpts of novels, etc.). As features, we suggest using the m most frequent terms of each text (isolated words and punctuation symbols with m set to at most 200). Applying a distance measure, we define whether there is enough evidence that two texts were written by the same author. The evaluations are based on six test collections (PAN AUTHOR CLUSTERING task at CLEF 2016). A more detailed analysis shows the strengths of our approach but also indicates the problems and provides reasons for some of the failures of the SPATIUM model. -
Author Clustering Using SPATIUM
Mirco Kocher & Jacques Savoy.
Short Paper, JCDL 2017, Toronto, Canada, June 19-23, 2017. ACM/IEEE, 265-268. Paper
This paper presents the author clustering problem and compares it to related authorship attribution questions. The proposed model is based on a distance measure called SPATIUM derived from the Canberra measure (weighted version of L1 norm). The selected features consist of the 200 most frequent words and punctuation symbols. An evaluation methodology is presented and the test collections are extracted from the PAN CLEF 2016 evaluation campaign. In addition to those, we also consider two additional corpora reflecting the literature domain more closely. Based on four different languages, the evaluation measures demonstrate a high precision and high F1 values for all 20 test collections. A more detailed analysis provides reasons explaining some of the failures of the SPATIUM model. -
Regroupement d'auteurs : Qui a écrit cet ensemble de romans ?
Mirco Kocher & Jacques Savoy.
In Nie, J.Y. and Lamprier, S. (Eds), Proceeding CORIA 2017, Marseille, France, March 29-31, 2017, ARIA, 311-326. Paper
In French. This paper describes the author clustering problem where, based on a set of n texts, the number k of distinct authors must be determined and the texts must be regrouped into k classes according to their author. Using two test collections, one written in French, the second in English, different distance measures are described and evaluated. To define the needed features, the m most frequent words (e.g., m between 50 to 300) or the m letters and bigrams of letters have been used. Our experiments show that word-based representations offer usually the best performance. Using the cosine distance function does not produce a better F1 value compared to functions based on the L1 norm (e.g., Canberra). However, the best distance measure for all cases cannot be defined precisely. Applying a bootstrap approach, we show that the performance measures own a relatively large variability. Finally, a deeper analysis indicates the difficulties and reasons explaining incorrect assignments. -
UniNE at CLEF 2017: Author Clustering: Notebook for PAN at CLEF 2017
Mirco Kocher & Jacques Savoy.
In Capellato, L., Ferro, N., Goeuriot, L., & Mandl, T. (Eds), CLEF 2017 Labs Working Notes, Dublin, Ireland, September 11-14, 2017, Aachen: CEUR. Paper
This paper describes and evaluates an effective unsupervised author clustering and authorship linking model. The suggested strategy can be adapted without any difficulty to different languages (such as Dutch, English, and Greek) in different text genres (e.g., newspaper articles and reviews). As features, we suggest using the m most frequent terms (isolated words and punctuation symbols) or the m most frequent character n-grams of each text. Applying a simple distance measure, we determine whether there is enough indication that two texts were written by the same author. The evaluations are based on 60 training and 120 test problems (PAN Author Clustering task at CLEF 2017). Using the most frequent terms results in a higher clustering precision, while using the most frequent character n-grams of letters gives a higher clustering recall. An analysis to assess the variability of the performance measures indicates that we have a system working stable independent of the underlying text collection and that our parameter choices did not over-fit to the training data. -
UniNE at CLEF 2017: Author Profiling Reasoning: Notebook for PAN at CLEF 2017
Mirco Kocher & Jacques Savoy.
In Capellato, L., Ferro, N., Goeuriot, L., & Mandl, T. (Eds), CLEF 2017 Labs Working Notes, Dublin, Ireland, September 11-14, 2017, Aachen: CEUR. Paper
This paper describes and evaluates a supervised author profiling model. The suggested strategy can be adapted without any problem to various languages (such as Arabic, English, Spanish, and Portuguese). As features, we suggest using the m most frequent terms of the query text (isolated words and punctuation symbols with m at most 200). Applying a simple distance measure and looking at the nearest text profiles, we can determine the gender (with the nominal values "male" or "female") and the language variety (e.g., in Spanish the nominal values "Argentina", "Chile", "Colombia", "Mexico", "Peru", "Spain", or "Venezuela"). The training and test data is available for Twitter tweets (PAN Author Profiling task at CLEF 2017). An analysis of the top ranked terms from a feature selection method allows a better understanding of the proposed assignments and presents typical writing styles for each category.
2016
-
UniNE at CLEF 2016 Author Clustering: Notebook for PAN at CLEF 2016.
Mirco Kocher.
In Balog, K., Capellato, L., Ferro, N., & Macdonald, C. (Eds), CLEF 2016 Labs Working Notes, Évora, Portugal, September 5-8, 2016, Aachen: CEUR. Paper Poster Slides
This notebook describes and evaluates an effective unsupervised author clustering model. The suggested strategy can be adapted without any problem to different languages (such as Dutch, English, and Greek) in different genres (newspaper articles and reviews). Applying a simple distance measure, we determine whether there is enough indication that two texts were written by the same author. The proposed clustering performs very well (rank 2 over all test collections) in a comparison of 8 approaches. -
UniNE at CLEF 2016 Author Profiling: Notebook for PAN at CLEF 2016.
Mirco Kocher & Jacques Savoy.
In Balog, K., Capellato, L., Ferro, N., & Macdonald, C. (Eds), CLEF 2016 Labs Working Notes, Évora, Portugal, September 5-8, 2016, Aachen: CEUR. Paper Poster
This notebook describes and evaluates a cross-genre author profiling model. The suggested strategy can be adapted without any problem to different Indo-European languages. Applying a simple distance measure and looking at the five nearest neighbors, we can determine the gender (male or female) and the age group (18-24 | 25-34 | 35-49 | 50-64 | >65). The proposed cross-genre profiling performs acceptable (rank 14 over all test collections) in a comparison of 23 approaches.
2015
-
UniNE at CLEF 2015 Author Identification: Notebook for PAN at CLEF 2015.
Mirco Kocher & Jacques Savoy.
In Capellato, L., Ferro, N., Jones, F. J. F., & Juan, E. S. (Eds), CLEF 2015 Labs Working Notes, Toulouse, France, September 8-11, 2015, Aachen: CEUR. Paper Poster
This notebook presents and evaluates an unsupervised authorship verification model. The suggested strategy can be adapted without any problem to different languages with their genre and topic differ significantly. Applying a simple distance measure and a set of impostors, we determine whether or not the disputed text was written by the proposed author. The proposed identification usually performs well (rank 8 over all test collections) in a comparison of 19 approaches. -
UniNE at CLEF 2015 Author Profiling: Notebook for PAN at CLEF 2015.
Mirco Kocher.
In Capellato, L., Ferro, N., Jones, F. J. F., & Juan, E. S. (Eds), CLEF 2015 Labs Working Notes, Toulouse, France, September 8-11, 2015, Aachen: CEUR. Paper Poster
This notebook presents and evaluates an effective author profiling model. The suggested strategy can be adapted without any problem to different languages in Twitter tweets. We can determine the gender (male and female), the age group (18-24 | 25-34 | 35-49 | >50), and the Big Five personality traits (each on an interval scale containing eleven items). The proposed profiling performs well (rank 4 over all test collections) in a comparison of 23 approaches.