Machine Learning for Computational Stylistics
Abstract
In this talk I will describe our research into using machine learning techniques to address problems of textual stylistics. Stylistics is the study of systematic variation in writing style, e.g., between authors, genres, or times. A paradigmatic example is "authorship attribution" in which the purpose is to decide who (of some given set of authors) wrote a given unattributed text. Potential applications of computational stylistics range from scholarly literary analysis, to forensic analysis of texts in criminology, to improved methods for information retrieval.
Our work is one of the first to use modern machine learning techniques to learn models that distinguish reliably between different textual styles. These techniques are capable of dealing effectively with the large number of lexical and syntactic features needed to characterize textual style. I will present our results for modern English texts on problems of authorship attribution, distinguishing genres of texts, and determining the sex of the author. We have also obtained significant results on Hebrew texts showing the validity of the basic approach for a languages with a very different structure from English. We obtain both good classification results as well as insight into the features underlying the stylistic differences we find.-- Shlomo Argamon.