Τμήμα Μηχανικών Πληροφοριακών & Επικοινωνιακών Συστημάτων

Your Profile

Stamatatos Efstathios Professor

Mini CV
Publications
Journals Conferences Chapters in Books Conference Proceedings Editing

Education

Diploma in Electrical Engineering, University of Patras (1994)
Doctoral degree in Electrical and Computer Engineering, University of Patras (2000).

Research Interests

Text mining
Intelligent information retrieval
Natural language processing
Machine learning
Computer music

Teaching Activities

Artificial Intelligence
Natural Language Processing
Machine Learning (Postgraduate)
Intelligent Systems (Postgraduate)

Journals

Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.

[1]

N. Manousakis, E. Stamatatos, Authorship Analysis and the Ending of Seven Against Thebes: Aeschylus' Antigone or Updating Adaptation?, Classical World, Vol. 116, No. 3, pp. 247-274, 2023, John Hopkins University Press, (to_appear), , IF =

Download

[2]

G. Barlas, E. Stamatatos, A Transfer Learning Approach to Cross domain Authorship Attribution, Evolving Systems, Vol. 12, pp. 625-643, 2021, Springer, (to_appear), https://link.springer.com/article/10.100..., IF =

[3]

N. Potha, E. Stamatatos, Improving author verification based on topic modeling, Journal of the Association for Information Science and Technology , 2019, Wiley Online Library,

[4]

F. Sánchez-Vega, E. Villatoro-Tello, M. Montes-y-Gómez, P. Rosso, E. Stamatatos, L. Villaseñor-Pineda, Paraphrase Plagiarism Identification with Character-level Features, Pattern Analysis and Applications, Vol. 22, No. 2, pp. 669-681, 2019, Springer, http://dx.doi.org/10.1007/s10044-017-067...

[5]

E. Stamatatos, Masking Topic-related Information to Enhance Authorship Attribution, Journal of the Association for Information Science and Technology, Vol. 69, No. 3, pp. 461-473, 2018, Wiley, https://doi.org/10.1002/asi.23968

[6]

D. Pritsos, E. Stamatatos, Open Set Evaluation in Web Genre Identification, Language Resources and Evaluation, Vol. 52, No. 4, pp. 949–968, 2018, Springer, http://dx.doi.org/10.1007/s10579-018-941...

[7]

A. Rocha, W.J. Scheirer, C.W. Forstall, T. Cavalcante, A. Theophilo, B. Shen, A.R.B. Carvalho, E. Stamatatos, Authorship Attribution for Social Media Forensics, IEEE Transactions on Information Forensics and Security, Vol. 12, No. 1, pp. 5-33, 2017, IEEE, https://doi.org/10.1109/TIFS.2016.260396..., indexed in SCI-E

[8]

N. Manousakis, E. Stamatatos, Devising Rhesus: A strange ‘collaboration’ between Aeschylus and Euripides, Digital Scholarship in the Humanities, 2017, (to_appear), https://doi.org/10.1093/llc/fqx021

[9]

S. Miranda-Jiménez, E. Stamatatos, Automatic Generation of Summary Obfuscation Corpus for Plagiarism Detection, Acta Polytechnica Hungarica, Vol. 14, No. 3, pp. 99-112, 2017, Budapest Tech Polytechnical Institution, https://www.uni-obuda.hu/journal/Miranda...

[10]

A. Pastor López-Monroy, M. Montes-y-Gómez, H.J. Escalante, L. Villaseñor-Pineda, E. Stamatatos, Discriminative Subprofile-Specific Representations for Author Profiling in Social Media, Knowledge-Based Systems, Vol. 89, pp. 134–147, 2015, Elsevier, http://dx.doi.org/10.1016/j.knosys.2015...., indexed in SCI-E

[11]

G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, Syntactic N-grams as Machine Learning Features for Natural Language Processing, Expert Systems with Applications, Vol. 41, No. 3, pp. 853-860, 2014, Elsevier, http://dx.doi.org/10.1016/j.eswa.2013.08..., indexed in SCI-E

[12]

E. Stamatatos, On the Robustness of Authorship Attribution Based on Character n-gram Features, Journal of Law and Policy, Vol. 21, No. 2, pp. 421-439, 2013, Brooklyn Law School, http://practicum.brooklaw.edu/journals/j...

Abstract
A number of independent authorship attribution studies have demonstrated the effectiveness of character n-gram features for representing the stylistic properties of text. However, the vast majority of these studies examined the simple case where the training and test corpora are similar in terms of genre, topic, and distribution of the texts. Hence, there are doubts whether such a simple and low-level representation is equally effective in realistic conditions where some of the above factors are not possible to remain stable. In this study, the robustness of authorship attribution based on character n-gram features is tested under cross-genre and cross-topic conditions. In addition, the distribution of texts over the candidate authors varies in training and test corpora to imitate real cases. Comparative results with another competitive text representation approach based on very frequent words show that character n-grams are better able to capture stylistic properties of text when there are significant differences among the training and test corpora. Moreover, a set of guidelines to tune an authorship attribution model according to the properties of training and test corpora is provided.

[13]

G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, Syntactic N-grams as Machine Learning Features for Natural Language Processing, Expert Systems with Applications, 2013, (to_appear), http://dx.doi.org/10.1016/j.eswa.2013.08...

Abstract
In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sngrams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.

[14]

G. Frantzeskou, S. MacDonell, E. Stamatatos, S. Georgiou, S. Gritzalis, The significance of user-defined identifiers in Java source code authorship identification, International Journal of Computer Systems Science and Engineering, Vol. 26, No. 2, pp. 139-148, 2011, http://aut.researchgateway.ac.nz/bitstre..., indexed in SCI-E, IF = 0.371

Download

Abstract
When writing source code, programmers have varying levels of freedom when it comes to the creation and use of identifiers. Do they habitually use the same identifiers, names that are different to those used by others? Is it then possible to tell who the author of a piece of code is by examining these identifiers?

[15]

E. Stamatatos, Plagiarism Detection Using Stopword n-grams, Journal of the American Society for Information Science and Technology, Vol. 62, No. 12, pp. 2512-2527, 2011, Wiley, http://dx.doi.org/10.1002/asi.21630, indexed in SCI-E

Abstract
In this paper, a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses content terms to represent documents, the proposed method is based on a small list of stopwords (i.e., very frequent words). We show that stopword n-grams reveal important information for plagiarism detection since they are able to capture syntactic similarities between suspicious and original documents and they can be used to detect the exact plagiarized passage boundaries. Experimental results on a publicly-available corpus demonstrate that the performance of the proposed approach is competitive when compared with the best reported results. More importantly, it achieves significantly better results when dealing with difficult plagiarism cases where the plagiarized passages are highly modified and most of the words or phrases have been replaced with synonyms.

[16]

I. Kanaris, E. Stamatatos, Learning to Recognize Webpage Genres, Information Processing and Management, Vol. 45, No. 5, pp. 499-512, 2009, Elsevier, http://dx.doi.org/10.1016/j.ipm.2009.05...., indexed in SCI-E

Abstract
Webpages are mainly distinguished by their topic (e.g., politics, sports etc.) and genre (e.g., blogs, homepages, e-shops, etc.). Automatic detection of webpage genre could considerably enhance the ability of modern search engines to focus on the requirements of the userﾒs information need. In this paper, we present an approach to webpage genre detection based on a fully-automated extraction of the feature set that represents the style of webpages. The features we propose (character n-grams of variable length and HTML tags) are language- independent and easily-extracted while they can be adapted to the properties of the still evolving web genres and the noisy environment of the web. Experiments based on two publicly-available corpora show that the performance of the proposed approach is superior in comparison to previously reported results. It is also shown that character n-grams are better features than words when the dimensionality increases while the binary representation is more effective than the term-frequency representation for both feature types. Moreover, we perform a series of cross-check experiments (e.g., training using a genre palette and testing using a different genre palette as well as using the features extracted from one corpus to discriminate the genres of the other corpus) to illustrate the robustness of our approach and its ability to capture the general stylistic properties of genre categories even when the feature set is not optimized for the given corpus.

[17]

E. Stamatatos, A Survey of Modern Authorship Attribution Methods, Journal of the American Society for Information Science and Technology, Vol. 60, No. 3, pp. 538-556, 2009, Wiley, http://dx.doi.org/10.1002/asi.21001

Abstract
Authorship attribution supported by statistical or computational methods has a long history starting from the 19th century and is marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed “Federalist Papers.”During the last decade, this scientific field has been developed substantially, taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology, provided it is able to handle short and noisy text from multiple candidate authors. In this article, a survey of recent advances of the automated approaches to attributing authorship is presented, examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than on linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area.

[18]

G. Frantzeskou, S. MacDonell, E. Stamatatos, S. Gritzalis, Examining the Significance of high-level programming features in Source-code Author Classification, Journal of Systems and Software, Vol. 81, No. 3, pp. 447-460, 2008, Elsevier, http://www.sciencedirect.com/science/art..., indexed in SCI-E, IF = 1.241

Download

Abstract
The use of Source Code Author Profiles (SCAP) represents a new, highly accurate approach to source code authorship identification that is, unlike previous methods, language independent. While accuracy is clearly a crucial requirement of any author identification method, in cases of litigation regarding authorship, plagiarism, and so on, there is also a need to know why it is claimed that a piece of code is written by a particular author. What is it about that piece of code that suggests a particular author? What features in the code make one author more likely than another? In this study, we describe a means of identifying the high-level features that contribute to source code authorship identification using as a tool the SCAP method. A variety of features are considered for Java and Common Lisp and the importance of each feature in determining authorship is measured through a sequence of experiments in which we remove one feature at a time. The results show that, for these programs, comments, layout features and package-related naming influence classification accuracy whereas user-defined naming, an obvious programmer related feature, does not appear to influence accuracy. A comparison is also made between the relative feature contributions in programs written in the two languages.

[19]

E. Stamatatos, Author Identification: Using Text Sampling to Handle the Class Imbalance Problem, Information Processing and Management, Vol. 44, No. 2, pp. 790-799, 2008, Elsevier, http://dx.doi.org/10.1016/j.ipm.2007.05....

Abstract
Authorship analysis of electronic texts assists digital forensics and anti-terror investigation. Author identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors or there is a significant variation in the text-length among the available training texts of the candidate authors. Moreover, in this task usually there is no similarity between the distribution of training and test texts over the classes, that is, a basic assumption of inductive learning does not apply. In this paper, we present methods to handle imbalanced multi-class textual datasets. The main idea is to segment the training texts into text samples according to the size of the class, thus producing a fairer classification model. Hence, minority classes can be segmented into many short samples and majority classes into less and longer samples. We explore text sampling methods in order to construct a training set according to a desirable distribution over the classes. Essentially, by text sampling we provide new synthetic data that artificially increase the training size of a class. Based on two text corpora of two languages, namely, newswire stories in English and newspaper reportage in Arabic, we present a series of authorship identification experiments on various multiclass imbalanced cases that reveal the properties of the presented methods.

[20]

G. Frantzeskou, E. Stamatatos, S. Gritzalis, C. Chaski, B. Howald, Identifying Authorship by Byte Level n-grams: The Source Code Author Profile (SCAP) Method, International Journal of Digital Evidence, Vol. 6, No. 1, pp. 1-15, 2007, Economic Crime Institute, http://www.utica.edu/academic/institutes...

Download

Abstract
Source code author identification deals with identifying the most likely author of a computer program, given a set of predefined author candidates. There are several scenarios where digital evidence of this kind plays a role in investigation and adjudication, such as code authorship disputes, intellectual property infringement, tracing the source of code left in the system after a cyber attack, and so forth. As in any identification task, the disputed program is compared to undisputed, known programming samples by the predefined author candidates. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles representing the source code author’s style. The SCAP method extends a method originally applied to natural language text authorship attribution; we show that an n-gram approach also suits the characteristics of source code analysis. The methodological extension includes a simplified profile and a less complicated, but more effective, similarity measure. Experiments on data sets of different programming-language (Java or C++) and commented/commentless code demonstrate the effectiveness of these extensions. The SCAP approach is programming-language independent. Moreover, the SCAP approach deals surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. Finally, it is also demonstrated that SCAP effectiveness persists even in the absence of comments in the source code, a condition usually met in cyber-crime cases.

[21]

B. Stein, S.Argamon, E. Stamatatos, Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection, ACM SIGIR Forum, Vol. 41, No. 2, pp. 68-71, 2007, http://dx.doi.org/10.1145/1328964.132897...

Abstract
Goal of the workshop was to bring together experts and prospective researchers around the exciting and future-oriented topic of plagiarism analysis, authorship identi¯cation, and high similarity search. This topic receives increasing attention, which results, among others, from the fact that information about nearly any subject can be found on the World Wide Web.

[22]

I. Kanaris, K. Kanaris, J. Houvardas, E. Stamatatos, Words vs. Character N-grams for Anti-spam Filtering, Int. Journal on Artificial Intelligence Tools, Vol. 16, No. 6, pp. 1047-1067, 2007, World Scientific, http://dx.doi.org/10.1142/S0218213007003...

Abstract
The increasing number of unsolicited e-mail messages (spam) reveals the need for the development of reliable anti-spam filters. The vast majority of content-based techniques rely on word-based representation of messages. Such approaches require reliable tokenizers for detecting the token boundaries. As a consequence, a common practice of spammers is to attempt to confuse tokenizers using unexpected punctuation marks or special characters within the message. In this paper we explore an alternative low-level representation based on character n-grams which avoids the use of tokenizers and other language-dependent tools. Based on experiments on two well-known benchmark corpora and a variety of evaluation measures, we show that character n-grams are more reliable features than word-tokens despite the fact that they increase the dimensionality of the problem. Moreover, we propose a method for extracting variable-length n-grams which produces optimal classifiers among the examined models under cost-sensitive evaluation.

[23]

E. Stamatatos, Authorship Attribution Based on Feature Set Subspacing Ensembles, Int. Journal on Artificial Intelligence Tools, Vol. 15, No. 5, pp. 823-838, 2006, World Scientific, http://dx.doi.org/10.1142/S0218213007003...

Abstract
Authorship attribution can assist the criminal investigation procedure as well as cybercrime analysis. This task can be viewed as a single-label multi-class text categorization problem. Given that the style of a text can be represented as mere word frequencies selected in a language-independent method, suitable machine learning techniques able to deal with high dimensional feature spaces and sparse data can be directly applied to solve this problem. This paper focuses on classifier ensembles based on feature set subspacing. It is shown that an effective ensemble can be constructed using, exhaustive disjoint subspacing, a simple method producing many poor but diverse base classifiers. The simple model can be enhanced by a variation of the technique of cross-validated committees applied to the feature set. Experiments on two benchmark text corpora demonstrate the effectiveness of the presented method improving previously reported results and compare it to support vector machines, an alternative suitable machine learning approach to authorship attribution.

[24]

E. Stamatatos, G. Widmer, Automatic Identification of Music Performers with Learning Ensembles, Artificial Intelligence, Vol. 165, No. 1, pp. 37-56, 2005, Elsevier, http://dx.doi.org/10.1016/j.artint.2005....

Abstract
This paper addresses the problem of identifying the most likely music performer, given a set of performances of the same piece by a number of skilled candidate pianists. We propose a set of features for representing the stylistic characteristics of a music performer, introducing norm-based features that are relevant to the average performance. A database of piano performances of 22 pianists playing two pieces by F. Chopin is used in the presented experiments. Due to the limitations of the training set size and the characteristics of the input features we propose an ensemble of simple classifiers derived by both subsampling the training set and subsampling the input features. The presented experiments show that the proposed features are able to quantify the differences between music performers. The proposed ensemble can efficiently cope with multi-class music performer recognition under inter-piece conditions, a difficult musical task, displaying a level of accuracy unlikely to be matched by human listeners (under similar conditions). Moreover, it is empirically demonstrated that the average performance is at least as effective as the best of the constituent individual performances while ‘extreme’ performances have the lowest discriminatory potential when used as norm.

[25]

E. Stamatatos, N. Fakotakis, G. Kokkinakis, Computer-Based Authorship Attribution without Lexical Measures, Computers and the Humanities, Vol. 35, No. 2, pp. 193-214, 2001, Kluwer, http://dx.doi.org/10.1023/A:100268191951...

Abstract
The most important approaches to computer-assisted authorship attribution are exclusively based on lexical measures that either represent the vocabulary richness of the author or simply comprise frequencies of occurrence of common words. In this paper we present a fully-automated approach to the identification of the authorship of unrestricted text that excludes any lexical measure. Instead we adapt a set of style markers to the analysis of the text performed by an already existing natural language processing tool using three stylometric levels, i.e., token-level, phrase-level, and analysis-level measures. The latter represent the way in which the text has been analyzed. The presented experiments on a Modern Greek newspaper corpus show that the proposed set of style markers is able to distinguish reliably the authors of a randomly-chosen group and performs better than a lexically-based approach. However, the combination of these two approaches provides the most accurate solution (i.e., 87% accuracy). Moreover, we describe experiments on various sizes of the training data as well as tests dealing with the significance of the proposed set of style markers.

[26]

E. Stamatatos, N. Fakotakis, G. Kokkinakis, Automatic Text Categorization in Terms of Genre and Author, Computational Linguistics, Vol. 26, No. 4, pp. 461-485, 2000, MIT Press, http://dx.doi.org/10.1162/08912010075010...

Abstract
The two main factors that characterize a text are its content and its style, and both can be used as a means of categorization. In this paper we present an approach to text categorization in terms of genre and author for Modern Greek. In contrast to previous stylometric approaches, we attempt to take full advantage of existing natural language processing (NLP) tools. To this end, we propose a set of style markers including analysis-levelmeasures that represent the way in which the input text has been analyzed and capture useful stylistic information without additional cost. We present a set of small-scale but reasonable experiments in text genre detection, author identication, and author verication tasks and show that the proposed method performs better than the most popular distributional lexical measures, i.e., functions of vocabulary richness and frequencies of occurrence of the most frequent words. All the presented experiments are based on unrestricted text downloaded from the World Wide Web without any manual text preprocessing or text sampling.Various performance issues regarding the training set size and the signicance of the proposed style markers are discussed.Our system can be used in any application that requires fast and easily adaptable text categorization in terms of stylistically homogeneous categories. Moreover, the procedure of dening analysis-level markers can be followed in order to extract useful stylistic information using existing text processing tools.

[27]

S. Michos, E. Stamatatos, N. Fakotakis, Supporting Multilinguality in Library Automation Systems Using AI Tools, Applied Artificial Intelligence, Vol. 13, No. 7, pp. 679-704, 1999, Taylor & Francis, http://dx.doi.org/10.1080/08839519911724...

Abstract
Language barriers present a major problemin the e ectiveness of resource sharing and in common access to the resources of libraries. In this paper we present the TRANSLIB system, which consists of an integration of both new and existing multilingual information tools. This systemtakes full advantage of some AI± based methods in order to provide multilingual access to library catalogues. Its main features include functionalities for searching in multiple languages, multilingual presentation of the query results, and localization of the user interface. TRANSLIB has currently been tested in existing medium± sized bibliographic databases. Evaluation results show a remarkable improvement in the search process and report high user friendliness and easy and low± cost maintenance and upgrade of the system.

Conferences

[1]

N. Potha, E. Stamatatos, Dynamic Ensemble Selection for Author Verification, European Conference on Information Retrieval Springer, Cham, 2019, pp. 102-115, Apr, 2019, Germany, European Conference on Information Retrieval. Springer, Cham,

Abstract
Author verification is a fundamental task in authorship analysis and associated with significant applications in humanities, cyber-security, and social media analytics. In some of the relevant studies, there is evidence that heterogeneous ensembles can provide very reliable solutions, better than any individual verification model. However, there is no systematic study of examining the application of ensemble methods in this task. In this paper, we start from a large set of base verification models covering the main paradigms in this area and study how they can be combined to build an accurate ensemble. We propose a simple stacking ensemble as well as a dynamic ensemble selection approach that can use the most reliable base models for each verification case separately. The experimental results in ten benchmark corpora covering multiple languages and genres verify the suitability of ensembles for this task and demonstrate the effectiveness of our method, in some cases improving the best reported results by more than 10%.

[2]

D. Pritsos, A. Rocha, E. Stamatatos, Open-set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio, 41st European Conference on Information Retrieval (ECIR), pp. 3-11, Dec, 2019, Springer, http://dx.doi.org/10.1007/978-3-030-1571...

[3]

M. Potthast, P. Rosso, E. Stamatatos, B. Stein, A Decade of Shared Tasks in Digital Text Forensics at PAN, 41st European Conference on Information Retrieval (ECIR), pp. 291-300, Dec, 2019, Springer, http://dx.doi.org/10.1007/978-3-030-1571...

[4]

N. Potha, E. Stamatatos, Intrinsic Author Verification Using Topic Modeling, SETN '18 , (eds), Jul, 2018, Patras, Greece, Proceedings of the 10th Hellenic Conference on Artificial Intelligence, 20

Abstract
Author verification is a fundamental task in authorship analysis and associated with important applications in humanities and forensics. In this paper, we propose the use of an intrinsic profile-based verification method that is based on latent semantic indexing (LSI). Our proposed approach is easy-to-follow and language independent. Based on experiments using benchmark corpora from the PAN shared task in author verification, we demonstrate that LSI is both more effective and more stable than latent Dirichlet allocation in this task. Moreover, LSI models are able to outperform existing approaches especially when multiple texts of known authorship are available per verification instance and all documents belong to the same thematic area and genre. We also study several feature types and similarity measures to be combined with the proposed topic models.

[5]

E. Stamatatos, F. Rangel, M. Tschuggnall, B. Stein, M. Kestemont, P. Rosso, M. Potthast, Overview of PAN 2018 - Author Identification, Author Profiling, and Author Obfuscation , Experimental IR Meets Multilinguality, Multimodality, and Interaction - 9th International Conference of the CLEF Association, P. Bellot, et al., (eds), pp. 267-285, Dec, 2018, Springer, http://dx.doi.org/10.1007%2F978-3-319-98...

[6]

N. Potha, E. Stamatatos, An Improved Impostors Method for Authorship Verification, CLEAF, Jones, G.J.F., Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, Th., Cappellato, L., Ferro, N. , (eds), pp. 138-144, Sep, 2017, Dublin, Ireland, Springer, Cham, https://doi.org/10.1007/978-3-319-65813-...

Abstract
Authorship verification has gained a lot of attention during the last years mainly due to the focus of PAN@CLEF shared tasks. A verification method called Impostors, based on a set of external (impostor) documents and a random subspace ensemble, is one of the most successful approaches. Variations of this method gained top-performing positions in recent PAN evaluation campaigns. In this paper, we propose a modification of the Impostors method that focuses on both appropriate selection of impostor documents and enhanced comparison of impostor documents with the documents under investigation. Our approach achieves competitive performance on PAN corpora, outperforming previous versions of the Impostors method.

[7]

E. Stamatatos, Authorship Attribution Using Text Distortion, 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 1138–1149, Dec, 2017, ACL, http://www.aclweb.org/anthology/E17-1107

[8]

P. Rosso, F. Rangel, M. Potthast, E. Stamatatos, M. Tschuggnall, B. Stein, Overview of PAN'16 - New Challenges for Authorship Analysis: Cross-Genre Profiling, Clustering, Diarization, and Obfuscation, 7th International Conference of the CLEF Association, pp. 332-350, Dec, 2017, Springer, https://link.springer.com/chapter/10.100...

[9]

E. Stamatatos, M. Tschuggnall, B. Verhoeven, W. Daelemans, G. Specht, B. Stein, M. Potthast, Clustering by Authorship Within and Across Documents, CLEF 2016, pp. 691-715, Dec, 2016, CEUR-WS, http://ceur-ws.org/Vol-1609/16090691.pdf

[10]

E. D'hondt, C. Grouin, A. Neveol, E. Stamatatos, P. Zweigenbaum, Detection of Text Reuse in French Medical Corpora, 5th Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM), pp. 108-114, Dec, 2016, https://aclweb.org/anthology/W/W16/W16-5...

[11]

E. Stamatatos, W. Daelemans, B. Verhoeven, P. Juola, A. López López, M. Potthast, B. Stein, Overview of the Author Identification Task at PAN 2015, Working Notes Papers of the CLEF 2015 Evaluation Labs, L. Cappellato, N. Ferro, J. Gareth, E. San Juan, (eds), Dec, 2015, ceur-ws.org, http://ceur-ws.org/Vol-1391/inv-pap3-CR....

[12]

E. Stamatatos, M. Potthast, F. Rangel, P. Rosso, B. Stein, Overview of the PAN/CLEF 2015 Evaluation Lab, 6th International Conference of the CLEF Association (CLEF-2015), Josiane Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel-Sauvagnat, Gareth J. F. Jones, Eric SanJuan, Linda Cappellato, Nicola Ferro, pp. 518-538, Dec, 2015, Springer, http://dx.doi.org/10.1007/978-3-319-2402...

[13]

D. Pritsos, E. Stamatatos, The Impact of Noise in Web Genre Identification, 6th International Conference of the CLEF Association (CLEF-2015), pp. 268-273, Dec, 2015, Springer LNCS 9283, http://link.springer.com/chapter/10.1007...

[14]

N. Potha, E. Stamatatos, A Profile-based Method for Authorship Verification, 8th Hellenic Conference on Artificial Intelligence (SETN), pp. 313-326, Dec, 2014, Springer, http://dx.doi.org/10.1007/978-3-319-0706...

[15]

E. Stamatatos, W. Daelemans, B. Verhoeven, B. Stein, M. Potthast, P. Juola, M.A. Sánchez-Pérez, A. Barrón-Cedeño, Overview of the Author Identification Task at PAN 2014, Working Notes for CLEF 2014 Conference, L. Cappellato, N. Ferro, M. Halvey, and W. Kraaij, (eds), pp. 877-897, Dec, 2014, ceur-ws.org, http://ceur-ws.org/Vol-1180/CLEF2014wn-P...

[16]

M. Potthast, T. Gollub, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling, 5th International Conference of the CLEF Initiative, pp. 268-299, Dec, 2014, Springer, http://dx.doi.org/10.1007/978-3-319-1138...

[17]

D. Pritsos, E. Stamatatos, The Impact of Noise in Web Genre Identification, 6th International Conference of the CLEF Association, Josiane Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel-Sauvagnat, Gareth J. F. Jones, Eric SanJuan, Linda Cappellato, Nicola Ferro, (eds), pp. 268-273, Dec, 2014, Springer, http://dx.doi.org/10.1007/978-3-319-2402...

[18]

N Pappas, G. Katsimpras, E. Stamatatos, Distinguishing the Popularity between Topics: A System for Up-to-Date Opinion Retrieval and Mining in the Web, 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2013), pp. 197-209, Dec, 2013, Springer LNCS,

Abstract
The constantly increasing amount of opinionated texts found in the Web had a significant impact in the development of sentiment analysis. So far, the majority of the comparative studies in this field focus on analyzing fixed (offline) collections from certain domains, genres, or topics. In this paper, we present an online system for opinion mining and retrieval that is able to discover up-to-date web pages on given topics using focused crawling agents, extract opinionated textual parts from web pages, and estimate their polarity using opinion mining agents. The evaluation of the system on real-world case studies, demonstrates that is appropriate for opinion comparison between topics, since it provides useful indications on the popularity based on a relatively small amount of web pages. Moreover, it can produce genre-aware results of opinion retrieval, a valuable option for decision-makers.

[19]

D. Pritsos, E. Stamatatos, Open-Set Classification for Automated Genre Identification, Advances in Information Retrieval - 35th European Conference on IR Research (ECIR 2013), pp. 207-217, Dec, 2013, Springer LNCS,

Abstract
Automated Genre Identification (AGI) of web pages is a problem of increasing importance since web genre (e.g. blog, news, eshops, etc.) information can enhance modern Information Retrieval (IR) systems. The state-of-the-art in this field considers AGI as a closed-set classification problem where a variety of web page representation and machine learning models have intensively studied. In this paper, we study AGI as an open-set classification problem which better formulates the real world conditions of exploiting AGI in practice. Focusing on the use of content information, different text representation methods (words and character n-grams) are tested. Moreover, two classification methods are examined, one-class SVM learners, used as a baseline, and an ensemble of classifiers based on random feature subspacing, originally proposed for author identification. It is demonstrated that very high precision can be achieved in open-set AGI while recall remains relatively high.

[20]

G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, Syntactic Dependency-Based N-grams: More Evidence of Usefulness in Classification, 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2013), pp. 13-24, Dec, 2013, Springer LNCS,

Abstract
The paper introduces and discusses a concept of syntactic n-grams (sn-grams) that can be applied instead of traditional n-grams in many NLP tasks. Sn-grams are constructed by following paths in syntactic trees, so sngrams allow bringing syntactic knowledge into machine learning methods. Still, previous parsing is necessary for their construction. We applied sn-grams in the task of authorship attribution for corpora of three and seven authors with very promising results.

[21]

K. Prokopiou, E. Kavallieratou, E. Stamatatos, An Image Processing Self-training System for Ruling Line Removal Algorithms, 18th Int. Conf. on Digital Signal Processing (DSP), Dec, 2013, IEEE,

Abstract
Ruling line removal is an important pre-processing step in document image processing. Several algorithms have been proposed for this task. However, it is important to be able to take full advantage of the existing algorithms by adapting them to the specific properties of a document image collection. In this paper, a system is presented, appropriate for fine-tuning the parameters of ruling line removal algorithms or appropriately adapt them to a specific document image collection, in order to improve the results. The application of our method to an existed line removal algorithms is presented.

[22]

M. Potthast, M. Hagen, T. Gollub, M. Tippmann, J. Kiesel, P. Rosso, E. Stamatatos, B. Stein, Overview of the 5th International Competition on Plagiarism Detection, CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, P. Forner, R. Navigli, and D. Tufis, (eds), Dec, 2013,

Abstract
This paper overviews 18 plagiarism detectors that have been evaluated within the fifth international competition on plagiarism detection at PAN 2013. We report on their performances for the two tasks source retrieval and text alignment of external plagiarism detection. Furthermore, we continue last year’s initiative to invite software submissions instead of run submissions, and, re-evaluate this year’s submissions on last year’s evaluation corpora and vice versa, thus demonstrating the benefits of software submissions in terms of reproducibility.

[23]

F. Rangel, P. Rosso, M. Koppel, E. Stamatatos, G. Inches, Overview of the Author Profiling Task at PAN 2013, CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, P. Forner, R. Navigli, and D. Tufis, (eds), Dec, 2013,

Abstract
This overview presents the framework and results for the Author Profiling task at PAN 2013. We describe in detail the corpus and its characteristics, and the evaluation framework we used to measure the participants performance to solve the problem of identifying age and gender from anonymous texts. Finally, the approaches of the 21 participants and their results are described.

[24]

P. Juola, E. Stamatatos, Overview of the Author Identification Task at PAN 2013, CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, P. Forner, R. Navigli, and D. Tufis, (eds), Dec, 2013,

Abstract
Abstract. The author identification task at PAN-2013 focuses on author verification where given a set of documents by a single author and a questioned document, the problem is to determine if the questioned document was written by that particular author or not. In this paper we present the evaluation setup, the performance measures, the new corpus we built for this task covering three languages and the evaluation results of the 18 participant teams that submitted their software. Moreover, we survey the characteristics of the submitted approaches and show that a very effective meta-model can be formed based on the output of the participant methods.

[25]

T. Gollub, M. Potthast, A. Beyer, M. Busse, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, Recent Trends in Digital Text Forensics and its Evaluation: Plagiarism Detection, Author Identification, and Author Profiling, 4th Int. Conf. of the CLEF Intiative: Information Access Evaluation. Multilinguality, Multimodality, and Visualization, pp. 282-301, Dec, 2013, Springer LNCS, 8138,

Abstract
This paper outlines the concepts and achievements of our evaluation lab on digital text forensics, PAN 13, which called for original research and development on plagiarism detection, author identification, and author profiling. We present a standardized evaluation framework for each of the three tasks and discuss the evaluation results of the altogether 58 submitted contributions. For the first time, instead of accepting the output of software runs, we collected the softwares themselves and run them on a computer cluster at our site. As evaluation and experimentation platform we use TIRA, which is being developed at the Webis Group in Weimar. TIRA can handle large-scale software submissions by means of virtualization, sandboxed execution, tailored unit testing, and staged submission. In addition to the achieved evaluation results, a major achievement of our lab is that we now have the largest collection of state-of-the-art approaches with regard to the mentioned tasks for further analysis at our disposal.

[26]

N Pappas, G. Katsimpras, E. Stamatatos, Extracting Informative Textual Parts from Web Pages Containing User-Generated Content, 12th International Conference on Knowledge Management and Knowledge Technologies (i-KNOW), Dec, 2012, ACM,

Abstract
The vast amount of user-generated content on the Web has increased the need for handling the problem of automatically processing content in web pages. The segmentation of web pages and noise (non-informative segment) removal are important pre-processing steps in a variety of applications such as sentiment analysis, text summarization and information retrieval. Currently, these two tasks tend to be handled separately or are handled together without emphasizing the diversity of the web corpora and the web page type detection. We present a unified approach that is able to provide robust identification of informative textual parts in web pages along with accurate type detection. The proposed algorithm takes into account visual and non-visual characteristics of a web page and is able to remove noisy parts from three major categories of pages which contain user-generated content (News, Blogs, Discussions). Based on a human annotated corpus consisting of diverse topics, domains and templates, we demonstrate the learning abilities of our algorithm, we examine its e↵ectiveness in extracting the informative textual parts and its usage as a rule-based classifier for web page type detection in a realistic web setting.

[27]

N Pappas, G. Katsimpras, E. Stamatatos, An Agent-Based Focused Crawling Framework for Topic- and Genre-Related Web Document Discovery, 24th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2012), pp. 508-515, Dec, 2012,

Abstract
The discovery of web documents about certain topics is an important task for web-based applications including web document retrieval, opinion mining and knowledge extraction. In this paper, we propose an agent-based focused crawling framework able to retrieve topic- and genre-related web documents. Starting from a simple topic query, a set of focused crawler agents explore in parallel topic-specific web paths using dynamic seed URLs that belong to certain web genres and are collected from web search engines. The agents make use of an internal mechanism that weighs topic and genre relevance scores of unvisited web pages. They are able to adapt to the properties of a given topic by modifying their internal knowledge during search, handle ambiguous queries, ignore irrelevant pages with respect to the topic and retrieve collaboratively topic-relevant web pages. We performed an experimental study to evaluate the behavior of the agents for a variety of topic queries demonstrating the benefits and the capabilities of our framework.

[28]

E. Chatzicharalampous, G. Frantzeskou, E. Stamatatos, Author Identification in Imbalanced Sets of Source Code Samples, 24th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2012), pp. 790-797, Dec, 2012,

Abstract
Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.

[29]

G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, Syntactic Dependency-Based N-grams as Classification Features, Advances in Computational Intelligence - 11th Mexican International Conference on Artificial Intelligence (MICAI 2012), pp. 1-11, Dec, 2012, Springer LNCS,

[30]

E. Stamatatos, Plagiarism Detection Based on Structural Information, 20th ACM Conference on Information and Knowledge Management (CIKM-11), pp. 1221-1230, Dec, 2011, ACM,

Abstract
In this paper a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses mainly content terms to represent documents, the proposed method is based on structural information provided by occurrences of a small list of stopwords (i.e., very frequent words). We show that stopword n-grams are able to capture local syntactic similarities between suspicious and original documents. Moreover, an algorithm for detecting the exact boundaries of plagiarized and source passages is proposed. Experimental results on a publicly-available corpus demonstrate that the performance of the proposed approach is competitive when compared with the best reported results. More importantly, it achieves significantly better results when dealing with difficult plagiarism cases where the plagiarized passages are highly modified by replacing most of the words or phrases with synonyms to hide the similarity with the source documents.

[31]

I. Kourtis, E. Stamatatos, Author Identification Using Semi-supervised Learning, 5th Int. Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN-11), Dec, 2011,

Abstract
Author identification models fall into two major categories according to the way they handle the training texts: profile-based models produce one representation per author while instance-based models produce one representation per text. In this paper, we propose an approach that combines two well-known representatives of these categories, namely the Common n-Grams method and a Support Vector Machine classifier based on character n-grams. The outputs of these classifiers are combined to enrich the training set with additional documents in a repetitive semi-supervised procedure inspired by the co-training algorithm. The evaluation results on closed-set author identification are encouraging, especially when the set of candidate authors is large.

[32]

E. Stamatatos, K. Stergiou, Learning how to propagate using random probing, 6th International Conference on Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems (CP-AI-OR 2009), Dec, 2009,

Abstract
In constraint programming there are often many choices regarding the propagation method to be used on the constraints of a problem. However, simple constraint solvers usually only apply a standard method, typically (generalized) arc consistency, on all constraints throughout search. Advanced solvers additionally allow for the modeler to choose among an array of propagators for certain (global) constraints. Since complex interactions exist among constraints, deciding in the modelling phase which propagation method to use on given constraints can be a hard task that ideally we would like to free the user from. In this paper we propose a simple technique towards the automation of this task. Our approach exploits information gathered from a random probing preprocessing phase to automatically decide on the propagation method to be used on each constraint. As we demonstrate, data gathered though probing allows for the solver to accurately differentiate between constraints that offer little pruning as opposed to ones that achieve many domain reductions, and also to detect constraints and variables that are amenable to certain propagation methods. Experimental results from an initial evaluation of the proposed method on binary CSPs demonstrate the benefits of our approach.

[33]

E. Stamatatos, Intrinsic Plagiarism Detection Using Character n-gram Profiles, 3rd Int. Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN-09), Dec, 2009,

Abstract
The task of intrinsic plagiarism detection deals with cases where no reference corpus is available and it is exclusively based on stylistic changes or inconsistencies within a given document. In this paper a new method is presented that attempts to quantify the style variation within a document using character n-gram profiles and a style change function based on an appropriate dissimilarity measure originally proposed for author identification. In addition, we propose a set of heuristic rules that attempt to detect plagiarism–free documents and plagiarized passages, as well as to reduce the effect of irrelevant style changes within a document. The proposed approach is evaluated on the recently-available corpus of the 1st Int. Competition on Plagiarism Detection with promising results.

[34]

S. Plakias, E. Stamatatos, Author Identification Using a Tensor Space Representation, 18th European Conference on Artificial Intelligence (ECAI, pp. 833-834, Dec, 2008,

Abstract
Author identification is a text categorization task with applications in intelligence, criminal law, computer forensics, etc. Usually, in such cases there is shortage of training texts. In this paper, we propose the use of second order tensors for representing texts for this problem, in contrast to the traditional vector space model. Based on a generalization of the SVM algorithm that can handle tensors, we explore various methods for filling the matrix of features taking into account that similar features should be placed in the same neighborhood. To this end, we propose a frequency-based metric. Experiments on a corpus controlled for genre and topic and variable amount of training texts show that the proposed approach is more effective than traditional vector-based SVM when only limited amount of training texts is used.

[35]

S. Plakias, E. Stamatatos, Tensor Space Models for Authorship Identification, 5th Hellenic Conference on Artificial Intelligence (SETN, pp. 239-249, Dec, 2008,

Abstract
Authorship identification can be viewed as a text categorization task. However, in this task the most frequent features appear to be the most important discriminators, there is usually a shortage of training texts, and the training texts are rarely evenly distributed over the authors. To cope with these problems, we propose tensors of second order for representing the stylistic properties of texts. Our approach requires the calculation of much fewer parameters in comparison to the traditional vector space representation. We examine various methods for building appropriate tensors taking into account that similar features should be placed in the same neighborhood. Based on an existing generalization of SVM able to handle tensors we perform experiments on corpora controlled for genre and topic and show that the proposed approach can effectively handle cases where only limited training texts are available.

[36]

I. Kanaris, E. Stamatatos, Webpage Genre Identification Using Variable-length Character n-grams, 19th IEEE Int. Conf. on Tools with Artificial Intelligence (ICTAI, Dec, 2007,

Abstract
An important factor for discriminating between webpages is their genre (e.g., blogs, personal homepages, e-shops, online newspapers, etc). Webpage genre identification has a great potential in information retrieval since users of search engines can combine genre-based and traditional topic-based queries to improve the quality of the results. So far, various features have been proposed to quantify the style of webpages including word and html-tag frequencies. In this paper, we propose a low-level representation for this problem based on character n-grams. Using an existing approach, we produce feature sets of variable-length character ngrams and combine this representation with information about the most frequent html-tags. Based on two benchmark corpora, we present webpage genre identification experiments and improve the best reported results in both cases.

[37]

E. Stamatatos, Author Identification Using Imbalanced and Limited Training Texts, 4th International Workshop on Text-based Information Retrieval, Dec, 2007,

Abstract
This paper deals with the problem of author identification. The Common N-Grams (CNG) method [6] is a language-independent profile-based approach with good results in many author identification experiments so far. A variation of this approach is presented based on new distance measures that are quite stable for large profile length values. Special emphasis is given to the degree upon which the effectiveness of the method is affected by the available training text samples per author. Experiments based on text samples on the same topic from the Reuters Corpus Volume 1 are presented using both balanced and imbalanced training corpora. The results show that CNG with the proposed distance measures is more accurate when only limited training text samples are available, at least for some of the candidate authors, a realistic condition in author identification problems.

[38]

G. Frantzeskou, E. Stamatatos, S. Gritzalis, S. K. Katsikas, Source Code Authorship Analysis using N-grams, AIAI 2006 3rd IFIP Conference on Artificial Intelligence Applications and Innovations, M. Bramer, I. Maglogiannis , (eds), pp. 508-515, Jun, 2006, Athens, Greece, Springer, https://www.utica.edu/academic/institute...

Download

Abstract
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the sys-tem after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author’s style. Experiments on data sets of dif-ferent programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idio-syncrasies of the source code authors. Moreover the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of com-ments in the source code, a condition usually met in cyber-crime cases.

[39]

G. Frantzeskou, E. Stamatatos, S. Gritzalis, S. K. Katsikas, Effective Identification of Source Code Authors Using Byte-Level Information, 28th International Conference on Software Engineering ICSE 2006 - Emerging Results Track, B. Cheng, B. Shen , (eds), pp. 893-896, May, 2006, Shanghai, China, ACM Press, http://dl.acm.org/ft_gateway.cfm?id=1134...

Download

Abstract
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author's style. Experiments on data sets of different programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach.A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover, the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.

[40]

J. Houvardas, E. Stamatatos, N-gram Feature Selection for Authorship Identification, 12th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications (AIMSA, J. Euzenat, and J. Domingue, (eds), pp. 77-86, Dec, 2006,

Abstract
Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.

[41]

E. Stamatatos, Ensemble-based Author Identification Using Character N-grams, 3rd Int. Workshop on Text-based Information Retrieval (TIR), pp. 41-46, Dec, 2006,

Abstract
This paper deals with the problem of identifying the most likely author of a text. Several thousands of character n-grams, rather than lexical or syntactic information, are used to represent the style of a text. Thus, the author identification task can be viewed as a single-label multiclass classification problem of high dimensional feature space and sparse data. In order to cope with such properties, we propose a suitable learning ensemble based on feature set subspacing. Performance results on two well-tested benchmark text corpora for author identification show that this classification scheme is quite effective, significantly improving the best reported results so far. Additionally, this approach is proved to be quite stable in comparison with support vector machines when using limited number of training texts, a condition usually met in this kind of problem.

[42]

E. Stamatatos, Text Sampling and Re-Sampling for Imbalanced Author Identification Cases, 17th European Conference on Artificial Intelligence (ECAI, Dec, 2006,

Abstract
Authorship identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors. In this paper, we present methods to handle imbalanced multi-class textual datasets. The main idea is to segment the training texts into sub-samples according to the size of the class. Hence, minority classes can be segmented into many short samples and majority classes into less and longer samples. Moreover, we explore text re-sampling in order to construct a training set according to a desirable distribution over the classes. Essentially, text re-sampling can be viewed as providing new synthetic data that increase the training size of a class. Based on a corpus of newswire stories in English we present authorship identification experiments on various multi-class imbalanced cases.

[43]

I. Kanaris, K. Kanaris, E. Stamatatos, Spam Detection Using Character N-grams, 4th Hellenic Conference on AI (SETN 2006): Advances in Artificial Intelligence, G. Antoniou, G. Potamias, C. Spyropoulos, D. Plexousakis, (eds), pp. 95–104, Dec, 2006,

Abstract
This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional 'bag of words' representation, we use a 'bag of character n-grams' representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or 'deep' text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines. Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application.

[44]

E. Kavallieratou, E. Stamatatos, Adaptive Binarization of Historical Document Images, 18th Int. Conf. on Pattern Recognition, pp. 742-745, Dec, 2006,

Abstract
In this paper, we present a binarization technique specifically designed for historical document images. Existing methods for this problem focus on either finding a good global threshold or adapting the threshold for each area so that to remove smear, strains, uneven illumination etc. We propose a hybrid approach that first applies a global thresholding method and, then, identifies the image areas that are more likely to still contain noise. Each of these areas is re-processed separately to achieve better quality of binarization. We evaluate the proposed approach for different kinds of degradation problems. The results show that our method can handle hard cases while documents already in good condition are not affected drastically.

[45]

E. Kavallieratou, E. Stamatatos, Improving the Quality of Degraded Document Images, 2nd IEEE Int. Conf. on Document Image Analysis for Libraries (DIAL), pp. 340-349, Dec, 2006,

Abstract
It is common for libraries to provide public access to historical and ancient document image collections. It is common for such document images to require specialized processing in order to remove background noise and become more legible. In this paper, we propose a hybrid binarizatin approach for improving the quality of old documents using a combination of global and local thresholding. First, a global thresholding technique specifically designed for old document images is applied to the entire image. Then, the image areas that still contain background noise are detected and the same technique is re-applied to each area separately. Hence, we achieve better adaptability of the algorithm in cases where various kinds of noise coexist in different areas of the same image while avoiding the computational and time cost of applying a local thresholding in the entire image. Evaluation results based on a collection of historical document images indicate that the proposed approach is effective in removing background noise and improving the quality of degraded documents while documents already in good condition are not affected.

[46]

G. Frantzeskou, E. Stamatatos, S. Gritzalis, Supporting the Digital Crime Investigation Process: Effective Discrimination of Source Code Authors based on Byte-level Information, ICETE‘2005 International Conference on eBusiness and Telecommunication Networks – Security and Reliability in Information Systems and Networks Track, pp. 283-290, Oct, 2005, UK, Springer, http://link.springer.com/content/pdf/10....

Download

Abstract
Source code authorship analysis is the particular field that attempts to identify the author of a computer program by treating each program as a linguistically analyzable entity. This is usually based on other undisputed program samples from the same author. There are several cases where the application of such a method could be of a major benefit, such as tracing the source of code left in the system after a cyber attack, authorship disputes, proof of authorship in court, etc. In this paper, we present our approach which is based on byte-level n-gram profiles and is an extension of a method that has been successfully applied to natural language text authorship attribution. We propose a simplified profile and a new similarity measure which is less complicated than the algorithm followed in text authorship attribution and it seems more suitable for source code identification since is better able to deal with very small training sets. Experiments were performed on two different data sets, one with programs written in C++ and the second with programs written in Java. Unlike the traditional language-dependent metrics used by previous studies, our approach can be applied to any programming language with no additional cost. The presented accuracy rates are much better than the best reported results for the same data sets.

[47]

G. Frantzeskou, E. Stamatatos, S. Gritzalis, Source Code Authorship Analysis using N-grams, 7th Biennial Conference on Forensic Linguistics, Jul, 2005, Cardiff, UK, http://link.springer.com/content/pdf/10....

Download

Abstract
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually. based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author’s style. Experiments on data sets of different programming language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.

[48]

E. Stamatatos, E. Kavallieratou, Music Performer Verification Based on Learning Ensembles, Methods and Applications of Artificial Intelligence, G. Vouros, (ed), pp. 122 – 131, Dec, 2004,

Abstract
In this paper the problem of music performer verification is introduced. Given a certain performance of a musical piece and a set of candidate pianists the task is to examine whether or not a particular pianist is the actual performer. A database of 22 pianists playing pieces by F. Chopin in a computercontrolled piano is used in the presented experiments. An appropriate set of features that captures the idiosyncrasies of music performers is proposed. Wellknown machine learning techniques for constructing learning ensembles are applied and remarkable results are described in verifying the actual pianist, a very difficult task even for human experts.

[49]

E. Kavallieratou, E. Stamatatos, Discrimination of Machine-Printed from Handwritten Text Using Simple Structural Characteristics, 17th International Conference on Pattern Recognition (ICPR 2004), Dec, 2004,

Abstract
In this paper, we present a trainable approach to discriminate between machine-printed and handwritten text. An integrated system able to localize text areas and split them in text-lines is used. A set of simple and easyto- compute structural characteristics that capture the differences between machine-printed and handwritten text-lines is introduced. Experiments on document images taken from IAM-DB and GRUHD databases show a remarkable performance of the proposed approach that requires minimal training data.

[50]

E. Stamatatos, G. Widmer, Music Performer Recognition Using an Ensemble of Simple Classifiers, 15th European Conference on Artificial Intelligence (ECAI’02), pp. 335-339, Dec, 2002,

Abstract
This. paper addresses the problem of identifying the most likely music performer, given a set of performances of the same piece by a number of skilled candidate pianists. We propose a set of features for representing the stylistic characteristics of a music performer. A database of piano performances of 22 pianists playing two pieces by F. Chopin is used in the presented experiments. Due to the limitations of the training set size and the characteristics of the input features we propose an ensemble of simple classifiers derived by both subsampling the training set and subsampling the input features. Preliminary experiments show that the resulting ensemble is able to efficiently cope with this difficult musical task, displaying a level of accuracy unlikely to be matched by human listeners (under similar conditions).

[51]

E. Stamatatos, Quantifying the Differences Between Music Performers: Score vs. Norm, International Computer Music Conference (ICMC’02), pp. 376-382, Dec, 2002,

Abstract
In this study, a comparison of features for discriminating between different music performers playing the same piece is presented. Based on a series of statistical experiments on a data set of piano pieces played by 22 performers, it is shown that the deviation from the performance norm (average performance) is better able to reveal the performers’ individualities in comparison to the deviation from the printed score. In the framework of automatic music performer recognition, the norm-based features prove to be very accurate in intra-piece tests (training and test set taken from the same piece) and very stable in inter-piece tests (training and test sets taken from different pieces). Moreover, it is empirically demonstrated that the average performance is at least as effective as the best of the constituent individual performances while ‘extreme’ performances have the lowest discriminatory potential when used as norm.

[52]

E. Stamatatos, A Computational Model for Discriminating Music Performers, MOSART Workshop on Current Research Directions in Computer Music, pp. 65-69, Dec, 2001,

Abstract
In this study, a computational model that aims at the automatic discrimination of different human music performers playing the same piece is presented. The proposed model is based on the note level and does not require any deep (e.g., structural or harmonic, etc.) analysis. A set of measures that attempts to capture both the style of the author and the style of the piece is introduced. The presented approach has been applied to a database of piano sonatas by W.A. Mozart performed by both a French and a Viennese pianist with very encouraging preliminary results.

[53]

E. Stamatatos, N. Fakotakis, G. Kokkinakis, A Practical Chunker for Unrestricted Text, Natural Language Processing, D. Christodoulakis, (ed), pp. 139-150, Dec, 2000,

Abstract
In this paper we present a practical approach to text chunking for unrestricted Modern Greek text that is based on multiple-pass parsing. Two versions of this chunker are proposed: one based on a large lexicon and one based on minimal resources. In the latter case the morphological analysis is performed using exclusively two small lexicons containing closed-class words and common suffixes of the Modern Greek words. We give comparative performance results on the basis of a corpus of unrestricted text and show that very good results can be obtained by omitting the large and complicate resources. Moreover, the considerable time cost introduced by the use of the large lexicon indicates that the minimal-resources chunker is the best solution regarding a practical application that requires rapid response and less than perfect parsing results.

[54]

E. Stamatatos, N. Fakotakis, G. Kokkinakis, Text Genre Detection Using Common Word Frequencies, 18th Int. Conf. on Computational Linguistics (COLING2000), pp. 808-814, Dec, 2000,

Abstract
In this paper we present a method for detecting the text genre quickly and easily following an approach originally proposed in authorship attribution studies which uses as style markers the frequencies of occurrence of the most frequent words in a training corpus (Burrows, 1992). In contrast to this approach we use the frequencies of occurrence of the most frequent words of the entire written language. Using as testing ground a part of the Wall Street Journal corpus, we show that the most frequent words of the British National Corpus, representing the most frequent words of the written English language, are more reliable discriminators of text genre in comparison to the most frequent words of the training corpus. Moreover, the frequencies of occurrence of the most common punctuation marks play an important role in terms of accurate text categorization as well as when dealing with training data of limited size.

[55]

E. Kavallieratou, E. Stamatatos, N. Fakotakis, G. Kokkinakis, Handwritten Character Segmentation Using Transformation-Based Learning, 15th Int. Conf. on Pattern Recognition (ICPR2000), pp. 634-637, Dec, 2000,

Abstract
This paper presents a character segmentation algorithm for unconstrained cursive handwritten text. The transformation-based learning method and a simplified variation of it are used in order to extract automatically rules that detect the segment boundaries. Comparative experimental results are given for a collection of multi-writer handwritten words. The achieved accuracy in detecting segment boundaries exceeds 82%. Moreover, limited training data can provide very satisfactory results.

[56]

E. Stamatatos, N. Fakotakis, G. Kokkinakis, Automatic Authorship Attribution, 9th Conf. οf the European Chapter of the Association for Computational Linguistics (EACL’99), pp. 158-164, Dec, 1999,

Abstract
In this paper we present an approach to automatic authorship attribution dealing with real-world (or unrestricted) text. Our method is based on the computational analysis of the input text using a text-processing tool. Besides the style markers relevant to the output of this tool we also use analysis-dependent style markers, that is, measures that represent the way in which the text has been processed. No word frequency counts, nor other lexically-based measures are taken into account. We show that the proposed set of style markers is able to distinguish texts of various authors of a weekly newspaper using multiple regression. All the experiments we present were performed using real-world text downloaded from the World Wide Web. Our approach is easily trainable and fully-automated requiring no manual text preprocessing nor sampling.

[57]

E. Stamatatos, N. Fakotakis, G. Kokkinakis, Automatic Extraction of Rules for Sentence Boundary Disambiguation, Workshop in Machine Learning in Human Language Technology, Advance Course on Artificial Intelligence (ACAI’99), pp. 88-92, Dec, 1999,

Abstract
Transformation-based learning (TBL) is the most important machine learning theory aiming at the automatic extraction of rules based on already tagged corpora. However, the application of this theory to a certain application without taking into account the features that characterize this application may cause problems regarding the training time cost as well as the accuracy of the extracted rules. In this paper we present a variation of the basic idea of the TBL and we apply it to the extraction of the sentence boundary disambiguation rules in real-world text, a prerequisite for the vast majority of the natural language processing applications. We show that our approach achieves considerably higher accuracy results and, moreover, requires minimal training time in comparison to the traditional TBL.

[58]

E. Stamatatos, S. Michos, N. Fakotakis, G. Kokkinakis, A User-Assisted Business Letter Generetor Dealing with Text’s Stylistic Variations, 9th IEEE Conference on Tools with Artificial Intelligence (ICTAI’97), pp. 182-189, Dec, 1997,

Abstract
This paper describes a user-assisted business letter generator that meets the ever-increasing demand for more flexible and modular letter generators which draw on explicit thematic models and are easily adaptable to specific user needs. Based on a detailed analysis of requirements and taking full advantage of the end users feedback, the presented generator not only creates a business letter according to the user choices, but also refines it taking into consideration stylistic aspects like written style and tone.

[59]

E. Stamatatos, S. Michos, C. Patelodimou, N. Fakotakis, TRANSLIB: An Advanced Tool for Supporting Multilingual Access to Library Catalogues, 2nd Workshop on Multilinguality in Software Industry (MULSAIC’97), pp. 33-40, Dec, 1997,

Abstract
Language barriers present a major problem in the effectiveness of resource sharing and in common access to the resources of libraries. In this paper we present the TRANSLIB system which stemmed from the integration of both new and already existing advanced multilingual information tools. By making use of some AI-based methods this system takes full advantage of these resources in order to provide multilingual access to library catalogues. Among its striking features, it enables searching in multiple languages, multilingual presentation of the query results, and localization of the user interface. TRANSLIB has been currently tested in existing medium-sized bibliographic databases. Early evaluation results show a remarkable improvement in the search process and report high user-friendliness, and easy and low-cost maintenance and upgrade of the system.

[60]

S. Michos, E. Stamatatos, N. Fakotakis, G. Kokkinakis, An Empirical Text Categorizing Computational Model Based on Stylistic Aspects, 8th IEEE Conference on Tools with Artificial Intelligence (ICTAI’96), pp. 71-77, Dec, 1996,

Abstract
The presented work is strongly motivated by the need of categorizing unrestricted texts in terms of functional style (FS) in order to attain a satisfying outcome in style processing. Towards this aim, it is given a three-level description of FS that comprises: (a) the basic categories of FS, (b) the main features that characterize each one of the above categories, and (c) the linguistic identifiers that act as style markers in texts for the identification of the above features. Special emphasis is put on the problems that faced the computational implementation of the aforementioned findings as well as the selection of the most appropriate stylometrics (i.e., stylistic scores) to achieve better results on text categorization. This approach is language independent, empirically-driven, and can be used in various applications including grammar and style checking, natural language generation, style verification in real-world texts, and recognition of style shift between adjacent portions of text.

Books

Chapters in Books

[1]

E. Stamatatos, Universality of Stylistic Traits in Texts, chapter in: Creativity and Universality in Language, M. Degli Esposti, G. E. Altmann, and F. Pachet, (eds), pp. 143-155, 2016, Springer, https://link.springer.com/chapter/10.100...

[2]

G. Frantzeskou, S. MacDonell, E. Stamatatos, Source Code Authorship Analysis For Supporting the Cybercrime Investigation Process, chapter in: Handbook of Research on Computational Forensics, Digital Crime, and Investigation, Chang-Tsun Li, (ed), pp. 470-495, 2010, IGI Global,

[3]

S. Michos, E. Stamatatos, N. Fakotakis, G. Kokkinakis, Categorising Texts by Using a Three-Level Functional Style Description, chapter in: Artificial Intelligence: Methodology, Systems, Applications- Frontiers in Artificial Intelligence and Applications,vol. 35, A. Ramsay , (ed), pp. 191-198, 1996, IOS press,

Conferences Proceedings Editor

[1]

B. Stein, E. Stamatatos, M. Koppel, (eds), International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN-08), Jul, 2008, Patras

[2]

B. Stein, M. Koppel, E. Stamatatos, (eds), International Workshop on Plagiarism Analysis, Authorship Attribution, and Near-Duplicate Detection (PAN-07), Jul, 2007, Amsterdam

Studies

Education

Research Interests

Teaching Activities

Journals

Conferences

Books

Chapters in Books

Conferences Proceedings Editor