Education

  • Diploma in Electrical Engineering, University of Patras (1994)
  • Doctoral degree in Electrical and Computer Engineering, University of Patras (2000).

Research Interests

  • Text mining
  • Intelligent information retrieval
  • Natural language processing
  • Machine learning
  • Computer music

Teaching Activities

  • Artificial Intelligence
  • Natural Language Processing
  • Machine Learning (Postgraduate)
  • Intelligent Systems (Postgraduate)

Journals


Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.


Conferences


Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.


N. Potha, E. Stamatatos, Dynamic Ensemble Selection for Author Verification, European Conference on Information Retrieval Springer, Cham, 2019, pp. 102-115, Apr, 2019, Germany, European Conference on Information Retrieval. Springer, Cham,
 

Abstract
Author verification is a fundamental task in authorship analysis and associated with significant applications in humanities, cyber-security, and social media analytics. In some of the relevant studies, there is evidence that heterogeneous ensembles can provide very reliable solutions, better than any individual verification model. However, there is no systematic study of examining the application of ensemble methods in this task. In this paper, we start from a large set of base verification models covering the main paradigms in this area and study how they can be combined to build an accurate ensemble. We propose a simple stacking ensemble as well as a dynamic ensemble selection approach that can use the most reliable base models for each verification case separately. The experimental results in ten benchmark corpora covering multiple languages and genres verify the suitability of ensembles for this task and demonstrate the effectiveness of our method, in some cases improving the best reported results by more than 10%.

[2]
D. Pritsos, A. Rocha, E. Stamatatos, Open-set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio, 41st European Conference on Information Retrieval (ECIR), pp. 3-11, Dec, 2019, Springer, http://dx.doi.org/10.1007/978-3-030-1571...
[3]
M. Potthast, P. Rosso, E. Stamatatos, B. Stein, A Decade of Shared Tasks in Digital Text Forensics at PAN, 41st European Conference on Information Retrieval (ECIR), pp. 291-300, Dec, 2019, Springer, http://dx.doi.org/10.1007/978-3-030-1571...
N. Potha, E. Stamatatos, Intrinsic Author Verification Using Topic Modeling, SETN '18 , (eds), Jul, 2018, Patras, Greece, Proceedings of the 10th Hellenic Conference on Artificial Intelligence, 20
 

Abstract
Author verification is a fundamental task in authorship analysis and associated with important applications in humanities and forensics. In this paper, we propose the use of an intrinsic profile-based verification method that is based on latent semantic indexing (LSI). Our proposed approach is easy-to-follow and language independent. Based on experiments using benchmark corpora from the PAN shared task in author verification, we demonstrate that LSI is both more effective and more stable than latent Dirichlet allocation in this task. Moreover, LSI models are able to outperform existing approaches especially when multiple texts of known authorship are available per verification instance and all documents belong to the same thematic area and genre. We also study several feature types and similarity measures to be combined with the proposed topic models.

[5]
E. Stamatatos, F. Rangel, M. Tschuggnall, B. Stein, M. Kestemont, P. Rosso, M. Potthast, Overview of PAN 2018 - Author Identification, Author Profiling, and Author Obfuscation , Experimental IR Meets Multilinguality, Multimodality, and Interaction - 9th International Conference of the CLEF Association, P. Bellot, et al., (eds), pp. 267-285, Dec, 2018, Springer, http://dx.doi.org/10.1007%2F978-3-319-98...
N. Potha, E. Stamatatos, An Improved Impostors Method for Authorship Verification, CLEAF, Jones, G.J.F., Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, Th., Cappellato, L., Ferro, N. , (eds), pp. 138-144, Sep, 2017, Dublin, Ireland, Springer, Cham, https://doi.org/10.1007/978-3-319-65813-...
 

Abstract
Authorship verification has gained a lot of attention during the last years mainly due to the focus of PAN@CLEF shared tasks. A verification method called Impostors, based on a set of external (impostor) documents and a random subspace ensemble, is one of the most successful approaches. Variations of this method gained top-performing positions in recent PAN evaluation campaigns. In this paper, we propose a modification of the Impostors method that focuses on both appropriate selection of impostor documents and enhanced comparison of impostor documents with the documents under investigation. Our approach achieves competitive performance on PAN corpora, outperforming previous versions of the Impostors method.

[7]
E. Stamatatos, Authorship Attribution Using Text Distortion, 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 1138–1149, Dec, 2017, ACL, http://www.aclweb.org/anthology/E17-1107
[8]
P. Rosso, F. Rangel, M. Potthast, E. Stamatatos, M. Tschuggnall, B. Stein, Overview of PAN'16 - New Challenges for Authorship Analysis: Cross-Genre Profiling, Clustering, Diarization, and Obfuscation, 7th International Conference of the CLEF Association, pp. 332-350, Dec, 2017, Springer, https://link.springer.com/chapter/10.100...
[9]
E. Stamatatos, M. Tschuggnall, B. Verhoeven, W. Daelemans, G. Specht, B. Stein, M. Potthast, Clustering by Authorship Within and Across Documents, CLEF 2016, pp. 691-715, Dec, 2016, CEUR-WS, http://ceur-ws.org/Vol-1609/16090691.pdf
[10]
E. D'hondt, C. Grouin, A. Neveol, E. Stamatatos, P. Zweigenbaum, Detection of Text Reuse in French Medical Corpora, 5th Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM), pp. 108-114, Dec, 2016, https://aclweb.org/anthology/W/W16/W16-5...
[11]
E. Stamatatos, W. Daelemans, B. Verhoeven, P. Juola, A. López López, M. Potthast, B. Stein, Overview of the Author Identification Task at PAN 2015, Working Notes Papers of the CLEF 2015 Evaluation Labs, L. Cappellato, N. Ferro, J. Gareth, E. San Juan, (eds), Dec, 2015, ceur-ws.org, http://ceur-ws.org/Vol-1391/inv-pap3-CR....
[12]
E. Stamatatos, M. Potthast, F. Rangel, P. Rosso, B. Stein, Overview of the PAN/CLEF 2015 Evaluation Lab, 6th International Conference of the CLEF Association (CLEF-2015), Josiane Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel-Sauvagnat, Gareth J. F. Jones, Eric SanJuan, Linda Cappellato, Nicola Ferro, pp. 518-538, Dec, 2015, Springer, http://dx.doi.org/10.1007/978-3-319-2402...
[13]
D. Pritsos, E. Stamatatos, The Impact of Noise in Web Genre Identification, 6th International Conference of the CLEF Association (CLEF-2015), pp. 268-273, Dec, 2015, Springer LNCS 9283, http://link.springer.com/chapter/10.1007...
[14]
N. Potha, E. Stamatatos, A Profile-based Method for Authorship Verification, 8th Hellenic Conference on Artificial Intelligence (SETN), pp. 313-326, Dec, 2014, Springer, http://dx.doi.org/10.1007/978-3-319-0706...
[15]
E. Stamatatos, W. Daelemans, B. Verhoeven, B. Stein, M. Potthast, P. Juola, M.A. Sánchez-Pérez, A. Barrón-Cedeño, Overview of the Author Identification Task at PAN 2014, Working Notes for CLEF 2014 Conference, L. Cappellato, N. Ferro, M. Halvey, and W. Kraaij, (eds), pp. 877-897, Dec, 2014, ceur-ws.org, http://ceur-ws.org/Vol-1180/CLEF2014wn-P...
[16]
M. Potthast, T. Gollub, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling, 5th International Conference of the CLEF Initiative, pp. 268-299, Dec, 2014, Springer, http://dx.doi.org/10.1007/978-3-319-1138...
[17]
D. Pritsos, E. Stamatatos, The Impact of Noise in Web Genre Identification, 6th International Conference of the CLEF Association, Josiane Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel-Sauvagnat, Gareth J. F. Jones, Eric SanJuan, Linda Cappellato, Nicola Ferro, (eds), pp. 268-273, Dec, 2014, Springer, http://dx.doi.org/10.1007/978-3-319-2402...
N Pappas, G. Katsimpras, E. Stamatatos, Distinguishing the Popularity between Topics: A System for Up-to-Date Opinion Retrieval and Mining in the Web, 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2013), pp. 197-209, Dec, 2013, Springer LNCS,
 

Abstract
The constantly increasing amount of opinionated texts found in the Web had a significant impact in the development of sentiment analysis. So far, the majority of the comparative studies in this field focus on analyzing fixed (offline) collections from certain domains, genres, or topics. In this paper, we present an online system for opinion mining and retrieval that is able to discover up-to-date web pages on given topics using focused crawling agents, extract opinionated textual parts from web pages, and estimate their polarity using opinion mining agents. The evaluation of the system on real-world case studies, demonstrates that is appropriate for opinion comparison between topics, since it provides useful indications on the popularity based on a relatively small amount of web pages. Moreover, it can produce genre-aware results of opinion retrieval, a valuable option for decision-makers.

D. Pritsos, E. Stamatatos, Open-Set Classification for Automated Genre Identification, Advances in Information Retrieval - 35th European Conference on IR Research (ECIR 2013), pp. 207-217, Dec, 2013, Springer LNCS,
 

Abstract
Automated Genre Identification (AGI) of web pages is a problem of increasing importance since web genre (e.g. blog, news, eshops, etc.) information can enhance modern Information Retrieval (IR) systems. The state-of-the-art in this field considers AGI as a closed-set classification problem where a variety of web page representation and machine learning models have intensively studied. In this paper, we study AGI as an open-set classification problem which better formulates the real world conditions of exploiting AGI in practice. Focusing on the use of content information, different text representation methods (words and character n-grams) are tested. Moreover, two classification methods are examined, one-class SVM learners, used as a baseline, and an ensemble of classifiers based on random feature subspacing, originally proposed for author identification. It is demonstrated that very high precision can be achieved in open-set AGI while recall remains relatively high.

G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, Syntactic Dependency-Based N-grams: More Evidence of Usefulness in Classification, 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2013), pp. 13-24, Dec, 2013, Springer LNCS,
 

Abstract
The paper introduces and discusses a concept of syntactic n-grams (sn-grams) that can be applied instead of traditional n-grams in many NLP tasks. Sn-grams are constructed by following paths in syntactic trees, so sngrams allow bringing syntactic knowledge into machine learning methods. Still, previous parsing is necessary for their construction. We applied sn-grams in the task of authorship attribution for corpora of three and seven authors with very promising results.

K. Prokopiou, E. Kavallieratou, E. Stamatatos, An Image Processing Self-training System for Ruling Line Removal Algorithms, 18th Int. Conf. on Digital Signal Processing (DSP), Dec, 2013, IEEE,
 

Abstract
Ruling line removal is an important pre-processing step in document image processing. Several algorithms have been proposed for this task. However, it is important to be able to take full advantage of the existing algorithms by adapting them to the specific properties of a document image collection. In this paper, a system is presented, appropriate for fine-tuning the parameters of ruling line removal algorithms or appropriately adapt them to a specific document image collection, in order to improve the results. The application of our method to an existed line removal algorithms is presented.

M. Potthast, M. Hagen, T. Gollub, M. Tippmann, J. Kiesel, P. Rosso, E. Stamatatos, B. Stein, Overview of the 5th International Competition on Plagiarism Detection, CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, P. Forner, R. Navigli, and D. Tufis, (eds), Dec, 2013,
 

Abstract
This paper overviews 18 plagiarism detectors that have been evaluated within the fifth international competition on plagiarism detection at PAN 2013. We report on their performances for the two tasks source retrieval and text alignment of external plagiarism detection. Furthermore, we continue last year’s initiative to invite software submissions instead of run submissions, and, re-evaluate this year’s submissions on last year’s evaluation corpora and vice versa, thus demonstrating the benefits of software submissions in terms of reproducibility.

F. Rangel, P. Rosso, M. Koppel, E. Stamatatos, G. Inches, Overview of the Author Profiling Task at PAN 2013, CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, P. Forner, R. Navigli, and D. Tufis, (eds), Dec, 2013,
 

Abstract
This overview presents the framework and results for the Author Profiling task at PAN 2013. We describe in detail the corpus and its characteristics, and the evaluation framework we used to measure the participants performance to solve the problem of identifying age and gender from anonymous texts. Finally, the approaches of the 21 participants and their results are described.

P. Juola, E. Stamatatos, Overview of the Author Identification Task at PAN 2013, CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, P. Forner, R. Navigli, and D. Tufis, (eds), Dec, 2013,
 

Abstract
Abstract. The author identification task at PAN-2013 focuses on author verification where given a set of documents by a single author and a questioned document, the problem is to determine if the questioned document was written by that particular author or not. In this paper we present the evaluation setup, the performance measures, the new corpus we built for this task covering three languages and the evaluation results of the 18 participant teams that submitted their software. Moreover, we survey the characteristics of the submitted approaches and show that a very effective meta-model can be formed based on the output of the participant methods.

T. Gollub, M. Potthast, A. Beyer, M. Busse, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, Recent Trends in Digital Text Forensics and its Evaluation: Plagiarism Detection, Author Identification, and Author Profiling, 4th Int. Conf. of the CLEF Intiative: Information Access Evaluation. Multilinguality, Multimodality, and Visualization, pp. 282-301, Dec, 2013, Springer LNCS, 8138,
 

Abstract
This paper outlines the concepts and achievements of our evaluation lab on digital text forensics, PAN 13, which called for original research and development on plagiarism detection, author identification, and author profiling. We present a standardized evaluation framework for each of the three tasks and discuss the evaluation results of the altogether 58 submitted contributions. For the first time, instead of accepting the output of software runs, we collected the softwares themselves and run them on a computer cluster at our site. As evaluation and experimentation platform we use TIRA, which is being developed at the Webis Group in Weimar. TIRA can handle large-scale software submissions by means of virtualization, sandboxed execution, tailored unit testing, and staged submission. In addition to the achieved evaluation results, a major achievement of our lab is that we now have the largest collection of state-of-the-art approaches with regard to the mentioned tasks for further analysis at our disposal.

N Pappas, G. Katsimpras, E. Stamatatos, Extracting Informative Textual Parts from Web Pages Containing User-Generated Content, 12th International Conference on Knowledge Management and Knowledge Technologies (i-KNOW), Dec, 2012, ACM,
 

Abstract
The vast amount of user-generated content on the Web has increased the need for handling the problem of automatically processing content in web pages. The segmentation of web pages and noise (non-informative segment) removal are important pre-processing steps in a variety of applications such as sentiment analysis, text summarization and information retrieval. Currently, these two tasks tend to be handled separately or are handled together without emphasizing the diversity of the web corpora and the web page type detection. We present a unified approach that is able to provide robust identification of informative textual parts in web pages along with accurate type detection. The proposed algorithm takes into account visual and non-visual characteristics of a web page and is able to remove noisy parts from three major categories of pages which contain user-generated content (News, Blogs, Discussions). Based on a human annotated corpus consisting of diverse topics, domains and templates, we demonstrate the learning abilities of our algorithm, we examine its e↵ectiveness in extracting the informative textual parts and its usage as a rule-based classifier for web page type detection in a realistic web setting.

N Pappas, G. Katsimpras, E. Stamatatos, An Agent-Based Focused Crawling Framework for Topic- and Genre-Related Web Document Discovery, 24th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2012), pp. 508-515, Dec, 2012,
 

Abstract
The discovery of web documents about certain topics is an important task for web-based applications including web document retrieval, opinion mining and knowledge extraction. In this paper, we propose an agent-based focused crawling framework able to retrieve topic- and genre-related web documents. Starting from a simple topic query, a set of focused crawler agents explore in parallel topic-specific web paths using dynamic seed URLs that belong to certain web genres and are collected from web search engines. The agents make use of an internal mechanism that weighs topic and genre relevance scores of unvisited web pages. They are able to adapt to the properties of a given topic by modifying their internal knowledge during search, handle ambiguous queries, ignore irrelevant pages with respect to the topic and retrieve collaboratively topic-relevant web pages. We performed an experimental study to evaluate the behavior of the agents for a variety of topic queries demonstrating the benefits and the capabilities of our framework.

E. Chatzicharalampous, G. Frantzeskou, E. Stamatatos, Author Identification in Imbalanced Sets of Source Code Samples, 24th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2012), pp. 790-797, Dec, 2012,
 

Abstract
Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.

[29]
G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, Syntactic Dependency-Based N-grams as Classification Features, Advances in Computational Intelligence - 11th Mexican International Conference on Artificial Intelligence (MICAI 2012), pp. 1-11, Dec, 2012, Springer LNCS,
E. Stamatatos, Plagiarism Detection Based on Structural Information, 20th ACM Conference on Information and Knowledge Management (CIKM-11), pp. 1221-1230, Dec, 2011, ACM,
 

Abstract
In this paper a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses mainly content terms to represent documents, the proposed method is based on structural information provided by occurrences of a small list of stopwords (i.e., very frequent words). We show that stopword n-grams are able to capture local syntactic similarities between suspicious and original documents. Moreover, an algorithm for detecting the exact boundaries of plagiarized and source passages is proposed. Experimental results on a publicly-available corpus demonstrate that the performance of the proposed approach is competitive when compared with the best reported results. More importantly, it achieves significantly better results when dealing with difficult plagiarism cases where the plagiarized passages are highly modified by replacing most of the words or phrases with synonyms to hide the similarity with the source documents.

I. Kourtis, E. Stamatatos, Author Identification Using Semi-supervised Learning, 5th Int. Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN-11), Dec, 2011,
 

Abstract
Author identification models fall into two major categories according to the way they handle the training texts: profile-based models produce one representation per author while instance-based models produce one representation per text. In this paper, we propose an approach that combines two well-known representatives of these categories, namely the Common n-Grams method and a Support Vector Machine classifier based on character n-grams. The outputs of these classifiers are combined to enrich the training set with additional documents in a repetitive semi-supervised procedure inspired by the co-training algorithm. The evaluation results on closed-set author identification are encouraging, especially when the set of candidate authors is large.

E. Stamatatos, K. Stergiou, Learning how to propagate using random probing, 6th International Conference on Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems (CP-AI-OR 2009), Dec, 2009,
 

Abstract
In constraint programming there are often many choices regarding the propagation method to be used on the constraints of a problem. However, simple constraint solvers usually only apply a standard method, typically (generalized) arc consistency, on all constraints throughout search. Advanced solvers additionally allow for the modeler to choose among an array of propagators for certain (global) constraints. Since complex interactions exist among constraints, deciding in the modelling phase which propagation method to use on given constraints can be a hard task that ideally we would like to free the user from. In this paper we propose a simple technique towards the automation of this task. Our approach exploits information gathered from a random probing preprocessing phase to automatically decide on the propagation method to be used on each constraint. As we demonstrate, data gathered though probing allows for the solver to accurately differentiate between constraints that offer little pruning as opposed to ones that achieve many domain reductions, and also to detect constraints and variables that are amenable to certain propagation methods. Experimental results from an initial evaluation of the proposed method on binary CSPs demonstrate the benefits of our approach.

E. Stamatatos, Intrinsic Plagiarism Detection Using Character n-gram Profiles, 3rd Int. Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN-09), Dec, 2009,
 

Abstract
The task of intrinsic plagiarism detection deals with cases where no reference corpus is available and it is exclusively based on stylistic changes or inconsistencies within a given document. In this paper a new method is presented that attempts to quantify the style variation within a document using character n-gram profiles and a style change function based on an appropriate dissimilarity measure originally proposed for author identification. In addition, we propose a set of heuristic rules that attempt to detect plagiarism–free documents and plagiarized passages, as well as to reduce the effect of irrelevant style changes within a document. The proposed approach is evaluated on the recently-available corpus of the 1st Int. Competition on Plagiarism Detection with promising results.

S. Plakias, E. Stamatatos, Author Identification Using a Tensor Space Representation, 18th European Conference on Artificial Intelligence (ECAI, pp. 833-834, Dec, 2008,
 

Abstract
Author identification is a text categorization task with applications in intelligence, criminal law, computer forensics, etc. Usually, in such cases there is shortage of training texts. In this paper, we propose the use of second order tensors for representing texts for this problem, in contrast to the traditional vector space model. Based on a generalization of the SVM algorithm that can handle tensors, we explore various methods for filling the matrix of features taking into account that similar features should be placed in the same neighborhood. To this end, we propose a frequency-based metric. Experiments on a corpus controlled for genre and topic and variable amount of training texts show that the proposed approach is more effective than traditional vector-based SVM when only limited amount of training texts is used.

S. Plakias, E. Stamatatos, Tensor Space Models for Authorship Identification, 5th Hellenic Conference on Artificial Intelligence (SETN, pp. 239-249, Dec, 2008,
 

Abstract
Authorship identification can be viewed as a text categorization task. However, in this task the most frequent features appear to be the most important discriminators, there is usually a shortage of training texts, and the training texts are rarely evenly distributed over the authors. To cope with these problems, we propose tensors of second order for representing the stylistic properties of texts. Our approach requires the calculation of much fewer parameters in comparison to the traditional vector space representation. We examine various methods for building appropriate tensors taking into account that similar features should be placed in the same neighborhood. Based on an existing generalization of SVM able to handle tensors we perform experiments on corpora controlled for genre and topic and show that the proposed approach can effectively handle cases where only limited training texts are available.

I. Kanaris, E. Stamatatos, Webpage Genre Identification Using Variable-length Character n-grams, 19th IEEE Int. Conf. on Tools with Artificial Intelligence (ICTAI, Dec, 2007,
 

Abstract
An important factor for discriminating between webpages is their genre (e.g., blogs, personal homepages, e-shops, online newspapers, etc). Webpage genre identification has a great potential in information retrieval since users of search engines can combine genre-based and traditional topic-based queries to improve the quality of the results. So far, various features have been proposed to quantify the style of webpages including word and html-tag frequencies. In this paper, we propose a low-level representation for this problem based on character n-grams. Using an existing approach, we produce feature sets of variable-length character ngrams and combine this representation with information about the most frequent html-tags. Based on two benchmark corpora, we present webpage genre identification experiments and improve the best reported results in both cases.

E. Stamatatos, Author Identification Using Imbalanced and Limited Training Texts, 4th International Workshop on Text-based Information Retrieval, Dec, 2007,
 

Abstract
This paper deals with the problem of author identification. The Common N-Grams (CNG) method [6] is a language-independent profile-based approach with good results in many author identification experiments so far. A variation of this approach is presented based on new distance measures that are quite stable for large profile length values. Special emphasis is given to the degree upon which the effectiveness of the method is affected by the available training text samples per author. Experiments based on text samples on the same topic from the Reuters Corpus Volume 1 are presented using both balanced and imbalanced training corpora. The results show that CNG with the proposed distance measures is more accurate when only limited training text samples are available, at least for some of the candidate authors, a realistic condition in author identification problems.

G. Frantzeskou, E. Stamatatos, S. Gritzalis, S. K. Katsikas, Source Code Authorship Analysis using N-grams, AIAI 2006 3rd IFIP Conference on Artificial Intelligence Applications and Innovations, M. Bramer, I. Maglogiannis , (eds), pp. 508-515, Jun, 2006, Athens, Greece, Springer, https://www.utica.edu/academic/institute...
 

Abstract
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the sys-tem after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author’s style. Experiments on data sets of dif-ferent programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idio-syncrasies of the source code authors. Moreover the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of com-ments in the source code, a condition usually met in cyber-crime cases.

G. Frantzeskou, E. Stamatatos, S. Gritzalis, S. K. Katsikas, Effective Identification of Source Code Authors Using Byte-Level Information, 28th International Conference on Software Engineering ICSE 2006 - Emerging Results Track, B. Cheng, B. Shen , (eds), pp. 893-896, May, 2006, Shanghai, China, ACM Press, http://dl.acm.org/ft_gateway.cfm?id=1134...
 

Abstract
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author's style. Experiments on data sets of different programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach.A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover, the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.

J. Houvardas, E. Stamatatos, N-gram Feature Selection for Authorship Identification, 12th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications (AIMSA, J. Euzenat, and J. Domingue, (eds), pp. 77-86, Dec, 2006,
 

Abstract
Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.

E. Stamatatos, Ensemble-based Author Identification Using Character N-grams, 3rd Int. Workshop on Text-based Information Retrieval (TIR), pp. 41-46, Dec, 2006,
 

Abstract
This paper deals with the problem of identifying the most likely author of a text. Several thousands of character n-grams, rather than lexical or syntactic information, are used to represent the style of a text. Thus, the author identification task can be viewed as a single-label multiclass classification problem of high dimensional feature space and sparse data. In order to cope with such properties, we propose a suitable learning ensemble based on feature set subspacing. Performance results on two well-tested benchmark text corpora for author identification show that this classification scheme is quite effective, significantly improving the best reported results so far. Additionally, this approach is proved to be quite stable in comparison with support vector machines when using limited number of training texts, a condition usually met in this kind of problem.

E. Stamatatos, Text Sampling and Re-Sampling for Imbalanced Author Identification Cases, 17th European Conference on Artificial Intelligence (ECAI, Dec, 2006,
 

Abstract
Authorship identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors. In this paper, we present methods to handle imbalanced multi-class textual datasets. The main idea is to segment the training texts into sub-samples according to the size of the class. Hence, minority classes can be segmented into many short samples and majority classes into less and longer samples. Moreover, we explore text re-sampling in order to construct a training set according to a desirable distribution over the classes. Essentially, text re-sampling can be viewed as providing new synthetic data that increase the training size of a class. Based on a corpus of newswire stories in English we present authorship identification experiments on various multi-class imbalanced cases.

I. Kanaris, K. Kanaris, E. Stamatatos, Spam Detection Using Character N-grams, 4th Hellenic Conference on AI (SETN 2006): Advances in Artificial Intelligence, G. Antoniou, G. Potamias, C. Spyropoulos, D. Plexousakis, (eds), pp. 95–104, Dec, 2006,
 

Abstract
This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional 'bag of words' representation, we use a 'bag of character n-grams' representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or 'deep' text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines. Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application.

E. Kavallieratou, E. Stamatatos, Adaptive Binarization of Historical Document Images, 18th Int. Conf. on Pattern Recognition, pp. 742-745, Dec, 2006,
 

Abstract
In this paper, we present a binarization technique specifically designed for historical document images. Existing methods for this problem focus on either finding a good global threshold or adapting the threshold for each area so that to remove smear, strains, uneven illumination etc. We propose a hybrid approach that first applies a global thresholding method and, then, identifies the image areas that are more likely to still contain noise. Each of these areas is re-processed separately to achieve better quality of binarization. We evaluate the proposed approach for different kinds of degradation problems. The results show that our method can handle hard cases while documents already in good condition are not affected drastically.

E. Kavallieratou, E. Stamatatos, Improving the Quality of Degraded Document Images, 2nd IEEE Int. Conf. on Document Image Analysis for Libraries (DIAL), pp. 340-349, Dec, 2006,
 

Abstract
It is common for libraries to provide public access to historical and ancient document image collections. It is common for such document images to require specialized processing in order to remove background noise and become more legible. In this paper, we propose a hybrid binarizatin approach for improving the quality of old documents using a combination of global and local thresholding. First, a global thresholding technique specifically designed for old document images is applied to the entire image. Then, the image areas that still contain background noise are detected and the same technique is re-applied to each area separately. Hence, we achieve better adaptability of the algorithm in cases where various kinds of noise coexist in different areas of the same image while avoiding the computational and time cost of applying a local thresholding in the entire image. Evaluation results based on a collection of historical document images indicate that the proposed approach is effective in removing background noise and improving the quality of degraded documents while documents already in good condition are not affected.

G. Frantzeskou, E. Stamatatos, S. Gritzalis, Supporting the Digital Crime Investigation Process: Effective Discrimination of Source Code Authors based on Byte-level Information, ICETE‘2005 International Conference on eBusiness and Telecommunication Networks – Security and Reliability in Information Systems and Networks Track, pp. 283-290, Oct, 2005, UK, Springer, http://link.springer.com/content/pdf/10....
 

Abstract
Source code authorship analysis is the particular field that attempts to identify the author of a computer program by treating each program as a linguistically analyzable entity. This is usually based on other undisputed program samples from the same author. There are several cases where the application of such a method could be of a major benefit, such as tracing the source of code left in the system after a cyber attack, authorship disputes, proof of authorship in court, etc. In this paper, we present our approach which is based on byte-level n-gram profiles and is an extension of a method that has been successfully applied to natural language text authorship attribution. We propose a simplified profile and a new similarity measure which is less complicated than the algorithm followed in text authorship attribution and it seems more suitable for source code identification since is better able to deal with very small training sets. Experiments were performed on two different data sets, one with programs written in C++ and the second with programs written in Java. Unlike the traditional language-dependent metrics used by previous studies, our approach can be applied to any programming language with no additional cost. The presented accuracy rates are much better than the best reported results for the same data sets.

G. Frantzeskou, E. Stamatatos, S. Gritzalis, Source Code Authorship Analysis using N-grams, 7th Biennial Conference on Forensic Linguistics, Jul, 2005, Cardiff, UK, http://link.springer.com/content/pdf/10....
 

Abstract
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually. based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author’s style. Experiments on data sets of different programming language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.

E. Stamatatos, E. Kavallieratou, Music Performer Verification Based on Learning Ensembles, Methods and Applications of Artificial Intelligence, G. Vouros, (ed), pp. 122 – 131, Dec, 2004,
 

Abstract
In this paper the problem of music performer verification is introduced. Given a certain performance of a musical piece and a set of candidate pianists the task is to examine whether or not a particular pianist is the actual performer. A database of 22 pianists playing pieces by F. Chopin in a computercontrolled piano is used in the presented experiments. An appropriate set of features that captures the idiosyncrasies of music performers is proposed. Wellknown machine learning techniques for constructing learning ensembles are applied and remarkable results are described in verifying the actual pianist, a very difficult task even for human experts.

E. Kavallieratou, E. Stamatatos, Discrimination of Machine-Printed from Handwritten Text Using Simple Structural Characteristics, 17th International Conference on Pattern Recognition (ICPR 2004), Dec, 2004,
 

Abstract
In this paper, we present a trainable approach to discriminate between machine-printed and handwritten text. An integrated system able to localize text areas and split them in text-lines is used. A set of simple and easyto- compute structural characteristics that capture the differences between machine-printed and handwritten text-lines is introduced. Experiments on document images taken from IAM-DB and GRUHD databases show a remarkable performance of the proposed approach that requires minimal training data.

E. Stamatatos, G. Widmer, Music Performer Recognition Using an Ensemble of Simple Classifiers, 15th European Conference on Artificial Intelligence (ECAI’02), pp. 335-339, Dec, 2002,
 

Abstract
This. paper addresses the problem of identifying the most likely music performer, given a set of performances of the same piece by a number of skilled candidate pianists. We propose a set of features for representing the stylistic characteristics of a music performer. A database of piano performances of 22 pianists playing two pieces by F. Chopin is used in the presented experiments. Due to the limitations of the training set size and the characteristics of the input features we propose an ensemble of simple classifiers derived by both subsampling the training set and subsampling the input features. Preliminary experiments show that the resulting ensemble is able to efficiently cope with this difficult musical task, displaying a level of accuracy unlikely to be matched by human listeners (under similar conditions).

E. Stamatatos, Quantifying the Differences Between Music Performers: Score vs. Norm, International Computer Music Conference (ICMC’02), pp. 376-382, Dec, 2002,
 

Abstract
In this study, a comparison of features for discriminating between different music performers playing the same piece is presented. Based on a series of statistical experiments on a data set of piano pieces played by 22 performers, it is shown that the deviation from the performance norm (average performance) is better able to reveal the performers’ individualities in comparison to the deviation from the printed score. In the framework of automatic music performer recognition, the norm-based features prove to be very accurate in intra-piece tests (training and test set taken from the same piece) and very stable in inter-piece tests (training and test sets taken from different pieces). Moreover, it is empirically demonstrated that the average performance is at least as effective as the best of the constituent individual performances while ‘extreme’ performances have the lowest discriminatory potential when used as norm.

E. Stamatatos, A Computational Model for Discriminating Music Performers, MOSART Workshop on Current Research Directions in Computer Music, pp. 65-69, Dec, 2001,
 

Abstract
In this study, a computational model that aims at the automatic discrimination of different human music performers playing the same piece is presented. The proposed model is based on the note level and does not require any deep (e.g., structural or harmonic, etc.) analysis. A set of measures that attempts to capture both the style of the author and the style of the piece is introduced. The presented approach has been applied to a database of piano sonatas by W.A. Mozart performed by both a French and a Viennese pianist with very encouraging preliminary results.

E. Stamatatos, N. Fakotakis, G. Kokkinakis, A Practical Chunker for Unrestricted Text, Natural Language Processing, D. Christodoulakis, (ed), pp. 139-150, Dec, 2000,
 

Abstract
In this paper we present a practical approach to text chunking for unrestricted Modern Greek text that is based on multiple-pass parsing. Two versions of this chunker are proposed: one based on a large lexicon and one based on minimal resources. In the latter case the morphological analysis is performed using exclusively two small lexicons containing closed-class words and common suffixes of the Modern Greek words. We give comparative performance results on the basis of a corpus of unrestricted text and show that very good results can be obtained by omitting the large and complicate resources. Moreover, the considerable time cost introduced by the use of the large lexicon indicates that the minimal-resources chunker is the best solution regarding a practical application that requires rapid response and less than perfect parsing results.

E. Stamatatos, N. Fakotakis, G. Kokkinakis, Text Genre Detection Using Common Word Frequencies, 18th Int. Conf. on Computational Linguistics (COLING2000), pp. 808-814, Dec, 2000,
 

Abstract
In this paper we present a method for detecting the text genre quickly and easily following an approach originally proposed in authorship attribution studies which uses as style markers the frequencies of occurrence of the most frequent words in a training corpus (Burrows, 1992). In contrast to this approach we use the frequencies of occurrence of the most frequent words of the entire written language. Using as testing ground a part of the Wall Street Journal corpus, we show that the most frequent words of the British National Corpus, representing the most frequent words of the written English language, are more reliable discriminators of text genre in comparison to the most frequent words of the training corpus. Moreover, the frequencies of occurrence of the most common punctuation marks play an important role in terms of accurate text categorization as well as when dealing with training data of limited size.

E. Kavallieratou, E. Stamatatos, N. Fakotakis, G. Kokkinakis, Handwritten Character Segmentation Using Transformation-Based Learning, 15th Int. Conf. on Pattern Recognition (ICPR2000), pp. 634-637, Dec, 2000,
 

Abstract
This paper presents a character segmentation algorithm for unconstrained cursive handwritten text. The transformation-based learning method and a simplified variation of it are used in order to extract automatically rules that detect the segment boundaries. Comparative experimental results are given for a collection of multi-writer handwritten words. The achieved accuracy in detecting segment boundaries exceeds 82%. Moreover, limited training data can provide very satisfactory results.

E. Stamatatos, N. Fakotakis, G. Kokkinakis, Automatic Authorship Attribution, 9th Conf. οf the European Chapter of the Association for Computational Linguistics (EACL’99), pp. 158-164, Dec, 1999,
 

Abstract
In this paper we present an approach to automatic authorship attribution dealing with real-world (or unrestricted) text. Our method is based on the computational analysis of the input text using a text-processing tool. Besides the style markers relevant to the output of this tool we also use analysis-dependent style markers, that is, measures that represent the way in which the text has been processed. No word frequency counts, nor other lexically-based measures are taken into account. We show that the proposed set of style markers is able to distinguish texts of various authors of a weekly newspaper using multiple regression. All the experiments we present were performed using real-world text downloaded from the World Wide Web. Our approach is easily trainable and fully-automated requiring no manual text preprocessing nor sampling.

E. Stamatatos, N. Fakotakis, G. Kokkinakis, Automatic Extraction of Rules for Sentence Boundary Disambiguation, Workshop in Machine Learning in Human Language Technology, Advance Course on Artificial Intelligence (ACAI’99), pp. 88-92, Dec, 1999,
 

Abstract
Transformation-based learning (TBL) is the most important machine learning theory aiming at the automatic extraction of rules based on already tagged corpora. However, the application of this theory to a certain application without taking into account the features that characterize this application may cause problems regarding the training time cost as well as the accuracy of the extracted rules. In this paper we present a variation of the basic idea of the TBL and we apply it to the extraction of the sentence boundary disambiguation rules in real-world text, a prerequisite for the vast majority of the natural language processing applications. We show that our approach achieves considerably higher accuracy results and, moreover, requires minimal training time in comparison to the traditional TBL.

E. Stamatatos, S. Michos, N. Fakotakis, G. Kokkinakis, A User-Assisted Business Letter Generetor Dealing with Text’s Stylistic Variations, 9th IEEE Conference on Tools with Artificial Intelligence (ICTAI’97), pp. 182-189, Dec, 1997,
 

Abstract
This paper describes a user-assisted business letter generator that meets the ever-increasing demand for more flexible and modular letter generators which draw on explicit thematic models and are easily adaptable to specific user needs. Based on a detailed analysis of requirements and taking full advantage of the end users feedback, the presented generator not only creates a business letter according to the user choices, but also refines it taking into consideration stylistic aspects like written style and tone.

E. Stamatatos, S. Michos, C. Patelodimou, N. Fakotakis, TRANSLIB: An Advanced Tool for Supporting Multilingual Access to Library Catalogues, 2nd Workshop on Multilinguality in Software Industry (MULSAIC’97), pp. 33-40, Dec, 1997,
 

Abstract
Language barriers present a major problem in the effectiveness of resource sharing and in common access to the resources of libraries. In this paper we present the TRANSLIB system which stemmed from the integration of both new and already existing advanced multilingual information tools. By making use of some AI-based methods this system takes full advantage of these resources in order to provide multilingual access to library catalogues. Among its striking features, it enables searching in multiple languages, multilingual presentation of the query results, and localization of the user interface. TRANSLIB has been currently tested in existing medium-sized bibliographic databases. Early evaluation results show a remarkable improvement in the search process and report high user-friendliness, and easy and low-cost maintenance and upgrade of the system.

S. Michos, E. Stamatatos, N. Fakotakis, G. Kokkinakis, An Empirical Text Categorizing Computational Model Based on Stylistic Aspects, 8th IEEE Conference on Tools with Artificial Intelligence (ICTAI’96), pp. 71-77, Dec, 1996,
 

Abstract
The presented work is strongly motivated by the need of categorizing unrestricted texts in terms of functional style (FS) in order to attain a satisfying outcome in style processing. Towards this aim, it is given a three-level description of FS that comprises: (a) the basic categories of FS, (b) the main features that characterize each one of the above categories, and (c) the linguistic identifiers that act as style markers in texts for the identification of the above features. Special emphasis is put on the problems that faced the computational implementation of the aforementioned findings as well as the selection of the most appropriate stylometrics (i.e., stylistic scores) to achieve better results on text categorization. This approach is language independent, empirically-driven, and can be used in various applications including grammar and style checking, natural language generation, style verification in real-world texts, and recognition of style shift between adjacent portions of text.

Books


Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.


Chapters in Books


Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.


[1]
E. Stamatatos, Universality of Stylistic Traits in Texts, chapter in: Creativity and Universality in Language, M. Degli Esposti, G. E. Altmann, and F. Pachet, (eds), pp. 143-155, 2016, Springer, https://link.springer.com/chapter/10.100...
[2]
G. Frantzeskou, S. MacDonell, E. Stamatatos, Source Code Authorship Analysis For Supporting the Cybercrime Investigation Process, chapter in: Handbook of Research on Computational Forensics, Digital Crime, and Investigation, Chang-Tsun Li, (ed), pp. 470-495, 2010, IGI Global,
[3]
S. Michos, E. Stamatatos, N. Fakotakis, G. Kokkinakis, Categorising Texts by Using a Three-Level Functional Style Description, chapter in: Artificial Intelligence: Methodology, Systems, Applications- Frontiers in Artificial Intelligence and Applications,vol. 35, A. Ramsay , (ed), pp. 191-198, 1996, IOS press,

Conferences Proceedings Editor


Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.


[1]
B. Stein, E. Stamatatos, M. Koppel, (eds), International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN-08), Jul, 2008, Patras
[2]
B. Stein, M. Koppel, E. Stamatatos, (eds), International Workshop on Plagiarism Analysis, Authorship Attribution, and Near-Duplicate Detection (PAN-07), Jul, 2007, Amsterdam