Abstract
A number of independent authorship attribution studies have demonstrated the effectiveness of character n-gram features for representing the stylistic properties of text. However, the vast
majority of these studies examined the simple case where the training and test corpora are similar in terms of genre, topic, and distribution of the texts. Hence, there are doubts whether such a simple and low-level representation is equally effective in realistic conditions where some of the above factors are not possible to remain stable. In this study, the robustness of
authorship attribution based on character n-gram features is tested under cross-genre and cross-topic conditions. In addition, the distribution of texts over the candidate authors varies in
training and test corpora to imitate real cases. Comparative results with another competitive text representation approach based on very frequent words show that character n-grams are better able to capture stylistic properties of text when there are significant differences among the training and test corpora. Moreover, a set of guidelines to tune an authorship attribution model according to the properties of training and test corpora is
provided.
Abstract
In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we
construct them, i.e., what elements are considered neighbors. In case of sngrams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still,
previous parsing is necessary for their construction. Sn-grams can be applied in
any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters;
three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.
Abstract
When writing source code, programmers have varying levels of freedom when it comes to the creation and use of identifiers. Do they habitually use the same identifiers, names that are different to those used by others? Is it then possible to tell who the author of a piece of code is by examining these identifiers?
Abstract
In this paper, a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses content terms to represent documents, the proposed method is based on a small list of stopwords (i.e., very frequent words). We show that stopword n-grams reveal important information for plagiarism detection since they are able to capture syntactic similarities between suspicious and original documents and they can be used to detect the exact plagiarized passage boundaries. Experimental results on a publicly-available corpus demonstrate that the performance of the proposed approach is competitive when compared with the best reported results. More importantly, it achieves significantly better results when dealing with difficult plagiarism cases where the plagiarized passages are highly modified and most of the words or phrases have been replaced with synonyms.
Abstract
Webpages are mainly distinguished by their topic (e.g., politics, sports etc.) and genre (e.g.,
blogs, homepages, e-shops, etc.). Automatic detection of webpage genre could considerably
enhance the ability of modern search engines to focus on the requirements of the userメs
information need. In this paper, we present an approach to webpage genre detection based
on a fully-automated extraction of the feature set that represents the style of webpages.
The features we propose (character n-grams of variable length and HTML tags) are language-
independent and easily-extracted while they can be adapted to the properties of
the still evolving web genres and the noisy environment of the web. Experiments based
on two publicly-available corpora show that the performance of the proposed approach
is superior in comparison to previously reported results. It is also shown that character
n-grams are better features than words when the dimensionality increases while the binary
representation is more effective than the term-frequency representation for both feature
types. Moreover, we perform a series of cross-check experiments (e.g., training using a
genre palette and testing using a different genre palette as well as using the features
extracted from one corpus to discriminate the genres of the other corpus) to illustrate
the robustness of our approach and its ability to capture the general stylistic properties
of genre categories even when the feature set is not optimized for the given corpus.
Abstract
Authorship attribution supported by statistical or computational
methods has a long history starting from the 19th
century and is marked by the seminal study of Mosteller
and Wallace (1964) on the authorship of the disputed
“Federalist Papers.”During the last decade, this scientific
field has been developed substantially, taking advantage
of research advances in areas such as machine learning,
information retrieval, and natural language processing.
The plethora of available electronic texts (e.g., e-mail messages,
online forum messages, blogs, source code, etc.)
indicates a wide variety of applications of this technology,
provided it is able to handle short and noisy text
from multiple candidate authors. In this article, a survey
of recent advances of the automated approaches
to attributing authorship is presented, examining their
characteristics for both text representation and text classification.
The focus of this survey is on computational
requirements and settings rather than on linguistic or
literary issues. We also discuss evaluation methodologies
and criteria for authorship attribution studies and list
open questions that will attract future work in this area.
Abstract
The use of Source Code Author Profiles (SCAP) represents a new, highly accurate approach to source code authorship identification
that is, unlike previous methods, language independent. While accuracy is clearly a crucial requirement of any author identification
method, in cases of litigation regarding authorship, plagiarism, and so on, there is also a need to know why it is claimed that a piece
of code is written by a particular author. What is it about that piece of code that suggests a particular author? What features in the code make one author more likely than another? In this study, we describe a means of identifying the high-level features that contribute to source code authorship identification using as a tool the SCAP method. A variety of features are considered for Java and Common Lisp and the importance of each feature in determining authorship is measured through a sequence of experiments in which we remove one feature at a time. The results show that, for these programs, comments, layout features and package-related naming influence classification accuracy whereas user-defined naming, an obvious programmer related feature, does not appear to influence accuracy. A comparison is also made between the relative feature contributions in programs written in the two languages.
Abstract
Authorship analysis of electronic texts assists digital forensics and anti-terror investigation. Author identification can be
seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for
some of the candidate authors or there is a significant variation in the text-length among the available training texts of the
candidate authors. Moreover, in this task usually there is no similarity between the distribution of training and test texts
over the classes, that is, a basic assumption of inductive learning does not apply. In this paper, we present methods to handle
imbalanced multi-class textual datasets. The main idea is to segment the training texts into text samples according to
the size of the class, thus producing a fairer classification model. Hence, minority classes can be segmented into many short
samples and majority classes into less and longer samples. We explore text sampling methods in order to construct a training
set according to a desirable distribution over the classes. Essentially, by text sampling we provide new synthetic data
that artificially increase the training size of a class. Based on two text corpora of two languages, namely, newswire stories in
English and newspaper reportage in Arabic, we present a series of authorship identification experiments on various multiclass
imbalanced cases that reveal the properties of the presented methods.
Abstract
Source code author identification deals with identifying the most likely author of a computer
program, given a set of predefined author candidates. There are several scenarios where
digital evidence of this kind plays a role in investigation and adjudication, such as code
authorship disputes, intellectual property infringement, tracing the source of code left in the
system after a cyber attack, and so forth. As in any identification task, the disputed program is
compared to undisputed, known programming samples by the predefined author candidates.
We present a new approach, called the SCAP (Source Code Author Profiles) approach, based
on byte-level n-gram profiles representing the source code author’s style. The SCAP method
extends a method originally applied to natural language text authorship attribution; we show
that an n-gram approach also suits the characteristics of source code analysis. The
methodological extension includes a simplified profile and a less complicated, but more
effective, similarity measure. Experiments on data sets of different programming-language
(Java or C++) and commented/commentless code demonstrate the effectiveness of these
extensions. The SCAP approach is programming-language independent. Moreover, the SCAP
approach deals surprisingly well with cases where only a limited amount of very short
programs per programmer is available for training. Finally, it is also demonstrated that SCAP
effectiveness persists even in the absence of comments in the source code, a condition
usually met in cyber-crime cases.
Abstract
Goal of the workshop was to bring together experts and prospective researchers around the exciting
and future-oriented topic of plagiarism analysis, authorship identi¯cation, and high similarity
search. This topic receives increasing attention, which results, among others, from the fact that
information about nearly any subject can be found on the World Wide Web.
Abstract
The increasing number of unsolicited e-mail messages (spam) reveals the need for the development of reliable anti-spam filters. The vast majority of content-based techniques rely on word-based representation of messages. Such approaches require reliable tokenizers for detecting the token boundaries. As a consequence, a common practice of spammers is to attempt to confuse tokenizers using unexpected punctuation marks or special characters within the message. In this paper we explore an alternative low-level representation based on character n-grams which avoids the use of tokenizers and other language-dependent tools. Based on experiments on two well-known benchmark corpora and a variety of evaluation measures, we show that character n-grams are more reliable features than word-tokens despite the fact that they increase the dimensionality of the problem. Moreover, we propose a method for extracting variable-length n-grams which produces optimal classifiers among the examined models under cost-sensitive evaluation.
Abstract
Authorship attribution can assist the criminal investigation procedure as well as cybercrime analysis.
This task can be viewed as a single-label multi-class text categorization problem. Given that the style
of a text can be represented as mere word frequencies selected in a language-independent method,
suitable machine learning techniques able to deal with high dimensional feature spaces and sparse
data can be directly applied to solve this problem. This paper focuses on classifier ensembles based
on feature set subspacing. It is shown that an effective ensemble can be constructed using,
exhaustive disjoint subspacing, a simple method producing many poor but diverse base classifiers.
The simple model can be enhanced by a variation of the technique of cross-validated committees
applied to the feature set. Experiments on two benchmark text corpora demonstrate the effectiveness
of the presented method improving previously reported results and compare it to support vector
machines, an alternative suitable machine learning approach to authorship attribution.
Abstract
This paper addresses the problem of identifying the most likely music performer, given a set of performances of the same piece by a number of skilled candidate pianists. We propose a set of features for representing the stylistic characteristics of a music performer, introducing norm-based features that are relevant to the average performance. A database of piano performances of 22 pianists playing two pieces by F. Chopin is used in the presented experiments. Due to the limitations of the training set size and the characteristics of the input features we propose an ensemble of simple classifiers derived by both subsampling the training set and subsampling the input features. The presented experiments show that the proposed features are able to quantify the differences between music performers. The proposed ensemble can efficiently cope with multi-class music performer recognition under inter-piece conditions, a difficult musical task, displaying a level of accuracy unlikely to be matched by human listeners (under similar conditions). Moreover, it is empirically demonstrated that the average performance is at least as effective as the best of the constituent individual performances while ‘extreme’ performances have the lowest discriminatory potential when used as norm.
Abstract
The most important approaches to computer-assisted authorship attribution are exclusively
based on lexical measures that either represent the vocabulary richness of the author or simply
comprise frequencies of occurrence of common words. In this paper we present a fully-automated
approach to the identification of the authorship of unrestricted text that excludes any lexical measure.
Instead we adapt a set of style markers to the analysis of the text performed by an already existing
natural language processing tool using three stylometric levels, i.e., token-level, phrase-level, and
analysis-level measures. The latter represent the way in which the text has been analyzed. The
presented experiments on a Modern Greek newspaper corpus show that the proposed set of style
markers is able to distinguish reliably the authors of a randomly-chosen group and performs better
than a lexically-based approach. However, the combination of these two approaches provides the
most accurate solution (i.e., 87% accuracy). Moreover, we describe experiments on various sizes of
the training data as well as tests dealing with the significance of the proposed set of style markers.
Abstract
The two main factors that characterize a text are its content and its style, and both can be used
as a means of categorization. In this paper we present an approach to text categorization in
terms of genre and author for Modern Greek. In contrast to previous stylometric approaches,
we attempt to take full advantage of existing natural language processing (NLP) tools. To this
end, we propose a set of style markers including analysis-levelmeasures that represent the way in
which the input text has been analyzed and capture useful stylistic information without additional
cost. We present a set of small-scale but reasonable experiments in text genre detection, author
identication, and author verication tasks and show that the proposed method performs better
than the most popular distributional lexical measures, i.e., functions of vocabulary richness and
frequencies of occurrence of the most frequent words. All the presented experiments are based on
unrestricted text downloaded from the World Wide Web without any manual text preprocessing
or text sampling.Various performance issues regarding the training set size and the signicance of
the proposed style markers are discussed.Our system can be used in any application that requires
fast and easily adaptable text categorization in terms of stylistically homogeneous categories.
Moreover, the procedure of dening analysis-level markers can be followed in order to extract
useful stylistic information using existing text processing tools.
Abstract
Language barriers present a major problemin the e ectiveness of resource sharing and in
common access to the resources of libraries. In this paper we present the TRANSLIB
system, which consists of an integration of both new and existing multilingual information
tools. This systemtakes full advantage of some AI± based methods in order to provide
multilingual access to library catalogues. Its main features include functionalities for
searching in multiple languages, multilingual presentation of the query results, and
localization of the user interface. TRANSLIB has currently been tested in existing
medium± sized bibliographic databases. Evaluation results show a remarkable improvement
in the search process and report high user friendliness and easy and low± cost maintenance
and upgrade of the system.
Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.
Abstract
Author verification is a fundamental task in authorship analysis and associated with significant applications in humanities, cyber-security, and social media analytics. In some of the relevant studies, there is evidence that heterogeneous ensembles can provide very reliable solutions, better than any individual verification model. However, there is no systematic study of examining the application of ensemble methods in this task. In this paper, we start from a large set of base verification models covering the main paradigms in this area and study how they can be combined to build an accurate ensemble. We propose a simple stacking ensemble as well as a dynamic ensemble selection approach that can use the most reliable base models for each verification case separately. The experimental results in ten benchmark corpora covering multiple languages and genres verify the suitability of ensembles for this task and demonstrate the effectiveness of our method, in some cases improving the best reported results by more than 10%.
Abstract
Author verification is a fundamental task in authorship analysis and associated with important applications in humanities and forensics. In this paper, we propose the use of an intrinsic profile-based verification method that is based on latent semantic indexing (LSI). Our proposed approach is easy-to-follow and language independent. Based on experiments using benchmark corpora from the PAN shared task in author verification, we demonstrate that LSI is both more effective and more stable than latent Dirichlet allocation in this task. Moreover, LSI models are able to outperform existing approaches especially when multiple texts of known authorship are available per verification instance and all documents belong to the same thematic area and genre. We also study several feature types and similarity measures to be combined with the proposed topic models.
Abstract
Authorship verification has gained a lot of attention during the last years mainly due to the focus of PAN@CLEF shared tasks. A verification method called Impostors, based on a set of external (impostor) documents and a random subspace ensemble, is one of the most successful approaches. Variations of this method gained top-performing positions in recent PAN evaluation campaigns. In this paper, we propose a modification of the Impostors method that focuses on both appropriate selection of impostor documents and enhanced comparison of impostor documents with the documents under investigation. Our approach achieves competitive performance on PAN corpora, outperforming previous versions of the Impostors method.
Abstract
The constantly increasing amount of opinionated texts found in the Web had a significant impact in the development of sentiment analysis. So far, the majority of the comparative studies in this field focus on analyzing fixed (offline) collections from certain domains, genres, or topics. In this paper, we present an online system for opinion mining and retrieval that is able to discover up-to-date web pages on given topics using focused crawling agents, extract opinionated textual parts from web pages, and estimate their polarity using opinion mining agents. The evaluation of the system on real-world case studies, demonstrates that is appropriate for opinion comparison between topics, since it provides useful indications on the popularity based on a relatively small amount
of web pages. Moreover, it can produce genre-aware results of opinion retrieval, a valuable option for decision-makers.
Abstract
Automated Genre Identification (AGI) of web pages is a
problem of increasing importance since web genre (e.g. blog, news, eshops,
etc.) information can enhance modern Information Retrieval (IR)
systems. The state-of-the-art in this field considers AGI as a closed-set
classification problem where a variety of web page representation and machine
learning models have intensively studied. In this paper, we study
AGI as an open-set classification problem which better formulates the
real world conditions of exploiting AGI in practice. Focusing on the use
of content information, different text representation methods (words and
character n-grams) are tested. Moreover, two classification methods are
examined, one-class SVM learners, used as a baseline, and an ensemble
of classifiers based on random feature subspacing, originally proposed for
author identification. It is demonstrated that very high precision can be
achieved in open-set AGI while recall remains relatively high.
Abstract
The paper introduces and discusses a concept of syntactic n-grams (sn-grams) that can be applied instead of traditional n-grams in many NLP tasks. Sn-grams are constructed by following paths in syntactic trees, so sngrams allow bringing syntactic knowledge into machine learning methods. Still, previous parsing is necessary for their construction. We applied sn-grams in the
task of authorship attribution for corpora of three and seven authors with very promising results.
Abstract
Ruling line removal is an important pre-processing step in document image processing. Several algorithms have been proposed for this task. However, it is important to be able to take full advantage of the existing algorithms by adapting them to the specific properties of a document image collection. In this paper, a system is presented, appropriate for fine-tuning the parameters of ruling line removal algorithms or appropriately adapt them to a specific document image collection, in order to improve the results. The application of our method to an existed line removal algorithms is presented.
Abstract
This paper overviews 18 plagiarism detectors that have been evaluated
within the fifth international competition on plagiarism detection at PAN 2013.
We report on their performances for the two tasks source retrieval and text alignment
of external plagiarism detection. Furthermore, we continue last year’s initiative
to invite software submissions instead of run submissions, and, re-evaluate
this year’s submissions on last year’s evaluation corpora and vice versa, thus
demonstrating the benefits of software submissions in terms of reproducibility.
Abstract
This overview presents the framework and results for the Author Profiling
task at PAN 2013. We describe in detail the corpus and its characteristics,
and the evaluation framework we used to measure the participants performance to
solve the problem of identifying age and gender from anonymous texts. Finally,
the approaches of the 21 participants and their results are described.
Abstract
Abstract. The author identification task at PAN-2013 focuses on author verification where given a set of documents by a single author and a questioned document, the problem is to determine if the questioned document was written by that particular author or not. In this paper we present the evaluation setup, the performance measures, the new corpus we built for this task covering three languages and the evaluation results of the 18 participant teams that submitted their software. Moreover, we survey the characteristics of the submitted approaches and show that a very effective meta-model can be formed based on the output of the participant methods.
Abstract
This paper outlines the concepts and achievements of our evaluation
lab on digital text forensics, PAN 13, which called for original research and development
on plagiarism detection, author identification, and author profiling.
We present a standardized evaluation framework for each of the three tasks and
discuss the evaluation results of the altogether 58 submitted contributions. For
the first time, instead of accepting the output of software runs, we collected the
softwares themselves and run them on a computer cluster at our site. As evaluation
and experimentation platform we use TIRA, which is being developed at
the Webis Group in Weimar. TIRA can handle large-scale software submissions
by means of virtualization, sandboxed execution, tailored unit testing, and staged
submission. In addition to the achieved evaluation results, a major achievement
of our lab is that we now have the largest collection of state-of-the-art approaches
with regard to the mentioned tasks for further analysis at our disposal.
Abstract
The vast amount of user-generated content on the Web has
increased the need for handling the problem of automatically
processing content in web pages. The segmentation of web
pages and noise (non-informative segment) removal are important
pre-processing steps in a variety of applications such
as sentiment analysis, text summarization and information
retrieval. Currently, these two tasks tend to be handled separately
or are handled together without emphasizing the diversity
of the web corpora and the web page type detection.
We present a unified approach that is able to provide robust
identification of informative textual parts in web pages
along with accurate type detection. The proposed algorithm
takes into account visual and non-visual characteristics of a
web page and is able to remove noisy parts from three major
categories of pages which contain user-generated content
(News, Blogs, Discussions). Based on a human annotated
corpus consisting of diverse topics, domains and templates,
we demonstrate the learning abilities of our algorithm, we
examine its e↵ectiveness in extracting the informative textual
parts and its usage as a rule-based classifier for web
page type detection in a realistic web setting.
Abstract
The discovery of web documents about certain topics
is an important task for web-based applications including web
document retrieval, opinion mining and knowledge extraction. In
this paper, we propose an agent-based focused crawling framework
able to retrieve topic- and genre-related web documents.
Starting from a simple topic query, a set of focused crawler
agents explore in parallel topic-specific web paths using dynamic
seed URLs that belong to certain web genres and are collected
from web search engines. The agents make use of an internal
mechanism that weighs topic and genre relevance scores of
unvisited web pages. They are able to adapt to the properties
of a given topic by modifying their internal knowledge during
search, handle ambiguous queries, ignore irrelevant pages with
respect to the topic and retrieve collaboratively topic-relevant
web pages. We performed an experimental study to evaluate the
behavior of the agents for a variety of topic queries demonstrating
the benefits and the capabilities of our framework.
Abstract
Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.
Abstract
In this paper a novel method for detecting plagiarized passages in
document collections is presented. In contrast to previous work in
this field that uses mainly content terms to represent documents,
the proposed method is based on structural information provided
by occurrences of a small list of stopwords (i.e., very frequent
words). We show that stopword n-grams are able to capture local
syntactic similarities between suspicious and original documents.
Moreover, an algorithm for detecting the exact boundaries of
plagiarized and source passages is proposed. Experimental results
on a publicly-available corpus demonstrate that the performance
of the proposed approach is competitive when compared with the
best reported results. More importantly, it achieves significantly
better results when dealing with difficult plagiarism cases where
the plagiarized passages are highly modified by replacing most of
the words or phrases with synonyms to hide the similarity with the
source documents.
Abstract
Author identification models fall into two major categories according to the way they handle the training texts: profile-based models produce one representation per author while instance-based models produce one representation per text. In this paper, we propose an approach that combines two well-known representatives of these categories, namely the Common n-Grams method and a Support Vector Machine classifier based on character n-grams. The outputs of these classifiers are combined to enrich the training set with additional documents in a repetitive semi-supervised procedure inspired by the co-training algorithm. The evaluation results on closed-set author identification are encouraging, especially when the set of candidate authors is large.
Abstract
In constraint programming there are often many choices regarding
the propagation method to be used on the constraints of a
problem. However, simple constraint solvers usually only apply a standard
method, typically (generalized) arc consistency, on all constraints
throughout search. Advanced solvers additionally allow for the modeler
to choose among an array of propagators for certain (global) constraints.
Since complex interactions exist among constraints, deciding in the modelling
phase which propagation method to use on given constraints can
be a hard task that ideally we would like to free the user from. In this paper
we propose a simple technique towards the automation of this task.
Our approach exploits information gathered from a random probing preprocessing
phase to automatically decide on the propagation method to
be used on each constraint. As we demonstrate, data gathered though
probing allows for the solver to accurately differentiate between constraints
that offer little pruning as opposed to ones that achieve many
domain reductions, and also to detect constraints and variables that are
amenable to certain propagation methods. Experimental results from an
initial evaluation of the proposed method on binary CSPs demonstrate
the benefits of our approach.
Abstract
The task of intrinsic plagiarism detection deals with cases where no reference corpus
is available and it is exclusively based on stylistic changes or inconsistencies within a given
document. In this paper a new method is presented that attempts to quantify the style variation
within a document using character n-gram profiles and a style change function based on an
appropriate dissimilarity measure originally proposed for author identification. In addition, we
propose a set of heuristic rules that attempt to detect plagiarism–free documents and
plagiarized passages, as well as to reduce the effect of irrelevant style changes within a
document. The proposed approach is evaluated on the recently-available corpus of the 1st Int.
Competition on Plagiarism Detection with promising results.
Abstract
Author identification is a text categorization task with
applications in intelligence, criminal law, computer forensics, etc.
Usually, in such cases there is shortage of training texts. In this
paper, we propose the use of second order tensors for representing
texts for this problem, in contrast to the traditional vector space
model. Based on a generalization of the SVM algorithm that can
handle tensors, we explore various methods for filling the matrix of
features taking into account that similar features should be placed in
the same neighborhood. To this end, we propose a frequency-based
metric. Experiments on a corpus controlled for genre and topic and
variable amount of training texts show that the proposed approach
is more effective than traditional vector-based SVM when only
limited amount of training texts is used.
Abstract
Authorship identification can be viewed as a text categorization task.
However, in this task the most frequent features appear to be the most important
discriminators, there is usually a shortage of training texts, and the training texts
are rarely evenly distributed over the authors. To cope with these problems, we
propose tensors of second order for representing the stylistic properties of texts.
Our approach requires the calculation of much fewer parameters in comparison
to the traditional vector space representation. We examine various methods for
building appropriate tensors taking into account that similar features should be
placed in the same neighborhood. Based on an existing generalization of SVM
able to handle tensors we perform experiments on corpora controlled for genre
and topic and show that the proposed approach can effectively handle cases
where only limited training texts are available.
Abstract
An important factor for discriminating between
webpages is their genre (e.g., blogs, personal homepages,
e-shops, online newspapers, etc). Webpage genre
identification has a great potential in information
retrieval since users of search engines can combine
genre-based and traditional topic-based queries to
improve the quality of the results. So far, various features
have been proposed to quantify the style of webpages
including word and html-tag frequencies. In this paper,
we propose a low-level representation for this problem
based on character n-grams. Using an existing approach,
we produce feature sets of variable-length character ngrams
and combine this representation with information
about the most frequent html-tags. Based on two
benchmark corpora, we present webpage genre
identification experiments and improve the best reported
results in both cases.
Abstract
This paper deals with the problem of author
identification. The Common N-Grams (CNG) method
[6] is a language-independent profile-based approach
with good results in many author identification
experiments so far. A variation of this approach is
presented based on new distance measures that are
quite stable for large profile length values. Special
emphasis is given to the degree upon which the
effectiveness of the method is affected by the available
training text samples per author. Experiments based on
text samples on the same topic from the Reuters
Corpus Volume 1 are presented using both balanced
and imbalanced training corpora. The results show
that CNG with the proposed distance measures is more
accurate when only limited training text samples are
available, at least for some of the candidate authors, a
realistic condition in author identification problems.
Abstract
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the sys-tem after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author’s style. Experiments on data sets of dif-ferent programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idio-syncrasies of the source code authors. Moreover the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of com-ments in the source code, a condition usually met in cyber-crime cases.
Abstract
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author's style. Experiments on data sets of different programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach.A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover, the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.
Abstract
Automatic authorship identification offers a valuable tool for
supporting crime investigation and security. It can be seen as a multi-class,
single-label text categorization task. Character n-grams are a very successful
approach to represent text for stylistic purposes since they are able to capture
nuances in lexical, syntactical, and structural level. So far, character n-grams of
fixed length have been used for authorship identification. In this paper, we
propose a variable-length n-gram approach inspired by previous work for
selecting variable-length word sequences. Using a subset of the new Reuters
corpus, consisting of texts on the same topic by 50 different authors, we show
that the proposed approach is at least as effective as information gain for
selecting the most significant n-grams although the feature sets produced by the
two methods have few common members. Moreover, we explore the
significance of digits for distinguishing between authors showing that an
increase in performance can be achieved using simple text pre-processing.
Abstract
This paper deals with the problem of identifying the
most likely author of a text. Several thousands of character n-grams,
rather than lexical or syntactic information, are used to represent the
style of a text. Thus, the author identification task can be viewed as
a single-label multiclass classification problem of high dimensional
feature space and sparse data. In order to cope with such properties,
we propose a suitable learning ensemble based on feature set
subspacing. Performance results on two well-tested benchmark text
corpora for author identification show that this classification
scheme is quite effective, significantly improving the best reported
results so far. Additionally, this approach is proved to be quite
stable in comparison with support vector machines when using
limited number of training texts, a condition usually met in this kind
of problem.
Abstract
Authorship identification can be seen as a single-label
multi-class text categorization problem. Very often, there are
extremely few training texts at least for some of the candidate
authors. In this paper, we present methods to handle imbalanced
multi-class textual datasets. The main idea is to segment the
training texts into sub-samples according to the size of the class.
Hence, minority classes can be segmented into many short samples
and majority classes into less and longer samples. Moreover, we
explore text re-sampling in order to construct a training set
according to a desirable distribution over the classes. Essentially,
text re-sampling can be viewed as providing new synthetic data that
increase the training size of a class. Based on a corpus of newswire
stories in English we present authorship identification experiments
on various multi-class imbalanced cases.
Abstract
This paper presents a content-based approach to spam detection
based on low-level information. Instead of the traditional 'bag of words' representation,
we use a 'bag of character n-grams' representation which avoids the
sparse data problem that arises in n-grams on the word-level. Moreover, it is
language-independent and does not require any lemmatizer or 'deep' text preprocessing.
Based on experiments on Ling-Spam corpus we evaluate the proposed
representation in combination with support vector machines. Both binary
and term-frequency representations achieve high precision rates while maintaining
recall on equally high level, which is a crucial factor for anti-spam filters, a
cost sensitive application.
Abstract
In this paper, we present a binarization technique
specifically designed for historical document images.
Existing methods for this problem focus on either
finding a good global threshold or adapting the
threshold for each area so that to remove smear,
strains, uneven illumination etc. We propose a hybrid
approach that first applies a global thresholding
method and, then, identifies the image areas that are
more likely to still contain noise. Each of these areas is
re-processed separately to achieve better quality of
binarization. We evaluate the proposed approach for
different kinds of degradation problems. The results
show that our method can handle hard cases while
documents already in good condition are not affected
drastically.
Abstract
It is common for libraries to provide public access
to historical and ancient document image collections.
It is common for such document images to require
specialized processing in order to remove background
noise and become more legible. In this paper, we
propose a hybrid binarizatin approach for improving
the quality of old documents using a combination of
global and local thresholding. First, a global
thresholding technique specifically designed for old
document images is applied to the entire image. Then,
the image areas that still contain background noise are
detected and the same technique is re-applied to each
area separately. Hence, we achieve better adaptability
of the algorithm in cases where various kinds of noise
coexist in different areas of the same image while
avoiding the computational and time cost of applying a
local thresholding in the entire image. Evaluation
results based on a collection of historical document
images indicate that the proposed approach is effective
in removing background noise and improving the
quality of degraded documents while documents
already in good condition are not affected.
Abstract
Source code authorship analysis is the particular field that attempts to identify the author of a computer program by treating each program as a linguistically analyzable entity. This is usually based on other undisputed program samples from the same author. There are several cases where the application of such a method could be of a major benefit, such as tracing the source of code left in the system after a cyber attack, authorship disputes, proof of authorship in court, etc. In this paper, we present our approach which is based on byte-level n-gram profiles and is an extension of a method that has been successfully applied to natural language text authorship attribution. We propose a simplified profile and a new similarity measure which is less complicated than the algorithm followed in text authorship attribution and it seems more suitable for source code identification since is better able to deal with very small training sets. Experiments were performed on two different data sets, one with programs written in C++ and the second with programs written in Java. Unlike the traditional language-dependent metrics used by previous studies, our approach can be applied to any programming language with no additional cost. The presented accuracy rates are much better than the best reported results for the same data sets.
Abstract
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually. based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author’s style. Experiments on data sets of different programming language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.
Abstract
In this paper the problem of music performer verification is introduced.
Given a certain performance of a musical piece and a set of candidate pianists
the task is to examine whether or not a particular pianist is the actual performer.
A database of 22 pianists playing pieces by F. Chopin in a computercontrolled
piano is used in the presented experiments. An appropriate set of features
that captures the idiosyncrasies of music performers is proposed. Wellknown
machine learning techniques for constructing learning ensembles are applied
and remarkable results are described in verifying the actual pianist, a very
difficult task even for human experts.
Abstract
In this paper, we present a trainable approach to
discriminate between machine-printed and handwritten
text. An integrated system able to localize text areas and
split them in text-lines is used. A set of simple and easyto-
compute structural characteristics that capture the
differences between machine-printed and handwritten
text-lines is introduced. Experiments on document images
taken from IAM-DB and GRUHD databases show a
remarkable performance of the proposed approach that
requires minimal training data.
Abstract
This. paper addresses the problem of identifying the
most likely music performer, given a set of performances of the
same piece by a number of skilled candidate pianists. We propose a
set of features for representing the stylistic characteristics of a
music performer. A database of piano performances of 22 pianists
playing two pieces by F. Chopin is used in the presented
experiments. Due to the limitations of the training set size and the
characteristics of the input features we propose an ensemble of
simple classifiers derived by both subsampling the training set and
subsampling the input features. Preliminary experiments show that
the resulting ensemble is able to efficiently cope with this difficult
musical task, displaying a level of accuracy unlikely to be matched
by human listeners (under similar conditions).
Abstract
In this study, a comparison of features for
discriminating between different music performers
playing the same piece is presented. Based on a series
of statistical experiments on a data set of piano pieces
played by 22 performers, it is shown that the
deviation from the performance norm (average
performance) is better able to reveal the performers’
individualities in comparison to the deviation from
the printed score. In the framework of automatic
music performer recognition, the norm-based features
prove to be very accurate in intra-piece tests (training
and test set taken from the same piece) and very
stable in inter-piece tests (training and test sets taken
from different pieces). Moreover, it is empirically
demonstrated that the average performance is at least
as effective as the best of the constituent individual
performances while ‘extreme’ performances have the
lowest discriminatory potential when used as norm.
Abstract
In this study, a computational model that aims at the automatic discrimination of different human
music performers playing the same piece is presented. The proposed model is based on the note
level and does not require any deep (e.g., structural or harmonic, etc.) analysis. A set of measures
that attempts to capture both the style of the author and the style of the piece is introduced. The
presented approach has been applied to a database of piano sonatas by W.A. Mozart performed by
both a French and a Viennese pianist with very encouraging preliminary results.
Abstract
In this paper we present a practical approach to text chunking for unrestricted Modern Greek text that is based on multiple-pass parsing. Two versions of this chunker are proposed: one based on a large lexicon and one based on minimal resources. In the latter case the morphological analysis is performed using exclusively two small lexicons containing closed-class words and common suffixes of the Modern Greek words. We give comparative performance results on the basis of a corpus of unrestricted text and show that very good results can be obtained by omitting the large and complicate resources. Moreover, the considerable time cost introduced by the use of the large lexicon indicates that the minimal-resources chunker is the best solution regarding a practical application that requires rapid response and less than perfect parsing results.
Abstract
In this paper we present a method for detecting the text genre quickly and easily following an approach originally proposed in authorship attribution studies which uses as style markers the frequencies of occurrence of the most frequent words in a training corpus (Burrows, 1992). In contrast to this approach we use the frequencies of occurrence of the most frequent words of the entire written language. Using as testing ground a part of the Wall Street Journal corpus, we show that the most frequent words of the British National Corpus, representing the most frequent words of the written English language, are more reliable discriminators of text genre in comparison to the most frequent words of the training corpus. Moreover, the frequencies of occurrence of the most common punctuation marks play an important role in terms of accurate text categorization as well as when dealing with training data of limited size.
Abstract
This paper presents a character segmentation algorithm for unconstrained cursive handwritten text. The transformation-based learning method and a simplified variation of it are used in order to extract automatically rules that detect the segment boundaries. Comparative experimental results are given for a collection of multi-writer handwritten words. The achieved accuracy in detecting segment boundaries exceeds 82%. Moreover, limited training data can provide very satisfactory results.
Abstract
In this paper we present an approach to automatic authorship attribution dealing with real-world (or unrestricted) text. Our method is based on the computational analysis of the input text using a text-processing tool. Besides the style markers relevant to the output of this tool we also use analysis-dependent style markers, that is, measures that represent the way in which the text has been processed. No word frequency counts, nor other lexically-based measures are taken into account. We show that the proposed set of style markers is able to distinguish texts of various authors of a weekly newspaper using multiple regression. All the experiments we present were performed using real-world text downloaded from the World Wide Web. Our approach is easily trainable and fully-automated requiring no manual text preprocessing nor sampling.
Abstract
Transformation-based learning (TBL) is the most important machine learning theory aiming at the automatic extraction of rules based on already tagged corpora. However, the application of this theory to a certain application without taking into account the features that characterize this application may cause problems regarding the training time cost as well as the accuracy of the extracted rules. In this paper we present a variation of the basic idea of the TBL and we apply it to the extraction of the sentence boundary disambiguation rules in real-world text, a prerequisite for the vast majority of the natural language processing applications. We show that our approach achieves considerably higher accuracy results and, moreover, requires minimal training time in comparison to the traditional TBL.
Abstract
This paper describes a user-assisted business letter generator that meets the ever-increasing demand for more flexible and modular letter generators which draw on explicit thematic models and are easily adaptable to specific user needs. Based on a detailed analysis of requirements and taking full advantage of the end users feedback, the presented generator not only creates a business letter according to the user choices, but also refines it taking into consideration stylistic aspects like written style and tone.
Abstract
Language barriers present a major problem in the effectiveness of resource sharing and in common access to the resources of libraries. In this paper we present the TRANSLIB system which stemmed from the integration of both new and already existing advanced multilingual information tools. By making use of some AI-based methods this system takes full advantage of these resources in order to provide multilingual access to library catalogues. Among its striking features, it enables searching in multiple languages, multilingual presentation of the query results, and localization of the user interface. TRANSLIB has been currently tested in existing medium-sized bibliographic databases. Early evaluation results show a remarkable improvement in the search process and report high user-friendliness, and easy and low-cost maintenance and upgrade of the system.
Abstract
The presented work is strongly motivated by the need of categorizing unrestricted texts in terms of functional style (FS) in order to attain a satisfying outcome in style processing. Towards this aim, it is given a three-level description of FS that comprises: (a) the basic categories of FS, (b) the main features that characterize each one of the above categories, and (c) the linguistic identifiers that act as style markers in texts for the identification of the above features. Special emphasis is put on the problems that faced the computational implementation of the aforementioned findings as well as the selection of the most appropriate stylometrics (i.e., stylistic scores) to achieve better results on text categorization. This approach is language independent, empirically-driven, and can be used in various applications including grammar and style checking, natural language generation, style verification in real-world texts, and recognition of style shift between adjacent portions of text.
Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.
Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.
Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.