Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.
Abstract
Author verification is a fundamental task in authorship analysis and associated with significant applications in humanities, cyber-security, and social media analytics. In some of the relevant studies, there is evidence that heterogeneous ensembles can provide very reliable solutions, better than any individual verification model. However, there is no systematic study of examining the application of ensemble methods in this task. In this paper, we start from a large set of base verification models covering the main paradigms in this area and study how they can be combined to build an accurate ensemble. We propose a simple stacking ensemble as well as a dynamic ensemble selection approach that can use the most reliable base models for each verification case separately. The experimental results in ten benchmark corpora covering multiple languages and genres verify the suitability of ensembles for this task and demonstrate the effectiveness of our method, in some cases improving the best reported results by more than 10%.
Abstract
Author verification is a fundamental task in authorship analysis and associated with important applications in humanities and forensics. In this paper, we propose the use of an intrinsic profile-based verification method that is based on latent semantic indexing (LSI). Our proposed approach is easy-to-follow and language independent. Based on experiments using benchmark corpora from the PAN shared task in author verification, we demonstrate that LSI is both more effective and more stable than latent Dirichlet allocation in this task. Moreover, LSI models are able to outperform existing approaches especially when multiple texts of known authorship are available per verification instance and all documents belong to the same thematic area and genre. We also study several feature types and similarity measures to be combined with the proposed topic models.
Abstract
Authorship verification has gained a lot of attention during the last years mainly due to the focus of PAN@CLEF shared tasks. A verification method called Impostors, based on a set of external (impostor) documents and a random subspace ensemble, is one of the most successful approaches. Variations of this method gained top-performing positions in recent PAN evaluation campaigns. In this paper, we propose a modification of the Impostors method that focuses on both appropriate selection of impostor documents and enhanced comparison of impostor documents with the documents under investigation. Our approach achieves competitive performance on PAN corpora, outperforming previous versions of the Impostors method.
Abstract
The constantly increasing amount of opinionated texts found in the Web had a significant impact in the development of sentiment analysis. So far, the majority of the comparative studies in this field focus on analyzing fixed (offline) collections from certain domains, genres, or topics. In this paper, we present an online system for opinion mining and retrieval that is able to discover up-to-date web pages on given topics using focused crawling agents, extract opinionated textual parts from web pages, and estimate their polarity using opinion mining agents. The evaluation of the system on real-world case studies, demonstrates that is appropriate for opinion comparison between topics, since it provides useful indications on the popularity based on a relatively small amount
of web pages. Moreover, it can produce genre-aware results of opinion retrieval, a valuable option for decision-makers.
Abstract
Automated Genre Identification (AGI) of web pages is a
problem of increasing importance since web genre (e.g. blog, news, eshops,
etc.) information can enhance modern Information Retrieval (IR)
systems. The state-of-the-art in this field considers AGI as a closed-set
classification problem where a variety of web page representation and machine
learning models have intensively studied. In this paper, we study
AGI as an open-set classification problem which better formulates the
real world conditions of exploiting AGI in practice. Focusing on the use
of content information, different text representation methods (words and
character n-grams) are tested. Moreover, two classification methods are
examined, one-class SVM learners, used as a baseline, and an ensemble
of classifiers based on random feature subspacing, originally proposed for
author identification. It is demonstrated that very high precision can be
achieved in open-set AGI while recall remains relatively high.
Abstract
The paper introduces and discusses a concept of syntactic n-grams (sn-grams) that can be applied instead of traditional n-grams in many NLP tasks. Sn-grams are constructed by following paths in syntactic trees, so sngrams allow bringing syntactic knowledge into machine learning methods. Still, previous parsing is necessary for their construction. We applied sn-grams in the
task of authorship attribution for corpora of three and seven authors with very promising results.
Abstract
Ruling line removal is an important pre-processing step in document image processing. Several algorithms have been proposed for this task. However, it is important to be able to take full advantage of the existing algorithms by adapting them to the specific properties of a document image collection. In this paper, a system is presented, appropriate for fine-tuning the parameters of ruling line removal algorithms or appropriately adapt them to a specific document image collection, in order to improve the results. The application of our method to an existed line removal algorithms is presented.
Abstract
This paper overviews 18 plagiarism detectors that have been evaluated
within the fifth international competition on plagiarism detection at PAN 2013.
We report on their performances for the two tasks source retrieval and text alignment
of external plagiarism detection. Furthermore, we continue last year’s initiative
to invite software submissions instead of run submissions, and, re-evaluate
this year’s submissions on last year’s evaluation corpora and vice versa, thus
demonstrating the benefits of software submissions in terms of reproducibility.
Abstract
This overview presents the framework and results for the Author Profiling
task at PAN 2013. We describe in detail the corpus and its characteristics,
and the evaluation framework we used to measure the participants performance to
solve the problem of identifying age and gender from anonymous texts. Finally,
the approaches of the 21 participants and their results are described.
Abstract
Abstract. The author identification task at PAN-2013 focuses on author verification where given a set of documents by a single author and a questioned document, the problem is to determine if the questioned document was written by that particular author or not. In this paper we present the evaluation setup, the performance measures, the new corpus we built for this task covering three languages and the evaluation results of the 18 participant teams that submitted their software. Moreover, we survey the characteristics of the submitted approaches and show that a very effective meta-model can be formed based on the output of the participant methods.
Abstract
This paper outlines the concepts and achievements of our evaluation
lab on digital text forensics, PAN 13, which called for original research and development
on plagiarism detection, author identification, and author profiling.
We present a standardized evaluation framework for each of the three tasks and
discuss the evaluation results of the altogether 58 submitted contributions. For
the first time, instead of accepting the output of software runs, we collected the
softwares themselves and run them on a computer cluster at our site. As evaluation
and experimentation platform we use TIRA, which is being developed at
the Webis Group in Weimar. TIRA can handle large-scale software submissions
by means of virtualization, sandboxed execution, tailored unit testing, and staged
submission. In addition to the achieved evaluation results, a major achievement
of our lab is that we now have the largest collection of state-of-the-art approaches
with regard to the mentioned tasks for further analysis at our disposal.
Abstract
The vast amount of user-generated content on the Web has
increased the need for handling the problem of automatically
processing content in web pages. The segmentation of web
pages and noise (non-informative segment) removal are important
pre-processing steps in a variety of applications such
as sentiment analysis, text summarization and information
retrieval. Currently, these two tasks tend to be handled separately
or are handled together without emphasizing the diversity
of the web corpora and the web page type detection.
We present a unified approach that is able to provide robust
identification of informative textual parts in web pages
along with accurate type detection. The proposed algorithm
takes into account visual and non-visual characteristics of a
web page and is able to remove noisy parts from three major
categories of pages which contain user-generated content
(News, Blogs, Discussions). Based on a human annotated
corpus consisting of diverse topics, domains and templates,
we demonstrate the learning abilities of our algorithm, we
examine its e↵ectiveness in extracting the informative textual
parts and its usage as a rule-based classifier for web
page type detection in a realistic web setting.
Abstract
The discovery of web documents about certain topics
is an important task for web-based applications including web
document retrieval, opinion mining and knowledge extraction. In
this paper, we propose an agent-based focused crawling framework
able to retrieve topic- and genre-related web documents.
Starting from a simple topic query, a set of focused crawler
agents explore in parallel topic-specific web paths using dynamic
seed URLs that belong to certain web genres and are collected
from web search engines. The agents make use of an internal
mechanism that weighs topic and genre relevance scores of
unvisited web pages. They are able to adapt to the properties
of a given topic by modifying their internal knowledge during
search, handle ambiguous queries, ignore irrelevant pages with
respect to the topic and retrieve collaboratively topic-relevant
web pages. We performed an experimental study to evaluate the
behavior of the agents for a variety of topic queries demonstrating
the benefits and the capabilities of our framework.
Abstract
Similarly to natural language texts, source code documents can be distinguished by their style. Source code author identification can be viewed as a text classification task given that samples of known authorship by a set of candidate authors are available. Although very promising results have been reported for this task, the evaluation of existing approaches avoids focusing on the class imbalance problem and its effect on the performance. In this paper, we present a systematic experimental study of author identification in skewed training sets where the training samples are unequally distributed over the candidate authors. Two representative author identification methods are examined, one follows the profile-based paradigm (where a single representation is produced for all the available training samples per author) and the other follows the instance-based paradigm (where each training sample has its own individual representation). We examine the effect of the source code representation on the performance of these methods and show that the profile-based method is better able to handle cases of highly skewed training sets while the instance-based method is a better choice in balanced or slightly-skewed training sets.
Abstract
In this paper a novel method for detecting plagiarized passages in
document collections is presented. In contrast to previous work in
this field that uses mainly content terms to represent documents,
the proposed method is based on structural information provided
by occurrences of a small list of stopwords (i.e., very frequent
words). We show that stopword n-grams are able to capture local
syntactic similarities between suspicious and original documents.
Moreover, an algorithm for detecting the exact boundaries of
plagiarized and source passages is proposed. Experimental results
on a publicly-available corpus demonstrate that the performance
of the proposed approach is competitive when compared with the
best reported results. More importantly, it achieves significantly
better results when dealing with difficult plagiarism cases where
the plagiarized passages are highly modified by replacing most of
the words or phrases with synonyms to hide the similarity with the
source documents.
Abstract
Author identification models fall into two major categories according to the way they handle the training texts: profile-based models produce one representation per author while instance-based models produce one representation per text. In this paper, we propose an approach that combines two well-known representatives of these categories, namely the Common n-Grams method and a Support Vector Machine classifier based on character n-grams. The outputs of these classifiers are combined to enrich the training set with additional documents in a repetitive semi-supervised procedure inspired by the co-training algorithm. The evaluation results on closed-set author identification are encouraging, especially when the set of candidate authors is large.
Abstract
In constraint programming there are often many choices regarding
the propagation method to be used on the constraints of a
problem. However, simple constraint solvers usually only apply a standard
method, typically (generalized) arc consistency, on all constraints
throughout search. Advanced solvers additionally allow for the modeler
to choose among an array of propagators for certain (global) constraints.
Since complex interactions exist among constraints, deciding in the modelling
phase which propagation method to use on given constraints can
be a hard task that ideally we would like to free the user from. In this paper
we propose a simple technique towards the automation of this task.
Our approach exploits information gathered from a random probing preprocessing
phase to automatically decide on the propagation method to
be used on each constraint. As we demonstrate, data gathered though
probing allows for the solver to accurately differentiate between constraints
that offer little pruning as opposed to ones that achieve many
domain reductions, and also to detect constraints and variables that are
amenable to certain propagation methods. Experimental results from an
initial evaluation of the proposed method on binary CSPs demonstrate
the benefits of our approach.
Abstract
The task of intrinsic plagiarism detection deals with cases where no reference corpus
is available and it is exclusively based on stylistic changes or inconsistencies within a given
document. In this paper a new method is presented that attempts to quantify the style variation
within a document using character n-gram profiles and a style change function based on an
appropriate dissimilarity measure originally proposed for author identification. In addition, we
propose a set of heuristic rules that attempt to detect plagiarism–free documents and
plagiarized passages, as well as to reduce the effect of irrelevant style changes within a
document. The proposed approach is evaluated on the recently-available corpus of the 1st Int.
Competition on Plagiarism Detection with promising results.
Abstract
Author identification is a text categorization task with
applications in intelligence, criminal law, computer forensics, etc.
Usually, in such cases there is shortage of training texts. In this
paper, we propose the use of second order tensors for representing
texts for this problem, in contrast to the traditional vector space
model. Based on a generalization of the SVM algorithm that can
handle tensors, we explore various methods for filling the matrix of
features taking into account that similar features should be placed in
the same neighborhood. To this end, we propose a frequency-based
metric. Experiments on a corpus controlled for genre and topic and
variable amount of training texts show that the proposed approach
is more effective than traditional vector-based SVM when only
limited amount of training texts is used.
Abstract
Authorship identification can be viewed as a text categorization task.
However, in this task the most frequent features appear to be the most important
discriminators, there is usually a shortage of training texts, and the training texts
are rarely evenly distributed over the authors. To cope with these problems, we
propose tensors of second order for representing the stylistic properties of texts.
Our approach requires the calculation of much fewer parameters in comparison
to the traditional vector space representation. We examine various methods for
building appropriate tensors taking into account that similar features should be
placed in the same neighborhood. Based on an existing generalization of SVM
able to handle tensors we perform experiments on corpora controlled for genre
and topic and show that the proposed approach can effectively handle cases
where only limited training texts are available.
Abstract
An important factor for discriminating between
webpages is their genre (e.g., blogs, personal homepages,
e-shops, online newspapers, etc). Webpage genre
identification has a great potential in information
retrieval since users of search engines can combine
genre-based and traditional topic-based queries to
improve the quality of the results. So far, various features
have been proposed to quantify the style of webpages
including word and html-tag frequencies. In this paper,
we propose a low-level representation for this problem
based on character n-grams. Using an existing approach,
we produce feature sets of variable-length character ngrams
and combine this representation with information
about the most frequent html-tags. Based on two
benchmark corpora, we present webpage genre
identification experiments and improve the best reported
results in both cases.
Abstract
This paper deals with the problem of author
identification. The Common N-Grams (CNG) method
[6] is a language-independent profile-based approach
with good results in many author identification
experiments so far. A variation of this approach is
presented based on new distance measures that are
quite stable for large profile length values. Special
emphasis is given to the degree upon which the
effectiveness of the method is affected by the available
training text samples per author. Experiments based on
text samples on the same topic from the Reuters
Corpus Volume 1 are presented using both balanced
and imbalanced training corpora. The results show
that CNG with the proposed distance measures is more
accurate when only limited training text samples are
available, at least for some of the candidate authors, a
realistic condition in author identification problems.
Abstract
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the sys-tem after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author’s style. Experiments on data sets of dif-ferent programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idio-syncrasies of the source code authors. Moreover the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of com-ments in the source code, a condition usually met in cyber-crime cases.
Abstract
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author's style. Experiments on data sets of different programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach.A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover, the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.
Abstract
Automatic authorship identification offers a valuable tool for
supporting crime investigation and security. It can be seen as a multi-class,
single-label text categorization task. Character n-grams are a very successful
approach to represent text for stylistic purposes since they are able to capture
nuances in lexical, syntactical, and structural level. So far, character n-grams of
fixed length have been used for authorship identification. In this paper, we
propose a variable-length n-gram approach inspired by previous work for
selecting variable-length word sequences. Using a subset of the new Reuters
corpus, consisting of texts on the same topic by 50 different authors, we show
that the proposed approach is at least as effective as information gain for
selecting the most significant n-grams although the feature sets produced by the
two methods have few common members. Moreover, we explore the
significance of digits for distinguishing between authors showing that an
increase in performance can be achieved using simple text pre-processing.
Abstract
This paper deals with the problem of identifying the
most likely author of a text. Several thousands of character n-grams,
rather than lexical or syntactic information, are used to represent the
style of a text. Thus, the author identification task can be viewed as
a single-label multiclass classification problem of high dimensional
feature space and sparse data. In order to cope with such properties,
we propose a suitable learning ensemble based on feature set
subspacing. Performance results on two well-tested benchmark text
corpora for author identification show that this classification
scheme is quite effective, significantly improving the best reported
results so far. Additionally, this approach is proved to be quite
stable in comparison with support vector machines when using
limited number of training texts, a condition usually met in this kind
of problem.
Abstract
Authorship identification can be seen as a single-label
multi-class text categorization problem. Very often, there are
extremely few training texts at least for some of the candidate
authors. In this paper, we present methods to handle imbalanced
multi-class textual datasets. The main idea is to segment the
training texts into sub-samples according to the size of the class.
Hence, minority classes can be segmented into many short samples
and majority classes into less and longer samples. Moreover, we
explore text re-sampling in order to construct a training set
according to a desirable distribution over the classes. Essentially,
text re-sampling can be viewed as providing new synthetic data that
increase the training size of a class. Based on a corpus of newswire
stories in English we present authorship identification experiments
on various multi-class imbalanced cases.
Abstract
This paper presents a content-based approach to spam detection
based on low-level information. Instead of the traditional 'bag of words' representation,
we use a 'bag of character n-grams' representation which avoids the
sparse data problem that arises in n-grams on the word-level. Moreover, it is
language-independent and does not require any lemmatizer or 'deep' text preprocessing.
Based on experiments on Ling-Spam corpus we evaluate the proposed
representation in combination with support vector machines. Both binary
and term-frequency representations achieve high precision rates while maintaining
recall on equally high level, which is a crucial factor for anti-spam filters, a
cost sensitive application.
Abstract
In this paper, we present a binarization technique
specifically designed for historical document images.
Existing methods for this problem focus on either
finding a good global threshold or adapting the
threshold for each area so that to remove smear,
strains, uneven illumination etc. We propose a hybrid
approach that first applies a global thresholding
method and, then, identifies the image areas that are
more likely to still contain noise. Each of these areas is
re-processed separately to achieve better quality of
binarization. We evaluate the proposed approach for
different kinds of degradation problems. The results
show that our method can handle hard cases while
documents already in good condition are not affected
drastically.
Abstract
It is common for libraries to provide public access
to historical and ancient document image collections.
It is common for such document images to require
specialized processing in order to remove background
noise and become more legible. In this paper, we
propose a hybrid binarizatin approach for improving
the quality of old documents using a combination of
global and local thresholding. First, a global
thresholding technique specifically designed for old
document images is applied to the entire image. Then,
the image areas that still contain background noise are
detected and the same technique is re-applied to each
area separately. Hence, we achieve better adaptability
of the algorithm in cases where various kinds of noise
coexist in different areas of the same image while
avoiding the computational and time cost of applying a
local thresholding in the entire image. Evaluation
results based on a collection of historical document
images indicate that the proposed approach is effective
in removing background noise and improving the
quality of degraded documents while documents
already in good condition are not affected.
Abstract
Source code authorship analysis is the particular field that attempts to identify the author of a computer program by treating each program as a linguistically analyzable entity. This is usually based on other undisputed program samples from the same author. There are several cases where the application of such a method could be of a major benefit, such as tracing the source of code left in the system after a cyber attack, authorship disputes, proof of authorship in court, etc. In this paper, we present our approach which is based on byte-level n-gram profiles and is an extension of a method that has been successfully applied to natural language text authorship attribution. We propose a simplified profile and a new similarity measure which is less complicated than the algorithm followed in text authorship attribution and it seems more suitable for source code identification since is better able to deal with very small training sets. Experiments were performed on two different data sets, one with programs written in C++ and the second with programs written in Java. Unlike the traditional language-dependent metrics used by previous studies, our approach can be applied to any programming language with no additional cost. The presented accuracy rates are much better than the best reported results for the same data sets.
Abstract
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually. based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author’s style. Experiments on data sets of different programming language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.
Abstract
In this paper the problem of music performer verification is introduced.
Given a certain performance of a musical piece and a set of candidate pianists
the task is to examine whether or not a particular pianist is the actual performer.
A database of 22 pianists playing pieces by F. Chopin in a computercontrolled
piano is used in the presented experiments. An appropriate set of features
that captures the idiosyncrasies of music performers is proposed. Wellknown
machine learning techniques for constructing learning ensembles are applied
and remarkable results are described in verifying the actual pianist, a very
difficult task even for human experts.
Abstract
In this paper, we present a trainable approach to
discriminate between machine-printed and handwritten
text. An integrated system able to localize text areas and
split them in text-lines is used. A set of simple and easyto-
compute structural characteristics that capture the
differences between machine-printed and handwritten
text-lines is introduced. Experiments on document images
taken from IAM-DB and GRUHD databases show a
remarkable performance of the proposed approach that
requires minimal training data.
Abstract
This. paper addresses the problem of identifying the
most likely music performer, given a set of performances of the
same piece by a number of skilled candidate pianists. We propose a
set of features for representing the stylistic characteristics of a
music performer. A database of piano performances of 22 pianists
playing two pieces by F. Chopin is used in the presented
experiments. Due to the limitations of the training set size and the
characteristics of the input features we propose an ensemble of
simple classifiers derived by both subsampling the training set and
subsampling the input features. Preliminary experiments show that
the resulting ensemble is able to efficiently cope with this difficult
musical task, displaying a level of accuracy unlikely to be matched
by human listeners (under similar conditions).
Abstract
In this study, a comparison of features for
discriminating between different music performers
playing the same piece is presented. Based on a series
of statistical experiments on a data set of piano pieces
played by 22 performers, it is shown that the
deviation from the performance norm (average
performance) is better able to reveal the performers’
individualities in comparison to the deviation from
the printed score. In the framework of automatic
music performer recognition, the norm-based features
prove to be very accurate in intra-piece tests (training
and test set taken from the same piece) and very
stable in inter-piece tests (training and test sets taken
from different pieces). Moreover, it is empirically
demonstrated that the average performance is at least
as effective as the best of the constituent individual
performances while ‘extreme’ performances have the
lowest discriminatory potential when used as norm.
Abstract
In this study, a computational model that aims at the automatic discrimination of different human
music performers playing the same piece is presented. The proposed model is based on the note
level and does not require any deep (e.g., structural or harmonic, etc.) analysis. A set of measures
that attempts to capture both the style of the author and the style of the piece is introduced. The
presented approach has been applied to a database of piano sonatas by W.A. Mozart performed by
both a French and a Viennese pianist with very encouraging preliminary results.
Abstract
In this paper we present a practical approach to text chunking for unrestricted Modern Greek text that is based on multiple-pass parsing. Two versions of this chunker are proposed: one based on a large lexicon and one based on minimal resources. In the latter case the morphological analysis is performed using exclusively two small lexicons containing closed-class words and common suffixes of the Modern Greek words. We give comparative performance results on the basis of a corpus of unrestricted text and show that very good results can be obtained by omitting the large and complicate resources. Moreover, the considerable time cost introduced by the use of the large lexicon indicates that the minimal-resources chunker is the best solution regarding a practical application that requires rapid response and less than perfect parsing results.
Abstract
In this paper we present a method for detecting the text genre quickly and easily following an approach originally proposed in authorship attribution studies which uses as style markers the frequencies of occurrence of the most frequent words in a training corpus (Burrows, 1992). In contrast to this approach we use the frequencies of occurrence of the most frequent words of the entire written language. Using as testing ground a part of the Wall Street Journal corpus, we show that the most frequent words of the British National Corpus, representing the most frequent words of the written English language, are more reliable discriminators of text genre in comparison to the most frequent words of the training corpus. Moreover, the frequencies of occurrence of the most common punctuation marks play an important role in terms of accurate text categorization as well as when dealing with training data of limited size.
Abstract
This paper presents a character segmentation algorithm for unconstrained cursive handwritten text. The transformation-based learning method and a simplified variation of it are used in order to extract automatically rules that detect the segment boundaries. Comparative experimental results are given for a collection of multi-writer handwritten words. The achieved accuracy in detecting segment boundaries exceeds 82%. Moreover, limited training data can provide very satisfactory results.
Abstract
In this paper we present an approach to automatic authorship attribution dealing with real-world (or unrestricted) text. Our method is based on the computational analysis of the input text using a text-processing tool. Besides the style markers relevant to the output of this tool we also use analysis-dependent style markers, that is, measures that represent the way in which the text has been processed. No word frequency counts, nor other lexically-based measures are taken into account. We show that the proposed set of style markers is able to distinguish texts of various authors of a weekly newspaper using multiple regression. All the experiments we present were performed using real-world text downloaded from the World Wide Web. Our approach is easily trainable and fully-automated requiring no manual text preprocessing nor sampling.
Abstract
Transformation-based learning (TBL) is the most important machine learning theory aiming at the automatic extraction of rules based on already tagged corpora. However, the application of this theory to a certain application without taking into account the features that characterize this application may cause problems regarding the training time cost as well as the accuracy of the extracted rules. In this paper we present a variation of the basic idea of the TBL and we apply it to the extraction of the sentence boundary disambiguation rules in real-world text, a prerequisite for the vast majority of the natural language processing applications. We show that our approach achieves considerably higher accuracy results and, moreover, requires minimal training time in comparison to the traditional TBL.
Abstract
This paper describes a user-assisted business letter generator that meets the ever-increasing demand for more flexible and modular letter generators which draw on explicit thematic models and are easily adaptable to specific user needs. Based on a detailed analysis of requirements and taking full advantage of the end users feedback, the presented generator not only creates a business letter according to the user choices, but also refines it taking into consideration stylistic aspects like written style and tone.
Abstract
Language barriers present a major problem in the effectiveness of resource sharing and in common access to the resources of libraries. In this paper we present the TRANSLIB system which stemmed from the integration of both new and already existing advanced multilingual information tools. By making use of some AI-based methods this system takes full advantage of these resources in order to provide multilingual access to library catalogues. Among its striking features, it enables searching in multiple languages, multilingual presentation of the query results, and localization of the user interface. TRANSLIB has been currently tested in existing medium-sized bibliographic databases. Early evaluation results show a remarkable improvement in the search process and report high user-friendliness, and easy and low-cost maintenance and upgrade of the system.
Abstract
The presented work is strongly motivated by the need of categorizing unrestricted texts in terms of functional style (FS) in order to attain a satisfying outcome in style processing. Towards this aim, it is given a three-level description of FS that comprises: (a) the basic categories of FS, (b) the main features that characterize each one of the above categories, and (c) the linguistic identifiers that act as style markers in texts for the identification of the above features. Special emphasis is put on the problems that faced the computational implementation of the aforementioned findings as well as the selection of the most appropriate stylometrics (i.e., stylistic scores) to achieve better results on text categorization. This approach is language independent, empirically-driven, and can be used in various applications including grammar and style checking, natural language generation, style verification in real-world texts, and recognition of style shift between adjacent portions of text.
Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.
Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.
Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.