Principal Component Analysis and Neural Networks for Authorship Attribution

: A common problem in statistical pattern recognition is that of feature selection or feature extraction. Feature selection refers to a process whereby a data space is transformed into a feature space that, in theory, has exactly the same dimension as the original data space. However, the transformation is designed in such a way that the data set may be represented by a reduced number of "effective" features and yet retain most of the intrinsic information content of the data; in other words, the data set undergoes a dimensionality reduction. In this paper the data collected by counting selected syntactic characteristics in around a thousand paragraphs of each of the sample books underwent a principal component analysis performed using neural networks. Then, first of the principal components are used to distinguish authors of the texts by the use of multilayer preceptor type artificial neural networks.


INTRODUCTION
Authorship attribution is probably the oldest of the all text categorization problems, as old as writing itself. Although it is also possibly the least well organized disciplines, and its history is marred with the mishandling of statistical techniques, it still promises to provide useful applications in spheres as diverse as law, security, and education.
Problems of authorship have always been attacked with traditional research methods: unearthing and dating original manuscripts, for instance. But since the late 19th century, statisticians have developed "non-traditional" tools that attempt to discern quantifiable patterns within a text or corpus, with the hope that these features will help to reliably identify different authors.
The origin of non-traditional authorship attribution, or stylometry, is often said to be Augustus de Morgan's suggestion in 1851 that certain authors of the Bible might be distinguishable from one another if one used longer words (Holmes 1998). In 1887, Mendenhall began investigating this hypothesis, searching for a characteristic difference in the distribution of different-sized words in writings of different languages and presentation styles. In 1901, he turned his methods to Shakespeare, Bacon and Marlowe, and found that while Shakespeare and Marlowe were nearly indistinguishable, they were both significantly and consistently different from Bacon (Williams 1975). The difference was mainly observed in the relative frequency of three-and four-letter words: Shakespeare used more four-letter words, and Bacon more three-letter words.
However, it was later noted by Williams that this difference was more likely attributable to the different styles of composition: Mendenhall had compared Bacon's prose to the blank verse of Marlowe and Shakespeare (Williams 1975). Williams examined the prose and verse of a fourth contemporary, Sir Philip Sydney, and found they were differed in much the same way as Bacon's and Shakespeare's writings. Williams concluded that Mendenhall had misclassified Shakespeare's writings as prose. In Smith's words, "Mendenhall's method now appears to be so discredited that any serious student of authorship should discard it" (cited in Juola 2006).
Authorship studies also began independently around the same time in Russia, it seems, with Morozov proposing a model for measuring style that garnered the interest of A. Markov (Kukurushkina et al. 2002). In the West, it took 30 years or so for Mendenhall's studies to be resumed by other linguists. George Zipf examined word frequencies and determined not a stylometric but a universal law of language, Zipf's Law: that the statistical rank of a word varies inversely to its frequency (Smith 2008). G. Udny Yule devised a feature known as "Yule's characteristic K," which estimated 'vocabulary richness' by comparing word frequencies to that expected by a Poisson distribution, but like Mendenhall's word lengths, this too was later found to be an unreliable marker of style (Holmes 1998).
In fact, most of the measurements proposed in this period proved unhelpful: among others, researchers tried average sentence length, number of syllables per word, and other estimates of vocabulary richness such as Simpson's D index and a simple type/token ratio (a ratio of the number of unique words, or types, to the number of total words, or tokens) (Juola 2006).
With Mosteller and Wallace's study on the Federalist Papers the needed breakthrough came at last in 1963. In 1787 and 1788, John Jay, Alexander Hamilton and James Madison collectively wrote 85 newspaper essays supporting the ratification of the constitution. Published under the pseudonym "Publius," the authors later revealed which of the Federalist Papers they had written; however, while authorship of 67 were undisputed, 12 were claimed by both Hamilton and Madison. Mosteller and Wallace hoped to characterize each author's style through their choice of function words, such as "to," "by," and so forth. Function words are regarded as good markers of style because they are (assumed to be) unconsciously generated and independent of semantics (meaning, or what the author is trying to convey). That is, an author may have a preference for modes of expression (for instance, the active vs. the passive voice) that emphasize certain function words, and the same broad set of function words will be used regardless of the topic at hand (Smith 2008).
Despite the fact that Hamilton and Madison have otherwise very similar styles-nearly identical sentence length distributions, as noted by (Juola 2006)-Mosteller and Wallace found sharp differences in their preference for different function words: for instance, the word "upon" appears 3.24 times per 1000 words in Hamilton, and just 0.23 times in Madison (quoted in Holmes 1998). Adjusting these frequencies with a Bayesian model, they showed that Madison had most likely written all 12 disputed papers. Traditional scholarship had already long come to the same conclusion, but Mosteller and Wallace's conclusion was independent, and thus a great achievement of the then quite exploratory field of stylometry. The Federalist Papers problem is still regarded as a very difficult test case, and as an unofficial benchmark it has been used to test most methods of authorship attribution developed since then (see, for instance, Kjell 1994, Holmes & Forsyth 1995, Bosch & Smith 1998, and Fung 2003.

PROBLEM DEFINITION
In this paper author attribution is considered as an application of principal component analysis, and as a classification task (Chaski, C. 2001(Chaski, C. , 2005. Texts studied are literary works of three Bosnian writers, Ivo Andrić (1892Andrić ( -1975 , M. Meša Selimović (1910-1982), and Derviš Sušić (1925-1990. Feature selected to describe texts are lexical and syntactical components that show promising results when used as writer invariants because they are used rather subconsciously and reflect the individual writing style which is difficult to be copied. Principal components of data elicited from texts possess generalization properties that allow for the required high accuracy of classification (Hayes 2008).
The novels authored by Ivo Andrić, M. Meša Selimović, and Derviš Sušić provide the corpora which are wide enough to make sure that characteristic features found based on the training data can be treated as representative of other parts of the texts and this generalized knowledge can be used to classify the test data according to their respective authors.
Obviously literary texts can greatly vary in length; what is more, all stylistic features can be influenced not only by different timelines within which the text is written but also by its genre. The first of these issues is easily dealt with by dividing long texts, such as novels, into some number of smaller parts of approximately the same size.
Described approach gives additional advantage in classification tasks as even in case of some incorrect classification results of these parts the whole text can still be properly attributed to some author by based the final decision on the majority of outcomes instead of all individual decisions for all samples. Whether the genre of a novel is reflected in lexical and syntactic characteristics of it is the question yet to be answered.

Feature Selection
Establishing features that work as effective discriminators of texts under study is one of critical issues in research on authorship analysis which are lexical. In this research fourteen textual descriptors are used, average sentence length, average word length, number of words, sentences, commas, and conjecture "and", in Bosnian "i", and other characteristics in paragraphs listed in the first column of In the next chapter the pattern captured by principal components corresponding to these data will be displayed.

PRINCIPAL COMPONENT ANALYSIS
The methods of Mosteller and Wallace have proved as enduring as the problem they investigated: they were only modestly altered when Burrows described his method of stylometric analysis in a series of papers published in the late 1980s and early 1990s (Holmes 1998; see, for instance, Burrows 1992). The Burrows method essentially involves computing the frequency of each of a list of function words (larger than that of Mosteller and Wallace), and performing principle component analysis (PCA) to find the linear combination of variables that best accounts for the variations in the data. Rather than analyze this result statistically, the transformed data are simply plotted. Two-dimensional plots of the first two principal components supply us with a means to inspect visually for trends, which occur as clusters of points (Holmes 1998). Later, cluster analysis may follow this step.
This simple but effective method continues to be used today, partly because of the ease with which the results are communicated and interpreted. For example, Binongo used this method to study the problem of the authorship of L. Frank Baum's last book, which historians had long suspected of being mostly the work of Baum's successor, Ruth P. Thompson (Binongo 2003). He confirmed this suspicion independently, demonstrating that Thompson was much more prone to use position words such as "up," "down," "over," and "back," than Baum. This was not demonstrated using complex statistical techniques; rather, function word frequencies were tallied, the authors' tallies compared, PCA used to reduce the dimensionality of the data, and the resulting plots inspected: the two authors' works form obvious clusters. Similar procedures can be found in (Holmes & Forsyth 1995, Holmes et al. 2001, and Peng & Hentgartner 2002.

Theory of Principal component Analysis
Multivariate statistics deals with the relation between several random variables. The sets of observations of the random variables are represented by a multivariate data matrix X, Multivariate statistics deals with the relation between several random variables. The sets of observations of the random variables are represented by a multivariate data matrix X, . ( Each column vector represents the data for a different variable. If c is an 1 matrix, then (2) is a linear combinations of the set of observations.
Descriptive statistics can also be applied to a multivariate data matrix X, the sample mean of the kth variable is ∑ , the sample variance is defined by Next we introduce a matrix that contains statistics that relate pairs of variables , , sample covariance : It follows that and , the sample variance. (6) is symmetric.     Points representing frequencies in the first and second principal components of the other book authored by Meša Selimović; Tvrdjeva is shown in Figure 5. The writing print of Meša Selimović is revealed as twice higher peaks compared to the corresponding Ivo Andrić peaks. Andrić, and Meša Selimović. Figure 6 displays these features. Figure 6. Pattern of the points representing frequencies in the first and second principal components of Pobune.

THEOREM
The frequency profile of first and second principal components of the textual data seems to be invariant throughout a text. There are similarities in the frequency profiles of the text authored by the same person. Therefore these frequency profiles can be regarded as writerprints. However a visual identification of the authors of these writerprints seems to be difficult. To help the classification of these writerprints, we propose to take it as a pattern classification task, and use artificial neural networks, more specifically multilayer perceptrons to do the job.

ARTIFICIAL NEURAL NETWORKS
Nervous systems existing in biological organism for years have been the subject of studies for mathematicians who tried to develop some models describing such systems and all their complexities. Artificial Neural Networks emerged as generalizations of these concepts with mathematical model of artificial neuron due to McCuloch and Pitts described in (McCuloch and Pitts 1943) definition of unsupervised learning rule by Hebb in (Hebb 1949), and the first ever implementation of Rosenblatt's perceptron in (Rosenblatt 1958). The efficiency and applicability of artificial neural networks to computational tasks have been questioned many times, especially at the very beginning of their history the book "Perceptrons" by Minsky and Papert (Minsky and Papert 1969), caused dissipation of initial interest and enthusiasm in applications of neural networks. It was not until 1970s and 80s, when the backpropagation algorithm for supervised learning was documented that artificial neural networks regained their status and proved beyond doubt to be sufficiently good approach to many problems.

Multilayer Perceptrons
Multilayer perceptrons have been applied successfully to solve some difficult and diverse problems by training them in a supervised manner with a highly popular algorithm known as the error backpropagation algorithm. This algorithm is based on the error -correction learning rule. As such, it may be viewed as a generalization of an equally popular adaptive filtering algorithm: the ubiquitous least-mean-square (LMS) algorithm.
From architecture point of view neural networks can be divided into two categories: feed-forward and recurrent networks. In feed-forward networks the flow of data is strictly from input to output cells that can be grouped into layers but no feedback interconnections can exist. On the other hand, recurrent networks contain feedback loops and their dynamical properties are very important. The most popularly used type of neural networks employed in pattern classification tasks is the feedforward network which is constructed from layers and possesses unidirectional weighted connections between neurons.
The common examples of this category are Multilayer Perceptron or Radial Basis Function networks, and committee machines.
Multilayer perceptron type is more closely defined by establishing the number of neurons from which it is built, and this process can be divided into three parts, the two of which, finding the number of input and output units, are quite simple, whereas the third, specification of the number of hidden neurons can become crucial to accuracy of obtained classification results.
The number of input and output neurons can be actually seen as external specification of the network and these parameters are rather found in a task specification. For classification purposes as many distinct features are defined for objects which are analyzed that many input nodes are required. The only way to better adapt the network to the problem is in consideration of chosen data types for each of selected features. For example instead of using the absolute value of some feature for each sample it can be more advantageous to calculate its change as this relative value should be smaller than the whole range of possible values and thus variations could be more easily picked up by Artificial Neural Network. The number of network outputs typically reflects the number of classification classes.
The third factor in specification of the Multilayer Perceptron is the number of hidden neurons and layers and it is essential to classification ability and accuracy. With no hidden layer the network is able to properly solve only linearly separable problems with the output neuron dividing the input space by a hyperplane. Since not many problems to be solved are within this category, usually some hidden layer is necessary.
With a single hidden layer the network can classify objects in the input space that are sometimes and not quite formally referred to as simplexes, single convex objects that can be created by partitioning out from the space by some number of hyperplanes, whereas with two hidden layers the network can classify any objects since they can always be represented as a sum or difference of some such simplexes classified by the second hidden layer.
Apart from the number of layers there is another issue of the number of neurons in these layers. When the number of neurons is unnecessarily high the network easily learns but poorly generalizes on new data. This situation reminds auto-associative property: too many neurons keep too much information about training set rather "remembering" than "learning" its characteristics. This is not enough to ensure good generalization that is needed.
On the other hand, when there are too few hidden neurons the network may never learn the relationships amongst the input data. Since there is no precise indicator how many neurons should be used in the construction of a network, it is a common practice to built a network with some initial number of units and when it trains poorly this number is either increased or decreased as required. Obtained solutions are usually task-dependant.
For the purposes of this research¸ a neural network with fourteen input terminals, five hidden neurons in one hidden layer, and an output layer with one neuron is chosen.

Activation Functions
Activation or transfer function of a neuron is a rule that defines how it reacts to data received through its inputs that all have certain weights.
Among the most frequently used activation functions are linear or semilinear function, a hard limiting threshold function or a smoothly limiting threshold such as a sigmoid or a hyperbolic tangent. Due to their inherent properties, whether they are linear, continuous or differentiable, different activation functions perform with different efficiency in task-specific solutions.
For classification tasks antisymmetric sigmoid tangent hyperbolic function is the most popularly used activation function: In order to produce the desired set of output states whenever a set of inputs is presented to a neural network it has to be configured by setting the strengths of the interconnections and this step corresponds to the network learning procedure. Learning rules are roughly divided into three categories of supervised, unsupervised and reinforcement learning methods.
The term supervised indicates an external teacher who provides information about the desired answer for each input sample. Thus in case of supervised learning the training data is specified in forms of pairs of input values and expected outputs. By comparing the expected outcomes with the ones actually obtained from the network the error function is calculated and its minimization leads to modification of connection weights in such a way as to obtain the output values closest to expected for each training sample and to the whole training set.
In unsupervised learning no answer is specified as expected of the neural network and it is left somewhat to itself to discover such self-organization which yields the same values at an output neuron for new samples as there are for the nearest sample of the training set.
Reinforcement learning relies on constant interaction between the network and its environment. The network has no indication what is expected of it but it can induce it by discovering which actions bring the highest reward even if this reward is not immediate but delayed. Basing on these rewards it performs such re-organization that is most advantageous in the long run [16].
The modification of weights associated with network interconnections can be The important factor in this algorithm is the learning rate η whose value when too high can cause oscillations around the local minima of the error function and when too low results in slow convergence. This locality is considered the drawback of the backpropagation method but its universality is the advantage.

APPLICATION TO AUTHOR ATTRIBUTION
Author identification analysis that was performed within research presented in this paper can be seen as the multistage process, as follows  the first step was selection of the training and testing examplestexts to be studied,  next stage was taken by the choice of textual descriptors to be analyzed -the writerprints of the authors of previously selected texts,  then followed the third phase of calculating characteristics for all descriptors, calculation,  transform randomly chosen data matrices into matrices with principal components principal component analysis,  count frequencies of principal components in bins of equal length that were later used for training of the neural network, calculation of frequencies in bins,  specification of the network with its architecture and learning method can be seen as the fourth step of the whole procedure, neural network,  the fifth consisted of the actual training of the network,  the sixth stage is testing,  and the final one corresponded to analysis of obtained results and coming up with some conclusions and possible indicators for improvement, analysis of obtained results.
This process is applied to different input data, with a artificial neural network of 25 input terminals, five hidden neurons in one hidden layer and an output neuron.
The input vector x is twenty five dimensional with components frequencies in corresponding bins as shown in the signal flow graph in Figure 3. Algorithm results in a decision about attribution of paragraphs whose textual description entered in the form of frequencies in bins of principal components as inputs.
Our aim is to train a neural network to distinguish paragraphs authored by two authors in a mixed text. We have chosen 100 set of 200 paragraphs from each of the texts. Each 200 paragraph set is transformed into its principal components, and only first principal components are taken into account. Hence we have 100 first principal components from each text. Then principal components are transformed into data vectors whose elements are frequencies in 20 uniformly specified bins. The resulting data is a 100 20 matrix for each text.
In the training phase, neural network succeeded to classify Ivo: Cuprija na Drina, and Mesa: Derviš i Smrt paragraphs, with 100% probability of correct classification.
Then the test data consisting of a random mixture of 100 Cuprija and 100 Smrt data is sent to the neural network for classification. Network classified this data with 100% probability of correct classification. Next we sent to the network the data of length 200 from other texts. The correct classification numbers are as follows.   As it is seen from tables above, the neural network is successful in the test data from the texts he trained for. The success in the classification of other books of the same authors are also satisfactory.

CONCLUSIONS
The research described in this paper concerning author identification analysis shows that the method of principal component analysis (PCA), when followed by an artificial neural network is an efficient tool. Thus a series of future experiments should include wider range of authors, definition of new sets of textual descriptors, and test for other types and structures of neural networks, and search the possibility of inheritance through translation into other languages.