Authorship Authentication of Short Messages from Social Networks Machines

Dataset consists of 17000 tweets collected from Twitter each of 34 authors that meet certain criteria. Raw data using the software Nvivo. The collected raw data is preproce extract frequencies of 200 features. In the data analysis 128 of features are eliminated since they are rare in tweets. As a progressive presentation, five – fifteen – twenty – twenty five – thirty and thirty authors are selected each time. Since recurrent artificial neural networks are more stable and in general ANNs are more successful distinguishing two classes, for N authors, N×N neural networks are trained for classification. These experts then organized in N c (CANNT) to aggregate decisions of these NXN expert procedure is repeated seven times and committees wit voted for final decision. By a commonest type voting boosted around ten percent. Number of authors is seen not so effective on the accuracy of the authentication, and around 80% accuracy is achieved for any number of authors. their be considered that these sites have trustable environment but they are accessible to virtual attacks. Detecting fake and compromised accounts, and distinguishing them are the main problems in authorship auth networks. This work aims the study of developing a system which is able to operate for finding the author o messages by providing to the system posts written by a list , as 500 tweets for is collected by ssed to


INTRODUCTION
The Internet has dramatically become significant.Social Networks have taken interest of billions and their effect grows each day.Users reach the others, share opinions and transmit information.Online networks like Twitter and Facebook serve as virtual environment with simplicity and became rich and easy content platforms that provide knowledge.Nonetheless, there are several security issues that occur with the wide usage of these sites.It can of suspected users on social media and choosing matched authors.
Related studies investigated mostly focusing on longer text documents rather than what is intended to do by this research (Can, 2012).This study is important by combining stylometry which is more than a century-old science with current computational capacity for short text messages.The stylometry regarding text classification of short social network messages, appropriate methods applied in relevant and contemporary research were investigated as the base of this study.
Stylometry, also known as authorship analysis purposes to determine the original author of a given text which studies linguistic style.The methods of it have been primarily applied to analyze letters and literary works such as Federal Papers (Hamilton, et. al., 2008).The analysis in the vocabulary of an author and the use-frequency of words in it are known as a general method in stylometry which is later compared with the vocabulary of another author.The specific analysis of the use-frequency of function words including numerals, pronouns, prepositions, auxiliary verbs, and conjunctions is also possible with it.The analysis of average sentence length or the use of very unusual words is another method applied with stylometry for comparing texts.
There are three main perspectives regarding today's applications.These are authorship attribution, authorship verification, and authorship profiling.Authorship attribution aims to determine a probable author from a multitude of several other authors.On the other hand, authorship verification finds if an author's linguistic style matches to linguistic style of another author.Authorship profiling has the purpose of determining attributes which are likely to reveal an anonymous authors origin, age, gender, and so on.This work focuses on the first perspective, i.e. authorship attribution.
The detection of the authorship for a document which is fewer than 1000 words was thought to be difficult in the time of the early 19th century.In the early 21st century, the number decreased and the determination of the authorship of a document with 250 words was thought to be possible.There is also a need for decreasing this limit because of spreading usage of many shorter communication tools such as Twitter, Facebook etc.There are differences between authorship attribution of online documents and the authorship attribution of traditional works.This occurs in two ways.The first is that the online documents or text collections are frequently informal and unstructured which are not necessarily grammatically correct as a comparison to literature texts.The second is that the quantity of authorship disputes regarding a single online document is much more as a comparison to traditional published documents.In this situation, the scarcity of standardized data to test the accuracy of results underlies as the reason that is one of the challenges of authorship attribution.
For the researchers, the increasing of the popularity of social media has made it easier directing the focus on authorship attribution in micro-blogs.Various studies have been published as a respect to the use of authorship analysis in social network recently.
The problem of authorship attribution for an online social network Twitter is studied in this work.Twitter has had an increase with its popularity recently by reporting to have over 500 million user base that share almost the same quantity of messages daily which is called as tweets (internet live stats).Twitter differs from other social networks in terms of publishing limitation.Users are able to publish only 140 characters for each tweet.
Various classification methods are implementable to the authorship attribution problem.An important transition from statistical methods into machine learning based approaches is demonstrated by the authorship attribution techniques (Usha et al, 2017).Supervised classification methods are preferred in the current literature (Rocha et al, 2016).In this study, machine learning based approach was used.Abbasi A. (2005) collected 20 web forum messages from each of 20 authors.Average length was 76.6 words.They used 5 authors and randomly chosen 30 messages in their experiment for comparing feature types and classification techniques.301 features were chosen and C4.5 and SVM were used.Accuracy for C4.5 was 90% while it was 97% for SVM.Calix et al. (2008) updated an existing C# based stylometry system for verifying authors of e-mails.They used 55 style features and K-nearest neighbor algorithm for classification.The average length of e-mails was 150 words.
Layton R. (2010) evaluated current techniques and identified some new preprocessing methods.They stated that existing authorship attribution technique SCAP (Source code authorship profile) performs well.A threshold quantity of tweets regarding to attribution task is determined in the paper and informed that 120 tweets per author is an important threshold and there is not a significant improvement in accuracy even in the case of increasing the tweet number greater than the threshold value.Bhargava, et al (2013) grouped various tweets for increasing the text size under consideration.They prefer to analyze features over a group of tweets instead of a single one.They used syntactic, lexical, tweet specific and emoticon features as author style in which firstly the model was trained by applying SVM as classifier.By increasing the length of each block, they reached 81.42% accuracy for 10 users with 200 tweets each and 77.7% accuracy after increasing tweets number to 250 each.If they increased number of users to 20 with 300 tweets per user, they achieved 64.54% accuracy.Also they informed that while group of 10 tweets received the best result, using each tweets alone resulted with 78.1% accuracy.Green, and Sheppard (2013) focused on messages collected from Twitter to analyze most effective feature sets for authorship verification.They used sequential minimal optimization (SMO) algorithm included in Weka for classification 10 authors with 120 tweets from each and had 44% accuracy rate.They compared style makers (SM) feature sets and bag-of-words (BOW) feature sets and informed that SM features are more effective than BOW features for authorship verification.Further, the analysis of the authorship traits for verifying the legitimacy of Twitter accounts was examined by Barbon et al (2017).By aiming that, the syntactic, lexical, idiosyncratic and content specific features were applied.
Arakawa et al ( 2014) investigated a Twitter specific approach which evaluates the category and number of retweets.Afroz et al ( 2014) prepared a large scale study related to posts on forums and malicious search engine optimization.They proposed several features which are suitable to social network messages as word-level bigrams, numbers used in place of letters, capitalization, and existence of foreign words.Azarbonyad et al (2015) drew attention to the dynamicity of authors and examined the temporal changes of word usage by authors of tweets and emails and based on this examination they suggested a way to measure the dynamicity of authors' word usage.Li, et al (2016) used short posts from Facebook.Facebook post, average 20.6 words was applied as the dataset in order to determine whether user is authenticated or not among 30 users in the work.Further, SVM Light with 233 features was applied and 12 tests were conducted.They discussed the challenge of using traditional stylometry on short texts.They examined different feature sets.The success for 10 users with 233 features was 81.6%.When the author number was increased to 20 and 30, the success was slightly dropped to 79.8% and 79.6% respectively (Demir, 2016(Demir, , 2017)).
For the determination of traits in multi authored documents, Macke, and Hirshman (2015) used deep learning techniques that is at the sentence level.The vocabulary and grammatical structure with the application of recurrent neural network model (RNN) is modeled by the authors and it is noticed that application has less performance in the case the number of authors increases.Schwartz, et al (2013) trained SVM classifier for classification of Twitter messages and n-gram features set was used.The tweets that have fewer than 3 words were removed in the preprocessing process and k-signature of authors that appears in at least k% of author's training set but not appear in others' was defined and used as a feature.Authorship attribution in tweets with a focus on unique signature related with users was studied in the research.In the experiments different number of authors and tweets were used.65% accuracy was achieved for 50 authors and 500 tweets and 72% was archived for 1000 tweets.
Decreasing size of submitted data and increasing author number resulted with decreasing the accuracy rate.Rocha A. et al (2016) compared several algorithms to classify tweets and discussed an extensive review for the existing authorship analysis techniques in micro blogs.They concluded that PMSVM had the best accuracy rate.The success was 48% for 50 authors with 100 tweets.Using more number of tweets increased the accuracy rate; 500 tweets 55% and 1000 tweets 65%.The results offered for the necessity of a plenary method which allows the application of the data context and process it irrespective of its multimodality and further a system which tolerates the lack regarding availability for all author data during training.Brocardo. et al (2014) proposed a supervised technique used n-gram feature set for authorship identification.They used Enron e-mail dataset.They prepared their data as each block contains 500 characters and each user has 50 blocks.They used 87 users and the EER (equal error rate) was 14.35%.In their late work (2017), they analyzed the use of deep belief networks for authorship verification model of continuous authentication.They achieved 16.73% ERR for 10 user with 140-character-length 100 blocks per user.
An authorship attribution method is offered by Usha et al (2017) in which the tone and personality patterns related with an author is modeled.Method is acquired with the application of convolutional neural network trained on tone and personality data.Data of the authors from Twitter is employed on the models and then psycholinguistic features were united with the final level features.Obtained features were applied for training a linear SVM classifier for prediction of an unknown tweet's author.Their results showed that if data number increased, better results were obtained.However increasing the number of authors has reverse impact.15 Users with 250 tweets had 51% accuracy and with 800 tweets results increased to 80% accuracy.However 50 Users with 250 tweets achieved 50% accuracy and 50 Users with 800 tweets achieved 71% accuracy.
Sirinivasan and Nalini (2017) evaluated the effects of different classification methods for online messages.They used lexical, syntactic, structural and n-gram features and as classifier they examined C4.5, fuzzy classifier and Ada boost classifier.40 Amazon review messages were collected from each 5 authors and evaluated by using cross validation.Ada boost classifier received the best results with 84% accuracy for 5 authors.

A BRIEF NOTE ON ANNS
This brief presentation of artificial neural networks will focus on a particular structure of ANNs, multi-layer feedforward networks, which is the most popular and widely-used network paradigm in many applications including forecasting volatilities and prices in markets.For a general introductory account of ANNs, readers are referred to Wasserman (1989); Hertz et al. (1991);Smith (1993).Rumelhart et al. (1986aRumelhart et al. ( ), (1986bRumelhart et al. ( ), (1994Rumelhart et al. ( ), (1995)); Lippmann (1987); Hinton (1992); Hammerstrom (1993); Haykin 1999 illustrate the basic ideas in ANNs.
Hush and Horne (1993) summarize some theoretical developments in ANNs since Lippmann (1987) tutorial article.Masson and Wang (1990) give a detailed description of five different network models.Wilson and Sharda (1992) present a review of applications of ANNs in the business setting.Sharda (1994) provides an application bibliography for researchers in Management Science/ Operations Research.A bibliography of neural network business applications research is also given by Wong et al. (1995).Kuan and White (1994) review the ANN models used by economists and econometricians and establish several theoretical frames for ANN learning.Cheng and Titterington (1994) make a detailed analysis and comparison of ANNs paradigms with traditional statistical methods.
Basic structures of artificial neural networks, originally developed to mimic the human brain, are composed of a number of interconnected simple processing elements called neurons or nodes.Each node receives an input signal which is the total ''information'' from other nodes or external stimuli.The node processes incoming data locally through an activation function and produces a transformed output signal to other nodes or external outputs.Although each individual neuron implements its function rather slowly and imperfectly, collectively a network can perform a surprising number of tasks quite efficiently (Reilly and Cooper, 1990).This information processing characteristic makes ANNs a powerful computational device and able to learn from examples and then to generalize to examples never before seen.
Many different ANN models have been proposed since 1980s.Perhaps the most influential models are the multilayer perceptrons (MLP), Hopfield networks, and Kohonen'sself organizing networks (Kohonen, 2001).Hopfield (1982) proposes a recurrent neural network which works as an associative memory.An associative memory can recall an example from a partial or distorted version.Hopfield networks are non-layered with complete interconnectivity between nodes.The outputs of the network are not necessarily the functions of the inputs.Rather they are stable states of an iterative process.

Multi Layer Perceptrons for Forecasting
Especially in forecasting the MLP networks are used because of their inherent capability of arbitrary inputoutput mapping.Other types of ANNs are radial-basis functions networks (Park andSandberg, 1991, 1993;Chng et al., 1996), ridge polynomial networks (Shin and Ghosh, 1995), and wavelet networks (Zhang and Benveniste, 1992;Delyon et al., 1995) are also very useful in some applications due to their function approximating ability.
An MLP is composed of several layers of nodes.The lowest layer is an input layer where external information is received.The last layer is an output layer where the problem solution is obtained.Hidden layers separate the input layer and output layer.The nodes in adjacent layers are usually fully connected from a lower layer to a higher layer.Fig. 1 gives an example of a fully connected MLP with one hidden layer.

Fig. 1. A typical feedforward neural network multiple layer perceptron (MLP).
For a classification problem, the inputs to an ANN are usually the independent variables.The functional relationship estimated by the ANN can be written as where (x1,x2,…,xp) is the vector of p independent variables and y is a dependent variable.In this sense, the neural network is functionally equivalent to a nonlinear regression model.

Multi-Layer Feed-Forward Networks (FNN)
Multi-layer feed-forward networks (FNN) which forward information from the input layer to the output layer through a number of hidden layers.Neurons in a current layer connect to a neuron of the subsequent layer by weights and an activation function (Figure 1.).In order to modify weights, the backpropagation (BP) learning algorithm is adopted.This iterative mechanism works by feeding the error back through the network.The synaptic weights are iteratively updated until there is no improvement in the error function.This process requires the derivative of the error function with respect to the network weights.The sum of squared error E is the conventional least square objective function in a NN, defined as: (2) where ‫ݕ‬ ௧ denote observed values of time series and ‫ݕ‬ ො ௧ are fitted outputs.
In forecasting time series, in general, alongside feedforward neural networks there are a second type of ANNs which are called recurrent neural networks.
FNNs in Figure 1 are highly non-parsimonious requiring an infinite amount of past observations as inputs to achieve the same accuracy in forecasting comparing to RNN.Moreover, in practical applications, recurrent neural networks provide a significantly better prediction than a feed-forward network.
2.4 Recurrent Neural Networks (RNN) Time series mostly dependent nonlinearly on time and hence recurrent neural networks (RNN) are particularly useful (Szkoła, et al, 2011;Lipton, 2015).They are constructed by taking a feedforward network and adding feedback connections from output and/or hidden layers to input layers.The standard backpropagation algorithm also trains these networks conditional that patterns must always be presented in time sequential order.The one difference in the structure is that there are extra neurons in the input layer that is connected to the hidden layer and/or output layer just like the other input neurons.These extra neurons hold the contents of one of the layers as it existed when the previous pattern was trained.In this way, the network takes into account previous knowledge it has about previous inputs.These extra neurons are called the context unit and it represents the network's long-term memory (Balkin 1997).
There are three types of RNNs: Jordan, Elman, and Jordan/Elman recurrent networks.A Jordan neural network (JNN) has additional neurons in the input layer, which are fed back from output layer (Carcanoa, et al, 2011).While an Elman neural network (ENN) has additional neurons in the input layer, which is fed back from hidden layer (Elman, 1990).

Jordan Recurrent Neural Networks (JNN)
A Jordan neural network (JNN) has several feedback connections from the output layer to the input layer.The input layer has additional neurons, which are fed back from the output layer (Carcanoa, et al, 2011).
Figure 2. JNN with a single hidden layer representing a nonlinear regression model

DATA
Dataset of this research consists of 17000 tweets collected from Twitter, as 500 tweets for each of 34 authors that meet certain criteria.Raw data collected using the software Nvivo.The collected raw data is preprocessed in order to obtain same structure and improve classification accuracy.Data preprocessing is a very critical stage for establishing the next stage's quality.Data in its original form is not in convenient pattern for learning.It needs to be transformed into an appropriate input form.Second step was feature extracting.200 features in four types are integrated into feature set and used for e-mail authentication.71 of them are function words which are selected from the list that was prepared by Zheng et al. (2006).The features are extracted by a program in Java, and registered to a text file.
Later this text file was reached by our program for training the classifiers and to implement author attribution.
The features that are evaluated are combinations of character-based lexical features, word-based lexical features, syntactic features, structural features and social networking-based features.We collected only textual inputs and did not collect metadata like date of posting, location of user, application for posting, and id.because of the research's extent.Further, data set is collected without any tendency to any particular content or user.
Studies showed that different types of features have different power of discrimination.Therefore, it is important to identify the key features.
In the second phase we decreased the number of features to72 by removing the ones which are rare in tweets (Demir, and Can, 2018), the nineteen of which are function words.
The reason of having that much sparse feature vectors is the nature of tweets which contain few words.Measurements of the features are normalized in the range of 0-1.Normalization was done by dividing each value by the total word count of the corresponding text, in order to remove the influence of different overall text size.Feature vectors, created by extracting from Twitter messages, were used as input for modeling artificial neural network (ANN).

AUTHENTIFIERS FOR FIVE AUTHORS
To train a recurrent artificial neural network that will be able to distinguish tweets of the authors a1, and a2, we choose an appropriate network architecture.

Network Architecture
The input vector is 72 dimensional, for bias, 1 is added as the first element of each data vector, and we add ten components for the recurrent information.Therefore, the neural network will have 83 input neurons.This input vector is multiplied by a 83×83 snaptic weight matrix W1, to create vector of 83 numbers at 83 hidden layer nodes.
To this vector 1 is added as the first element as bias.Then this 84 component vector is multiplied by the activation function to employ a nonlinear transformation.Finally another 1×84 synaptic weight matrix W2, multiples the resulting vector to create a number.Then this number is sent to a "hard limiter" to create +1, or -1 at the output node.This number is also sent as the last component of the input vector to recur at the next iteration.
If the data entered to ANN belongs to a tweet authored by the author a1, and the output is +1, it is OK, otherwise it is erroneous, and synaptic weights must be adjusted by back propagation of the error through iterations, till ANN creates enough correct results at the output node.
To classify tweets authored by five authors, 25 pair wise recurrent artificial neural networks are trained, five of which are dummy that are indifferent between +1, and -1.
When from a mixture of 500 tweets, 100 from each author are classified by pairwise classifiers, the accuracies of 25 pairwise classifiers are found as in Table 1.The average of pairwise classification accuracies is 89.05%.

2.1 Aggregating Expert Votes
An authentication device from these 25 experts for tweets by five authors is created to aggregate their votes.Experts are grouped as competing teams.The team tk, k=1,…5, of ten experts tk={eik, eki}, i=1,…,5, is trained to distinguish tweets by author ak, from tweets from other four authors.
If the data vector v, belonging to a tweet by ak, let k=4, is considered for authentication, most probably the experts e4j, j=1,2,…,5 of the team t4 will rise a flag +1, while other experts ei4, i=1,2,…,5 of the same team will rise flags-1.Since experts of other teams tk, k≠4, are not trained in tweets by a4, they will not be as consistent as the members of the team t4.Therefore, their votes will rather be mixed signals.Upon introduction of a data vector that belongs to a tweet written by author a4, the votes of 25 experts may be just like the ones in Table 2.
Table 2.The votes of 25 experts (5 of them are dummy) in five competing teams.
Let us reorganize the votes of five competing teams as in Table 3.In rows, +1s, in columns -1s are correct votes.
The team that has more correct votes, wins the competition.In this case the highest number of true votes, (6) is collected by Team 4. Hence we conclude that the tweet whose data vector is entered, is authored by a4.
Table 3. Team 4, has the largest number of correct votes, and wins the competition.
Aggregating pairwise classification votes as in the above, 500 shuffled tweets of five authors are classified.The percentage of true positives are as in Table 4, where the average accuracy is 82%.

COMMITTEE MECHINES
To improve the accuracy of the decision, the above procedure is repeated several times, and decisions obtained in these cases are aggregated by "commonest" operation.

Bootstrapping
To avoid overlapping of training, validation, and test sets during several trainings of committee members boot strapping is used.For example, if committee has five layers, we divide data from five authors in five equal parts.Each time different sets are chosen for training, validating, and testing.

Aggregating Decisions of Committees
After training five committees each has 25 ANN members, the five different decisions about the test sets are aggregated as in Table 5.Then this procedure is repeated seven times and committees with seven members voted for final decision.By a commonest type voting, the accuracy is boosted around ten percent.Number of authors is seen not so effective on the accuracy of the authentication, and around 80% accuracy is achieved for any number of authors.

Table 1 .
The accuracies achieved by 25 experts that trained to distinguish tweets by author pairs (ai, aj).

Table 4 .
The accuracies in distinguishing shuffled tweets authored by five authors.The average is 82.2%

Table 5 .
Aggregating decisions of seven committee members for twelve sample tweets authored by author one

Table 6 .
The accuracies of the committees in distinguishing shuffled tweets authored by five authors.The averages are at the last column CONCLUSION Dataset consists of 17000 tweets collected from Twitter for 34 authors.The collected raw data is preprocessed to extract frequencies of 200 features of which 128 are eliminated since they are rare in tweets.As a progressive presentation, five -fifteen -twenty -twenty five -thirty and thirty four of these authors are selected each time.Since ANNs are more successful distinguishing two classes, for N authors, N×N neural networks are trained for pair wise classification.These experts then organized as N special teams (CANNT) with N experts to aggregate decisions.The accuracies remained in 70%-80% band.