Southeast Europe Journal of Soft Computing Diagnosis of Parkinson's Disease Using Principal Component Analysis and Boosting Committee Machines

Parkinson's disease (PD) has become one of the most common degenerative disorder of the central nervous system. In this study, our main goal was to discriminate between healthy people and people with Parkinson's disease. In order to achieve this we used artificial neural networks, and dataset taken from University of California, Irvine machine learning database, having 48 normal and 147 PD cases. We examine the performance of neural network systems with back propagation together with a majority voting scheme. In order to train examples we used boosting by filtering technique with seven committee machines, and principal component analysis is used for data reduction. The experimental results have demonstrated that the combination of these proposed methods has obtained very good results with correct positive value of 92% on the classification of PD.


I. INTRODUCTION
Parkinson's disease (PD) is one of the most common neurodegenerative disorder.It accounts for a variety of motor and non-motor deficits which are the result of the loss of dopamine-producing brain cells.Parkinson's primarily affects neurons in the area of the brain called the substantia nigra.These cells normally produce dopamine, a chemical (neurotransmitter) that transmits signals between areas in the brain that, when working normally, coordinate smooth and balanced muscle movement.As Parkinson Disease progresses, the amount of dopamine produced in the brain decreases, leaving a person unable to control movement normally.The exact cause of this deterioration is not known, but scientists are doing a lot of research to look for the answer.
The four primary symptoms of PD are tremor, or trembling in hands, arms, legs, jaw, and face; rigidity, or stiffness of the limbs and trunk; bradykinesia, or slowness of movement; and postural instability, or impaired balance and coordination.As these symptoms become more pronounced, patients may have difficulty walking, talking, or completing other simple tasks.PD usually affects people over the age of 50.However, it can start earlier, and it is more common in men than in women.Early symptoms of PD are subtle and occur gradually.In some people the disease progresses more quickly than in others.As the disease progresses, the shaking, or tremor, which affects the majority of PD patients may begin to interfere with daily activities.Other symptoms may include depression and other emotional changes; difficulty in swallowing, chewing, and speaking; urinary problems or constipation; skin problems; and sleep disruptions.
The disease can be difficult to diagnose accurately, particularly in the early stages of the disease when symptoms resemble other medical conditions, and misdiagnosis occurs occasionally.There are currently no blood or laboratory tests that have been proven to help in diagnosing PD, and the prognosis depends on the patient's age and symptoms.The diagnosis is based on the medical history and neurological examination conducted by interviewing and observing the patient.Brain scans or laboratory tests may be used to help doctors exclude other medical conditions that produce symptoms similar to those of Parkinson's disease.
After a Parkinson's diagnosis, Parkinson's disease treatments are given to help relieve symptoms.At present, there is no cure for PD, but a variety of medications considerably reduce the typical Parkinson movement disorders (i.e.tremor at rest, rigidity, akynesia and postural instability).In some cases, surgery may be appropriate if the disease doesn't respond to drugs.Medical treatment or surgery may just alleviate some of the most troubling symptoms, but there is no causal cure now available, and early diagnosis is critical for maximizing the effect of treatment and improving the quality of the patient's life.
Current research programs are trying to respond how the disease progresses and to develop new drug therapies.Scientists looking for the cause of PD continue to search for possible environmental factors, such as toxins, that may trigger the disorder, and study genetic factors to determine how defective genes play a role.Other scientists are working to develop new protective drugs that can delay, prevent, or reverse the disease.Recently, a group of experts found some features in the voices of the people with Parkinson's disease that can be used as discriminatory measures to differentiate those who have the disease from those who do not.Max Little, from the University of Oxford, has been developing software that learns to detect differences in voice patterns, in order to spot distinctive clues associated with Parkinson's.He used machine learning.He was collecting a large amount of data when he knew if someone has the disease or not and trained the database to learn how to separate out the true symptoms of the disease from other factors, and he succeeded to develop an algorithm to detect changes in voice purely associated with Parkinson's.He introduce a new measure of dysphonia, pitch period entropy (PPE), which is robust to many uncontrollable confounding effects including noisy acoustic environments and normal, healthy variations in voice frequency.He collected sustained phonations from 31 people, 23 with PD, and using a kernel support vector machine (SVM) got overall correct classification performance of 91.4%.
So far, several studies have been reported focusing on PD diagnosis.In these studies different methods were applied to the given problems.Consequently, different of novel methods have been proposed.In generally, most of these methods are based on appliaction of some neural networks.
In [25], the Multi-Layer Perceptron (MLP) with Back-Propagation learning algorithm were used to classify to effective diagnosis Parkinsons disease(PD).The Artificial neural networks was used to classify the diagnosis of patients.The accuracy in training data set was 82.051% and in the validation data set 83.333%.Gil and Johnson [1] propose a hybrid system combining ANN (Artificial Neural Networks) and SVM (Support Vector Machines) classifiers.These two classifiers, which are widely used for pattern recognition, can provide a good generalization performance in the diagnosis task.They showed a high degree of certainty, above 90%.Furthermore, some of the parameters reach very high accuracy such as "Sensitivity" and "Negative predictive value" with 99.32% and 97.06% respectively.Also, in [5] [20].Paper [21] deals with the application of some probabilistic neural network (PNN) variants to discriminate between healthy people and people with Parkinson's disease.Three PNN types have been used in this classification process, related to the smoothing factor search: incremental search (IS), Monte Carlo search (MCS) and hybrid search (HS).The concrete application has provided diagnosis accuracies ranging between 79% and 81% for new, undiagnosed patients.A comparison of multiple classification methods for diagnosis of Parkinson disease is done in [19].Four independent classification schemas were applied and a comparative study was carried out.These are Neural Networks, DMneural, Regression and Decision Tree respectively.Various evaluation methods were employed for calculating the performance score of the classifiers.According to the application scores, neural networks classifier yields the best results.The overall classification score for neural network is 92.9%.
As it is shown, there are many methods that are in use today in medical research and public health for recognizing Parkinson's disease.Generally, in the field of health care, neural networks play an increasingly important role.In recent years, most of the researchers have been using neural networks for detecting PD in the early stage.
This paper deals with the application of Neural Networks with back propagation together with Principal Component Analysis to a medical dataset concerning PD with the aim of automatically classify patients in PD or non-PD depending on their medical attributes.In order to test the performance and efficiency of the proposed method, the classification accuracy, sensitivity and specificity were used.The paper is organized as follows.Section 1 defines the related works carried out in the Parkinson Disease area.Section 2 deals with the Parkinson Dataset that is used in this research work.Section 3 gives an overview of the Artificial neural networks and explain architecture used for classification of the patient as PD or non-PD.Section 4 and 5 describe committee machine and Principal Component Analysis, respectively.Section 6 is dealt with the experimental results of the Algorithms.And Section 7 concludes research paper and proposes the future work.

II. PARKINSON DATASET
The Parkinson database used in this study is taken from the University of California at Irvine (UCI) machine learning repository [15].The features of dataset are given in Table 1.The PD dataset was created by Max little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado.The data consists of 195 sustained vowel phonations from 31 male and female subjects, of which 23 were diagnosed with PD [5,7,8].The time since diagnoses ranged from 0 to 28 years, and the ages of the subjects ranged from 46 to 85 years.Averages of six phonations were recorded from each subject, ranging from 1 to 36 s in length.There are 195 instances comprising 48 normal and 147 PD cases in the dataset.The essential aim of processing the data is to discriminate healthy people from those with PD, according to the "status" attribute which is set to 0 for healthy and 1 for people with Parkinson's disease, which is a two-decision classification problem.Little applied a correlation filter and of these 23 attributes 12 are removed.Each correlation coefficient, which is less than 0.95 is considered not to contribute to classification accuracy, thus the attribute is removed.A total of 11 attributes are kept after the correlation filter has been applied.Table 2 indicates which features are kept.First 10 are used as inputs to the classifiers.The data set used for this implementation are also described in detail in [2], and at the UCI website [15].However a 10 dimensional input data vector would be computationally expensive, therefore Principle Component Analysis (PCA) is used to reduce the dimensionality of the input data without losing accuracy of data representation.

III. ARTIFICIAL NEURAL NETWORKS
Brain is highly complex, nonlinear and parallel computer capable to perform certain computations (e.g., pattern recognition, perception, and motor control) many times faster than the fastest digital computer in existence today.A neural network is an artificial representation of the human brain that tries to simulate its information processing.It is an interconnected group of artificial neurons which may share some properties of biological neural networks.A neural network derives its computing power through its parallel distributed structure and its ability to learn and generalize.These two capabilities make it possible for neural networks to solve complex problems by decomposing problem into a number of relatively simple tasks, and neural networks are assigned a subset of the tasks that match their inherent capabilities.It is important to recognize, however, that we have a long way to go (if ever) before we can build a computer architecture that mimics a human brain.

A. Architecture
A neural network consists of a certain number of layers, and each layer contains a certain number of units.There is an input layer, an output layer, and one or more hidden layers between the input and the output layer.In general, we may identify three different classes of network architectures [35]: Recurrent networks unlike others, has at least one feedback loop.For example, a recurrent network may consist of a single layer of neurons with each neuron feeding its output signal back to the inputs of all the other neurons.The presence of the feedback loops, has impact on the learning capability of the network and on its performance Single-Layer Feedforward networks is the simplest form of a layered network, where we have an input layer of source nodes that projects onto an output layer of neurons, but not vice versa.In other words, this network is strictly a feedforward, and single refers to one output layer of computation nodes.
Multilayer Feddforward Networks has one or more hidden layers.The function of hidden neurons is to intervene between the external input and the network output in some useful manner.In this kind of networks, there are no connections from any of the units to the inputs of the previous layers (no feedback information) nor to other units in the same layer, nor to units more than one layer ahead.Every unit only acts as an input to the immediate next layer.Obviously, this class of networks is easier to analyze theoretically than other general topologies because their outputs can be represented with explicit functions of the inputs and the weights.The architectural graph in Figure1 illustrates the layout of a multilayer feedforward neural network for the case of a two hidden layer.Each neuron (see figure 2) in the input and hidden layers is connected to all neurons in the next layer by weighted connections.These neurons (see figure 2) compute weighted sums of their inputs and adds a threshold.The resulting sums are used to calculate the activity of the neurons by applying a sigmoid activation function.This process is defined as follows: where variables x 1 , x 2 , . . .,x i , . . .,x n are the inputs, t is a threshold, variables w 1 , w 2 , . . .,w i , . . .,w n are the weights associated with the impulses/inputs, signifying the relative importance that is associated with the path from which the input is coming and F is the activation function of the neuron, and y is the output.In a neural networks each neuron has an activation function which specifies the output of a neuron to a given inputs.Activation functions for the hidden units are needed to introduce non-linearity into the networks.For classification tasks antisymmetric sigmoid tangent hyperbolic function is a common choice of the activation function.It is defined as: The architecture used in this study is the multilayer feedforward network consisting of an input layer and hidden layers, and an output layer (healthy or ill) which represents the classification result.For training multi-layer feed-forward network back-propagation is used.Back propagation algorithm is a generalization of the least mean squared algorithm that minimizes the mean squared error between the desired and actual output of the network by modifying its network weight.A back propagation algorithm uses supervised learning, where the network is trained using the known data of inputs and desired output.After the data is trained, the network weights are frozen to be used later for new input samples to compute an output values.

IV. COMMITTEE MACHINES
When a task is too complex, the best thing to do is to divide it into smaller and simpler tasks and combine solutions in order to solve the whole task.In supervised learning, computational simplicity is achieved by distributing the learning task among a number of experts, which in turn divides the input space into a set of subspaces.The combination of experts is said to constitute a committee machine.Basically, it fuses knowledge acquired by experts to arrive at an overall decision that is supposedly superior to that attainable by any one of them acting alone.So, committee machines are expected to produce better results than using any expert individually, because they combine knowledge from several experts to reach a decision.Committee machines can be built in two different ways, using static and dynamic structures.[35] In dynamic structures input data is involved in each expert output combination mechanism to generate the global output.This category includes 2 methods: mixture of experts, where answers from experts are nonlinearly linked by only one gating network; hierarchical mixture of experts, where answers from experts are nonlinearly linked by several gating networks arranged in a hierarchical fashion.In static structures, combination mechanism between experts does not depend on input data.This category can also be classified in: ensemble averaging, where global output is a result of linear combination of each specialist outputs; boosting, where a weak learning algorithm can learn how to reach a higher accuracy.In boosting machine the experts are trained on data sets with entirely different distributions.Boosting can be implemented in three different ways: boosting by filtering, subsampling and reweighting.In this paper we used boosting by filtering technique.

A. Boosting by filtering
In boosting by filtering, the committee machine consists of three experts, arbitrarily labeled as first, second and third.The  examples also needed to train the first expert, the total size of data needed to train the entire committee machine is N 4 =N 1 +N 2 +N 3 .However, the computational cost is based on 3N 1 examples because N 1 is the number of examples actually used to train each of the three experts.We may therefore say that the boosting algorithm described here is indeed smart in the sense that the committee machine requires a large set of examples for its operation, but only a subset of that data set is used to perform the actual training.
To evaluate the performance of the committee machine on test patterns, simple voting scheme was used in this paper.If the first and second experts in the committee machine agree in their respective decision, that class label is used.Otherwise, the class label discovered by the third expert is used.

V. PRINCIPAL COMPONENT ANALYSIS
Principal Components Analysis (PCA) is a useful statistical technique that has found application in fields such as face recognition and image compression, and is a common technique for finding patterns in data of high dimension.PCA is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences.Since patterns in data can be hard to find in data of high dimension, principal component analysis is used for data reduction, by compressing the data or mapping the data into a lower dimensional space.In general, reduction of dimensionality will be accompanied by a loss of some of the information.Thus, the main goal in dimensionality reduction is to preserve as much of the relevant information as possible.

A. Theory of Principal component Analysis
Principal component analysis is useful if we have obtained data on a number of variables (possibly a large number of variables), and we believe that there is some redundancy in those variables.In this case, redundancy means that some of the variables are correlated with one another.Because of this redundancy, it should be possible to reduce the observed variables into a smaller number of principal components (artificial variables) that will account for most of the variance in the observed variables.is a linear combinations of the set of observations.Descriptive statistics can also be applied to a multivariate data matrix X, the sample mean of the kth variable is the sample variance is defined by ( )   , and corresponding orthonormal eigenvectors be u 1 , u 2 , …, u p .Then ith principal component y i is given by the linear combination of the original variables in the data matrix X [37]: The variance of y i is λ i and cov(y i, y j )=0, i≠j.The total variance of the data in X is equal to the sum of eigenvalues: If a large percentage of the total variance can be attributed to the first few components, then these new variables can replace the original variables without significant loss of information.Thus we can achieve significant reduction in data.

VI. RESULTS
As the performance measures, true positive, true negative, false positive and false negative values have been used in this paper.A confusion matrix contains information about actual and predicted classifications done by a classification system.The confusion matrix is shown in Table 3 (actual vs. predicted): The dataset is composed of a range of biomedical voice measurements from 31 people, 23 with PD.The data set used in this study is very unbalanced, where out of 195 samples, 147 are Parkinson's disease type and others represent healthy people.The main problem with imbalanced data set is that it is very difficult to train to predict the presence of Parkinson's disease, since the ratio between classes is 3:1.Figure5 gives 3D graph of Parkinson's disease data showing samples with and without Parkinson's disease with different colors.As it can be seen from Figure 3 PD and non-PD samples coincide a lot.This makes difficulty in separating these two classes of data.Also, imbalanced data set may increase false positive rates.In order to increase performance of true recognition rates, parallel neural networks, boosted by filtering, in combination with majority rule based system are used.
Before the classification of Parkinson dataset, PCA is used to reduce the dimensionality of the input.After using PCA, the input dataset was randomly partitioned into train and test dataset.For neural networks classifier, the following adjustments were carried out: the back propagation learning algorithm has been used in the feed-forward, four hidden layer neural network.A tangent sigmoid transfer function has been used for both the hidden layers and the output layer.We used 7 committee machines.The initial weights were chosen randomly.All networks are run several times each, and training and testing performances are obtained.The obtained classification true/false counts for this case are listed in Table 4.It can be seen from results that this method yields very good classification result, having true positive rate of 92% and true negative rate of 70.67%, and very small false classification rates.Comparing the presented results with those reported in other studies one can notice that the proposed method gives excellent results, considering the fact that is applied on vary imbalanced dataset with small number of samples.It can be concluded from this study that parallel neural networks, when boosted by filtering, in combination with majority rule based system can increase performance of true recognition rates in an imbalanced data set.

Fig. 1
Fig.1 Feedforward neural network with two hidden layers

Fig. 2 A
Fig.2 A neuron in the hidden or output layer in the MLP.

Fig. 3 .
Fig. 3. Activation Function three experts are individually trained as follows: the first expert is trained on a set consisting of N1 examples.The trained first expert is used to filter another set of examples by proceeding in the following manner: flip a fair coin to simulate a random guess.If the result is heads, pass new patterns through the first expert and discard correctly classified patterns until a pattern is misclassified.That misclassified pattern is added to the training set for the second expert.If the result is tails, do the opposite.Specifically, pass new patterns through the first expert and discard incorrectly classified patterns until a pattern is classified correctly.That correctly classified pattern is added to the training set for the second expert.Continue this process until a total N1 examples has been filtered by the first expert.This set of filtered examples constitutes the training set for the second expert.Once the second expert has been trained in the usual way, a third training set is formed for the third expert by proceeding in the following manner: pass a new pattern through both the first and second experts.If the two experts agree in their decision, discard that pattern.If, on the other hand, they disagree, the pattern is added to the training set for the third expert.Continue with this process until a total of N1 examples have been filtered jointly by the first and second experts.This set of jointly filtered examples constitutes the training set for the third expert.The third expert is then trained in the usual way, and the training of the entire committee machine is thereby completed.The three-point filtering procedure is illustrated in Figure2.

Fig. 4
Fig.4 Illustration of boosting by filtering, (a) Filtering of examples performed by Expert 1, (b) Filtering of examples performed by Expert 2 and 3 Multivariate statistics deals with the relation between several random variables.The sets of observations of the random variables are represented by a multivariate data matrix X. Multivariate statistics deals with the relation between several random variables.The sets of observations of the random variables are represented by a multivariate data matrix X, u k represents the data for a different variable.If c is an p x 1 matrix, then S n be the p x p covariance matrix related to the multivariate data matrix X.Let eigenvalues of S n be 0 ...
total variance covered by the kth principal component is given by:

TABLE I TABLE DESCRIBING THE
ATTRIBUTES THAT ARE REMOVED AFTER APPLYING THE CORRELATION FILTER

TABLE II FEATURES
USED IN THIS PAPER