Classification of chromosomes using nearest neighbor classifier

This paper addresses automated classification of human chromosomes using k nearest neighbor classifier. k nearest neighbor classifier classifies objects according to the closest training sample in the feature space. Various distance functions can be used in computation of how close the object is to the training sample. In this work various different distance functions are used to compare the performance of each. It was found that Euclidean distance function produces the best results. Keywords— nearest neighbor classifier, chromosome classification


INTRODUCTION
Chromosome classification can be used in pre-natal diagnosis of genetic disorder, some cancer diagnosis or bone marrow transplant studies.A human cell contains 46 chromosomes belonging to 24 classes.These 46 chromosomes consist of 22 pairs of classes, and two sex chromosomes.A traditional method of human chromosome classification is done by keryotyping, classification by inspection under a microscope by a human expert.This method of classification takes about 10 minutes, with an error of 0.3% (Ritter and Gallegos, 1997).One of the difficulties in chromosome classification is the chromosome variability within one chromosome class, originating from different metaphases.Another difficulty is that chromosomes may overlap and touch each other, may be bent, and have different orientation.Furthermore, a high number of classes that need to be differentiated adds to the complexity of the task.Since mid-sixties automated analysis was initiated by Ledley and Ruddle (1966).Some of these computeraided techniques include parametric classifiers (Oosterlinck et al. 1997), maximum likelihood classifiers (Piper 1987), and Markov networks (Granum et al. 1989).Achieved accuracy using these methods was only between 75% -85%.Machine learning methods have been widely employed in classification tasks.One of their main advantages is efficiency in dealing with a large amount of data.As chromosome classification is a pattern recognition problem, different artificial neural network methods have been employed in their classification: multilayer perceptron [Wu et al. 1990;Delshadpour 2003], probabilisitc neural networks [8], neuro-fuzzy classifier (Ruan 2000), etc.In this paper k-nearest neighbor (k-nn) method is used in chromosome classification and a comparative study of different distance functions is performed.The steps taken during this research are as follows: • Data retrieval Chromosome classification was carried out by k nearest neighbor (k-nn) classifier.K-nn is one of the most popular classification method mainly due to its ease of implementation and succussful classification results.A sample is classified according to the majority vote of its nearest k training samples in the feature space.Distance of a sample to its neighbors is defined using a distance function.For all points x, y, and z, a distance function F(., .), must satisfy the following: Three distance functions that can be used in k-nn classifier are:  Euclidean distance,  2 norm: (2)  Manhattan or city block distance, the  1 norm:  Mahalanobis distance that takes into account the correlation S of the dataset : In the experiments carried out in this research k was taken to be 1.
The steps that need to be carried out during the k-nn algorithm are as follows: • divide the data into training and test data • select a value k • determine which distance function is to be used • choose a sample from the test data that needs to be classified and compute the distance to its n training samples • sort the distances obtained and take the k nearest data samples • assign the test class to the class based on the majority vote of its k neighbors Despite its ease of use the two main drawbacks of the nearest neighbor classifier are: 1. High computation cost: during computation, a distance between the test sample and all of the stored training samples must be calculated one by one and a list of the k closest ones is kept.
Reducing the training set reduces the rate of successful classification, however, increasing the training set increases the computation time.
One approach to overcome this problem is to reduce the dimensions of the feature space by using principal component analysis.Another approach is to modify the training set by removing some samples that belong to the same class label and exibit similar features.2. The algorithm performance depends on the training set used.If the training data set is not representative enough then poor classification results may be obtained.[Anil 2006;Lindenbaum et al. 2004] describe some techniques that try to overcome these problems.The above table shows a gradual rise in the success rate starting from p-norm 0.5 until p-norm 2 (i.e. the Euclidean distance) is reached, followed by a study fall.

RESULTS
Experiments were also carried out using Mahalanobis distance function.However, the Mahalanobis distance function is computationally very expensive.
The time taken to complete the calculations is considerably longer than the time needed for the pnorm distance functions.To overcome this problem, principal component analysis (PCA) was carried out.Principal component analysis is a common method applied to reduce data dimensions without losing too much information.
Principal component analysis transforms the original data such that the new data has the same number of variables, but most of the variation of the original data is covered by a small number of components.Since the Euclidean distance function produced the best results, it is not a surprise that it is the most widely used distance function when k-nn classifier is used.
The success rate of classification highly depends on the quality of the dataset.The quality of Copenhagen dataset chromosomes is considered to be good since the chromosomes were measured carefully using densiometry of photographic negatives from selected high quality cells.All the classifications of the chromosomes in the Copenhagen dataset were classified by a cytegeneticist.None of the chromosomes from this dataset exhibit any abnormalities.The Copenhagen dataset was pre-processed where all the text features were converted to digits for further processing.A total of 4400 data samples were used in the experiments carried out in this research, and the data was divided into two parts.2200 samples were used for the training purposes (100 data samples for each chromosome class) and a remaining 2200 for testing purposes.The features used include the chromosome length, centromere index and the gray banding pattern.The longest chromosome in the dataset used consisted of 100 bands in the banding profile, thus the feature space for each chromosome consists of 102 numbers.
K NEAREST NEIGHBOR CLASSIFIER

Table 1
Since computing the results by Mahalanobis norm was computationally expensive, principal component analysis was used to reduce the data dimensionality and thus speed up the computation process.In order to compare the results achieved by Mahalanobis distance function (88% classification success rate), where the feature space was significantly reduced, with the ones previously achieved by different p-norms, the classification of chromosomes was also carried out with the best performing distance function, i.e. the Euclidean distance function (94.05% classification success rate), by using the same reduced feature data set as in the Mahalanobis distance.Once again a much better classification results were obtained by using Euclidean distance function.