Denver Groups Classification of Human Chromosomes Using Fuzzy C-Means Clustering

— Unbanded human chromosome can be classified into seven Denver Groups (A-G) based their lengths and the ratio of the length of the shorter arm to the whole length of the chromosome, which is called the centromere index (CI). In this article, the fuzzy c-means method will be used to perform the Denver Group classification of a given set of human chromosomes. The objective in clustering is to partition a given human chromosome set into homogeneous clusters; by homogeneous we mean that all points in the same cluster share similar attributes and they do not share similar attributes with points in other clusters. However, the separation of clusters and the meaning of similarity are fuzzy notions and can be described as such. It is found that the clusters iterations converge, highly depend on the initial partition matrix,


INTRODUCTION
In 1956 Tjio and Levan using the improved cell culturing and staining technique discovered that the number of human chromosomes is 46 (Tjio, and Levan 1956). From this time on, the research on chromosomal abnormalities, as a cause of diseases, became one of the main branches of the molecular biology.
Disorder in human chromosomes is a powerful indicator in diagnosis of leukemia, skin and breast cancers, and other genetic diseases. Clinical laboratories routinely performed researches to identify chromosome abnormalities, and provide medical doctors the diagnostic results and help them decide therapeutic treatments for patients.
The most prominent difficulty in chromosome analysis is the absence of clear microscopic chromosome images. The variation of cell culturing conditions, chromosome staining, and microscope illumination make finding analyzable chromosomes in a genetics clinical laboratories very difficult. For human experts, identification and classification of chromosomes is a tedious and time-consuming task. The human error also introduces variation and affects the accuracy of the diagnostics made by physicians.
The development of computer-assisted metaphase finding and karyotyping systems, slowed down by the noisy cell images.

HUMAN CHROMOSOMES
Since Waldeyer in 1898 (Verma, and Babu 1995) coined the term chromosome, it is known that chromosomes resides within a cell's nucleus, and contains the person's deoxyribonucleic acid (DNA). Each chromosome is made up a single extremely long DNA molecule. Using cells cultured from fetal lung tissue, Tjio and Levan, demonstrated that human cells contain 46 chromosomes as they appear during cell division or mitosis. A healthy human cell nucleus includes 44 autosomes and 2 sex chromosomes: X and Y.
The test cells used for chromosome imaging and analysis are taken mostly from blood sample, amniotic fluid, and bone marrow. These test samples are cultured overnight in a mitotic arresting agent. Then cells are processed with hypotonic solutions to increase cell volume. This procedure spreads the chromosomes apart. The methanol-acetic acid is used to fix them for analyses. The fixed cells are dropped onto a standard glass microscope slide and allowed to dry.
If karyotyping and classification are going to be performed using banded chromosomes, the slide is then subjected to a staining process. Staining makes clear the distinctive reproducible patterns of bands along chromosomes. These bands permit accurate identification of chromosomes and recognition of abnormalities.

Classification of Banded Chromosomes
In order to improve the performance of automated chromosome classification including recognition of disordered chromosomes, artificial intelligence and machine learning methods have been widely used in the computer-assisted chromosome detection and classification systems (Gagula-Palalic, and Can 2012). Among them, ANN is the most popular tool owing to its capability of modeling the human brain decision making process to recognize objects based on incomplete or partial information, as well as its simple topographic structure and easier training process (Mitchell, 1997).
Early studies also indicated that ANN performance could achieve comparable results compared with that obtained by simpler statistical methods (Sweeney, 1993). A large number of different feature based and pixel value distribution based ANN have been tested and evaluated in classification of banded chromosomes, which include supervised multi-layer neural networks (Delshadpour, 2003, Wu et. Al., 1990, Hopfield network (Ruan, 2000), and unsupervised architecture of self organizing nonlinear maps (Lerner et. Al., 1996), SOFM (Kyan et. Al. 1999) and mutual information maximization based training method (Mousavi et. Al., 1999).
However, the study found that performance of unsupervised nonlinear learning methods was lower than a supervised nonlinear paradigm (Lerner et. Al., 1996). Although ANN is a powerful machine learning tool in pattern recognition and classification, its relatively poor robustness in detection and classification of abnormalities depicted on the complicated chromosome images and its 'black box' type of optimization approach are its major disadvantages.
To provide researchers and clinicians with a better understanding of the logic or reasoning in automated classification of chromosomes, a variety of knowledge-based 'expert' systems were developed and evaluated (Gagula-Palalic, and Can 2012). Since clinical technicians are trained to recognize the chromosomes under non-ideal conditions, many researchers tried to record and apply or mimic the rules of manual karyotyping and diagnosis of chromosome irregularity into a knowledge-based automated classification system in an attempt to minimize the classification errors.
Hence, researchers worked with clinicians, observed their diagnostic process, summarized and quantify the diagnostic rules, and then converted these rules into the computer classification systems (Wu et. Al., 1989, Lu, and Ya 1989, Ramstein et. Al., 1992. The systems would then be trained on a bank of chromosome images, refining the rules as needed until the recognition rate was maximized. A major problem with such knowledge-based approach is the difficulty of converting karyotyping guidelines and intuitive notions (empirically diagnostic rules) into concrete rules that can be effectively programmed and applied in a computer-assisted scheme. Owing to this difficulty, the most popular knowledge-based classification system is a fuzzy logic rule-based system, which offers great promise for improving the recognition rate (Keller et. Al., 1995). One blind test involving a dataset of 180 chromosomes distributed in three classes demonstrated 88% classification accuracy using an automated system involving six phases of fuzzy logic rules (Sjahputera, and Keller, 1999).

Classification of Unbanded Chromosomes
When the chromosomes are not banded, they can be classified into seven Denver Groups (A-G) (H. C. S. Group, 1960) as seen in Table1. Denver Group classification is mainly based on: (1) the length or size of each chromosome and (2) the ratio of the length of the shorter arm to the whole length of the chromosome, which is called the centromere index (CI).
In this article, the fuzzy c-means method will be used to perform the Denver Group classification of a given set of human chromosomes.

FUZZY c-MEANS (FCM)
The concept of a fuzzy set first arose in the study of problems related to pattern classification (Bellman et al., 1966). Since the recognition and classification of patterns is integral to human perception, and since these perceptions are fuzzy, this study seems a likely beginning (Zadeh, 1971). This section presents a simple idea in the area of classification and has dealt in depth with a particular form of classification using a popular clustering method: FCM. The objective in clustering is to partition a given data set into homogeneous clusters; by homogeneous we mean that all points in the same cluster share similar attributes and they do not share similar attributes with points in other clusters. However, the separation of clusters and the meaning of similarity are fuzzy notions and can be described as such. One of the first introductions to the clustering of data was in the area of fuzzy partitions (Ruspini, 1969(Ruspini, , 1970(Ruspini, , 1973a, where similarity was measured using membership values. In this case, the classification metric was a function involving a distance measure that was minimized. Ruspini (1973b) points out that a definite benefit of fuzzy clustering is that stray points (outliers) or points isolated between clusters (Figure 1) may be classified this way; they will have low membership values in the clusters from which they are isolated. In crisp classification methods, these stray points need to belong to at least one of the clusters, and their membership in the cluster to which they are assigned is unity; their distance, or the extent of their isolation, cannot be measured by their membership. These notions of fuzzy classification described in this section provide for a point of departure in the recognition of known patterns. Figure 1. In fuzzy clustering outliers or points isolated between clusters will have low membership values in the clusters from which they are isolated. To develop fuzzy methods in classification, we define a family of fuzzy sets {̃ = 1,2, . . . , } as a fuzzy c-partition on a universe of data points, X. Because fuzzy sets allow for degrees of membership, we can assign membership to the various data points in each fuzzy set. Hence, a single point can have partial membership in more than one class. It will be useful to describe the membership value that the kth data point has in the ith class with the following notation: = � ( ) ∈ [0,1], with the restriction that the sum of all membership values for a single data point in all of the classes has to be unity: for all k = 1,2, . . . , n.
There can be no empty classes and there can be no class that contains all the data points. This qualification is depicted by the following expression: (2) Because each data point can have partial membership in more than one class, one has, We can now define fuzzy c-partitions � × [ ].

Fuzzy c-Means Algorithm
To describe a method to determine the fuzzy c-partition matrix � for grouping a collection of n data sets into c classes, we define an objective function Jm for a fuzzy c-partition: where � is the partition matrix, v i are cluster centers, d ij are Euclidean distance measures in m-dimensional feature space, between the j th data sample x j and the i th cluster center v i , and is the membership of j th data point to the i th class.
Partition matrix � is used for grouping a collection of n data sets into c classes, and as such each entry in the partition matrix is represented by the membership function . The Euclidean distance and cluster centers are given by equations (5) and ( The fuzzy C means is trying to tune the partition matrix, centers and distances, so that the objective function J m is minimized (Ross 2004).
A new parameter is introduced in Equation (10.28) called a weighting parameter, m (Bezdek, 1981). This value has a range α∈ [1, ∞). This parameter controls the amount of fuzziness in the classification process.
As with many optimization processes, the minimized objective function J m cannot be guaranteed to be a global optimum. What we seek is the best solution available within a prespecified level of accuracy. An effective algorithm for fuzzy classification, called iterative optimization, was proposed by Bezdek (1981). The steps in this algorithm are as follows: 1. Fix c (2 ≤c<n) and select a value for parameter α. Initialize the partition matrix, � (0) . Each step in this algorithm will be labeled r, where r =0, 1, 2,...

Calculate the c centers {
( ) } for each step.
In step 4, we compare a matrix norm ‖ ‖ of two successive fuzzy partitions to a prescribed level of accuracy, ϵ L , to determine whether the solution is good enough. In step 3, when the variable ( ) is zero, since this variable is in the denominator of a fraction, the operation is undefined mathematically, and computer calculations are abruptly halted.
So when some of the distance measures ( ) are zero, or extremely small in a computational sense, it is replaced by a small positive real number.

DATA DESCRIPTION
The data used in this work is taken from Copenhagen data base. We omitted gray level features, and only keep (1) the length of each chromosome and (2) the ratio of the length of the shorter arm to the whole length of the chromosome, which is called the centromere index (CI). Figure 2. The distribution of 2200 human chromosomes into seven Denver Group classes from A, to G.

CLASSIFICATION USING FUZZY c-MEANS (FCM)
Using Fuzzy c-Means Algorithm described in Section 3., it is found that the clusters iterations converge, highly depend on the initial partition matrix, � (0) . Denver Group classification from A, to G are distributed to clusters C1 to C7 as in Table 2. below.

SUMMARY
Article presents a simple idea in the area of classification and is dealt in depth with a particular form of classification using a popular clustering method: FCM. Although the idea behind the method is very simple, it succeeds to classify given 700 human chromosomes in seven Denver Group classes A, to G with a rate of 81.86 %.