Southeast Europe Journal of Soft Computing Available online: Word Identification According to Syllabic Property Southeast Europe Journal of Soft Computing Word Identification According to Syllabic Property

Natural Language Processing (NLP) is a field of computer science, artificial intelligence and computational linguistics that concerned the interactions between computers and natural languages. With developing computer technologies and social networks, researching natural languages such as machine translation, content summarization and information retrieval become most studied fields of NLP.


INTRODUCTION
This paper describes identifying or defining Uyghur words according to their syllabic properties. Uyghur is a Turkic language spoken mainly in Sin Kiang Uyghur autonomous region in China. By morphological structures, all Uyghur words have standard syllabic properties and all words can be split into syllables by applying general syllabic rules [1]. But Uyghur language is one of the oldest language in the Turkic language family and it is spoken in wide geographic region and counties such as Uyghur autonomous region in China, Afghanistan, Kazakhstan, Kyrgyzstan, Uzbekistan, Turkey, USA and some European countries. It includes many words that are not of Uyghur origin [1]. Most Uyghur speakers live in the Uyghur autonomous region in China and the contemporary Uyg language is heavily affected by Chinese words. There are also a lot of words adopted from Russian in Central Asian republics. In addition, the Uyghur language is also affected by Arabic and Persian words because of religion and Abstract Natural Language Processing (NLP) is a field of computer science, artificial intelligence and computational linguistics that concerned the interactions between computers and natural languages. With developing computer technologies and social networks, researching natural languages such as machine translation, content summarization and information retrieval become most studied fields of NLP. To make a general solution for a problem, it is important to classify words and find out the category of language. In this paper, according to syllabic property of Uyghur words, a simple Uyghur word identification approach has been suggested.
This paper describes identifying or defining Uyghur words according to their syllabic properties. Uyghur is a Turkic language spoken mainly in Sin Kiang Uyghur autonomous region in China. By morphological structures, all Uyghur properties and all words can be split into syllables by applying general syllabic rules [1]. But Uyghur language is one of the oldest language in the Turkic language family and it is spoken in wide geographic region and counties such as Uyghur region in China, Afghanistan, Kazakhstan, Kyrgyzstan, Uzbekistan, Turkey, USA and some European countries. It includes many words that are not of Uyghur origin [1]. Most Uyghur speakers live in the Uyghur autonomous region in China and the contemporary Uyghur language is heavily affected by Chinese words. There are also a lot of words adopted from Russian in Central Asian republics. In addition, the Uyghur language is also affected by Arabic and Persian words because of religion and geographic relations. Therefore, to study or analyze Uyghur language with computer based methods, it is necessary to define the origin of a word properly. In natural language studies, the alphabet is one of the most important factors. A range of alphabets and different numbers of characters have been used in different part of the world to write the Uyghur language. For example, th Arabic based alphabet is used in China (Figure 1  Natural Language Processing (NLP) is a field of computer science, artificial intelligence and computational linguistics that concerned with the interactions between computers and natural languages. With developing computer technologies and social networks, researching natural languages such as machine translation, content summarization and of NLP. To make a important to classify words and find this paper, according to syllabic property of Uyghur words, a simple Uyghur word identification approach has been erefore, to study or analyze Uyghur language with computer based methods, it is necessary to define the origin of a word properly. In natural language studies, the alphabet is one of the most important factors. A range of alphabets and different characters have been used in different part of the world to write the Uyghur language. For example, the in China (Figure 1), while the Cyrillic alphabet is used in Central Asian republics ( Figure 2) and the Latin based alphabet is used in western Therefore, to study the Uyghur language, it is necessary to study the relationship between these different alphabets. Characters used in an alphabet directly affect word structures and spelling rules. For example, in Central Asian republics, there are some Russian characters have been used to write Russian adapted words, but a single Russian character can represent two characters in the Uyghur Language. To study the Uyghur language as a single unit, it is important to implement correct the algorithm to convert one alphabet into another. Even though the Arabic based Uyghur alphabet is the official alphabet in Sin Kiang Uyghur autonomous region, the Latin based alphabet is commonly used. In this paper Uyghur words are split according to the Latin based alphabet adapted by the UKIJ [3-4] ( Figure 4).   Table [ 4] In the Turkic language family, the Turkish from Turkey is one of the languages with a large body of research of computational linguistic methods. Important progress been made, such as morphological analyzers, corp machine translation applications etc. [5][6][7][8][9][10]. These results provide important fundamentals for studying other Turkic languages. Unfortunately, NLP studies about other Turkic / Southeast Europe Journal of Soft Computing Vol.5 No.2September 2016 (11-15) Therefore, to study the Uyghur language, it is necessary to study the relationship between these different alphabets. Characters used in an alphabet directly affect word structures and spelling rules. For example, in Central Asian republics, there are some Russian characters have been used to write Russian adapted words, but a single haracters in the Uyghur Language. To study the Uyghur language as a single unit, it is important to implement correct the algorithm to convert one alphabet into another. Even though the Arabic based Uyghur alphabet is the official hur autonomous region, the Latin based alphabet is commonly used. In this paper Uyghur words are split according to the Latin based 4] (Figure 4).
In the Turkic language family, the Turkish from Turkey is one of the languages with a large body of research of computational linguistic methods. Important progress has been made, such as morphological analyzers, corpus and 10]. These results provide important fundamentals for studying other Turkic languages. Unfortunately, NLP studies about other Turkic languages is still in the early stages and there are insufficient resources and inestimable differences which exist among different Turkic languages [11]. This paper describes mainly how to identify or define Uyghur native words according to their syllabic properties. Comparing the difference between different Turkic languages or none Turkic languages syllabic properties is out of the scope of this paper. This paper is organized as follows: after providing short information about NLP and Uyghur language in the first section, syllabic and morphological properties of Uyghur words have been explained in the second section. The third section describes implementation of the algorithm that splits words into syllables and in the last section the algorithm has been evaluated and the result has been explained.

SYLLABIC and MORPHOLOGICAL PROPERTIES OF UYGHUR WORDS
To study syllabic properties of words, the first thing to do is analyze morphologic properties of those words. Uyghur is an agglutinative language with word structures formed by productive affixations of derivationa suffixes to root words. For example: SHEHIRDEKILERNINGKIMISHDEK Which can be broken down into morphemes as follows:

SHEHIR+DE+KI+LER+NING+MISH+DEK
Where the "+" indicates morpheme boundaries. This word can be translated into English such as" as if they belong to whom that live in a city". The root of this words is "SHEHIR" and rest of the morphemes add external meaning to the root word. Whenever a ne affixed, a new category is created. While a new morpheme or suffix is affixed, vowels in a morpheme have to agree with the preceding vowel in certain aspects to achieve vowel harmony, although there are small number of exceptions. In some cases, vowels changed or deleted from the root words [1,12]. Similarly, such modifications appear about consonants in root word and affixed morphemes.
Complicated morphological structures of a word, especially agglutinative languages, make it more complicated to study morphological, lexical and syntactic property of a language.
Uyghur origin or native words and adapted words have different morphological structures, therefore some computational morphological analyzers cannot solve all Uyghur words correctly [13][14]. If a word is identified, before it is analyzed and a none Uyghur origin word is elected. Next, different methods of analysis are suggested and the performance of the morphological analysis may be improved. In natural language processing, for agglut languages, a morphological analyzer is the most important part and it provides the fundamentals for further research.
languages is still in the early stages and there are and inestimable differences which exist among different Turkic languages [11]. This paper describes mainly how to identify or define Uyghur native words according to their syllabic properties. Comparing the difference between different Turkic s or none Turkic languages syllabic properties is This paper is organized as follows: after providing short information about NLP and Uyghur language in the first section, syllabic and morphological properties of Uyghur ds have been explained in the second section. The third section describes implementation of the algorithm that splits words into syllables and in the last section the algorithm has been evaluated and the result has been

MORPHOLOGICAL PROPERTIES
To study syllabic properties of words, the first thing to do is analyze morphologic properties of those words. Uyghur is an agglutinative language with word structures formed by productive affixations of derivational and inflectional suffixes to root words. For example: DEK Which can be broken down into morphemes as follows:

SHEHIR+DE+KI+LER+NING+MISH+DEK
Where the "+" indicates morpheme boundaries. This word can be translated into English such as" as if they belong to whom that live in a city". The root of this words is "SHEHIR" and rest of the morphemes add external meaning to the root word. Whenever a new morpheme is affixed, a new category is created. While a new morpheme or suffix is affixed, vowels in a morpheme have to agree with the preceding vowel in certain aspects to achieve vowel harmony, although there are small number of vowels changed or deleted from the root words [1,12]. Similarly, such modifications appear about consonants in root word and affixed Complicated morphological structures of a word, especially agglutinative languages, make it more d to study morphological, lexical and syntactic Uyghur origin or native words and adapted words have different morphological structures, therefore some computational morphological analyzers cannot solve all 14]. If a word is identified, before it is analyzed and a none Uyghur origin word is elected. Next, different methods of analysis are suggested and the performance of the morphological analysis may be improved. In natural language processing, for agglutinative languages, a morphological analyzer is the most important part and it provides the fundamentals for further research.
In Uyghur language, vowels are central parts of syllables. Without a vowel, a syllable cannot be created [1]. The main syllabic rule for an Uyghur word is that a syllable should consist of at least one vowel. The number of vowels in a word defines the number of syllables in that word. It means there is only one vowel sound per syllable.
In contemporary Uyghur language there are eight vowels and 24 consonants.
Both vowels and consonants can be categorized according to different criteria, but this is not the topic of this paper. In Uyghur language some words consist of only one syllable and some words consist of multiple syllables. Even a single vowel can be considered as a valid syllable.
To describe general syllabic property of Uyghur words, if consonants are represented as "C", and vowels with "V", the following cases could be summarizedfor Uyghur native words [1] (the explained syllable is underlined with bold characters). There are lot of adopted words in Uyghur language and it is also possible to describe syllabic styles for some of them [1]. In general, words used in the contemporary Uyghur language can be analyzed according to rules that describe above six standard rules [1].
According to those rules, an Uyghur word may consist of a single syllable or unlimited (with affixed suffixes) numbers of syllables. To correctly find out syllables in a word, it is important to define borders of syllables. In general, the syllable borders can be defined according to the following rules [1]. There are some special cases that do not follow these rules though these cases not included in this paper. In Uyghur language, there are some special cases to define the border of a syllable, and it is dependent on vowel harmonization. When a vowel is changed, it also affects some consonants and these changes affectthe border of the syllable. But such very specific cases are not included in this paper.
In some cases, a well implemented morphological analyzer can be used as a syllable splitter, but a morphological analyzer cannot split rood words, and it gives information about morphemes according to followed suffixes.

IMPLEMENTATION of the ALGORITHM
To implement the algorithm that splits Uyghur words into syllables, both Cyrillic and Arabic characters have been converted into Latin characters according to the source of the files. After that, the six rules have been applied on all words. In the last step, adjust syllable borders and decide if the created syllables map the standard Uyghur syllable styles or not. The algorithm that splits words into syllables can be represented as in Figure 5. If any words cannot be split, those words are considered as adopted words from other languages. To analyze these adopted words, alternative methods could be suggested. It may be open topic for this kind of problem. Because there are many adopted words from different languages.
As shown in Figure 5, there are two main parts of this algorithm, splitting into syllables and adjusting syllable borders. When the borders are adjusted, the number of characters in a syllable may be changed. In general, the maximum number of characters in a syllable is four and the minimum number of characters is one (one vowel).

RESULTS AND DISCUSSION
This algorithm has been tested with two different articles. One of the articles has been published in Kazakhstan and another was published in Sin Kiang Uyghur Autonomous region in China.
As a result, this algorithm was successfully able to split words into valid syllables except for Chinese and Russian words. In these two short articles a total 200 words have been used. If the number of words is increased or the type of article is changed, then the error rates may be changed and increased compared to shorted articles.
This is due to the fact that adopted words mainly appear in technical and political articles. If the following sentence is, "Men ikkikünkéyinGuangZhoughabiriptraktoralmaqchimen" (I am going to Guang Zhou after two days and buy a tractor ), is analyzed, the following syllables are generated. Multiple syllables are joined with "+" sign. Although, almost all Uyghur origin words can be identified with this approach, but there are few foreign originated words also classified standard Uyghur words, as mentions in section 4. These kinds of words may be considered a special word category and have to analyzed with a different method. With this approach not only it is it possible to identify a word, it is also possible to generate random words according to the standard structure of Uyghur words.
All Turkic language words have almost the same word structure, therefore this approach may be applied to other Turkic Languages as well. Because of different Turkic language have different number of character and using different alphabet, the syllable rules maybe different relatively.