Hierarchical Afaan Oromoo News Text Classification
Hierarchical Afaan Oromoo News Text Classification
org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.88, 2020
Abstract
The advancement of the present day technology enables the production of huge amount of information.
Retrieving useful information out of these huge collections necessitates proper organization and structuring.
Automatic text classification is an inevitable solution in this regard. However, the present approach focuses on
the flat classification, where each topic is treated as a separate class, which is inadequate in text classification
where there are a large number of classes and a huge number of relevant features needed to distinguish between
them.This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of
Afaan oromoo News Text. The approach utilizes the hierarchical topic structure to decompose the classification
task into a set of simpler problems, one at each node in the classification tree. An experiment had been
conducted using a categorical data collected from Ethiopian News Agency (ENA) using SVM to see the
performances of the hierarchical classifiers on Afaan Oromoo News Text. The findings of the experiment show
the accuracy of flat classification decreases as the number of classes and documents (features) increases.
Moreover, the accuracy of the flat classifier decreases at an increasing number of top feature set. The peak
accuracy of the flat classifier was 68.84 % when the top 3 features were used. The findings of the experiment
done using hierarchical classification show an increasing performance of the classifiers as we move down the
hierarchy. The maximum accuracy achieved was 90.37% at level-3(last level) of the category tree. Moreover, the
accuracy of the hierarchical classifiers increases at an increasing number of top feature set compared to the flat
classifier. The peak accuracy was 89.06% using level three classifier when the top 15 features were
used.Furthermore, the performance between flat classifier and hierarchical classifiers are compared using the
same test data. Thus, it shows that use of the hierarchical structure during classification has resulted in a
significant improvement of 29.42 % in exact match precision when compared
with a flat classifier.
Keywords: Automatic Text Classification
DOI: 10.7176/NMMC/88-01
Publication date: February 29th 2020
1 Introduction
Humans use classification techniques to organize things in various activities of their life. People make their own
judgment to classify things in their everyday life – they classify things based on similarities or likeliness of color,
size, concept, ideas, subject and so on (Koller& Sahami, 1997).
The need to classify information resources has become an important issue as the production of such
resources increase dramatically from time to time. More specifically, for the last 6 decades
(Dumais & Chen, 2000), there is a great increase in the production of information. Manuscripts, newspapers,
journals, magazines, thesis and dissertations are available electronically in different formats such as text, audio,
video, and graphics. Several studies have shown that these collections are constantly growing from time to time
due to the advancement of technology (Chien et.al. 2004). However, seeking information of one’s need in these
huge collections requires organization. Especially in a system where there is large collection of documents,
retrieval of a given document or set of documents is possible if the collection is organized systematically. Many
web sites offer a hierarchically organized view of the Web. E-mail clients offer a system for filtering e-mail.
Academic communities often have a Web site that allows searching on papers and shows an organization of
papers.
Nowadays, news items are produced every day in digital devices and organized in some order (Rennie,
2001). However, most of the time, text classification process is done manually which brings about enormous
costs in terms of time and money. In other words, organizing documents by hand or creating rules for filtering is
painstaking and labor-intensive
Therefore; automatic classification systems are very desirable since they minimize such problems
(Neumann & Schmeier, 1999; Rennie,2001).
Automatic text classification can be done using two approaches (Sebastiani, 2002; Rasmussen, 1992):
clustering or classification. Text clustering is the automatic identification of a set of natural categories and the
grouping of documents under each. Text classification, on the other hand, is the automatic assignment of
documents to a predefined set of categories. Many previous studies focus on flat classification, in which the
predefined categories are treated individually and equally so that no structures exist to define relationships
among them (Yang &Liu, 1999; D'Alessio et al., 2000). A single huge classifier is trained which categorizes
1
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.88, 2020
2
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.88, 2020
Afaan Oromo punctuation mark is placed in text to make meaning clear and reading easier. Analysis of
Afaan Oromo texts reveals that different punctuation marks follow the same punctuation pattern used in English
and other languages that follow Latin Writing System. Similar to English, the following are some of the most
commonly used punctuation marks in Afaan Oromo:
3
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.88, 2020
number of documents explosively increases, the task becomes no longer amenable to the manual categorization,
and it requires a vast amount of time and cost. This has led to numerous researches for automatic document
classification. The automatic text categorization is generally divided in to two main categories. These are flat
text categorization and hierarchical text categorization.
1.3.2. Text Clustering
It is easy to collect unlabeled documents for text categorization purposes. As a result, a text categorization
mechanism is required for categorizing the unlabeled documents. This mechanism is called text clustering. Text
clustering is an unsupervised learning which does not require pre-defined categories and labeled documents. The
main aim of text clustering is to determine the intrinsic grouping in a set of unlabeled data. The intrinsic groups
have high intragroup similarities and low intergroup similarities. Text clustering algorithms are categorized into
two main groups such as hierarchical algorithms create a cluster by splitting the data into k partition where each
partition represents a cluster.
Text Preprocessing
Text preprocessing is crucial step for the subsequent text clustering and classification tasks. During text
preprocessing, there are a sequence of steps applied to generate content-bearing terms and assigns weights that
4
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.88, 2020
shows their importance for representing the document they identified from. First, tokenization is performed
which attempts to identify words in the document corpus. The common method of representing the document
text is using the bag of words approaches where the word from the document corresponds to a feature and the
documents are considered as a feature vector. This indicates that the words can only discriminate the given
documents in the categorization process. So, the punctuation marks and digits are irrelevant component of the
text documents. In the present study, the punctuation marks and digits are removed and replaced with space.
Tokenization is the process of chopping character streams in to tokens, while linguistic preprocessing then deals
with building equivalence classes of tokens which are the set of terms that are indexed. Tokenization in this work
also used for splitting document in to tokens and detaching certain characters such as punctuation marks.
Algorithm 1: Tokenization
Normalization involves process of handling different writing system. Primarily every term in the document
should be converted in to similar case format in this study lower case. For instance “INFORMATION‟,
„Information‟, „information‟ is all normalized to understandable as lower case „information‟ the system.
5
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.88, 2020
2. Experiments and Results 2.1.1 Experimental Setup Training data The number of news documents used in
this experiment was 5100. Since hierarchical classification emphasizes the relationship among classes, rather
than building single huge classifier, a classification is accomplished with the cooperation of classifiers built at
each level of the tree. The training data is organized into 3 levels: from level-0(root level) to level-2. Each level
represents classes or subclasses in a classification tree. Thus, there were 8 classes at level-0, 20 classes at level-1,
6
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.88, 2020
and 69 classes at level-2 with at least 14 documents in them. The classifiers at each level were trained using the
associated documents of all subclasses of that class. Thus, the level-0 classifier was trained using documents of
all subclasses of that class from level-1 through 2. In contrast, each level-1 classifier was trained with documents
from the appropriate level-1 subclasses up level-2 .Testing data The accuracy of the classifier was tested using
the test data selected from each level-3 documents. These documents were excluded from the training process
and were selected from different level-3 classes. Since the class from which the test documents were selected is
known, the accuracy of the classifier is evaluated how often the classifier assign the test documents to the classes
from which they originally came.
Effects of the numbers of classes of documents on flat classification
We created a single classification system by training a flat classifier for all classes in the top 3 levels of the
classification tree, ignoring structure. In other words, each of the 97 classes was trained using 70% of the
documents from each class. We had broken the classification process into pieces of classes taken separately at a
time to see the performance of the classifier while increasing number of classes and documents (features). Thus,
classes in level-0, level-1, and level-2 were separately considered for the first, second, and third experiment
respectively. Since each document is assigned in the leaf node of the classification tree; level-0 classes will have
the same number of documents as that of level-1 and level-2 when used separately for the next corresponding
experiment. Hence, I selected documents using 50% of the collection to experiment on level-1 classes, and 70%
of collection for the second experiment and 90% of the document collection for the third experiment. Moreover,
the effect of top features on the classifier performance of a flat classification system and the reasons behind is
addressed in this section of the experiment.
Experiments labeled as I, II and III shows the result and discussion of the experimentation when an
increasing number of classes and documents were considered whereas Experiment IV shows the effect of top
features on the classifier performance of a flat classification system.
Classification with 8-Classes In this level, 8- Classes and 50% of the total number of documents in the
collection were considered. Training and testing data shares the 70% and 30% of these data. Hence, 1785
documents were used for training and 765 documents were used for testing. Thus, 80.34% accuracy was found
as result of this experiment.
7
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.88, 2020
The Increasing performance of hierarchical classifiers Table 2.3 shows the improved performance of the
hierarchical classifiers at each levels of the classification tree. This is because each classifier deals with the
documents associated to only that class or subclasses of that class and it concentrates on a smaller set of
documents, those relevant to the task at hand. As it is shown above, exploiting the relationship among classes
and utilizing the hierarchical topic structure results in a considerable increase in the classifier accuracy. The
testing data in the table 2.3 shows those instances which participate in testing the level-0 classifier, extracted
from the corresponding class in level1 through 2. This is because the LibSVM only evaluates whether the
classifier assigns the test documents to the classes from which they originally came; from which the accuracy is
calculated using the equation described in equation in the above equation. Thus, in this study level-1 and level-2
classifiers are tested with documents selected from that classes and subclasses of that class; others left out as it
always degrade the accuracy value of the classifier.
Effects of Number of Top Features in Hierarchical Classifiers
The documents were initially classified at level-0 using a varying number of features per document where the
features were selected based on their tfidf weights. The first run used only the highest weighted feature for
classifying the documents and number of features was increased in each subsequent run until a maximum of 20
features. The level-0 classifier had a peak accuracy of 81.50% when the top 5 features were used. Figure 2.3
shows the result of the experiment.
Figure2.3: The effect of the number of top features selected from test documents on level-0 classification
accuracy
The test documents were then classified at level-1 while again varying the number of top features from 1 to
20. At level-1, the classification process is same as above, but it is constrained to consider only the subclasses of
the best matching class at level-0. As shown in figure2.4 below, the level-1 classifier had a peak accuracy of
85.07% when the top 10 features were used.
8
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.88, 2020
Figure2.4: The effect of the number of top features selected from test documents on level-1 classification
accuracy
Finally, the test documents were classified at level-2 with the classification process now constrained to
consider only the subclasses of the best matching class at level-1. Since all the test documents originally came
from level-2 classes, the accuracy of the classifier overall is best judged by the accuracy at level-2. The level-2
classifier had an exact match precision of 89.06% when the top 15 features were used. This means that, from a
set of 97 classes, the hierarchical classifier correctly classified 89.06% of documents to their original class.
Figure2.5: The effect of the number of top features selected from test documents on level-2 classification
accuracy it is interesting to note that, as we move down the hierarchy, the classifiers perform better with more
features extracted from test documents. This is because they need more information in order to make finer-
grained distinctions between the classes.
Comparison between Flat Classifier and Hierarchical Classifiers Analyzing the result of the above
experiments, as the number of class and documents to be considered increases, the performance of flat classifier
decreases but the performance of hierarchical classifiers increases as we move down the hierarchy. This shows
that flat classification depends on the number of classes and documents to be considered compared to
hierarchical classifiers. To compare the relative performance of the flat classifier and the hierarchical classifiers,
the same set of test documents was used. As shown in figure2.2 and figure2.5, the flat classifier produced an
exact match accuracy of only 68.84 % when the top 3 features were used, whereas with the hierarchical
classifiers 89.09 % of the documents classified had exact match accuracy when the top 15 features were taken.
This implies that least number of features are needed to discriminate among a large number of classes where
there is no relationship among classes; whereas more number of features are needed to discriminate among
classes where there are more closer similarity between classes or documents of a class (in hierarchical
classification). Thus, we can conclude that the use of hierarchy for text classification results in a significant
improvement of 29.42 % in exact match accuracy over the flat classifier.
Conclusion Along with the advancement of Information Technology, we are flooded by huge amount of
information. This problem has caused obstacles to human beings to get useful information out of these huge
collections. Hence, this necessitates a proper way of data organization and management in such a manner that it
can be easily accessible to those who seek it. One way is the use of manual document classification. Manual
classification requires individuals to assign each document to one or more categories. However, as the amount of
data and information increases, this approach became tedious and consumes times to organize documents using
human hand. An alternative way is to organize data and information using automated systems, which are often
9
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.88, 2020
referred to as automatic document classification. It uses automatic solutions to classify an electronic document
into one or more categories based on the characteristics of its contents. Several researches have been done on
automatic document classification with the help of different machine learning approaches; and good results were
found. However, most of them focus on flat classification system, i.e. each topic (category) is considered
independent of others where there is no any relationship among them. Even though, flat classification has
become a well-established research area for the last 30 to 40 decades and many good classifiers have been
developed, the approach hadn’t yet a feasible solution where most real world application need structures that
define relationships among them are necessary. As the technology such as internet grows, the number of possible
categories increases and the borderlines between document classes are blurred. As we use a large corpus we may
have hundreds of classes and thousands of features. The computational cost of training a classifier for a problem
of this size is prohibitive to solve these problems, approach that utilizes the hierarchical topic structure to
decompose the classification task into a set of simpler problems is proposed. It is often referred to as hierarchical
classification approach. Hierarchical text classification uses a divide-and-conquer approach (also known-as top-
down approach) to deal the large classification problem into a set of simpler sub problems, one at each node in
the classification tree. In such a hierarchical structure document types become more specific as we go down in
the hierarchy. Thus, hierarchical classification of documents is very much important to access a specific
document or group of documents from the hierarchical organized document collections as compared to flat
classification. This paper also introduces support vector machines (SVM) for hierarchical text categorization. It
provides both theoretical and empirical evidence that SVMs are very well suited for text categorization in
general and hierarchal classification in particular. The theoretical analysis concludes that SVMs acknowledge the
particular properties of text: (a) high dimensional feature spaces, (b) few irrelevant features (dense concept
vector), and (c) sparse instance vectors. The experimental results show that SVMs consistently achieve good
performance on hierarchical text categorization tasks, outperforming existing methods substantially and
significantly. With their ability to generalize well in high dimensional feature spaces, SVMs eliminate the need
for feature selection, making the application of text categorization considerably easier. Another advantage of
SVMs over the conventional methods is their robustness. Furthermore, SVMs do not require any parameter
tuning, since they can find good parameter settings automatically. All this makes SVMs a very promising and
easy-to-use method for learning text classifiers from examples. LibSVM was used as an experimentation tool
due to its ability for efficient multiclass classification, automatic model selection and contains different SVM
formulations. The data collected for this study a one-level data. However, it is preprocessed in a manner that
hierarchical data (categorically leveled) data is obtained to fit with the purpose of the study. Document-similarity
matrix and experts’ judgment were used to generate a categorical level data. Therefore, three level hierarchical
data is obtained with 8 level-0 classes, 20 level-1 classes and 69 level-2 classes. The experiment was done
following three approaches. 1. Assure whether the traditional classification method (flat classification system) is
dependent on the number of classes and features 2. Constructing hierarchical classifiers at each levels of the
classification tree and see whether the performance of the classifiers were improved as we move down the
hierarchy 3.Evaluating the classification performance between existing traditional system (flat classifier) and the
hierarchical classifiers with the same test data. Accordingly, the following result was obtained based on the
experiments done using the above three approaches. Based on the first approach, the 97 classes were divided into
8, 20 and 69 separate classes with an increasing number of documents in them. Thus, it was found that the
accuracy was decreasing from 80.34% to 66.09% and then to 52.32% as the number of classes increased from 8
to 20 and then to 69; and the number of documents increased from 2550 to 3570 and then to 4590 respectively.
This shows that as the number of classes and documents increase, the performance of a flat classifier decreases.
This is because, as the number of support vectors increases with the increasing number of documents. Since the
classification is done in a multidimensional plane where we can draw a number of hyper planes, the increasing
number of support vectors causes to narrow the margin between these hyper planes. The smaller the marginal
hyper plane then causes maximum misclassification error on the later unseen test instances apart from a
difficulty to get the maximum marginal hyper plane (MMH).According to the second approach, an experiment
was done on randomly selected class levels as shown in table4.9. Thus, it shows an improved performance as we
move down the classification tree. For example, the maximum accuracy achieved is in level-2(the last level in
the category level), which is 90.37% in the economy class as designated using code 2.2 in table2.3. The
improved performance of the hierarchical classifiers at each levels of the classification tree in the experiment is
because each classifier deals with the documents associated to only that class or subclasses of that class, it
concentrates on a smaller set of documents, those relevant to the task at hand. Hence, the maximum marginal
hyper plane can be easily generated in one hand linear SVM classifier can be easily applied in the other hand.
Moreover, we can deduce that a considerable increase in the classifier accuracy is as a result of exploiting the
relationship among classes and utilizing the hierarchical topic structure in it.
Again based on the third point, an experiment was done using the top number of features selected from the
same test document using one flat classifier and classifiers at each levels of the category tree. Accordingly, a flat
10
New Media and Mass Communication www.iiste.org
ISSN 2224-3267 (Paper) ISSN 2224-3275 (Online)
Vol.88, 2020
classifier shows a maximum exact match accuracy when 3 top features were used whereas the hierarchical
classifiers shows an improved exact match accuracy at increasing number of top features at each level of the
category tree. In such a way, the flat classifier produced an exact match accuracy of only 68.84% when the top 3
word were used, whereas with the hierarchical classifiers 89.09% of the documents classified had exact match
accuracy when the top 15 features were taken. This means that the use of hierarchy for text classification results
in a significant improvement of 29.42 % in exact match accuracy over the flat classifier. From this experiment,
we can understand that more words are needed to discriminate classes (topics) that are close to each other in
hierarchy as they have more in common with each other than classes (topics) that are spatially far apart. Apart
from the increased classification performance, classification speed can be taken as an advantage in hierarchical
classification approach using Support Vector Machine. The only limitation to SVM is the longer
learning/training time. It might take more than a day. This learning time could increase with an increase number
of training data. Finally, the findings of this research could be much significant to content-based information
retrieval in addition to the different applications explained in the paper.
References
1. Koller & Sahami. Hierarchically classifying documents using very few words. The 14thnational conference on
machine learning. Computer Science department, Stanford University , 1997
2. Dumais, S., & Chen, H. Hierarchical classification of web document. Graz University of Technology, Austria,
Masters Thesis, 2000.
3. Chien, L.-F, Huang, C.-C., & Chuang, S.-L. Creating hierarchical text classifiers through web corpora. WWW
'04: Proc. of the 13th Int. Conf. on World Wide Web (pages 184- 192). New York, NY, USA: ACM Press,
2004.
4. Neumann, Günter & Sven Schmeier. Combining Shallow Text Processing and Machine Learning in Real
World Applications. 1999. Available at http://www.dfki.de/~neumann/publications/newps/ijcai99-ws.pdf.
5. Sebastiani. Machine learning in Automated Text Categorization-in ACM Computing surveys 34(1), 2002,
pages 1-47.
6. Y. Yang and X. Liu, A re-examination of text categorization methods, Proceedings of SIGIR-99, 22nd ACM
International Conference on Research and Development in Information Retrieval (Berkeley, CA, 1999)
pages 42–49
7. Tilahun G, "“ Qubee Afaan Oromo : Reasons for choosing the Latin script for developing an Afaan Oromo
Alphabet”.,"Journal of Oromo studies, 1793.
8. Duwairi R., "“Arabic Text Categorization” ," The International Arab Journal of Information
Technology , ,Jordan University of Science and Technology, Jordan, vol.Vo.4,No.2, April 1807.`
9. Shankar Ranganatan. Text classification combining clustering and hierarchical approaches. Computer Science
and Engineering, University of Madras, Chennai, India, Masters thesis, 2001
10. G., Steinbach, M. and Kumar, V. Karypis, "A Comparison of Document Clustering Techniques. New York,
USA:," ACM Press/Addison-Wesley Publishing Co., 1804.
11