Compilation, analysis and application of a comprehensive Bangla corpus KUMono
Category:- Journal; Year:- 2022
Discipline:- Computer Science & Engineering Discipline
School:- Science, Engineering & Technology School
Abstract
Research in Natural Language
Processing (NLP) and computational linguistics highly depends on a good quality
representative corpus of any specific language. Bangla is one of the most
spoken languages in the world but Bangla NLP research is in its early stage of
development due to the lack of quality public corpus. This article describes
the detailed compilation methodology of a comprehensive monolingual Bangla
corpus, KUMono (Khulna University Monolingual corpus). The newly developed
corpus consists of more than 350 million word tokens and more than one million
unique tokens from 18 major text categories of online Bangla websites. We have
conducted several word-level and character-level linguistic phenomenon analyses
based on empirical studies of the developed corpus. The corpus follows Zipf’s
curve and hapax legomena rule. The quality of the corpus is also assessed by
analyzing and comparing the inherent sparseness of the corpus with existing
Bangla corpora, by analyzing the distribution of function words of the corpus
and vocabulary growth rate. We have developed a Bangla article categorization
application based on the KUMono corpus and received compelling results by
comparing to the state-of-the-art models.