Corpus Development

Building comprehensive text collections and language databases for Kurdish NLP research

July 20, 2023
Kurdish Language Research
Ongoing
1 min read

Corpus development for Kurdish language involves creating large-scale, annotated text collections that serve as the foundation for various natural language processing applications and linguistic research.

Research Scope

This project focuses on building comprehensive corpora that represent different Kurdish dialects, genres, and domains, providing essential resources for computational linguistics research.

Corpus Components

  • Multi-dialectal text collections
  • Morphological annotations
  • Syntactic parsing data
  • Semantic role labeling
  • Named entity recognition datasets

Data Collection

The research involves systematic collection of Kurdish texts from various sources including literature, news, social media, and academic publications, ensuring balanced representation across domains.

Student Contribution

Fanar is leading the corpus annotation efforts, developing standardized guidelines for linguistic markup and quality assurance protocols.

Research Impact

These corpora will enable advanced Kurdish NLP applications including machine translation, information retrieval, and language modeling, significantly advancing Kurdish computational linguistics.

Open Science

The developed corpora will be made available to the research community, fostering collaborative development of Kurdish language technologies.