N-gram Models for Text Generation in Yoruba Language

Natural Language Processing for Low-Resource African Languages with Custom Web-Scraped BBC News Yoruba Corpus and N-gram Language Modeling

Category: AI/ML, Natural Language Processing, Data Engineering
Tools & Technologies: Python, BeautifulSoup4, Requests, Pandas, PyYAML, Multiprocessing, Google Colab, Jupyter Notebook

Status: Completed

Introduction

This project addresses the challenge of Natural Language Processing (NLP) for low-resource African languages by implementing N-gram language models for automatic text generation in Yoruba. The project includes a complete data pipeline — from building a custom Yoruba text corpus by scraping BBC News Yoruba articles across multiple categories (sports, politics, health, entertainment, and more) to training N-gram models capable of generating coherent Yoruba text. The automated web scraper utilizes Python multiprocessing for parallel category scraping, BeautifulSoup4 for HTML parsing, and produces a cleaned corpus of over 1,280 lines and 595,000 words. The N-gram models are then trained on this corpus to learn Yoruba language patterns and generate new text sequences.

Project Overview Project Overview


Aim and Objectives

Aim:
Develop N-gram language models for automatic text generation in Yoruba, a low-resource African language, using a custom-built web-scraped corpus.

Objectives:

  • Build a comprehensive Yoruba text corpus by scraping BBC News Yoruba articles across multiple news categories.
  • Implement an automated, configurable web scraper with parallel processing for efficient data collection.
  • Clean and preprocess the scraped corpus by merging headlines with article text and removing metadata.
  • Train N-gram language models (unigram, bigram, trigram) on the Yoruba corpus.
  • Generate coherent Yoruba text sequences using the trained N-gram models.
  • Evaluate text generation quality and language pattern adherence.

Features & Deliverables

  • Automated Web Scraper: Configurable BBC News Yoruba scraper with multi-category support and automatic topic discovery from homepage.
  • Parallel Processing: Python multiprocessing for concurrent category scraping, significantly reducing data collection time.
  • YAML Configuration: CSS selectors and class names stored in YAML config, making the scraper resilient to BBC site structure changes.
  • CLI Interface: Command-line arguments for language, output file, article count, categories, time delay, and category spread control.
  • Custom Corpus: 1,281 lines, 595,340 words (~3.5 MB) of cleaned Yoruba text from BBC News articles.
  • Article Deduplication: Cross-category deduplication prevents duplicate articles from inflating the corpus.
  • N-gram Modeling: Language model training and text generation pipeline for Yoruba language.
  • Data Cleaning Pipeline: Automated merging of headlines and article body text with metadata removal.

Process / Methodology

Data Collection

  • Surveyed available Yoruba language datasets including Niger-Volta LTI corpus, HuggingFace FLEURS, CohereForAI Aya collection, and Glot500.
  • Built custom web scraper targeting BBC News Yoruba for domain-specific news text.
  • Scraped articles across multiple categories: Idaraya (Sports), Gbajumọ (Popular), and auto-discovered topic categories.
  • Implemented multiprocessing-based parallel scraping for efficient large-scale data collection.

Data Preprocessing

  • Merged article headlines with body text for contextual completeness.
  • Removed metadata columns (category, URL) to produce clean text-only corpus.
  • Deduplicated articles that appeared across multiple BBC News categories.
  • Exported final corpus as clean text file suitable for language model training.

Model Training & Generation

  • Tokenized Yoruba text corpus respecting diacritical marks and tone markers.
  • Built N-gram frequency distributions from the preprocessed corpus.
  • Implemented text generation using probability-based next-word prediction.
  • Evaluated generated text for grammatical coherence and Yoruba language patterns.

Challenges & Solutions

  • Challenge: Limited availability of large-scale Yoruba text datasets for NLP research.
    Solution: Built custom web scraper to create purpose-built corpus from BBC News Yoruba, yielding 595K+ words.
  • Challenge: Yoruba diacritical marks and tone markers complicating text tokenization.
    Solution: Implemented Unicode-aware text processing to preserve Yoruba orthographic features.
  • Challenge: BBC website structure changes breaking scraper CSS selectors.
    Solution: Externalized all selectors to YAML configuration file for easy updates without code changes.
  • Challenge: Duplicate articles appearing across multiple news categories.
    Solution: Implemented article-level deduplication based on content hashing before corpus compilation.

Results & Impact

  • Corpus Size: Successfully built a 1,281-line, 595,340-word Yoruba text corpus from BBC News articles.
  • Reusable Scraper: Configurable web scraping tool adaptable to other BBC language services.
  • NLP Contribution: Provided resources and methodology for NLP research on low-resource African languages.
  • Text Generation: N-gram models produced recognizable Yoruba language patterns and word sequences.

Future Enhancements

  • Expand corpus with additional Yoruba text sources (books, social media, government documents).
  • Implement transformer-based language models (GPT-style) for improved text generation quality.
  • Add language evaluation metrics (perplexity, BLEU score) for quantitative model assessment.
  • Extend methodology to other low-resource Nigerian languages (Igbo, Hausa, Pidgin).

Demonstration / Access

  • GitHub Repository: Coming soon
  • Live Demonstration: Coming soon

Thank You for Visiting My Portfolio

I sincerely appreciate you taking the time to explore my portfolio and learn about my work and expertise. If you have any questions or wish to discuss potential collaborations, please feel free to reach out via the Contact section.

Best regards,
Damilare Lekan, Adekeye.