N-gram Models for Text Generation in Yoruba Language
Natural Language Processing for Low-Resource African Languages with Custom Web-Scraped BBC News Yoruba
Corpus and N-gram Language Modeling
Category: AI/ML, Natural Language Processing, Data Engineering
Tools & Technologies: Python, BeautifulSoup4, Requests, Pandas, PyYAML, Multiprocessing, Google Colab,
Jupyter Notebook
Status: Completed
Introduction
This project addresses the challenge of Natural Language Processing (NLP) for low-resource African
languages by implementing N-gram language models for automatic text generation in Yoruba. The
project includes a complete data pipeline — from building a custom Yoruba text corpus by scraping
BBC News Yoruba articles across multiple categories (sports, politics, health, entertainment, and
more) to training N-gram models capable of generating coherent Yoruba text. The automated web
scraper utilizes Python multiprocessing for parallel category scraping, BeautifulSoup4 for HTML
parsing, and produces a cleaned corpus of over 1,280 lines and 595,000 words. The N-gram models are
then trained on this corpus to learn Yoruba language patterns and generate new text sequences.
Project Overview
Aim and Objectives
Aim:
Develop N-gram language models for automatic text generation in Yoruba, a
low-resource African language, using a custom-built web-scraped corpus.
Objectives:
- Build a comprehensive Yoruba text corpus by scraping BBC News Yoruba articles across multiple
news categories.
- Implement an automated, configurable web scraper with parallel processing for efficient data
collection.
- Clean and preprocess the scraped corpus by merging headlines with article text and removing
metadata.
- Train N-gram language models (unigram, bigram, trigram) on the Yoruba corpus.
- Generate coherent Yoruba text sequences using the trained N-gram models.
- Evaluate text generation quality and language pattern adherence.
Features & Deliverables
- Automated Web Scraper: Configurable BBC News Yoruba scraper with multi-category
support and automatic topic discovery from homepage.
- Parallel Processing: Python multiprocessing for concurrent category scraping,
significantly reducing data collection time.
- YAML Configuration: CSS selectors and class names stored in YAML config, making
the scraper resilient to BBC site structure changes.
- CLI Interface: Command-line arguments for language, output file, article count,
categories, time delay, and category spread control.
- Custom Corpus: 1,281 lines, 595,340 words (~3.5 MB) of cleaned Yoruba text from
BBC News articles.
- Article Deduplication: Cross-category deduplication prevents duplicate articles
from inflating the corpus.
- N-gram Modeling: Language model training and text generation pipeline for
Yoruba language.
- Data Cleaning Pipeline: Automated merging of headlines and article body text
with metadata removal.
Process / Methodology
Data Collection
- Surveyed available Yoruba language datasets including Niger-Volta LTI corpus, HuggingFace
FLEURS, CohereForAI Aya collection, and Glot500.
- Built custom web scraper targeting BBC News Yoruba for domain-specific news text.
- Scraped articles across multiple categories: Idaraya (Sports), Gbajumọ (Popular), and
auto-discovered topic categories.
- Implemented multiprocessing-based parallel scraping for efficient large-scale data collection.
Data Preprocessing
- Merged article headlines with body text for contextual completeness.
- Removed metadata columns (category, URL) to produce clean text-only corpus.
- Deduplicated articles that appeared across multiple BBC News categories.
- Exported final corpus as clean text file suitable for language model training.
Model Training & Generation
- Tokenized Yoruba text corpus respecting diacritical marks and tone markers.
- Built N-gram frequency distributions from the preprocessed corpus.
- Implemented text generation using probability-based next-word prediction.
- Evaluated generated text for grammatical coherence and Yoruba language patterns.
Challenges & Solutions
- Challenge: Limited availability of large-scale Yoruba text datasets for NLP
research.
Solution: Built custom web scraper to create purpose-built corpus from BBC News
Yoruba, yielding 595K+ words.
- Challenge: Yoruba diacritical marks and tone markers complicating text
tokenization.
Solution: Implemented Unicode-aware text processing to preserve Yoruba
orthographic features.
- Challenge: BBC website structure changes breaking scraper CSS selectors.
Solution: Externalized all selectors to YAML configuration file for easy
updates without code changes.
- Challenge: Duplicate articles appearing across multiple news categories.
Solution: Implemented article-level deduplication based on content hashing
before corpus compilation.
Results & Impact
- Corpus Size: Successfully built a 1,281-line, 595,340-word Yoruba text corpus
from BBC News articles.
- Reusable Scraper: Configurable web scraping tool adaptable to other BBC
language services.
- NLP Contribution: Provided resources and methodology for NLP research on
low-resource African languages.
- Text Generation: N-gram models produced recognizable Yoruba language patterns
and word sequences.
Future Enhancements
- Expand corpus with additional Yoruba text sources (books, social media, government documents).
- Implement transformer-based language models (GPT-style) for improved text generation quality.
- Add language evaluation metrics (perplexity, BLEU score) for quantitative model assessment.
- Extend methodology to other low-resource Nigerian languages (Igbo, Hausa, Pidgin).
Demonstration / Access
- GitHub Repository: Coming
soon
- Live Demonstration: Coming
soon