Multilingual Parallel Corpus Pipeline

A unified pipeline to download, preprocess, assess the quality, merge, and push multilingual parallel corpora to the Hugging Face Hub. Supports sources like Hugging Face Datasets, GitHub, and OPUS.

✨ Features

Download datasets from:
- Hugging Face Hub (hf)
- GitHub (github)
- OPUS (opus)
Preprocessing:
- Rule-based filtering
- Semantic filtering
- Language detect filtering
Quality Assessment:
- Model-based quality estimation
Merge and Push:
- Choose dataset with good quality
- Combine all processed datasets into one
- Push to Hugging Face Hub

🚀 Quick Start

Clone the repository

git clone https://github.com/amaneth/mt-data-processing.git
cd mt-data-processing

Install dependencies
```
pip install -r requirements.txt
```
Configure settings Modify the config.yaml file to define your language pair, data sources, and pipeline settings.

Preprocess the dataset

python process.py --config am_config.yaml

Merge the datasets

python merge.py --datasets dataset1 dataset2 ...

Push to Hugging Face Hub
```
python push_to_hub.py --dataset data
```

🛠️ `config.yaml` Overview

The config.yaml file controls the entire pipeline. Here’s an overview of its sections:

`dataset`

lang_pair: Source and target languages (e.g., en-am)
sources: List of dataset types to include (hf, github, opus)

Hugging Face datasets (`hf`)

name: Identifier for the dataset
path: HF dataset ID
split: Train/test/dev
config_name: Config name if needed
src_col / tgt_col: Source and target language fields

GitHub datasets (`github`)

name: Identifier for the dataset
src_url: URL to the source file
tgt_url: URL to the target file

OPUS datasets (`opus`)

name: Identifier for the dataset
url: Download URL

`preprocessing`

pipelines: List of preprocessing steps, e.g., rule_filter, semantic_filter
from_cache: If true, it checks if the dataset is already preporcessed in the save_dir and skips preprocessing"

`filters`

Rule filter:
- min_length, max_length
- max_length_ratio
Semantic filter:
- threshold: Similarity threshold
- chunk_size: Batch size for filtering
Language detect filter:
- batch_size: Batch size for fasttext processing
- min_score: threshold value for filtering

`output`

prefix: Prefix for filtered dataset files
format: Final dataset format (e.g., json, csv, parquet)
save_dir: Output directory for saving results

`logging`

log_file: Log filename
log_dir: Directory for storing logs
level: Log level (INFO, DEBUG, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
configs		configs
.gitignore		.gitignore
README.md		README.md
dataset_loader.py		dataset_loader.py
dataset_merge_sample.py		dataset_merge_sample.py
fetch.py		fetch.py
helpers.py		helpers.py
merge.py		merge.py
model_loader.py		model_loader.py
pipelines.py		pipelines.py
process.py		process.py
push_to_hub.py		push_to_hub.py
requirements.txt		requirements.txt
validators.py		validators.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Parallel Corpus Pipeline

✨ Features

🚀 Quick Start

🛠️ `config.yaml` Overview

`dataset`

Hugging Face datasets (`hf`)

GitHub datasets (`github`)

OPUS datasets (`opus`)

`preprocessing`

`filters`

`output`

`logging`

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

AfriNLP/mt-data-processing

Folders and files

Latest commit

History

Repository files navigation

Multilingual Parallel Corpus Pipeline

✨ Features

🚀 Quick Start

🛠️ config.yaml Overview

dataset

Hugging Face datasets (hf)

GitHub datasets (github)

OPUS datasets (opus)

preprocessing

filters

output

logging

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

🛠️ `config.yaml` Overview

`dataset`

Hugging Face datasets (`hf`)

GitHub datasets (`github`)

OPUS datasets (`opus`)

`preprocessing`

`filters`

`output`

`logging`

Packages