A unified pipeline to download, preprocess, assess the quality, merge, and push multilingual parallel corpora to the Hugging Face Hub. Supports sources like Hugging Face Datasets, GitHub, and OPUS.
- Download datasets from:
- Hugging Face Hub (
hf) - GitHub (
github) - OPUS (
opus)
- Hugging Face Hub (
- Preprocessing:
- Rule-based filtering
- Semantic filtering
- Language detect filtering
- Quality Assessment:
- Model-based quality estimation
- Merge and Push:
- Choose dataset with good quality
- Combine all processed datasets into one
- Push to Hugging Face Hub
- Clone the repository
git clone https://github.com/amaneth/mt-data-processing.git cd mt-data-processing - Install dependencies
pip install -r requirements.txt
- Configure settings
Modify the
config.yamlfile to define your language pair, data sources, and pipeline settings. - Preprocess the dataset
python process.py --config am_config.yaml
- Merge the datasets
python merge.py --datasets dataset1 dataset2 ...
- Push to Hugging Face Hub
python push_to_hub.py --dataset data
The config.yaml file controls the entire pipeline. Here’s an overview of its sections:
lang_pair: Source and target languages (e.g.,en-am)sources: List of dataset types to include (hf,github,opus)
name: Identifier for the datasetpath: HF dataset IDsplit: Train/test/devconfig_name: Config name if neededsrc_col/tgt_col: Source and target language fields
name: Identifier for the datasetsrc_url: URL to the source filetgt_url: URL to the target file
name: Identifier for the dataseturl: Download URL
pipelines: List of preprocessing steps, e.g.,rule_filter,semantic_filterfrom_cache: If true, it checks if the dataset is already preporcessed in thesave_dirand skips preprocessing"
- Rule filter:
min_length,max_lengthmax_length_ratio
- Semantic filter:
threshold: Similarity thresholdchunk_size: Batch size for filtering
- Language detect filter:
batch_size: Batch size for fasttext processingmin_score: threshold value for filtering
prefix: Prefix for filtered dataset filesformat: Final dataset format (e.g.,json,csv,parquet)save_dir: Output directory for saving results
log_file: Log filenamelog_dir: Directory for storing logslevel: Log level (INFO,DEBUG, etc.)