Fix syntax errors and stabilize dataset preprocessing/tokenizer creation#19

Open

Ankitaghavate wants to merge 1 commit intoML4SCI:mainfrom

Ankitaghavate:fix-preprocessing-and-tokenizer-errors

Ankitaghavate commented Jan 31, 2026

Summary

This PR fixes syntax and runtime-breaking issues in the preprocessing and tokenizer creation pipeline without changing the original logic.

Changes

Fixed syntax errors in print() statements
Removed non-Python text causing runtime failure
Ensured dataset directories are created before saving files
Added encoding safety for CSV reading

Notes

No logic or data-processing behavior was modified
Changes are limited to bug fixes and stability improvements

Checklist

Code runs without syntax errors
No logic changes introduced
Existing structure preserved


          Fix syntax errors and ensure tokenizer preprocessing runs correctly

9428c98

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet