Flow Preprocessing Package

Python package for preprocessing PageXML datasets for OCR/HTR tasks with HuggingFace integration.

Features

✅ Process ZIP files (local or remote URLs)
✅ Process HuggingFace datasets
✅ Optional image segmentation with YOLO (GPU-accelerated)
✅ Multiple export modes (line, region, text, window, raw_xml)
✅ Train/test splitting
✅ Line filtering by dimensions
✅ Direct upload to HuggingFace Hub
✅ FastAPI-compatible (non-blocking async with asyncio.to_thread())
✅ GPU support with optimal performance

Quick Start

Option 1: Using PreprocessorConfig (Explicit)

from flow_preprocessor import ZipPreprocessor
from flow_preprocessor.preprocessing_logic.config import PreprocessorConfig

# Create configuration
config = PreprocessorConfig(
    huggingface_target_repo_name="username/dataset-name",
    huggingface_token="your_hf_token",
    export_mode="line",
    min_width_line=40,
)

# Create and run preprocessor (async)
preprocessor = ZipPreprocessor("path/to/data.zip", config)
repo_url = await preprocessor.preprocess()
print(f"Dataset available at: {repo_url}")

Option 2: Using Builder Pattern (Fluent API)

from flow_preprocessor import PreprocessorBuilder

# Build and run preprocessor with fluent API
preprocessor = (PreprocessorBuilder("username/dataset-name")
    .with_token("your_hf_token")
    .with_export_mode("line")
    .with_line_filtering(min_width=40)
    .build_for_zip("path/to/data.zip"))

repo_url = await preprocessor.preprocess()
print(f"Dataset available at: {repo_url}")

Installation

Install with pip:

# Clone the repository
git clone <repository-url>
cd package-preprocessing

# Install the package
pip install .

Install with uv:

# Clone and navigate to directory
cd package-preprocessing

# Install with uv
uv pip install .

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
src/flow_preprocessor		src/flow_preprocessor
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flow Preprocessing Package

Features

Quick Start

Option 1: Using PreprocessorConfig (Explicit)

Option 2: Using Builder Pattern (Fluent API)

Installation

Install with pip:

Install with uv:

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

The-Flow-Project/package-preprocessing

Folders and files

Latest commit

History

Repository files navigation

Flow Preprocessing Package

Features

Quick Start

Option 1: Using PreprocessorConfig (Explicit)

Option 2: Using Builder Pattern (Fluent API)

Installation

Install with pip:

Install with uv:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages