Skip to content

Refactor Dataset Pipeline: Modular LensDataset & Transforms#128

Open
BeathovenGala wants to merge 2 commits intoML4SCI:mainfrom
BeathovenGala:refactor/dataset-pipeline
Open

Refactor Dataset Pipeline: Modular LensDataset & Transforms#128
BeathovenGala wants to merge 2 commits intoML4SCI:mainfrom
BeathovenGala:refactor/dataset-pipeline

Conversation

@BeathovenGala
Copy link

Description for issue #126

The current implementation in [dataset/preprocessing_model_2.py] tightly couples data loading, hardcoded category logic (e.g., if file_name.startswith('axion')), and transformations. This has resulted in a fragile, duplicated, and untestable codebase.

Issues with current approach:

  • Fragile: Breaks on new .npy datasets that don't match specific hardcoded filename prefixes.
  • Duplicated: Normalization logic (e.g., Min-Max scaling) is copied across 3 different files, multiplying the risk of bugs (e.g., division by zero).
  • Untestable: No isolation between loading and processing, making it impossible to verify transformations independently.

Fixes

This PR refactors the pipeline into modular, reusable components:

  • Introduced [LensDataset] (pure loading) and [WrapperDataset] (handles categories dynamically via config, removing hardcoded logic).
  • Added [get_transforms(config)] for a modular, configuration-driven augmentation pipeline.
  • Added comprehensive test suites for datasets and pipelines.

Problem Demo

The Error (Existing Code):

# Fragile: Fails if files aren't named 'axion'
if file_name.startswith('axion'): 
    data_point = data_point[0]

#  Duplicated: Normalization logic repeated in 3 locations
# Risk of division by zero not handled consistently
normalized = (data - min) / (max - min)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant