Skip to content

tensorflow: sm_drivers directory not found #5493

@straygar

Description

@straygar

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
When running on the latest Tensorflow CPU training image: 763104351884.dkr.ecr.eu-central-1.amazonaws.com/tensorflow-training:2.19-cpu-py312, the job cannot properly run, because the sm_drivers directory seems to not be present on the container image (see logs).

To reproduce

from pathlib import Path

from sagemaker.train.configs import Compute, InputData, SourceCode
from sagemaker.train.model_trainer import ModelTrainer

EXPERIMENT_NAME = "my-experiment"
ROLE_ARN = "arn:aws:iam::111222333444:role/SageMakerS3AccessRole"
TRAINING_IMAGE = (
    "763104351884.dkr.ecr.eu-central-1.amazonaws.com/tensorflow-training:2.19-cpu-py312"
)
INSTANCE_TYPE = "ml.c5.4xlarge"
DATASET_PATH = "s3://my-bucket/data/"

if __name__ == "__main__":
    trainer = ModelTrainer(
        base_job_name=EXPERIMENT_NAME,
        role=ROLE_ARN,
        training_image=TRAINING_IMAGE,
        source_code=SourceCode(
            source_dir=str(Path(__file__).parent),
            entry_script="train.py",
            requirements="requirements.sagemaker.txt",
            ignore_patterns=[
                "data",
                ".venv",
                "notebooks",
                "environment",
                "scripts",
                "pipelines",
                ".ipynb_checkpoints",
                ".github",
                "__pycache__",
                "*.ipynb",
            ],
        ),
        compute=Compute(
            instance_type=INSTANCE_TYPE
        ),
    )

    input_data = InputData(
        channel_name="training",
        data_source=DATASET_PATH,
    )

    trainer.train(input_data_config=[input_data])

Expected behavior
The job runs.

Screenshots or logs

See logs:


 {"code":{"TrainingInputMode":"File","S3DistributionType":"FullyReplicated","RecordWrapperType":"None"},"sm_drivers":{"TrainingInputMode":"File","S3DistributionType":"FullyReplicated","RecordWrapperType":"None"},"training":{"                   
 TrainingInputMode":"File","S3DistributionType":"FullyReplicated","RecordWrapperType":"None"}}                                                                                                                                                                                                                                                                                                                                  
 ++ /usr/local/bin/python3 /opt/ml/input/data/sm_drivers/scripts/environment.py                                                                                                                                                                                                                                                                                                                                             
 Setting up environment variables                                                                                                                                                                                                                                                                                                                                                                                          
 /usr/local/bin/python3: can't open file '/opt/ml/input/data/sm_drivers/scripts/environment.py': [Errno 2] No such file or directory   

System information
A description of your system. Please provide:

  • SageMaker Python SDK version:
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): tensorflow
  • Framework version: 2.19
  • Python version: 3.12
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions