-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
PySDK Version
- PySDK V2 (2.x)
- PySDK V3 (3.x)
Describe the bug
When running on the latest Tensorflow CPU training image: 763104351884.dkr.ecr.eu-central-1.amazonaws.com/tensorflow-training:2.19-cpu-py312, the job cannot properly run, because the sm_drivers directory seems to not be present on the container image (see logs).
To reproduce
from pathlib import Path
from sagemaker.train.configs import Compute, InputData, SourceCode
from sagemaker.train.model_trainer import ModelTrainer
EXPERIMENT_NAME = "my-experiment"
ROLE_ARN = "arn:aws:iam::111222333444:role/SageMakerS3AccessRole"
TRAINING_IMAGE = (
"763104351884.dkr.ecr.eu-central-1.amazonaws.com/tensorflow-training:2.19-cpu-py312"
)
INSTANCE_TYPE = "ml.c5.4xlarge"
DATASET_PATH = "s3://my-bucket/data/"
if __name__ == "__main__":
trainer = ModelTrainer(
base_job_name=EXPERIMENT_NAME,
role=ROLE_ARN,
training_image=TRAINING_IMAGE,
source_code=SourceCode(
source_dir=str(Path(__file__).parent),
entry_script="train.py",
requirements="requirements.sagemaker.txt",
ignore_patterns=[
"data",
".venv",
"notebooks",
"environment",
"scripts",
"pipelines",
".ipynb_checkpoints",
".github",
"__pycache__",
"*.ipynb",
],
),
compute=Compute(
instance_type=INSTANCE_TYPE
),
)
input_data = InputData(
channel_name="training",
data_source=DATASET_PATH,
)
trainer.train(input_data_config=[input_data])Expected behavior
The job runs.
Screenshots or logs
See logs:
{"code":{"TrainingInputMode":"File","S3DistributionType":"FullyReplicated","RecordWrapperType":"None"},"sm_drivers":{"TrainingInputMode":"File","S3DistributionType":"FullyReplicated","RecordWrapperType":"None"},"training":{"
TrainingInputMode":"File","S3DistributionType":"FullyReplicated","RecordWrapperType":"None"}}
++ /usr/local/bin/python3 /opt/ml/input/data/sm_drivers/scripts/environment.py
Setting up environment variables
/usr/local/bin/python3: can't open file '/opt/ml/input/data/sm_drivers/scripts/environment.py': [Errno 2] No such file or directory
System information
A description of your system. Please provide:
- SageMaker Python SDK version:
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): tensorflow
- Framework version: 2.19
- Python version: 3.12
- CPU or GPU: CPU
- Custom Docker image (Y/N): N
Additional context
Add any other context about the problem here.
Metadata
Metadata
Assignees
Labels
No labels