[gpu] Enhance driver installer and update README for custom images, versions, and performance#1320
Closed
cjac wants to merge 1 commit intoGoogleCloudDataproc:mainfrom
Closed
Conversation
Contributor
Author
|
/gcbrun |
Contributor
Author
|
/gcbrun |
This PR significantly refactors the GPU initialization action to improve support for custom image builds, enhance robustness, and update documentation.
**Key Changes:**
1. **Custom Image Building (`invocation-type=custom-images`):**
* The script now detects the `invocation-type=custom-images` metadata.
* When detected, Hadoop/Spark configurations are deferred to the first boot of a cluster instance created from the custom image. This is managed by a new systemd service, `dataproc-gpu-config.service`.
* This prevents issues where configurations are applied too early in the image build process.
2. **GCS Caching and Performance:**
* The README now extensively details the GCS caching mechanism for downloaded artifacts (drivers, CUDA) and compiled components (kernel modules, NCCL).
* Highlights the significant time savings on subsequent runs after the cache is warmed.
* Warns about potentially long first-run times (up to 150 mins on small instances) if components need to be built from source. Recommends pre-warming the cache on a larger instance.
* Notes the security benefit of using cached artifacts, reducing the need for build tools on cluster nodes.
3. **Hash Validation:**
* Added SHA256 hash verification for downloaded NVIDIA driver and CUDA `.run` files to ensure integrity.
4. **Documentation (`gpu/README.md`):**
* Fully revamped to reflect the script changes.
* Updated default CUDA versions and tested configurations.
* Clearer `gcloud` examples.
* New section on custom image usage.
* Updated metadata parameters list.
* Improved Secure Boot and troubleshooting sections.
* Clarified GPU agent metric reporting.
5. **Script Enhancements (`gpu/install_gpu_driver.sh`):**
* Refactored configuration logic into functions called conditionally.
* Improved GPG key fetching behind a proxy.
* Adjusted Conda paths for Dataproc 2.3+.
* More robust `kernel-devel` fetching on Rocky Linux.
* Better `DATAPROC_IMAGE_VERSION` detection.
**Purpose:**
These changes make the GPU initialization action more flexible for use in custom image pipelines, improve the reliability of installations, and provide users with better guidance on performance and security implications.
Contributor
Author
|
Closing and starting a new one. |
Contributor
Author
|
merged into #1363 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit significantly refactors the GPU initialization action to improve support for custom image builds, enhance robustness, and update documentation.
Key Changes:
Custom Image Building (
invocation-type=custom-images):invocation-type=custom-imagesmetadata.dataproc-gpu-config.service.GCS Caching and Performance:
Hash Validation:
.runfiles to ensure integrity.Documentation (
gpu/README.md):gcloudexamples.Script Enhancements (
gpu/install_gpu_driver.sh):kernel-develfetching on Rocky Linux.DATAPROC_IMAGE_VERSIONdetection.Purpose:
These changes make the GPU initialization action more flexible for use in custom image pipelines, improve the reliability of installations, and provide users with better guidance on performance and security implications.