-
Notifications
You must be signed in to change notification settings - Fork 244
Adding support for AMDAMA (Supernova) GPUs. #7697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
PR Title Lint Failed ❌Current Title: Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns: Conventional Commits Format:
Guidelines:
Examples:
Please update your PR title and the lint check will run again automatically. |
| return true | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method duplicated 3 times. Should be no more than 2. (Ideally 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just following the method used in each file. Presumably the ones we don't need will get removed when the move from scripted to scriptless is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also. would you recommend breaking form the model to make lint happy?
| "userAssignedIdentityID": config.UserAssignedIdentityClientID, | ||
| "isVHD": isVHD(profile), | ||
| "gpuNode": strconv.FormatBool(config.EnableNvidia), | ||
| "amdamaNode": strconv.FormatBool(datamodel.IsAmdAmaEnabledSKU(profile.VMSize)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unused?
| }) | ||
| } | ||
|
|
||
| func Test_AzureLinuxV3_MA35D(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AzureLinux is the only OS that support this VMSku correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With AKS, yes.
| } | ||
|
|
||
| setupAmdAma() { | ||
| if [ "$(isARM64)" -eq 1 ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this even possible ? to have AMDAMA_MODE == true and with a arm64 arch ? What would be the consequences here if this cluase is true and we expect to have AMDAMA_MODE and we dont setup anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should never happen.
| return | ||
| fi | ||
|
|
||
| if isMarinerOrAzureLinux "$OS"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, could we have ubuntu ? if it's the case what will happen ? should we baill out instead and cause anode bootstrap failure ? I would expect that Ubuntu/Flatcar and ARM64 will all be blocked at the RP level to provide users with early validation failure ?
One more question, does isMarinerOrAzureLinux return true for OSGuard I recall there were extrra check required to ensure it wasn't LinuxOSGuard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. If someone did try to create an AKS cluster with a different OS, it just wouldn't install the driver, but AKS cluster creation would succeed. That's fine.
No idea for OSGuard...
|
|
||
| if isMarinerOrAzureLinux "$OS"; then | ||
| # Install driver | ||
| sudo tdnf install -y azurelinux-repos-amd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need sudo, since the CSE extension is running as root. we recently had an outage cause some scripts were calling SUDO using an expirer root password..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use the helper function for any dnf command. which handle retries, etc
| sudo tdnf install -y azurelinux-repos-amd | ||
| KERNEL_VERSION=$(uname -r | sed 's/-/./g') | ||
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | ||
| sudo tdnf install -y $AMD_AMA_DRIVER_PACKAGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we combine into a single dnf call ?
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | ||
| sudo tdnf install -y $AMD_AMA_DRIVER_PACKAGE | ||
| # Install core package | ||
| sudo tdnf install -y libzip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all tdnf calls will not work on VM with restricted egress network.
| # Install core package | ||
| sudo tdnf install -y libzip | ||
| sudo tdnf install -y azurelinux-repos-extended | ||
| sudo RPM_FRONTEND=noninteractive tdnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very much hardcoded, if it's for testing purposes for now, I'm ok but we need to move the package to components.json and also ensure we can detect new versions automatically with renovate. Let's do that in another PR.
| if [ "${AMDAMA_NODE}" = "true" ]; then | ||
| logs_to_events "AKS.CSE.setupAmdAma" setupAmdAma | ||
| else | ||
| logs_to_events "AKS.CSE.setupAmdAma" "echo AMD AMA HW not found!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to log this eventk, knowing it will be the case 99.9% of the time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
| } | ||
|
|
||
| // IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support. | ||
| func IsAmdAmaEnabledSKU(vmSize string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the first time we perform a logic in the VHD to mapping SKU to feature... normally this is done in RP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because aks-rp is unaware that this is a GPU SKU so it doesn't interfere with Nvidia/AMD GPUs, and so we don't have to add a whole new mechanism just for this SKU. This was Yi's decision.
| fi | ||
|
|
||
| # Install and configure AMD AMA (Supernova) drivers if this is an AMA node | ||
| if [ "${AMDAMA_NODE}" = "true" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can query instance metadata server to get the sku and install it directly here instead of going through all the hoops if the intention is default enabled for the SKU
# Retrieves the VM SKU (size) from the cached IMDS instance metadata.
# Outputs:
# The VM size/SKU (e.g., "Standard_NM16ads_MA35D") or empty string if not found.
# Returns:
# 0 on success, non-zero on failure
get_imds_vm_sku() {
local vm_sku=""
vm_sku=$(jq -r '.compute.vmSize // ""' "$IMDS_INSTANCE_METADATA_CACHE_FILE")
echo "$vm_sku"
}
if [ "$(get_imds_vm_sku)" = "Standard_NM16ads_MA35D" ]; then
echo "Exact match!"
fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for AMD AMA (Supernova) GPU hardware by enabling detection and configuration for the Standard_NM16ads_MA35D VM SKU. The changes introduce a new VM size detection mechanism, install necessary drivers and packages for AMD AMA GPUs, and configure the required system settings for Azure Linux environments.
Changes:
- Added VM SKU detection logic for AMD AMA-enabled instances (Standard_NM16ads_MA35D)
- Implemented driver installation and configuration for AMD AMA GPUs on Azure Linux
- Added E2E tests to validate AMD AMA GPU functionality
- Updated firewall rules to allow access to download.microsoft.com for AMD AMA package installation
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| aks-node-controller/helpers/const.go | Added constant for Standard_NM16ads_MA35D VM size |
| aks-node-controller/parser/helper.go | Added helper function to detect AMD AMA enabled SKUs |
| aks-node-controller/parser/parser.go | Added AMDAMA_NODE environment variable to CSE configuration |
| pkg/agent/baker.go | Added IsAmdAmaEnabledSKU function for template rendering |
| pkg/agent/datamodel/helper.go | Added IsAmdAmaEnabledSKU function for VM SKU detection |
| pkg/agent/variables.go | Added amdamaNode variable to CSE command variables |
| parts/linux/cloud-init/artifacts/cse_cmd.sh | Declared AMDAMA_NODE variable for shell scripts |
| parts/linux/cloud-init/artifacts/cse_config.sh | Implemented setupAmdAma function to install drivers and configure system |
| parts/linux/cloud-init/artifacts/cse_main.sh | Added call to setupAmdAma during node preparation |
| e2e/scenario_test.go | Added E2E tests for AMD AMA GPU support (standard and scriptless) |
| e2e/aks_model.go | Added firewall rule to allow download.microsoft.com access |
| // IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support. | ||
| func IsAmdAmaEnabledSKU(vmSize string) bool { | ||
| switch vmSize { | ||
| case "Standard_NM16ads_MA35D": | ||
| return true | ||
| } | ||
| return false | ||
| } | ||
|
|
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IsAmdAmaEnabledSKU function is duplicated in both pkg/agent/baker.go and pkg/agent/datamodel/helper.go. This creates maintenance overhead as both implementations need to be updated when new AMD AMA SKUs are added. Consider removing this function from baker.go and calling datamodel.IsAmdAmaEnabledSKU instead, following the pattern used by IsSgxEnabledSKU which only exists in datamodel/helper.go.
| // IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support. | |
| func IsAmdAmaEnabledSKU(vmSize string) bool { | |
| switch vmSize { | |
| case "Standard_NM16ads_MA35D": | |
| return true | |
| } | |
| return false | |
| } |
| // getIsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support. | ||
| func getIsAmdAmaEnabledSKU(vmSize string) bool { | ||
| switch vmSize { | ||
| case helpers.VMSizeStandardNM16adsMA35D: | ||
| return true | ||
| } | ||
| return false | ||
| } |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new getIsAmdAmaEnabledSKU function lacks test coverage. The codebase has comprehensive test coverage for similar functions (e.g., Test_getIsSgxEnabledSKU starting at line 1221). Add a test function following the same pattern to verify the function correctly identifies AMD AMA enabled SKUs and returns false for other SKUs.
| setupAmdAma() { | ||
| if [ "$(isARM64)" -eq 1 ]; then | ||
| return | ||
| fi | ||
|
|
||
| if isMarinerOrAzureLinux "$OS"; then | ||
| # Install driver | ||
| dnf install -y azurelinux-repos-amd | ||
| KERNEL_VERSION=$(uname -r | sed 's/-/./g') | ||
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | ||
| dnf install -y $AMD_AMA_DRIVER_PACKAGE | ||
| # Install core package | ||
| dnf install -y libzip | ||
| dnf install -y azurelinux-repos-extended | ||
| RPM_FRONTEND=noninteractive dnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm | ||
| # Install AKS device plugin | ||
| dnf install -y amdama-device-plugin.x86_64 | ||
| # Configure huge pages | ||
| sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf" | ||
| sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages" | ||
| if [ $(systemctl is-active kubelet) = "active" ]; then | ||
| systemctl restart kubelet | ||
| fi | ||
| fi | ||
| } |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setupAmdAma function lacks comprehensive error handling. Multiple dnf install commands (lines 990, 993, 995-999) could fail without causing the function to exit with an error code. This is inconsistent with the error handling in ensureGPUDrivers (line 979) which uses || exit $ERR_GPU_DRIVERS_START_FAIL. Consider adding error handling or exit codes for critical installation failures to ensure the node setup fails explicitly rather than continuing with a partially configured state.
| @@ -44,6 +44,14 @@ func IsSgxEnabledSKU(vmSize string) bool { | |||
| return false | |||
| } | |||
|
|
|||
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IsAmdAmaEnabledSKU function is missing a documentation comment, which is inconsistent with the pattern used for IsSgxEnabledSKU above (line 38-39). Add a comment describing the function's purpose to maintain consistency with similar functions in this file.
| // IsAmdAmaEnabledSKU determines if a VM SKU has AMD AMA driver support. |
| dnf install -y azurelinux-repos-amd | ||
| KERNEL_VERSION=$(uname -r | sed 's/-/./g') | ||
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | ||
| dnf install -y $AMD_AMA_DRIVER_PACKAGE |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AMD_AMA_DRIVER_PACKAGE variable may be empty if no matching driver package is found. Installing an empty package with dnf on line 993 could fail silently or cause an unexpected error. Add validation to check if AMD_AMA_DRIVER_PACKAGE is non-empty before attempting installation, and handle the error case appropriately.
| dnf install -y $AMD_AMA_DRIVER_PACKAGE | |
| if [ -z "$AMD_AMA_DRIVER_PACKAGE" ]; then | |
| echo "Failed to find a matching amd-ama-driver package for kernel version ${KERNEL_VERSION}" >&2 | |
| return 1 | |
| fi | |
| dnf install -y "$AMD_AMA_DRIVER_PACKAGE" |
| # Configure huge pages | ||
| sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf" | ||
| sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages" | ||
| if [ $(systemctl is-active kubelet) = "active" ]; then |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command output from systemctl is-active should be quoted in the comparison. Without quotes, if the command fails or returns unexpected output, the test condition could fail with a syntax error. Change to: if [ "$(systemctl is-active kubelet)" = "active" ]; then
| if [ $(systemctl is-active kubelet) = "active" ]; then | |
| if [ "$(systemctl is-active kubelet)" = "active" ]; then |
| func IsAmdAmaEnabledSKU(vmSize string) bool { | ||
| switch vmSize { | ||
| case "Standard_NM16ads_MA35D": | ||
| return true | ||
| } | ||
| return false | ||
| } |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new IsAmdAmaEnabledSKU function lacks test coverage. The codebase has comprehensive test coverage for similar functions (e.g., TestIsSGXEnabledSKU in pkg/agent/datamodel/helper_test.go). Add a test function following the same pattern to verify the function correctly identifies AMD AMA enabled SKUs and returns false for other SKUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
| dnf install -y azurelinux-repos-amd | ||
| KERNEL_VERSION=$(uname -r | sed 's/-/./g') | ||
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | ||
| dnf install -y $AMD_AMA_DRIVER_PACKAGE | ||
| # Install core package | ||
| dnf install -y libzip | ||
| dnf install -y azurelinux-repos-extended | ||
| RPM_FRONTEND=noninteractive dnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm | ||
| # Install AKS device plugin | ||
| dnf install -y amdama-device-plugin.x86_64 | ||
| # Configure huge pages | ||
| sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf" | ||
| sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages" | ||
| if [ $(systemctl is-active kubelet) = "active" ]; then | ||
| systemctl restart kubelet |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setupAmdAma function lacks comprehensive error handling. All dnf install commands and other operations that could fail should include error handling with appropriate exit codes, similar to how GPU driver installation handles errors (e.g., lines 914, 926-928, 935, 948, etc. use || exit $ERR_GPU_DRIVERS_START_FAIL).
Consider defining appropriate error codes (e.g., ERR_AMD_AMA_INSTALL_FAIL, ERR_AMD_AMA_DRIVER_START_FAIL) in cse_helpers.sh and adding proper error handling for each critical operation.
| dnf install -y azurelinux-repos-amd | |
| KERNEL_VERSION=$(uname -r | sed 's/-/./g') | |
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | |
| dnf install -y $AMD_AMA_DRIVER_PACKAGE | |
| # Install core package | |
| dnf install -y libzip | |
| dnf install -y azurelinux-repos-extended | |
| RPM_FRONTEND=noninteractive dnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm | |
| # Install AKS device plugin | |
| dnf install -y amdama-device-plugin.x86_64 | |
| # Configure huge pages | |
| sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf" | |
| sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages" | |
| if [ $(systemctl is-active kubelet) = "active" ]; then | |
| systemctl restart kubelet | |
| dnf install -y azurelinux-repos-amd || exit $ERR_AMD_AMA_INSTALL_FAIL | |
| KERNEL_VERSION=$(uname -r | sed 's/-/./g') | |
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | |
| if [ -z "$AMD_AMA_DRIVER_PACKAGE" ]; then | |
| exit $ERR_AMD_AMA_INSTALL_FAIL | |
| fi | |
| dnf install -y "$AMD_AMA_DRIVER_PACKAGE" || exit $ERR_AMD_AMA_INSTALL_FAIL | |
| # Install core package | |
| dnf install -y libzip || exit $ERR_AMD_AMA_INSTALL_FAIL | |
| dnf install -y azurelinux-repos-extended || exit $ERR_AMD_AMA_INSTALL_FAIL | |
| RPM_FRONTEND=noninteractive dnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm || exit $ERR_AMD_AMA_INSTALL_FAIL | |
| # Install AKS device plugin | |
| dnf install -y amdama-device-plugin.x86_64 || exit $ERR_AMD_AMA_INSTALL_FAIL | |
| # Configure huge pages | |
| sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf" || exit $ERR_AMD_AMA_INSTALL_FAIL | |
| sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages" || exit $ERR_AMD_AMA_INSTALL_FAIL | |
| if [ "$(systemctl is-active kubelet)" = "active" ]; then | |
| systemctl restart kubelet || exit $ERR_AMD_AMA_DRIVER_START_FAIL |
| dnf install -y amdama-device-plugin.x86_64 | ||
| # Configure huge pages | ||
| sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf" | ||
| sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages" |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration is written to both /etc/sysctl.conf (line 1001) for persistence and directly to /proc/sys/vm/nr_hugepages (line 1002) for immediate effect. However, these changes could be better managed using the 'sysctl' command instead of direct file writes, similar to how it's done at line 372 with 'sysctl --system'.
Consider using: sysctl -w vm.nr_hugepages=4096
This approach is more consistent with how sysctl settings are managed elsewhere in the codebase and provides better error reporting.
| sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages" | |
| sysctl -w vm.nr_hugepages=4096 |
What this PR does / why we need it:
Adding support for AMDAMA (Supernova) GPUs.