Skip to content

Conversation

@mipresmsft
Copy link

What this PR does / why we need it:

Adding support for AMDAMA (Supernova) GPUs.

@mipresmsft mipresmsft changed the title Moved changes to sync with main branch. Adding support for AMDAMA (Supernova) GPUs. Jan 21, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 21, 2026

PR Title Lint Failed ❌

Current Title: Adding support for AMDAMA (Supernova) GPUs.

Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns:

Conventional Commits Format:

  • feat: add new feature - for new features
  • fix: resolve bug in component - for bug fixes
  • docs: update README - for documentation changes
  • refactor: improve code structure - for refactoring
  • test: add unit tests - for test additions
  • chore: remove dead code - for maintenance tasks
  • chore(deps): update dependencies - for updating dependencies
  • ci: update build pipeline - for CI/CD changes

Guidelines:

  • Use lowercase for the type and description
  • Keep the description concise but descriptive
  • Use imperative mood (e.g., "add" not "adds" or "added")
  • Don't end with a period

Examples:

  • feat(windows): add secure TLS bootstrapping for Windows nodes
  • fix: resolve kubelet certificate rotation issue
  • docs: update installation guide
  • Added new feature
  • Fix bug.
  • Update docs

Please update your PR title and the lint check will run again automatically.

return true
}
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method duplicated 3 times. Should be no more than 2. (Ideally 1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just following the method used in each file. Presumably the ones we don't need will get removed when the move from scripted to scriptless is done?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also. would you recommend breaking form the model to make lint happy?

"userAssignedIdentityID": config.UserAssignedIdentityClientID,
"isVHD": isVHD(profile),
"gpuNode": strconv.FormatBool(config.EnableNvidia),
"amdamaNode": strconv.FormatBool(datamodel.IsAmdAmaEnabledSKU(profile.VMSize)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unused?

})
}

func Test_AzureLinuxV3_MA35D(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AzureLinux is the only OS that support this VMSku correct ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With AKS, yes.

}

setupAmdAma() {
if [ "$(isARM64)" -eq 1 ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this even possible ? to have AMDAMA_MODE == true and with a arm64 arch ? What would be the consequences here if this cluase is true and we expect to have AMDAMA_MODE and we dont setup anymore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should never happen.

return
fi

if isMarinerOrAzureLinux "$OS"; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, could we have ubuntu ? if it's the case what will happen ? should we baill out instead and cause anode bootstrap failure ? I would expect that Ubuntu/Flatcar and ARM64 will all be blocked at the RP level to provide users with early validation failure ?

One more question, does isMarinerOrAzureLinux return true for OSGuard I recall there were extrra check required to ensure it wasn't LinuxOSGuard

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. If someone did try to create an AKS cluster with a different OS, it just wouldn't install the driver, but AKS cluster creation would succeed. That's fine.

No idea for OSGuard...


if isMarinerOrAzureLinux "$OS"; then
# Install driver
sudo tdnf install -y azurelinux-repos-amd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need sudo, since the CSE extension is running as root. we recently had an outage cause some scripts were calling SUDO using an expirer root password..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use the helper function for any dnf command. which handle retries, etc

sudo tdnf install -y azurelinux-repos-amd
KERNEL_VERSION=$(uname -r | sed 's/-/./g')
AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1)
sudo tdnf install -y $AMD_AMA_DRIVER_PACKAGE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we combine into a single dnf call ?

AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1)
sudo tdnf install -y $AMD_AMA_DRIVER_PACKAGE
# Install core package
sudo tdnf install -y libzip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all tdnf calls will not work on VM with restricted egress network.

# Install core package
sudo tdnf install -y libzip
sudo tdnf install -y azurelinux-repos-extended
sudo RPM_FRONTEND=noninteractive tdnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very much hardcoded, if it's for testing purposes for now, I'm ok but we need to move the package to components.json and also ensure we can detect new versions automatically with renovate. Let's do that in another PR.

if [ "${AMDAMA_NODE}" = "true" ]; then
logs_to_events "AKS.CSE.setupAmdAma" setupAmdAma
else
logs_to_events "AKS.CSE.setupAmdAma" "echo AMD AMA HW not found!"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to log this eventk, knowing it will be the case 99.9% of the time ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

}

// IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support.
func IsAmdAmaEnabledSKU(vmSize string) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the first time we perform a logic in the VHD to mapping SKU to feature... normally this is done in RP.

Copy link
Author

@mipresmsft mipresmsft Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because aks-rp is unaware that this is a GPU SKU so it doesn't interfere with Nvidia/AMD GPUs, and so we don't have to add a whole new mechanism just for this SKU. This was Yi's decision.

fi

# Install and configure AMD AMA (Supernova) drivers if this is an AMA node
if [ "${AMDAMA_NODE}" = "true" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can query instance metadata server to get the sku and install it directly here instead of going through all the hoops if the intention is default enabled for the SKU

# Retrieves the VM SKU (size) from the cached IMDS instance metadata.
# Outputs:
#   The VM size/SKU (e.g., "Standard_NM16ads_MA35D") or empty string if not found.
# Returns:
#   0 on success, non-zero on failure
get_imds_vm_sku() {
    local vm_sku=""
    vm_sku=$(jq -r '.compute.vmSize // ""' "$IMDS_INSTANCE_METADATA_CACHE_FILE")
    echo "$vm_sku"
}

if [ "$(get_imds_vm_sku)" = "Standard_NM16ads_MA35D" ]; then
    echo "Exact match!"
fi

Copilot AI review requested due to automatic review settings January 30, 2026 21:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for AMD AMA (Supernova) GPU hardware by enabling detection and configuration for the Standard_NM16ads_MA35D VM SKU. The changes introduce a new VM size detection mechanism, install necessary drivers and packages for AMD AMA GPUs, and configure the required system settings for Azure Linux environments.

Changes:

  • Added VM SKU detection logic for AMD AMA-enabled instances (Standard_NM16ads_MA35D)
  • Implemented driver installation and configuration for AMD AMA GPUs on Azure Linux
  • Added E2E tests to validate AMD AMA GPU functionality
  • Updated firewall rules to allow access to download.microsoft.com for AMD AMA package installation

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
aks-node-controller/helpers/const.go Added constant for Standard_NM16ads_MA35D VM size
aks-node-controller/parser/helper.go Added helper function to detect AMD AMA enabled SKUs
aks-node-controller/parser/parser.go Added AMDAMA_NODE environment variable to CSE configuration
pkg/agent/baker.go Added IsAmdAmaEnabledSKU function for template rendering
pkg/agent/datamodel/helper.go Added IsAmdAmaEnabledSKU function for VM SKU detection
pkg/agent/variables.go Added amdamaNode variable to CSE command variables
parts/linux/cloud-init/artifacts/cse_cmd.sh Declared AMDAMA_NODE variable for shell scripts
parts/linux/cloud-init/artifacts/cse_config.sh Implemented setupAmdAma function to install drivers and configure system
parts/linux/cloud-init/artifacts/cse_main.sh Added call to setupAmdAma during node preparation
e2e/scenario_test.go Added E2E tests for AMD AMA GPU support (standard and scriptless)
e2e/aks_model.go Added firewall rule to allow download.microsoft.com access

Comment on lines +1275 to +1283
// IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support.
func IsAmdAmaEnabledSKU(vmSize string) bool {
switch vmSize {
case "Standard_NM16ads_MA35D":
return true
}
return false
}

Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IsAmdAmaEnabledSKU function is duplicated in both pkg/agent/baker.go and pkg/agent/datamodel/helper.go. This creates maintenance overhead as both implementations need to be updated when new AMD AMA SKUs are added. Consider removing this function from baker.go and calling datamodel.IsAmdAmaEnabledSKU instead, following the pattern used by IsSgxEnabledSKU which only exists in datamodel/helper.go.

Suggested change
// IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support.
func IsAmdAmaEnabledSKU(vmSize string) bool {
switch vmSize {
case "Standard_NM16ads_MA35D":
return true
}
return false
}

Copilot uses AI. Check for mistakes.
Comment on lines +528 to +535
// getIsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support.
func getIsAmdAmaEnabledSKU(vmSize string) bool {
switch vmSize {
case helpers.VMSizeStandardNM16adsMA35D:
return true
}
return false
}
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new getIsAmdAmaEnabledSKU function lacks test coverage. The codebase has comprehensive test coverage for similar functions (e.g., Test_getIsSgxEnabledSKU starting at line 1221). Add a test function following the same pattern to verify the function correctly identifies AMD AMA enabled SKUs and returns false for other SKUs.

Copilot uses AI. Check for mistakes.
Comment on lines +983 to +1007
setupAmdAma() {
if [ "$(isARM64)" -eq 1 ]; then
return
fi

if isMarinerOrAzureLinux "$OS"; then
# Install driver
dnf install -y azurelinux-repos-amd
KERNEL_VERSION=$(uname -r | sed 's/-/./g')
AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1)
dnf install -y $AMD_AMA_DRIVER_PACKAGE
# Install core package
dnf install -y libzip
dnf install -y azurelinux-repos-extended
RPM_FRONTEND=noninteractive dnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm
# Install AKS device plugin
dnf install -y amdama-device-plugin.x86_64
# Configure huge pages
sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf"
sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages"
if [ $(systemctl is-active kubelet) = "active" ]; then
systemctl restart kubelet
fi
fi
}
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setupAmdAma function lacks comprehensive error handling. Multiple dnf install commands (lines 990, 993, 995-999) could fail without causing the function to exit with an error code. This is inconsistent with the error handling in ensureGPUDrivers (line 979) which uses || exit $ERR_GPU_DRIVERS_START_FAIL. Consider adding error handling or exit codes for critical installation failures to ensure the node setup fails explicitly rather than continuing with a partially configured state.

Copilot uses AI. Check for mistakes.
@@ -44,6 +44,14 @@ func IsSgxEnabledSKU(vmSize string) bool {
return false
}

Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IsAmdAmaEnabledSKU function is missing a documentation comment, which is inconsistent with the pattern used for IsSgxEnabledSKU above (line 38-39). Add a comment describing the function's purpose to maintain consistency with similar functions in this file.

Suggested change
// IsAmdAmaEnabledSKU determines if a VM SKU has AMD AMA driver support.

Copilot uses AI. Check for mistakes.
dnf install -y azurelinux-repos-amd
KERNEL_VERSION=$(uname -r | sed 's/-/./g')
AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1)
dnf install -y $AMD_AMA_DRIVER_PACKAGE
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AMD_AMA_DRIVER_PACKAGE variable may be empty if no matching driver package is found. Installing an empty package with dnf on line 993 could fail silently or cause an unexpected error. Add validation to check if AMD_AMA_DRIVER_PACKAGE is non-empty before attempting installation, and handle the error case appropriately.

Suggested change
dnf install -y $AMD_AMA_DRIVER_PACKAGE
if [ -z "$AMD_AMA_DRIVER_PACKAGE" ]; then
echo "Failed to find a matching amd-ama-driver package for kernel version ${KERNEL_VERSION}" >&2
return 1
fi
dnf install -y "$AMD_AMA_DRIVER_PACKAGE"

Copilot uses AI. Check for mistakes.
# Configure huge pages
sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf"
sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages"
if [ $(systemctl is-active kubelet) = "active" ]; then
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command output from systemctl is-active should be quoted in the comparison. Without quotes, if the command fails or returns unexpected output, the test condition could fail with a syntax error. Change to: if [ "$(systemctl is-active kubelet)" = "active" ]; then

Suggested change
if [ $(systemctl is-active kubelet) = "active" ]; then
if [ "$(systemctl is-active kubelet)" = "active" ]; then

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +53
func IsAmdAmaEnabledSKU(vmSize string) bool {
switch vmSize {
case "Standard_NM16ads_MA35D":
return true
}
return false
}
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new IsAmdAmaEnabledSKU function lacks test coverage. The codebase has comprehensive test coverage for similar functions (e.g., TestIsSGXEnabledSKU in pkg/agent/datamodel/helper_test.go). Add a test function following the same pattern to verify the function correctly identifies AMD AMA enabled SKUs and returns false for other SKUs.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 30, 2026 21:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comment on lines +990 to +1004
dnf install -y azurelinux-repos-amd
KERNEL_VERSION=$(uname -r | sed 's/-/./g')
AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1)
dnf install -y $AMD_AMA_DRIVER_PACKAGE
# Install core package
dnf install -y libzip
dnf install -y azurelinux-repos-extended
RPM_FRONTEND=noninteractive dnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm
# Install AKS device plugin
dnf install -y amdama-device-plugin.x86_64
# Configure huge pages
sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf"
sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages"
if [ $(systemctl is-active kubelet) = "active" ]; then
systemctl restart kubelet
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setupAmdAma function lacks comprehensive error handling. All dnf install commands and other operations that could fail should include error handling with appropriate exit codes, similar to how GPU driver installation handles errors (e.g., lines 914, 926-928, 935, 948, etc. use || exit $ERR_GPU_DRIVERS_START_FAIL).

Consider defining appropriate error codes (e.g., ERR_AMD_AMA_INSTALL_FAIL, ERR_AMD_AMA_DRIVER_START_FAIL) in cse_helpers.sh and adding proper error handling for each critical operation.

Suggested change
dnf install -y azurelinux-repos-amd
KERNEL_VERSION=$(uname -r | sed 's/-/./g')
AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1)
dnf install -y $AMD_AMA_DRIVER_PACKAGE
# Install core package
dnf install -y libzip
dnf install -y azurelinux-repos-extended
RPM_FRONTEND=noninteractive dnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm
# Install AKS device plugin
dnf install -y amdama-device-plugin.x86_64
# Configure huge pages
sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf"
sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages"
if [ $(systemctl is-active kubelet) = "active" ]; then
systemctl restart kubelet
dnf install -y azurelinux-repos-amd || exit $ERR_AMD_AMA_INSTALL_FAIL
KERNEL_VERSION=$(uname -r | sed 's/-/./g')
AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1)
if [ -z "$AMD_AMA_DRIVER_PACKAGE" ]; then
exit $ERR_AMD_AMA_INSTALL_FAIL
fi
dnf install -y "$AMD_AMA_DRIVER_PACKAGE" || exit $ERR_AMD_AMA_INSTALL_FAIL
# Install core package
dnf install -y libzip || exit $ERR_AMD_AMA_INSTALL_FAIL
dnf install -y azurelinux-repos-extended || exit $ERR_AMD_AMA_INSTALL_FAIL
RPM_FRONTEND=noninteractive dnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm || exit $ERR_AMD_AMA_INSTALL_FAIL
# Install AKS device plugin
dnf install -y amdama-device-plugin.x86_64 || exit $ERR_AMD_AMA_INSTALL_FAIL
# Configure huge pages
sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf" || exit $ERR_AMD_AMA_INSTALL_FAIL
sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages" || exit $ERR_AMD_AMA_INSTALL_FAIL
if [ "$(systemctl is-active kubelet)" = "active" ]; then
systemctl restart kubelet || exit $ERR_AMD_AMA_DRIVER_START_FAIL

Copilot uses AI. Check for mistakes.
dnf install -y amdama-device-plugin.x86_64
# Configure huge pages
sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf"
sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages"
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration is written to both /etc/sysctl.conf (line 1001) for persistence and directly to /proc/sys/vm/nr_hugepages (line 1002) for immediate effect. However, these changes could be better managed using the 'sysctl' command instead of direct file writes, similar to how it's done at line 372 with 'sysctl --system'.

Consider using: sysctl -w vm.nr_hugepages=4096

This approach is more consistent with how sysctl settings are managed elsewhere in the codebase and provides better error reporting.

Suggested change
sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages"
sysctl -w vm.nr_hugepages=4096

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants