Skip to content

Conversation

@xumanbu
Copy link

@xumanbu xumanbu commented Jan 17, 2026

What changes were proposed in this pull request?

Similar to SPARK-47475 for jars, this commit adds support for avoiding archive downloads in Kubernetes cluster mode when the archives are big and executor counts are high, to prevent network saturation and timeouts.

Why are the changes needed?

Does this PR introduce any user-facing change?

Changes:

  • Add KUBERNETES_ARCHIVES_AVOID_DOWNLOAD_SCHEMES configuration
  • Implement avoidArchiveDownload function in SparkSubmit
    The configuration accepts a comma-separated list of schemes (e.g., s3a, hdfs) or wildcard '*' to avoid downloading archives for any scheme.

How was this patch tested?

  • Add test case to verify archives avoid download functionality

Was this patch authored or co-authored using generative AI tooling?

NO

…loadSchemes for K8s Cluster Mode

Similar to SPARK-47475 for jars, this commit adds support for avoiding
archive downloads in Kubernetes cluster mode when the archives are big
and executor counts are high, to prevent network saturation and timeouts.

Changes:
- Add KUBERNETES_ARCHIVES_AVOID_DOWNLOAD_SCHEMES configuration
- Implement avoidArchiveDownload function in SparkSubmit
- Add test case to verify archives avoid download functionality

The configuration accepts a comma-separated list of schemes (e.g., s3a, hdfs)
or wildcard '*' to avoid downloading archives for any scheme.
@github-actions github-actions bot added the CORE label Jan 17, 2026
@github-actions
Copy link

JIRA Issue Information

=== Improvement SPARK-55077 ===
Summary: [CORE][K8S] Support spark.kubernetes.archives.avoidDownloadSchemes for K8s Cluster Mode
Assignee: None
Status: Open
Affected: ["4.0.0"]


This comment was automatically generated by GitHub Actions

"For use in cases when the archives are big and executor counts are high, " +
"concurrent download causes network saturation and timeouts. " +
"Wildcard '*' is denoted to not downloading archives for any the schemes.")
.version("4.0.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Apache Spark master branch is for 4.2.0-SNAPSHOT, new configuration should be 4.2.0, @xumanbu .

}
}

test("Avoid archives download if scheme matches " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a test prefix style.

test("Avoid archives download if scheme matches " +
test("SPARK-55077: Avoid archives download if scheme matches " +

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get. I'll fix it.

@xumanbu
Copy link
Author

xumanbu commented Jan 17, 2026

@dongjoon-hyun I'have fixed all comment, please take a look. but build failed may case by this pr #53720, It's not caused by this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants