-
-
Notifications
You must be signed in to change notification settings - Fork 0
⚡ Improve git-fetch.py performance and memory usage #157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Implement streaming downloads to minimize memory footprint. - Implement persistent connections (Keep-Alive) with `http.client`. - Use Queue-based worker pool for dynamic load balancing. - Fix subdirectory download bug. - Handle `Connection: close` gracefully. Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Summary of ChangesHello @Ven0m0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a great improvement, significantly optimizing performance and memory usage by switching to http.client for streaming downloads and using a queue-based worker pool. The fix for subdirectory downloads is also a valuable correction. My review includes a suggestion to further improve the new download_worker function by refactoring some duplicated logic for better maintainability, and also to consider the implications of not handling HTTP redirects, which was a feature of the previous implementation.
| conn.request("GET", url_path, headers=headers) | ||
| resp = conn.getresponse() | ||
|
|
||
| # Check if the server wants to close the connection | ||
| connection_header = resp.getheader("Connection", "").lower() | ||
| should_close = connection_header == "close" | ||
|
|
||
| if resp.status == 200: | ||
| with open(local_path, "wb") as f: | ||
| while True: | ||
| chunk = resp.read(16384) | ||
| if not chunk: | ||
| break | ||
| f.write(chunk) | ||
| print(f"✓ {display_path}") | ||
|
|
||
| if should_close: | ||
| conn.close() | ||
| # Reconnect for the next file in the queue | ||
| conn = http.client.HTTPSConnection(host, timeout=30) | ||
|
|
||
| break | ||
| elif resp.status in (301, 302, 307, 308): | ||
| loc = resp.getheader("Location") | ||
| resp.read() # Consume body | ||
| if loc: | ||
| print( | ||
| f"✗ {display_path}: Redirect to {loc} not handled in persistent mode" | ||
| ) | ||
| else: | ||
| print(f"✗ {display_path}: HTTP {resp.status}") | ||
|
|
||
| if should_close: | ||
| conn.close() | ||
| conn = http.client.HTTPSConnection(host, timeout=30) | ||
| break | ||
| else: | ||
| print(f"✗ {display_path}: HTTP {resp.status}") | ||
| resp.read() # Consume body | ||
| if should_close: | ||
| conn.close() | ||
| conn = http.client.HTTPSConnection(host, timeout=30) | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block contains duplicated logic for handling the should_close condition and breaking from the retry loop across different response statuses. You can simplify this and improve maintainability by restructuring it to avoid repetition.
Additionally, the switch to http.client means HTTP redirects are no longer handled automatically, which is a functional regression. The suggested refactoring maintains the current behavior of logging redirects, but you might consider implementing basic redirect handling in the future to make the script more robust.
conn.request("GET", url_path, headers=headers)
resp = conn.getresponse()
# Check if the server wants to close the connection
connection_header = resp.getheader("Connection", "").lower()
should_close = connection_header == "close"
if resp.status == 200:
with open(local_path, "wb") as f:
while True:
chunk = resp.read(16384)
if not chunk:
break
f.write(chunk)
print(f"✓ {display_path}")
elif resp.status in (301, 302, 307, 308):
loc = resp.getheader("Location")
resp.read() # Consume body
if loc:
print(
f"✗ {display_path}: Redirect to {loc} not handled in persistent mode"
)
else:
print(f"✗ {display_path}: HTTP {resp.status}")
else:
print(f"✗ {display_path}: HTTP {resp.status}")
resp.read() # Consume body
if should_close:
conn.close()
# Reconnect for the next file in the queue
conn = http.client.HTTPSConnection(host, timeout=30)
breakThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors git-fetch.py to improve performance and memory efficiency by replacing urllib.request with http.client for streaming downloads and implementing a queue-based worker pool pattern for parallel downloads with connection reuse. The PR also fixes a bug where subdirectory downloads would fail when the path didn't have a trailing slash.
Changes:
- Replaced synchronous per-file downloads with a queue-based worker pool that uses persistent HTTP connections
- Implemented streaming downloads with chunked reads to reduce memory usage for large files
- Fixed subdirectory download logic to accept both blob and tree types when path lacks trailing slash
- Added connection close detection and automatic reconnection logic
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| Cachyos/git-fetch.py | Core refactoring: added download_worker function for persistent connections, modified fetch_github and fetch_gitlab to use queue-based worker pool, fixed subdirectory type validation |
| .gitignore | Added bench_output/ directory to ignore benchmark results |
|
|
||
|
|
||
| def download_worker(host: str, file_q: queue.Queue, headers: dict[str, str]) -> None: | ||
| """Worker thread: process files from queue using a persistent connection.""" |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The headers dictionary is being mutated in-place within the worker thread. Since the same headers dictionary is shared across multiple worker threads (passed to each worker), this creates a race condition where multiple threads could simultaneously check and modify the "User-Agent" key.
While the impact is low (all threads would set the same value), this is still a thread-safety issue. The headers dictionary should be copied before modification, or the User-Agent should be set before passing to workers.
| """Worker thread: process files from queue using a persistent connection.""" | |
| """Worker thread: process files from queue using a persistent connection.""" | |
| # Make a per-thread copy so we don't mutate a shared headers dict. | |
| headers = dict(headers) |
Cachyos/git-fetch.py
Outdated
| elif item["type"] == "blob": | ||
| encoded_path = "/".join(urllib.parse.quote(p) for p in item_path.split("/")) | ||
| raw_url = f"https://raw.githubusercontent.com/{spec.owner}/{spec.repo}/{spec.branch}/{encoded_path}" | ||
| files_to_download.append((raw_url, local_path, item_path)) | ||
| # Host: raw.githubusercontent.com | ||
| path_part = f"/{spec.owner}/{spec.repo}/{spec.branch}/{encoded_path}" | ||
| files_to_download.append((path_part, local_path, item_path)) |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GitHub implementation doesn't ensure parent directories exist before adding files to the download queue, unlike the GitLab implementation which explicitly creates parent directories (lines 348-350).
When filtering the tree based on target_path (lines 220-225), intermediate directory entries may be filtered out. If a file's parent directory wasn't included as a "tree" type entry in the filtered results, the download_worker will fail when trying to write the file.
Consider adding the same parent directory creation logic used in fetch_gitlab (lines 348-350) to ensure parent directories exist before queueing files for download.
| except (http.client.HTTPException, OSError) as e: | ||
| # Connection might have been closed by server unexpectedly | ||
| conn.close() | ||
| retries -= 1 | ||
| if retries > 0: | ||
| conn = http.client.HTTPSConnection(host, timeout=30) | ||
| else: | ||
| print(f"✗ {display_path}: {e}") |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception handling catches both http.client.HTTPException and OSError but retries on all of them. However, some OSError exceptions like PermissionError when writing to local_path (line 125) should not be retried as they indicate a local filesystem issue, not a network problem.
Consider distinguishing between network errors (which should be retried) and local I/O errors (which should fail immediately).
| if resp.status == 200: | ||
| with open(local_path, "wb") as f: | ||
| while True: | ||
| chunk = resp.read(16384) |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chunk size of 16384 bytes (16KB) is relatively small for modern systems. While this works, larger chunk sizes (64KB-256KB) typically provide better performance for file downloads by reducing the number of system calls and better utilizing network bandwidth.
Consider using a larger chunk size like 65536 (64KB) or 131072 (128KB) for better throughput, especially for large files.
| chunk = resp.read(16384) | |
| chunk = resp.read(65536) |
| raise | ||
|
|
||
|
|
||
| def download_worker(host: str, file_q: queue.Queue, headers: dict[str, str]) -> None: |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function parameter file_q uses an abbreviated name. While the abbreviation is understandable in context, following Python naming conventions, it would be clearer as file_queue to improve readability.
This is a minor style preference, but it aligns with PEP 8's guidance on avoiding unnecessary abbreviations.
| retries = 3 | ||
| while retries > 0: | ||
| try: | ||
| conn.request("GET", url_path, headers=headers) | ||
| resp = conn.getresponse() | ||
|
|
||
| # Check if the server wants to close the connection | ||
| connection_header = resp.getheader("Connection", "").lower() | ||
| should_close = connection_header == "close" | ||
|
|
||
| if resp.status == 200: | ||
| with open(local_path, "wb") as f: | ||
| while True: | ||
| chunk = resp.read(16384) | ||
| if not chunk: | ||
| break | ||
| f.write(chunk) | ||
| print(f"✓ {display_path}") | ||
|
|
||
| if should_close: | ||
| conn.close() | ||
| # Reconnect for the next file in the queue | ||
| conn = http.client.HTTPSConnection(host, timeout=30) | ||
|
|
||
| break | ||
| elif resp.status in (301, 302, 307, 308): | ||
| loc = resp.getheader("Location") | ||
| resp.read() # Consume body | ||
| if loc: | ||
| print( | ||
| f"✗ {display_path}: Redirect to {loc} not handled in persistent mode" | ||
| ) | ||
| else: | ||
| print(f"✗ {display_path}: HTTP {resp.status}") | ||
|
|
||
| if should_close: | ||
| conn.close() | ||
| conn = http.client.HTTPSConnection(host, timeout=30) | ||
| break | ||
| else: | ||
| print(f"✗ {display_path}: HTTP {resp.status}") | ||
| resp.read() # Consume body | ||
| if should_close: | ||
| conn.close() | ||
| conn = http.client.HTTPSConnection(host, timeout=30) | ||
| break | ||
| except (http.client.HTTPException, OSError) as e: | ||
| # Connection might have been closed by server unexpectedly | ||
| conn.close() | ||
| retries -= 1 | ||
| if retries > 0: | ||
| conn = http.client.HTTPSConnection(host, timeout=30) | ||
| else: | ||
| print(f"✗ {display_path}: {e}") | ||
|
|
||
| file_q.task_done() |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry logic doesn't re-add failed items back to the queue. When a file download fails after exhausting all retries, the task_done() is still called (line 169), which marks the item as complete even though the download failed. This means the script will report success even when some files failed to download.
Consider either re-raising the exception after all retries are exhausted, or tracking failed downloads to report at the end and return an appropriate exit code.
| resp.read() # Consume body | ||
| if should_close: | ||
| conn.close() | ||
| conn = http.client.HTTPSConnection(host, timeout=30) |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a non-200 status code is received (lines 139-159), the function breaks out of the retry loop, meaning it won't retry on 4xx/5xx errors. This is inconsistent with typical download behavior where transient 5xx errors should be retried.
Consider only breaking for certain non-retriable status codes (like 404, 401, 403) and retrying for 5xx server errors and certain 4xx errors like 429 (rate limiting).
| conn = http.client.HTTPSConnection(host, timeout=30) | |
| conn = http.client.HTTPSConnection(host, timeout=30) | |
| # Non-retriable client errors: fail fast | |
| if resp.status in (401, 403, 404): | |
| break | |
| # Retry on transient server errors and rate limiting | |
| if 500 <= resp.status < 600 or resp.status == 429: | |
| retries -= 1 | |
| if retries > 0: | |
| continue | |
| # Out of retries, give up | |
| break | |
| # Default: treat other statuses as non-retriable |
| f"Requested path {spec.path!r} is a file, but repository " | ||
| f"contains a {item['type']} at that path." |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added condition allows item["type"] == "tree" to be treated as valid when the requested path doesn't end with a slash. This fixes the bug where downloading a subdirectory without a trailing slash would fail.
However, this changes the behavior: now requesting a path without a trailing slash will work for both files and directories. This might be the intended behavior, but it's worth noting that the error message on line 244 still says "is a file" which is now incorrect since it could also be a directory.
Consider updating the error message to reflect that both blob and tree types are valid, or adjusting the logic to maintain the distinction between file and directory requests.
| f"Requested path {spec.path!r} is a file, but repository " | |
| f"contains a {item['type']} at that path." | |
| f"Requested path {spec.path!r} is a file or directory, but " | |
| f"repository contains a {item['type']} at that path." |
| if "User-Agent" not in headers: | ||
| headers["User-Agent"] = "git-fetch.py" | ||
|
|
||
| conn = http.client.HTTPSConnection(host, timeout=30) |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The connection is initialized but never explicitly connected before the first request. While http.client.HTTPSConnection will implicitly connect on the first request() call, explicitly calling conn.connect() after initialization would make the code's intent clearer and allow for better error handling during the connection phase.
This is a minor style issue - the current code works correctly since request() will establish the connection if needed.
| file_q.put(f) | ||
|
|
||
| max_workers = min(32, (os.cpu_count() or 1) * 4) | ||
| num_workers = min(max_workers, len(files_to_download)) |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there are very few files to download (e.g., 1-2 files), creating worker threads adds unnecessary overhead since thread creation and the queue mechanism have their own costs.
Consider adding a threshold (e.g., if len(files_to_download) < 3) to download files synchronously without threading for small batches, or setting a minimum of 1 worker explicitly to handle the edge case where len(files_to_download) could theoretically be 0 (though line 275-276 prevents this).
| num_workers = min(max_workers, len(files_to_download)) | |
| num_workers = max(1, min(max_workers, len(files_to_download))) |
This PR optimizes
Cachyos/git-fetch.pyfor performance and memory efficiency.💡 What
urllib.requestwithhttp.clientto enable streaming downloads and persistent connections.Connection: closeheaders and reconnect automatically (necessary for GitHub Raw).src/requests) would fail if the tree API returned it as atreetype and the URL didn't have a trailing slash.🎯 Why
resp.read()). Streaming (shutil.copyfileobjequivalent) keeps memory usage low regardless of file size.📊 Measured Improvement
PR created automatically by Jules for task 17867593564761135098 started by @Ven0m0