Added duplicate key detection forked_projects_util.py#1982
Added duplicate key detection forked_projects_util.py#1982Kcruz28 wants to merge 1 commit intoTestingResearchIllinois:mainfrom
Conversation
darko-marinov
left a comment
There was a problem hiding this comment.
Good starting point but some changes are needed to accept this in the checker.
| from collections import Counter | ||
| import re | ||
|
|
||
| with open(file_path, "r") as f: |
There was a problem hiding this comment.
Use Json library to read the file. It's presumably already done somewhere in this script.
There was a problem hiding this comment.
json.load() is the standard way to parse JSON, but it automatically collapses duplicate keys by keeping only the last occurrence. Because of this behavior, any duplicate keys in the original file are lost during parsing. For that reason I did not use json.load().
There was a problem hiding this comment.
Thanks for clarifying about json.load(); I wasnt' aware of the default behavior. It seems that json.loads(..., object_pairs_hook=...) allows for checking for duplicates (and doesn't require you to parse keys, assuming all key-value pairs are in the same line).
| dups = find_duplicate_json_keys(file_path) | ||
| if dups: | ||
| log_esp_error(file_path, log, f"Duplicate in forked-projects.json keys detected: {dups}") | ||
| elif not is_fp_sorted(data): |
There was a problem hiding this comment.
Does the sorting check strictly increasing or just non-decreasing? Can we make it strictly increasing so that it subsumes the check for duplicates?
There was a problem hiding this comment.
Making the sorting check strictly increasing would catch duplicates in the loaded data. But json.load() collapses duplicates when parsing, so any duplicates in the original file wouldn’t be detected.
This PR adds a check for duplicate keys in
forked_projects.json. I added a new function,find_duplicate_json_keys, which scans the raw JSON file to detect duplicate keys before it is parsed. I then updatedrun_checks_sort_fpto report an error if any duplicates are found. This helps catch cases where duplicate keys would otherwise be silently overwritten, making the checks more reliable.Example error:
ERROR: On file format_checker/forked-projects.json: Duplicate in forked-projects.json keys detected: ['https://github.com/apache/incubator-shardingsphere']