-
Notifications
You must be signed in to change notification settings - Fork 23
#330 - Enable deep copy #331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds deep copy functionality to the CAS (Common Analysis Structure) object, allowing users to create independent copies of CAS instances with or without duplicating the associated typesystem.
Key changes:
- Added a
deep_copy()method to theCasclass with optional typesystem copying - Implemented comprehensive test coverage for both copying modes (with and without typesystem)
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| cassis/cas.py | Implements the deep_copy() method to create deep copies of CAS objects, handling sofas, views, feature structures, and their references |
| tests/test_cas.py | Adds two test cases verifying deep copy functionality with and without typesystem copying |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@dmnc-grdnr Could you please consider the Copilot review. If you believe and of its review remarks are wrong, please comment them. |
|
Oh, and thank you again for another PR :) |
|
I reviewed the Copilot suggestions agreed with most, but found that a few lines are actually obsolete and should be removed instead of modified. This is my first pull request and I am unsure how to make the changes without breaking things. Could you please point me to what to do next? |
reckart
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much. To continue, it would be great if you could simply address the comments by updating the code and then committing the changes into this branch. Once the changes are in, the review will be repeated. Please do not use the "amend" function of git, just add another commit. Avoid force pushes, then you can't really break anything. Thanks!
|
I was finally able to iron out all of the issues that appeared after including the random CAS tests. They were extremely helpful! I think, I made all the changes that you and Copilot required. Could you please have another look? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 12 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| xmiID=sofa.xmiID, | ||
| ) | ||
| sofa_copy.mimeType = sofa.mimeType | ||
| sofa_copy.sofaArray = sofa.sofaArray |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 869, sofaArray is copied by direct assignment, which creates a shallow copy. If sofaArray contains mutable data (like a list or array), modifications to one could affect the other. Consider using copy.copy() or copy.deepcopy() depending on the data structure.
| def test_deep_copy_without_typesystem(small_xmi, small_typesystem_xml): | ||
| org = load_cas_from_xmi(small_xmi, typesystem=load_typesystem(small_typesystem_xml)) | ||
| copy = org.deep_copy(copy_typesystem=False) | ||
|
|
||
| assert org != copy | ||
| assert len(copy.to_json(pretty_print=True)) == len(org.to_json(pretty_print=True)) | ||
| assert copy.to_json(pretty_print=True) == org.to_json(pretty_print=True) | ||
|
|
||
| assert org.typesystem == copy.typesystem | ||
|
|
||
|
|
||
| def test_deep_copy_with_typesystem(small_xmi, small_typesystem_xml): | ||
| org = load_cas_from_xmi(small_xmi, typesystem=load_typesystem(small_typesystem_xml)) | ||
| copy = org.deep_copy(copy_typesystem=True) | ||
|
|
||
| assert org != copy | ||
| assert len(copy.to_json(pretty_print=True)) == len(org.to_json(pretty_print=True)) | ||
| assert copy.to_json(pretty_print=True) == org.to_json(pretty_print=True) | ||
|
|
||
|
|
||
| assert org.typesystem != copy.typesystem | ||
| assert len(org.typesystem.to_xml()) == len(copy.typesystem.to_xml()) | ||
| assert org.typesystem.to_xml() == copy.typesystem.to_xml() | ||
|
|
||
|
|
||
| def test_random_multi_type_random_deep_copy(): | ||
| generator = MultiTypeRandomCasGenerator() | ||
| for i in range(0, 10): | ||
| generator.size = (i + 1) * 10 | ||
| generator.type_count = i + 1 | ||
| typesystem = generator.generate_type_system() | ||
| org = generator.generate_cas(typesystem) | ||
| print(f"CAS size: {sum(len(view.get_all_annotations()) for view in org.views)}") | ||
| copy = org.deep_copy(copy_typesystem=True) | ||
|
|
||
| org_text = org.to_xmi(pretty_print=True) | ||
| copy_text = copy.to_xmi(pretty_print=True) | ||
|
|
||
| assert org != copy | ||
| assert len(org_text) == len(copy_text) | ||
| assert org_text == copy_text | ||
|
|
||
|
|
||
| def test_random_multi_feature_deep_copy(): | ||
| generator = MultiFeatureRandomCasGenerator() | ||
| for i in range(0, 10): | ||
| generator.size = (i + 1) * 10 | ||
| typesystem = generator.generate_type_system() | ||
| org = generator.generate_cas(typesystem) | ||
| print(f"CAS size: {sum(len(view.get_all_annotations()) for view in org.views)}") | ||
| copy = org.deep_copy(copy_typesystem=True) | ||
|
|
||
| org_text = org.to_xmi(pretty_print=True) | ||
| copy_text = copy.to_xmi(pretty_print=True) | ||
|
|
||
| assert org != copy | ||
| assert len(org_text) == len(copy_text) | ||
| assert org_text == copy_text | ||
|
|
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests don't cover edge cases such as: feature structures with None values in arrays, feature structures with None non-primitive features, empty arrays, or multiple views. Consider adding tests for these scenarios to ensure the deep_copy method handles them correctly.
| cas_copy = Cas(ts, | ||
| document_language=self.document_language, | ||
| lenient=self._lenient, | ||
| sofa_mime=self.sofa_mime, | ||
| ) | ||
|
|
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 852, document_language is read from self.document_language, which accesses the document annotation. If the original CAS doesn't have a document annotation, this will create one, which might not be desired behavior during copying. Consider handling the case where document_language is None or catching exceptions when the document annotation doesn't exist.
| cas_copy = Cas(ts, | |
| document_language=self.document_language, | |
| lenient=self._lenient, | |
| sofa_mime=self.sofa_mime, | |
| ) | |
| # Safely determine the document language without assuming that a | |
| # document annotation already exists on the original CAS. | |
| document_language = None | |
| try: | |
| document_language = self.document_language | |
| except AttributeError: | |
| document_language = None | |
| cas_copy = Cas( | |
| ts, | |
| document_language=document_language, | |
| lenient=self._lenient, | |
| sofa_mime=self.sofa_mime, | |
| ) |
| for feature in t.all_features: | ||
| if ts.is_primitive(feature.rangeType): | ||
| fs_copy[feature.name] = fs.get(feature.name) | ||
| elif ts.is_primitive_collection(feature.rangeType): | ||
| fs_copy[feature.name] = ts.get_type(feature.rangeType.name)() | ||
| fs_copy[feature.name].elements = fs.get(feature.name).elements | ||
| elif ts.is_array(feature.rangeType): | ||
| fs_copy[feature.name] = ts.get_type(TYPE_NAME_FS_ARRAY)() | ||
| # collect referenced xmiIDs for mapping later | ||
| referenced_list = [] | ||
| for item in fs[feature.name].elements: | ||
| if hasattr(item, 'xmiID') and item.xmiID is not None: | ||
| referenced_list.append(item.xmiID) | ||
| referenced_arrays.setdefault(fs.xmiID, {}) | ||
| referenced_arrays[fs.xmiID][feature.name] = referenced_list | ||
| elif feature.rangeType.name == TYPE_NAME_SOFA: | ||
| # ignore sofa references | ||
| pass | ||
| else: | ||
| if hasattr(fs[feature.name], 'xmiID') and fs[feature.name].xmiID is not None: | ||
| references.setdefault(feature.name, []) | ||
| references[feature.name].append((fs.xmiID, fs[feature.name].xmiID)) | ||
| else: | ||
| warnings.warn(f"Original non-primitive feature \"{feature.name}\" was and not copied from feature structure {fs.xmiID}.") | ||
|
|
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deep_copy method doesn't handle FSList (linked list) types. While it handles primitive collections and FSArrays, FSList types (like NonEmptyFSList) are not explicitly handled in the feature copying logic (lines 894-918). These will fall into the else clause on line 912, where they'll be treated as single references, which is incorrect for list structures. Consider adding explicit handling for FSList types similar to how FSArray is handled.
| cas_copy._sofas[sofa_copy.sofaID] = sofa_copy | ||
| cas_copy._views[sofa_copy.sofaID] = View(sofa=sofa_copy) | ||
|
|
||
| # removes the _IntialView created with the initialization of the copied CAS |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment: "_IntialView" should be "_InitialView"
| # removes the _IntialView created with the initialization of the copied CAS | |
| # removes the _InitialView created with the initialization of the copied CAS |
| fs_copy[feature.name] = ts.get_type(feature.rangeType.name)() | ||
| fs_copy[feature.name].elements = fs.get(feature.name).elements | ||
| elif ts.is_array(feature.rangeType): | ||
| fs_copy[feature.name] = ts.get_type(TYPE_NAME_FS_ARRAY)() | ||
| # collect referenced xmiIDs for mapping later | ||
| referenced_list = [] | ||
| for item in fs[feature.name].elements: | ||
| if hasattr(item, 'xmiID') and item.xmiID is not None: | ||
| referenced_list.append(item.xmiID) | ||
| referenced_arrays.setdefault(fs.xmiID, {}) | ||
| referenced_arrays[fs.xmiID][feature.name] = referenced_list |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code doesn't handle the case where a primitive collection feature value (line 897-899) or an array feature value (line 900-908) could be None. If fs.get(feature.name) returns None for these feature types, the code will fail with an AttributeError when trying to access .elements. Add a check for None before accessing .elements.
| fs_copy[feature.name] = ts.get_type(feature.rangeType.name)() | |
| fs_copy[feature.name].elements = fs.get(feature.name).elements | |
| elif ts.is_array(feature.rangeType): | |
| fs_copy[feature.name] = ts.get_type(TYPE_NAME_FS_ARRAY)() | |
| # collect referenced xmiIDs for mapping later | |
| referenced_list = [] | |
| for item in fs[feature.name].elements: | |
| if hasattr(item, 'xmiID') and item.xmiID is not None: | |
| referenced_list.append(item.xmiID) | |
| referenced_arrays.setdefault(fs.xmiID, {}) | |
| referenced_arrays[fs.xmiID][feature.name] = referenced_list | |
| original_value = fs.get(feature.name) | |
| if original_value is not None: | |
| fs_copy[feature.name] = ts.get_type(feature.rangeType.name)() | |
| fs_copy[feature.name].elements = original_value.elements | |
| else: | |
| fs_copy[feature.name] = None | |
| elif ts.is_array(feature.rangeType): | |
| original_value = fs.get(feature.name) | |
| if original_value is not None: | |
| fs_copy[feature.name] = ts.get_type(TYPE_NAME_FS_ARRAY)() | |
| # collect referenced xmiIDs for mapping later | |
| referenced_list = [] | |
| for item in original_value.elements: | |
| if hasattr(item, 'xmiID') and item.xmiID is not None: | |
| referenced_list.append(item.xmiID) | |
| referenced_arrays.setdefault(fs.xmiID, {}) | |
| referenced_arrays[fs.xmiID][feature.name] = referenced_list | |
| else: | |
| fs_copy[feature.name] = None |
| elements = [all_copied_fs[reference_ID] for reference_ID in referenced_list] | ||
| all_copied_fs[current_ID][feature].elements = elements |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 933, the code assumes all reference_IDs in referenced_list exist in all_copied_fs, but doesn't handle the case where a reference might be missing. This will raise a KeyError. Consider adding error handling similar to lines 925-928, or filter out missing references.
| elements = [all_copied_fs[reference_ID] for reference_ID in referenced_list] | |
| all_copied_fs[current_ID][feature].elements = elements | |
| elements = [] | |
| for reference_ID in referenced_list: | |
| try: | |
| elements.append(all_copied_fs[reference_ID]) | |
| except KeyError: | |
| warnings.warn( | |
| f"Reference {reference_ID} not found for feature '{feature}' of feature structure {current_ID}" | |
| ) | |
| try: | |
| all_copied_fs[current_ID][feature].elements = elements | |
| except KeyError: | |
| warnings.warn( | |
| f"Feature '{feature}' or feature structure {current_ID} not found when setting array references" | |
| ) |
| if hasattr(fs[feature.name], 'xmiID') and fs[feature.name].xmiID is not None: | ||
| references.setdefault(feature.name, []) | ||
| references[feature.name].append((fs.xmiID, fs[feature.name].xmiID)) |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code checks if fs[feature.name] has an xmiID attribute (line 913), but doesn't handle the case where fs[feature.name] could be None. This will cause an error. Add a None check before accessing attributes: if fs[feature.name] is not None and hasattr(...)
| if hasattr(fs[feature.name], 'xmiID') and fs[feature.name].xmiID is not None: | |
| references.setdefault(feature.name, []) | |
| references[feature.name].append((fs.xmiID, fs[feature.name].xmiID)) | |
| feature_value = fs[feature.name] | |
| if feature_value is not None and hasattr(feature_value, 'xmiID') and feature_value.xmiID is not None: | |
| references.setdefault(feature.name, []) | |
| references[feature.name].append((fs.xmiID, feature_value.xmiID)) |
|
|
||
| # add feature structures to the appropriate views | ||
| feature_structures = sorted(all_copied_fs.values(), key=lambda f: f.xmiID, reverse=False) | ||
| for item in all_copied_fs.values(): |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'feature_structures' is created on line 937 but never used. It appears this was meant to iterate over the sorted list, but line 938 iterates over all_copied_fs.values() instead, which is unsorted. This could lead to issues if feature structures need to be added in a specific order.
| for item in all_copied_fs.values(): | |
| for item in feature_structures: |
| fs_copy[feature.name] = fs.get(feature.name) | ||
| elif ts.is_primitive_collection(feature.rangeType): | ||
| fs_copy[feature.name] = ts.get_type(feature.rangeType.name)() | ||
| fs_copy[feature.name].elements = fs.get(feature.name).elements |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When copying primitive collections (line 899), the code directly assigns elements from the original collection to the copy. This creates a shallow copy of the elements list. If the elements are mutable (e.g., lists of bytes), modifications to one could affect the other. Consider creating a deep copy of the elements: fs_copy[feature.name].elements = list(fs.get(feature.name).elements) or using copy.deepcopy.
| fs_copy[feature.name].elements = fs.get(feature.name).elements | |
| fs_copy[feature.name].elements = list(fs.get(feature.name).elements) |
|
@dmnc-grdnr coming back to this, I re-ran copilot just for good measure before I would review the code again and possible merge it. Unfortunately, I have to say it found a few points that appear to be quite critical and which I may have quite easily missed in a manual review. Do you still have the capacity and interest to address those issues? |
This PR adds the deep copy functionality to the CAS object.
Tests for the function was added to test_cas.py
I have verified the functionality on different CAS files, but due to the complexity of the data structure, there might be something I have overlooked.
Therefore feedback is highly appreciated!