Skip to content

Conversation

@kmontemayor2-sc
Copy link
Collaborator

@kmontemayor2-sc kmontemayor2-sc commented Jan 27, 2026

Scope of work done

Per discussion in #438, migrating to some get_node_ids which is more flexible and can fetch a split and optionally shard, etc.

Additionally, breaking out the dataset building utils to their own (tested) file, tests/test_assets/distributed/test_dataset.py and then creating remote_dist_dataset_test.py as unit tests for RDI.

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

@kmontemayor2-sc
Copy link
Collaborator Author

/unit_test_py

@kmontemayor2-sc
Copy link
Collaborator Author

/integration_test

@kmontemayor2-sc
Copy link
Collaborator Author

/e2e_test

@github-actions
Copy link
Contributor

github-actions bot commented Jan 27, 2026

GiGL Automation

@ 17:56:25UTC : 🔄 Python Unit Test started.

@ 19:10:59UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

GiGL Automation

@ 17:56:36UTC : 🔄 Integration Test started.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 27, 2026

GiGL Automation

@ 17:56:55UTC : 🔄 E2E Test started.

@ 19:17:55UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc
Copy link
Collaborator Author

/unit_test_py

@kmontemayor2-sc
Copy link
Collaborator Author

/integration_test

@kmontemayor2-sc
Copy link
Collaborator Author

/e2e_test

@github-actions
Copy link
Contributor

github-actions bot commented Jan 28, 2026

GiGL Automation

@ 23:31:16UTC : 🔄 Python Unit Test started.

@ 24:39:48UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

GiGL Automation

@ 23:31:19UTC : 🔄 Integration Test started.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 28, 2026

GiGL Automation

@ 23:31:21UTC : 🔄 E2E Test started.

@ 24:55:47UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc
Copy link
Collaborator Author

/unit_test_py

@kmontemayor2-sc
Copy link
Collaborator Author

/integration_test

@github-actions
Copy link
Contributor

github-actions bot commented Jan 28, 2026

GiGL Automation

@ 23:32:48UTC : 🔄 Python Unit Test started.

@ 24:51:43UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

GiGL Automation

@ 23:32:52UTC : 🔄 Integration Test started.

@kmontemayor2-sc
Copy link
Collaborator Author

/unit_test_py

@kmontemayor2-sc
Copy link
Collaborator Author

/integration_test

@kmontemayor2-sc
Copy link
Collaborator Author

/e2e_test

@github-actions
Copy link
Contributor

github-actions bot commented Jan 29, 2026

GiGL Automation

@ 18:21:41UTC : 🔄 Python Unit Test started.

@ 19:40:00UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 29, 2026

GiGL Automation

@ 18:21:49UTC : 🔄 E2E Test started.

@ 19:43:14UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

GiGL Automation

@ 18:21:52UTC : 🔄 Integration Test started.

Copy link
Collaborator

@mkolodner-sc mkolodner-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kyle! Left a few small comments, generally LGTM.

In the future, it might be easier to review this if the testing utility changes were moved to a separate PR from the get_node_ids change here.

# =============================================================================

USER: Final[NodeType] = NodeType("user")
STORY: Final[NodeType] = NodeType("story")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there was a comment prior that we should prefer item type to story to be more generic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

man I can update but is there really no place for fun and whimsy in the world...

def create_homogeneous_dataset(
edge_index: torch.Tensor,
node_features: Optional[torch.Tensor] = None,
node_feature_dim: int = DEFAULT_HOMOGENEOUS_NODE_FEATURE_DIM,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should node feature dim be inferred from node features?

I see that right now, if it's None, we are creating an empty tensor. I feel like if it's None, this should mean my dataset has no NodeFeatures, and if we want to indicate there's no node features on the machine, an empty tensor should be provided as input

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do you think it's worth adding an arg here for edge features while we are making this, as well as node_labels to be similar to the below function?

node_features: Mapping of NodeType -> feature tensor [num_nodes, feature_dim].
If None, creates zero features with the specified dimension.
node_labels: Mapping of NodeType -> label tensor [num_nodes, 1].
If None, creates labels equal to node indices.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reusability in other tests, I feel like the None behavior again should be to default to no labels/features, not to auto-populate them.

return dataset


def create_heterogeneous_dataset_with_labels(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth specifying in function header that this is ABLP labels, not node classification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants