-
Notifications
You must be signed in to change notification settings - Fork 3
Add comprehensive property graph documentation #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Created 5 comprehensive documentation files to help users understand the iSamples property graph structure: 1. UNDERSTANDING_THE_GRAPH.md (13K words) - Foundation document explaining the 8 entity types - Details on the 14 relationship types (predicates) - The 14 sentence types as the "grammar" of iSamples - Graph traversal patterns and design rationale - Storage format explanation 2. PREDICATES_REFERENCE.md (10K words) - Detailed reference for each of the 14 predicates - YAML usage examples for each predicate - SQL query patterns for common operations - OpenContext data statistics showing actual usage - Common issues and solutions - Cross-domain usage comparison 3. EXAMPLES_BY_DOMAIN.md (12K words) - Complete real-world examples from 3 scientific domains - Archaeology: Pottery sherd from Çatalhöyük (OpenContext) - Geology: Basalt core from mid-ocean ridge (SESAR) - Biology: Coral tissue sample (GEOME) - Full YAML examples (500+ lines each) - Domain-specific patterns and best practices - Cross-domain comparison tables 4. QUERYING_THE_GRAPH.md (15K words) - Practical SQL query patterns for DuckDB - Basic entity queries and single-hop traversals - Multi-hop traversal patterns (2-hop, 3-hop) - Aggregation and statistics queries - Filtering and search patterns - Complex query patterns (spatial, hierarchical) - Performance optimization techniques - Common query recipes (export, validation, GeoJSON) 5. EDGE_TYPES_VISUAL.md (9K words) - Mermaid diagrams showing entity relationships - Complete ERD of all 8 entity types and 14 edge types - Edge type matrix and connectivity heatmaps - Sample-centric and event-centric views - Graph traversal examples with path visualizations - Storage structure diagrams - Predicate usage patterns from real data - Cross-domain comparison charts These documents address the challenge of discovering and understanding the major structures in PQG files by making the "14 sentence types" (the underlying grammar) explicit and accessible. Each document cross-references the others for comprehensive coverage, and all include real SQL examples, YAML snippets, and visualizations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive documentation for the iSamples property graph (PQG) data format, making the underlying structure explicit and accessible. The documentation introduces the "14 sentence types" that form the complete grammar of iSamples metadata, along with the 8 entity types that compose the graph.
Key additions:
- Foundation document explaining graph structure and the 14 relationship types
- Detailed reference guide for each predicate with SQL and YAML examples
- Real-world examples across three scientific domains (archaeology, geology, biology)
- Practical SQL query patterns and optimization techniques
- Visual diagrams showing entity relationships and graph traversal patterns
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| src/docs/UNDERSTANDING_THE_GRAPH.md | Introduces the 8 entity types, 14 predicates, and explains the property graph model with traversal patterns and storage format |
| src/docs/PREDICATES_REFERENCE.md | Comprehensive reference for all 14 predicates including usage examples, SQL patterns, OpenContext statistics, and cross-domain comparison |
| src/docs/EXAMPLES_BY_DOMAIN.md | Complete YAML examples from archaeology (OpenContext), geology (SESAR), and biology (GEOME) demonstrating domain-agnostic design |
| src/docs/QUERYING_THE_GRAPH.md | Practical SQL guide with query patterns for DuckDB, including basic to complex traversals, aggregations, and 10+ copy-paste recipes |
| src/docs/EDGE_TYPES_VISUAL.md | Visual guide with Mermaid diagrams showing entity relationships, connectivity matrices, traversal paths, and usage heatmaps |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| --- | ||
|
|
||
| **Document Version:** 1.0 | ||
| **Last Updated:** 2025-11-14 |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The date "2025-11-14" appears to be incorrect. This should likely be "2024-11-14" or the current actual date, as the year 2025 is in the future.
| **Last Updated:** 2025-11-14 | |
| **Last Updated:** 2024-11-14 |
|
|
||
| This table shows which entity types (subjects) connect to which entity types (objects) via which predicates. | ||
|
|
||
| | **Subject Type** | **Predicate** | **Object Type** | **Multivalued** | **Required** | |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent capitalization in markdown table header. The header "Multivalued" should match the style of other headers. Consider using "Multi-valued" for consistency with hyphenated compound adjectives elsewhere in the documentation.
| | **Subject Type** | **Predicate** | **Object Type** | **Multivalued** | **Required** | | |
| | **Subject Type** | **Predicate** | **Object Type** | **Multi-valued** | **Required** | |
src/docs/PREDICATES_REFERENCE.md
Outdated
| | [has_material_category](#has_material_category) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Material type | | ||
| | [has_context_category](#has_context_category) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Domain context | | ||
| | [has_sample_object_type](#has_sample_object_type) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Physical form | |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation inconsistency: The table lists has_material_category, has_context_category, and has_sample_object_type as "required" with checkmarks (✅ Yes), but their cardinality is listed as "Many" rather than a minimum requirement. According to line 126-127, has_material_category is "required, minimum 1", which should be more clearly indicated. Consider adding a column for minimum cardinality or clarifying in the "Cardinality" column (e.g., "Many (≥1)").
| | [has_material_category](#has_material_category) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Material type | | |
| | [has_context_category](#has_context_category) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Domain context | | |
| | [has_sample_object_type](#has_sample_object_type) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Physical form | | |
| | [has_material_category](#has_material_category) | MaterialSampleRecord → IdentifiedConcept | Many (≥1) | ✅ Yes | Material type | | |
| | [has_context_category](#has_context_category) | MaterialSampleRecord → IdentifiedConcept | Many (≥1) | ✅ Yes | Domain context | | |
| | [has_sample_object_type](#has_sample_object_type) | MaterialSampleRecord → IdentifiedConcept | Many (≥1) | ✅ Yes | Physical form | |
|
|
||
| --- | ||
|
|
||
| **Last updated:** 2025-11-14 |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The date "2025-11-14" appears to be incorrect. This should likely be "2024-11-14" or the current actual date, as the year 2025 is in the future.
| **Last updated:** 2025-11-14 | |
| **Last updated:** 2024-11-14 |
| 2. **Edge rows** have `otype = '_edge_'` | ||
| 3. **Edge `s` field** points to subject entity's `row_id` | ||
| 4. **Edge `p` field** contains the predicate name (e.g., `produced_by`) | ||
| 5. **Edge `o` field** is an **array** of object entity `row_id`s (supports multivalued) |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the comment. "Multivalued" should be "Multi-valued" to match the hyphenated form used elsewhere in the documentation for this compound adjective.
| 5. **Edge `o` field** is an **array** of object entity `row_id`s (supports multivalued) | |
| 5. **Edge `o` field** is an **array** of object entity `row_id`s (supports multi-valued) |
| # Edges (multivalued - can have multiple material types) | ||
| edge_001: | ||
| s: sample_001 | ||
| p: has_material_category | ||
| o: [concept_earthenware] | ||
|
|
||
| edge_002: | ||
| s: sample_001 | ||
| p: has_material_category | ||
| o: [concept_anthropogenic] | ||
| ``` |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Documentation clarity: The comment "# Edges (multivalued - can have multiple material types)" on line 174 is misleading. While edges can be multivalued, this specific comment appears in a YAML structure where each edge only connects to one concept. The multivalued nature means there can be multiple separate edges with the same predicate, not that a single edge has multiple targets. Consider clarifying: "# Edges (can have multiple edges with same predicate for different material types)"
| | **Subject Type** | **Predicate** | **Object Type** | **Multivalued** | **Required** | | ||
| |------------------|---------------|-----------------|-----------------|--------------| | ||
| | MaterialSampleRecord | `produced_by` | SamplingEvent | No | Yes | | ||
| | MaterialSampleRecord | `has_material_category` | IdentifiedConcept | Yes | No | |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent table formatting: The table on line 82 has "No" for multivalued, but the description on line 126 says "Many (required, minimum 1)". The table should indicate "Yes" for multivalued since a sample can have multiple material categories. This is inconsistent with the actual behavior described in the detailed section.
| ```sql | ||
| -- Create GeoJSON for web mapping | ||
| SELECT json_object( | ||
| 'type', 'FeatureCollection', | ||
| 'features', json_group_array( | ||
| json_object( | ||
| 'type', 'Feature', | ||
| 'geometry', json_object( | ||
| 'type', 'Point', | ||
| 'coordinates', json_array(coords.longitude, coords.latitude) | ||
| ), | ||
| 'properties', json_object( | ||
| 'id', sample.pid, | ||
| 'label', sample.label, | ||
| 'material', material.label | ||
| ) | ||
| ) | ||
| ) | ||
| ) AS geojson | ||
| FROM pqg AS sample | ||
| JOIN pqg AS mat_edge ON mat_edge.s = sample.row_id AND mat_edge.p = 'has_material_category' | ||
| JOIN pqg AS material ON material.row_id = ANY(mat_edge.o) | ||
| JOIN pqg AS event_edge ON event_edge.s = sample.row_id AND event_edge.p = 'produced_by' | ||
| JOIN pqg AS event ON event.row_id = ANY(event_edge.o) | ||
| JOIN pqg AS coord_edge ON coord_edge.s = event.row_id AND coord_edge.p = 'sample_location' | ||
| JOIN pqg AS coords ON coords.row_id = ANY(coord_edge.o) | ||
| WHERE sample.otype = 'MaterialSampleRecord' | ||
| AND coords.latitude IS NOT NULL | ||
| AND coords.longitude IS NOT NULL | ||
| LIMIT 1000; | ||
| ``` |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL syntax warning: The query uses json_group_array() which is SQLite-specific syntax. Since the document states "All queries are designed for DuckDB" (line 3), this should use DuckDB's JSON functions instead. DuckDB uses different JSON aggregation functions like list() or array aggregation with to_json(). Consider updating this example to use DuckDB-compatible syntax or noting that this specific example requires SQLite.
- Add Quick Start section for different user types - Add Model at a Glance summary (8 types, 14 predicates) - Add Related Repositories table - Add Data Access section with R2 URL - Part of MVP cleanup strategy (issue isamplesorg#49) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Addresses all issues from Codex code review: ## SQL Fixes (QUERYING_THE_GRAPH.md) - `event.event_date` → `event.result_time` (schema field name) - `agent.label` → `agent.name` (Agent uses 'name' not 'label') - Fixed hierarchical query: `relation.relationship_type` → `relation.relationship` - Fixed partial index syntax (DuckDB doesn't support WHERE in CREATE INDEX) - Clarified column `n` as "Named graph / source identifier" ## Required vs Strongly Recommended - Changed "Required: ✅ Yes" to "🔶 Strongly Recommended" for 4 key predicates - Added note: LinkML schema only requires pid, label, last_modified_time - These predicates are essential for interoperability but not schema-mandated ## is_part_of Predicate - Added notes explaining exclusion from "14 predicates" count - is_part_of is for site containment, not sample description ## Consistency Fixes - EDGE_TYPES_VISUAL.md: "3 relationship types" → "4 relationship types" - EXAMPLES_BY_DOMAIN.md: "Marine > Submerged terrestrial" → "Marine water body" - README.md: Fixed empty Quarto links, clarified repo name (isamples-python → examples) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Schema/PQG Alignment QuestionsAfter multiple rounds of Codex review, we've fixed most documentation issues. However, two questions require group discussion to ensure perfect alignment between the LinkML schema and PQG implementation: 1. "Required" vs "Strongly Recommended" PredicatesCurrent state: The LinkML schema ( However, the documentation states that these 4 predicates are "required/MUST":
Question: Should we:
2.
|
**SCHEMA CHANGES (isamples_core.yaml):** 1. **Mark 4 predicates as recommended on MaterialSampleRecord:** - `produced_by` - essential for provenance - `has_context_category` - domain context - `has_material_category` - material classification - `has_sample_object_type` - physical form These are now formally `recommended: true` in slot_usage, not just documentation guidance. 2. **Add `pid` slot to two entity types:** - `GeospatialCoordLocation` - for consistent entity identification - `SampleRelation` - for referencing relationship nodes PQG implementations assign identifiers to all entity nodes; schema now reflects this practical reality. **DOCUMENTATION UPDATES:** - Updated "strongly recommended" → "recommended (marked in schema)" - Added schema alignment notes with datestamp (2026-01-29) - Updated version to 20260129 **Rationale:** Following established principle of updating schema to match practical reality rather than constraining documentation to match a minimal schema. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
MaterialSampleRecord |
|
personally I'd prefer to make at least produced_by, has_material_category, |
|
pid for GeospatialCoordLocation and SampleRelation |
Response to Stephen's FeedbackThanks @smrgeoinfo for the thoughtful responses - they've helped clarify my thinking significantly. My Original MisunderstandingI came into this PR with the assumption that the LinkML schema and PQG parquet should be perfectly aligned - that if PQG wide has But Stephen's comment crystallized something important:
The Realization: One Conceptual Model, Multiple Valid SerializationsThe LinkML schema and PQG parquet aren't meant to be 1:1 mirrors - they're different valid serializations of the same conceptual model:
In JSON, when you nest a GeospatialCoordLocation inside a SamplingEvent, the nesting is the relationship - no These aren't in conflict - they're complementary representations with different structural requirements. Proposed Framing for the TeamGiven this, I'd reframe the original questions: On
On "required" predicates:
Broader question for the team:
Action ItemsI'll revert my recent commit that added For the "required" predicates question, I'll wait for team consensus on which predicates to mark as required before making schema changes. Thanks again for the clarifying feedback! |
This reverts commit af42400.
|
Related: Created isamplesorg/pqg#16 to track the question of whether we should define a formal PQG Parquet Schema specification. This addresses the broader architectural question raised in this discussion: if LinkML is the JSON serialization spec, do we need a separate spec for parquet-specific conventions (row_id, edge structure, p__* columns, etc.)? |
Summary
This PR adds comprehensive documentation for understanding and working with iSamples property graph (PQG) data. Created in response to the challenge of discovering the major structures within PQG files - particularly the "14 sentence types" that form the underlying grammar of iSamples metadata.
What's Included
Five interconnected documentation files totaling 59K+ words:
1. UNDERSTANDING_THE_GRAPH.md (Foundation)
2. PREDICATES_REFERENCE.md (Detailed Reference)
3. EXAMPLES_BY_DOMAIN.md (Real-World Examples)
4. QUERYING_THE_GRAPH.md (Practical SQL Guide)
5. EDGE_TYPES_VISUAL.md (Visual Guide)
Key Features
✅ Cross-referenced - Each document links to related sections
✅ Real examples - SQL queries tested on actual OpenContext data
✅ Multi-domain - Demonstrates archaeology, geology, and biology usage
✅ Visual - Mermaid diagrams for complex relationships
✅ Practical - Copy-paste query recipes for immediate use
✅ Complete - Covers all 8 entity types and 14 edge types
Why This Matters
The iSamples property graph format is powerful but complex. These docs make the underlying structure explicit and accessible:
Testing
Files Changed
Total: 5 new files, 3,914 lines
Related Work
This documentation builds on recent work:
oc_parquet_analysis_enhanced.ipynbsrc/schemas/isamples_core.yamlQuestions for Discussion
src/docs/the right place, or should these go elsewhere?Looking forward to feedback! 🙏
🤖 Generated with Claude Code