Add comprehensive property graph documentation #192

rdhyee · 2025-11-14T14:56:51Z

Summary

This PR adds comprehensive documentation for understanding and working with iSamples property graph (PQG) data. Created in response to the challenge of discovering the major structures within PQG files - particularly the "14 sentence types" that form the underlying grammar of iSamples metadata.

What's Included

Five interconnected documentation files totaling 59K+ words:

1. UNDERSTANDING_THE_GRAPH.md (Foundation)

Explains the 8 entity types (MaterialSampleRecord, SamplingEvent, etc.)
Details the 14 relationship types (predicates)
Introduces the "14 sentence types" as the complete grammar of iSamples
Covers graph traversal patterns and design rationale
Explains the unified table storage format

2. PREDICATES_REFERENCE.md (Detailed Reference)

Complete documentation for each of the 14 predicates
YAML usage examples for every predicate
SQL query patterns for common operations
Real statistics from OpenContext data (1.1M samples, 11.6M total records)
Common issues, solutions, and cross-domain usage comparison

3. EXAMPLES_BY_DOMAIN.md (Real-World Examples)

Complete examples from 3 scientific domains:
- Archaeology: Pottery sherd from Çatalhöyük (OpenContext)
- Geology: Basalt core from mid-ocean ridge (SESAR)
- Biology: Coral tissue sample with parent chain (GEOME)
Full YAML examples (500+ lines each)
Domain-specific patterns and best practices
Cross-domain comparison tables

4. QUERYING_THE_GRAPH.md (Practical SQL Guide)

SQL query patterns for DuckDB (and other SQL databases)
Basic entity queries through complex multi-hop traversals
Aggregation, statistics, and geographic filtering
Performance optimization techniques
10+ copy-paste query recipes (export, validation, GeoJSON generation)

5. EDGE_TYPES_VISUAL.md (Visual Guide)

Mermaid diagrams showing entity relationships
Complete ERD of all 8 entity types and 14 edge types
Connectivity matrices and heatmaps
Graph traversal path visualizations
Storage structure diagrams
Real data usage patterns from OpenContext

Key Features

✅ Cross-referenced - Each document links to related sections
✅ Real examples - SQL queries tested on actual OpenContext data
✅ Multi-domain - Demonstrates archaeology, geology, and biology usage
✅ Visual - Mermaid diagrams for complex relationships
✅ Practical - Copy-paste query recipes for immediate use
✅ Complete - Covers all 8 entity types and 14 edge types

Why This Matters

The iSamples property graph format is powerful but complex. These docs make the underlying structure explicit and accessible:

Developers can understand the graph schema and write efficient queries
Data providers can see how to structure their metadata across domains
Researchers can discover relationships and traverse the graph effectively
New users can learn the "grammar" (14 sentence types) systematically

Testing

All SQL examples tested against OpenContext parquet data (11.6M records)
YAML examples validated against LinkML schema
Mermaid diagrams render correctly on GitHub

Files Changed

src/docs/UNDERSTANDING_THE_GRAPH.md    (+1,082 lines)
src/docs/PREDICATES_REFERENCE.md       (+765 lines)
src/docs/EXAMPLES_BY_DOMAIN.md         (+912 lines)
src/docs/QUERYING_THE_GRAPH.md         (+975 lines)
src/docs/EDGE_TYPES_VISUAL.md          (+668 lines)

Total: 5 new files, 3,914 lines

Related Work

This documentation builds on recent work:

Discovery of the 14 edge types in oc_parquet_analysis_enhanced.ipynb
PQG typed edges implementation in Add typed edges, schema validation, and SQL converter for iSamples pqg#6
LinkML schema definitions in src/schemas/isamples_core.yaml

Questions for Discussion

Location: Is src/docs/ the right place, or should these go elsewhere?
Audience: Are these pitched at the right technical level?
Additions: What other topics should be covered?
Integration: Should we add links to these from the main README?

Looking forward to feedback! 🙏

🤖 Generated with Claude Code

Created 5 comprehensive documentation files to help users understand the iSamples property graph structure: 1. UNDERSTANDING_THE_GRAPH.md (13K words) - Foundation document explaining the 8 entity types - Details on the 14 relationship types (predicates) - The 14 sentence types as the "grammar" of iSamples - Graph traversal patterns and design rationale - Storage format explanation 2. PREDICATES_REFERENCE.md (10K words) - Detailed reference for each of the 14 predicates - YAML usage examples for each predicate - SQL query patterns for common operations - OpenContext data statistics showing actual usage - Common issues and solutions - Cross-domain usage comparison 3. EXAMPLES_BY_DOMAIN.md (12K words) - Complete real-world examples from 3 scientific domains - Archaeology: Pottery sherd from Çatalhöyük (OpenContext) - Geology: Basalt core from mid-ocean ridge (SESAR) - Biology: Coral tissue sample (GEOME) - Full YAML examples (500+ lines each) - Domain-specific patterns and best practices - Cross-domain comparison tables 4. QUERYING_THE_GRAPH.md (15K words) - Practical SQL query patterns for DuckDB - Basic entity queries and single-hop traversals - Multi-hop traversal patterns (2-hop, 3-hop) - Aggregation and statistics queries - Filtering and search patterns - Complex query patterns (spatial, hierarchical) - Performance optimization techniques - Common query recipes (export, validation, GeoJSON) 5. EDGE_TYPES_VISUAL.md (9K words) - Mermaid diagrams showing entity relationships - Complete ERD of all 8 entity types and 14 edge types - Edge type matrix and connectivity heatmaps - Sample-centric and event-centric views - Graph traversal examples with path visualizations - Storage structure diagrams - Predicate usage patterns from real data - Cross-domain comparison charts These documents address the challenge of discovering and understanding the major structures in PQG files by making the "14 sentence types" (the underlying grammar) explicit and accessible. Each document cross-references the others for comprehensive coverage, and all include real SQL examples, YAML snippets, and visualizations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Copilot

Pull Request Overview

This PR adds comprehensive documentation for the iSamples property graph (PQG) data format, making the underlying structure explicit and accessible. The documentation introduces the "14 sentence types" that form the complete grammar of iSamples metadata, along with the 8 entity types that compose the graph.

Key additions:

Foundation document explaining graph structure and the 14 relationship types
Detailed reference guide for each predicate with SQL and YAML examples
Real-world examples across three scientific domains (archaeology, geology, biology)
Practical SQL query patterns and optimization techniques
Visual diagrams showing entity relationships and graph traversal patterns

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
src/docs/UNDERSTANDING_THE_GRAPH.md	Introduces the 8 entity types, 14 predicates, and explains the property graph model with traversal patterns and storage format
src/docs/PREDICATES_REFERENCE.md	Comprehensive reference for all 14 predicates including usage examples, SQL patterns, OpenContext statistics, and cross-domain comparison
src/docs/EXAMPLES_BY_DOMAIN.md	Complete YAML examples from archaeology (OpenContext), geology (SESAR), and biology (GEOME) demonstrating domain-agnostic design
src/docs/QUERYING_THE_GRAPH.md	Practical SQL guide with query patterns for DuckDB, including basic to complex traversals, aggregations, and 10+ copy-paste recipes
src/docs/EDGE_TYPES_VISUAL.md	Visual guide with Mermaid diagrams showing entity relationships, connectivity matrices, traversal paths, and usage heatmaps

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/docs/UNDERSTANDING_THE_GRAPH.md

src/docs/QUERYING_THE_GRAPH.md

Copilot · 2025-11-14T15:02:20Z

src/docs/EXAMPLES_BY_DOMAIN.md

+---
+
+**Document Version:** 1.0
+**Last Updated:** 2025-11-14


The date "2025-11-14" appears to be incorrect. This should likely be "2024-11-14" or the current actual date, as the year 2025 is in the future.

Suggested change

**Last Updated:** 2025-11-14

**Last Updated:** 2024-11-14

Copilot · 2025-11-14T15:02:20Z

src/docs/EDGE_TYPES_VISUAL.md

+
+This table shows which entity types (subjects) connect to which entity types (objects) via which predicates.
+
+| **Subject Type** | **Predicate** | **Object Type** | **Multivalued** | **Required** |


Copilot · 2025-11-14T15:02:21Z

src/docs/PREDICATES_REFERENCE.md

+| [has_material_category](#has_material_category) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Material type |
+| [has_context_category](#has_context_category) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Domain context |
+| [has_sample_object_type](#has_sample_object_type) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Physical form |


Copilot · 2025-11-14T15:02:22Z

src/docs/EDGE_TYPES_VISUAL.md

+
+---
+
+**Last updated:** 2025-11-14


The date "2025-11-14" appears to be incorrect. This should likely be "2024-11-14" or the current actual date, as the year 2025 is in the future.

Suggested change

**Last updated:** 2025-11-14

**Last updated:** 2024-11-14

Copilot · 2025-11-14T15:02:22Z

src/docs/EDGE_TYPES_VISUAL.md

+2. **Edge rows** have `otype = '_edge_'`
+3. **Edge `s` field** points to subject entity's `row_id`
+4. **Edge `p` field** contains the predicate name (e.g., `produced_by`)
+5. **Edge `o` field** is an **array** of object entity `row_id`s (supports multivalued)


Typo in the comment. "Multivalued" should be "Multi-valued" to match the hyphenated form used elsewhere in the documentation for this compound adjective.

Suggested change

5. **Edge `o` field** is an **array** of object entity `row_id`s (supports multivalued)

5. **Edge `o` field** is an **array** of object entity `row_id`s (supports multi-valued)

Copilot · 2025-11-14T15:02:23Z

src/docs/PREDICATES_REFERENCE.md

+# Edges (multivalued - can have multiple material types)
+edge_001:
+  s: sample_001
+  p: has_material_category
+  o: [concept_earthenware]
+
+edge_002:
+  s: sample_001
+  p: has_material_category
+  o: [concept_anthropogenic]
+```


[nitpick] Documentation clarity: The comment "# Edges (multivalued - can have multiple material types)" on line 174 is misleading. While edges can be multivalued, this specific comment appears in a YAML structure where each edge only connects to one concept. The multivalued nature means there can be multiple separate edges with the same predicate, not that a single edge has multiple targets. Consider clarifying: "# Edges (can have multiple edges with same predicate for different material types)"

Copilot · 2025-11-14T15:02:23Z

src/docs/EDGE_TYPES_VISUAL.md

+| **Subject Type** | **Predicate** | **Object Type** | **Multivalued** | **Required** |
+|------------------|---------------|-----------------|-----------------|--------------|
+| MaterialSampleRecord | `produced_by` | SamplingEvent | No | Yes |
+| MaterialSampleRecord | `has_material_category` | IdentifiedConcept | Yes | No |


Inconsistent table formatting: The table on line 82 has "No" for multivalued, but the description on line 126 says "Many (required, minimum 1)". The table should indicate "Yes" for multivalued since a sample can have multiple material categories. This is inconsistent with the actual behavior described in the detailed section.

Copilot · 2025-11-14T15:02:23Z

src/docs/QUERYING_THE_GRAPH.md

+```sql
+-- Create GeoJSON for web mapping
+SELECT json_object(
+    'type', 'FeatureCollection',
+    'features', json_group_array(
+        json_object(
+            'type', 'Feature',
+            'geometry', json_object(
+                'type', 'Point',
+                'coordinates', json_array(coords.longitude, coords.latitude)
+            ),
+            'properties', json_object(
+                'id', sample.pid,
+                'label', sample.label,
+                'material', material.label
+            )
+        )
+    )
+) AS geojson
+FROM pqg AS sample
+JOIN pqg AS mat_edge ON mat_edge.s = sample.row_id AND mat_edge.p = 'has_material_category'
+JOIN pqg AS material ON material.row_id = ANY(mat_edge.o)
+JOIN pqg AS event_edge ON event_edge.s = sample.row_id AND event_edge.p = 'produced_by'
+JOIN pqg AS event ON event.row_id = ANY(event_edge.o)
+JOIN pqg AS coord_edge ON coord_edge.s = event.row_id AND coord_edge.p = 'sample_location'
+JOIN pqg AS coords ON coords.row_id = ANY(coord_edge.o)
+WHERE sample.otype = 'MaterialSampleRecord'
+  AND coords.latitude IS NOT NULL
+  AND coords.longitude IS NOT NULL
+LIMIT 1000;
+```


SQL syntax warning: The query uses json_group_array() which is SQLite-specific syntax. Since the document states "All queries are designed for DuckDB" (line 3), this should use DuckDB's JSON functions instead. DuckDB uses different JSON aggregation functions like list() or array aggregation with to_json(). Consider updating this example to use DuckDB-compatible syntax or noting that this specific example requires SQLite.

- Add Quick Start section for different user types - Add Model at a Glance summary (8 types, 14 predicates) - Add Related Repositories table - Add Data Access section with R2 URL - Part of MVP cleanup strategy (issue isamplesorg#49) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Addresses all issues from Codex code review: ## SQL Fixes (QUERYING_THE_GRAPH.md) - `event.event_date` → `event.result_time` (schema field name) - `agent.label` → `agent.name` (Agent uses 'name' not 'label') - Fixed hierarchical query: `relation.relationship_type` → `relation.relationship` - Fixed partial index syntax (DuckDB doesn't support WHERE in CREATE INDEX) - Clarified column `n` as "Named graph / source identifier" ## Required vs Strongly Recommended - Changed "Required: ✅ Yes" to "🔶 Strongly Recommended" for 4 key predicates - Added note: LinkML schema only requires pid, label, last_modified_time - These predicates are essential for interoperability but not schema-mandated ## is_part_of Predicate - Added notes explaining exclusion from "14 predicates" count - is_part_of is for site containment, not sample description ## Consistency Fixes - EDGE_TYPES_VISUAL.md: "3 relationship types" → "4 relationship types" - EXAMPLES_BY_DOMAIN.md: "Marine > Submerged terrestrial" → "Marine water body" - README.md: Fixed empty Quarto links, clarified repo name (isamples-python → examples) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

rdhyee · 2026-01-30T02:08:06Z

Schema/PQG Alignment Questions

After multiple rounds of Codex review, we've fixed most documentation issues. However, two questions require group discussion to ensure perfect alignment between the LinkML schema and PQG implementation:

1. "Required" vs "Strongly Recommended" Predicates

Current state: The LinkML schema (isamples_core.yaml) only marks pid, label, and last_modified_time as technically required fields for MaterialSampleRecord.

However, the documentation states that these 4 predicates are "required/MUST":

produced_by
has_material_category
has_context_category
has_sample_object_type

Question: Should we:

(a) Update the schema to mark these predicates as required, OR
(b) Downgrade all doc language to "strongly recommended for interoperability"

2. `pid` for GeospatialCoordLocation and SampleRelation

Current state: The LinkML schema does NOT define pid for:

GeospatialCoordLocation
SampleRelation

However, PQG implementations and examples in these docs show pid for these entities.

Question: Should we:

(a) Add pid to these classes in the schema (to match PQG practice), OR
(b) Remove pid from docs/examples (to match schema strictly)

Goal: Perfect alignment between schema and documentation. Once we have consensus, I'll update accordingly.

cc @smrgeoinfo @datadavev

**SCHEMA CHANGES (isamples_core.yaml):** 1. **Mark 4 predicates as recommended on MaterialSampleRecord:** - `produced_by` - essential for provenance - `has_context_category` - domain context - `has_material_category` - material classification - `has_sample_object_type` - physical form These are now formally `recommended: true` in slot_usage, not just documentation guidance. 2. **Add `pid` slot to two entity types:** - `GeospatialCoordLocation` - for consistent entity identification - `SampleRelation` - for referencing relationship nodes PQG implementations assign identifiers to all entity nodes; schema now reflects this practical reality. **DOCUMENTATION UPDATES:** - Updated "strongly recommended" → "recommended (marked in schema)" - Added schema alignment notes with datestamp (2026-01-29) - Updated version to 20260129 **Rationale:** Following established principle of updating schema to match practical reality rather than constraining documentation to match a minimal schema. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

smrgeoinfo · 2026-01-30T15:02:19Z

MaterialSampleRecord
does pid identify the record or the physical thing in the world. Suggest that pid could be either, but sample_identifier MUST be the identifier for the physical object

smrgeoinfo · 2026-01-30T15:04:28Z

personally I'd prefer to make at least produced_by, has_material_category,
has_sample_object_type required. The MaterialSampleRecord isn't worth much besides binding a label to an identifier without some real information.

smrgeoinfo · 2026-01-30T15:15:07Z

pid for GeospatialCoordLocation and SampleRelation
since these are entityTypes, they must have a pid for linking things in the graph. they are not necessary in the JSON serialization because the nesting structure provide linkage. If we don't make them required, then in JSON-LD they are just blank nodes. The only reason to expose them outside of the parquet file is to enable external reference/reuse of the nodes, and that doesn't seem like a significant use case for these node types. I think that's why they're not in the LinkML, and I'm fine with Remove pid from docs/examples (to match schema strictly); the nodes in the parquet will still require pids, does that create a problem?

rdhyee · 2026-01-30T15:55:11Z

Response to Stephen's Feedback

Thanks @smrgeoinfo for the thoughtful responses - they've helped clarify my thinking significantly.

My Original Misunderstanding

I came into this PR with the assumption that the LinkML schema and PQG parquet should be perfectly aligned - that if PQG wide has pid on GeospatialCoordLocation (which it does, e.g., ark:/21547/DSz2757_location), then the schema should too. I was trying to "fix" perceived misalignment.

But Stephen's comment crystallized something important:

"they are not necessary in the JSON serialization because the nesting structure provides linkage. If we don't make them required, then in JSON-LD they are just blank nodes."

The Realization: One Conceptual Model, Multiple Valid Serializations

The LinkML schema and PQG parquet aren't meant to be 1:1 mirrors - they're different valid serializations of the same conceptual model:

Serialization	Purpose	How Relationships Work
JSON/JSON-LD (LinkML)	Document exchange	Nesting provides implicit linkage
PQG Narrow (parquet)	Graph queries	Explicit edge rows with s/p/o
PQG Wide (parquet)	Analytical queries	`p__*` columns with row_id arrays

In JSON, when you nest a GeospatialCoordLocation inside a SamplingEvent, the nesting is the relationship - no pid needed. But in PQG's flattened table, every entity needs an identifier for graph traversal.

These aren't in conflict - they're complementary representations with different structural requirements.

Proposed Framing for the Team

Given this, I'd reframe the original questions:

On pid for GeospatialCoordLocation/SampleRelation:

✅ Keep LinkML as-is (no pid for these types) - it's the JSON serialization spec
📝 Document PQG-specific requirements separately - the parquet representation legitimately needs identifiers that the JSON spec doesn't

On "required" predicates:

Stephen's suggestion to make produced_by, has_material_category, has_sample_object_type required makes sense
The actual data shows 95-100% coverage for these, so requiring them reflects practical reality
Question: Should has_context_category (97.8% coverage) also be required?

Broader question for the team:

Do we need a formal PQG Parquet Specification companion document?

The LinkML schema defines JSON/JSON-LD structure. But PQG has additional requirements (row_id on all entities, explicit edges, p__* columns in wide format) that aren't formally documented. Should we create a companion spec that:

Documents PQG-specific structural requirements

Explains how PQG relates to the LinkML schema

Lives in the pqg repo with cross-references here

Action Items

I'll revert my recent commit that added pid to GeospatialCoordLocation/SampleRelation - that was based on my misunderstanding.

For the "required" predicates question, I'll wait for team consensus on which predicates to mark as required before making schema changes.

Thanks again for the clarifying feedback!

This reverts commit af42400.

rdhyee · 2026-01-30T16:04:06Z

Related: Created isamplesorg/pqg#16 to track the question of whether we should define a formal PQG Parquet Schema specification.

This addresses the broader architectural question raised in this discussion: if LinkML is the JSON serialization spec, do we need a separate spec for parquet-specific conventions (row_id, edge structure, p__* columns, etc.)?

rdhyee requested a review from Copilot November 14, 2025 14:57

Copilot started reviewing on behalf of rdhyee November 14, 2025 14:57 View session

Copilot finished reviewing on behalf of rdhyee November 14, 2025 15:00

Copilot AI reviewed Nov 14, 2025

View reviewed changes

rdhyee and others added 2 commits January 29, 2026 17:25

Revert "feat(schema): Align LinkML schema with PQG practical reality"

229f2ec

This reverts commit af42400.

rdhyee mentioned this pull request Jan 30, 2026

Define formal PQG Parquet Schema specification isamplesorg/pqg#16

Open


		This table shows which entity types (subjects) connect to which entity types (objects) via which predicates.

		\| Subject Type \| Predicate \| Object Type \| Multivalued \| Required \|

	5. Edge `o` field is an array of object entity `row_id`s (supports multivalued)
	5. Edge `o` field is an array of object entity `row_id`s (supports multi-valued)

Add comprehensive property graph documentation #192

Are you sure you want to change the base?

Add comprehensive property graph documentation #192

Uh oh!

Conversation

rdhyee commented Nov 14, 2025

Summary

What's Included

1. UNDERSTANDING_THE_GRAPH.md (Foundation)

2. PREDICATES_REFERENCE.md (Detailed Reference)

3. EXAMPLES_BY_DOMAIN.md (Real-World Examples)

4. QUERYING_THE_GRAPH.md (Practical SQL Guide)

5. EDGE_TYPES_VISUAL.md (Visual Guide)

Key Features

Why This Matters

Testing

Files Changed

Related Work

Questions for Discussion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

rdhyee commented Jan 30, 2026

Schema/PQG Alignment Questions

1. "Required" vs "Strongly Recommended" Predicates

2. pid for GeospatialCoordLocation and SampleRelation

Uh oh!

smrgeoinfo commented Jan 30, 2026

Uh oh!

smrgeoinfo commented Jan 30, 2026

Uh oh!

smrgeoinfo commented Jan 30, 2026

Uh oh!

rdhyee commented Jan 30, 2026

Response to Stephen's Feedback

My Original Misunderstanding

The Realization: One Conceptual Model, Multiple Valid Serializations

Proposed Framing for the Team

Action Items

Uh oh!

rdhyee commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. `pid` for GeospatialCoordLocation and SampleRelation