Skip to content

Graph OLAP Platform - Design Requirements

Platform for your customer analytics workload enabling analysts to create ad-hoc graph instances from Starburst SQL queries, run graph/network algorithms, and share work.

User Interface: The platform is accessed exclusively via the Python SDK in Jupyter notebook environments. There is no web interface.

Scale: Tens of analysts, potentially hundreds of concurrent instances, ≤2GB graphs, <24hr typical lifespan


TermDefinition
MappingA user-created configuration that defines how to structure graph data from Starburst SQL queries. Contains node definitions (SQL + schema) and edge definitions (SQL + relationships). This is a resource, not a process.
SnapshotA point-in-time export of data based on a Mapping. Created by running the Mapping’s SQL queries against Starburst and storing results as Parquet files in GCS.
InstanceA running Ryugraph database loaded with data from a Snapshot. Provides query and algorithm execution capabilities.
TermDefinition
ExportThe process of running Starburst UNLOAD queries and writing Parquet files to GCS. Performed by the Export Worker during Snapshot creation.
LoadThe process of reading Parquet files from GCS and importing them into a graph database. Performed by the graph wrapper (Ryugraph or FalkorDB) during Instance creation.
TermDefinition
Control PlaneThe central management component: Python/FastAPI REST API backend. Manages all resource lifecycle, orchestrates operations, and serves as single source of truth. Includes the Mapping Generator subsystem. Accessed exclusively via the Jupyter SDK.
Export WorkerBackground service that handles Snapshot creation. Export Submitter submits UNLOAD queries to Starburst; Export Poller uses APScheduler to poll the export_jobs table and call Starburst directly until completion.
Ryugraph WrapperPer-instance FastAPI service wrapping an embedded Ryugraph database. Creates schema, loads data from Snapshots, and provides query/algorithm APIs.
FalkorDB WrapperPer-instance FastAPI service wrapping an embedded FalkorDB database. In-memory graph database with native Cypher algorithms.
Mapping GeneratorSubsystem within Control Plane that validates Mappings against Starburst, infers column types, and generates Ryugraph schema DDL.

ComponentTechnology
Graph DBRyugraph (KuzuDB fork), FalkorDB
InfrastructureGKE on Google Cloud
Data SourceStarburst Galaxy (managed Trino SaaS) + BigQuery
Export PlatformStarburst Galaxy system.unload() with PyArrow fallback
StorageGCS at gs://bucket/{user_id}/{mapping_id}/{snapshot_id}/ (user_id = snapshot owner)
Control PlanePython/FastAPI REST API, PostgreSQL (Cloud SQL, DO Managed, or local pod), raw SQL
Ryugraph WrapperPython (FastAPI), per-instance REST API with embedded Ryugraph
FalkorDB WrapperPython (FastAPI), per-instance REST API with embedded FalkorDB
Jupyter SDKPython (full control plane + query/algorithm interface)
IaC/GitOpsTerraform, Jenkins (CI), ./infrastructure/cd/deploy.sh + kubectl apply -f infrastructure/cd/resources/ (CD). No Helm (except Zero-to-JupyterHub), no ArgoCD, no GitHub Actions.

References:

  • ADR-070: Starburst Galaxy + BigQuery Export Platform
  • ADR-071: PyArrow Fallback Export Strategy
  • ADR-072: Removal of Local Trino Stack

Three resource types, each with single owner (all resources visible to all analysts):

  1. Mapping: SQL query defining graph schema. Retention: until deleted/inactivity timeout
  2. Data Snapshot: Timestamped Parquet files from mapping. Retention: 7 days default (configurable). Re-running mapping creates new versioned snapshot
  3. Graph Instance: Running Ryugraph pod from snapshot. Retention: <24hr typical (configurable)

Mapping (header):

FieldTypeDescription
idUUIDUnique identifier
namestringDisplay name
descriptionstringGeneral description of the mapping
owner_usernamestringOwner username (DB-backed user record)
current_versionintegerLatest version number
created_attimestampCreation time
ttldurationTime-to-live (null = no expiry)
inactivity_timeoutdurationDelete after no snapshots created

Mapping Version (immutable):

FieldTypeDescription
mapping_idUUIDParent mapping reference
versionintegerVersion number (1, 2, 3, …)
change_descriptionstringDescription of changes (null for initial version)
node_definitionsJSONArray of node definitions (see structure below)
edge_definitionsJSONArray of edge definitions (see structure below)
created_attimestampWhen this version was created

Versioning rules:

  • Versions are immutable once created
  • Editing a mapping creates a new version (requires change description)
  • Cannot delete a mapping if any snapshots exist for any version
  • Must delete all snapshots before deleting a mapping

Node Definition Structure:

{
"label": "Customer",
"sql": "SELECT customer_id, name, city FROM analytics.customers",
"primary_key": {"name": "customer_id", "type": "STRING"},
"properties": [
{"name": "name", "type": "STRING"},
{"name": "city", "type": "STRING"}
]
}
  • label: Ryugraph node table name
  • sql: Starburst SQL query (primary_key column must be first in SELECT)
  • primary_key: Column name and Ryugraph type for node primary key
  • properties: Property columns with Ryugraph types, in SELECT order (after primary key)
  • Supported types: STRING, INT64, INT32, INT16, INT8, DOUBLE, FLOAT, DATE, TIMESTAMP, BOOL, BLOB, UUID, LIST, MAP, STRUCT

Edge Definition Structure:

{
"type": "PURCHASED",
"from_node": "Customer",
"to_node": "Product",
"sql": "SELECT customer_id, product_id, amount, purchase_date FROM analytics.transactions",
"from_key": "customer_id",
"to_key": "product_id",
"properties": [
{"name": "amount", "type": "DOUBLE"},
{"name": "purchase_date", "type": "DATE"}
]
}
  • type: Ryugraph relationship table name
  • from_node/to_node: Source and target node labels
  • sql: Starburst SQL query (from_key first, to_key second, then properties)
  • from_key/to_key: Column names for source/target node references (types inferred from referenced node primary keys)
  • properties: Property columns with Ryugraph types, in SELECT order (after from/to keys)

Copy Mapping Behavior:

When copying a mapping:

FieldBehavior
nameDefaults to “Copy of {source.name}“
descriptionCopied from source
node_definitionsCopied from source current_version
edge_definitionsCopied from source current_version
current_versionReset to 1
ownerSet to copying user
ttlReset to global default
inactivity_timeoutReset to global default
created_atSet to now

What is NOT copied: Version history (only current version copied as v1), snapshots, favorites.

FieldTypeDescription
idUUIDUnique identifier
namestringDisplay name
descriptionstringOptional description
mapping_idUUIDSource mapping reference
mapping_versionintegerVersion of mapping used to create this snapshot
owner_usernamestringOwner username (DB-backed user record)
gcs_pathstringGCS location of Parquet files
size_bytesintegerTotal storage size
node_countintegerNumber of nodes (per node type)
edge_countintegerNumber of edges (per edge type)
statusenumpending, creating, ready, failed, cancelled
error_messagestringFailure details (if status=failed)
created_attimestampCreation time
ttldurationTime-to-live
inactivity_timeoutdurationDelete after no instances created

Deletion rule: Cannot delete a snapshot if any instances exist for that snapshot. Must terminate all instances before deleting a snapshot.

GCS File Structure (per snapshot):

gs://bucket/{user_id}/{mapping_id}/{snapshot_id}/ (user_id = snapshot owner)
├── nodes/
│ ├── {node_label_1}/
│ │ └── *.parquet (multiple files, written in parallel by Starburst)
│ └── {node_label_2}/
│ └── *.parquet
└── edges/
├── {edge_type_1}/
│ └── *.parquet (multiple files, written in parallel by Starburst)
└── {edge_type_2}/
└── *.parquet

Constraints:

  • Node Parquet files: columns in order [primary_key, property1, property2, …]
  • Edge Parquet files: columns in order [from_key, to_key, property1, property2, …]
  • Nodes must be loaded before edges (edges reference node primary keys)
  • Starburst UNLOAD writes multiple files in parallel; Ryugraph COPY FROM reads all files via glob pattern (*.parquet)
FieldTypeDescription
idUUIDUnique identifier
namestringDisplay name
descriptionstringOptional description
snapshot_idUUIDSource snapshot reference
owner_usernamestringOwner username (DB-backed user record)
instance_urlstringUnique access URL
pod_namestringKubernetes pod name
pod_ipstringInternal pod IP
statusenumstarting, running, stopping, failed (stopping = terminating, instance deleted when complete)
error_messagestringFailure details (if status=failed)
created_attimestampCreation time
last_activity_attimestampLast query/algorithm execution
ttldurationTime-to-live
inactivity_timeoutdurationTerminate after no activity
memory_usage_bytesintegerCurrent memory consumption
disk_usage_bytesintegerCurrent disk consumption
lock_holder_idUUIDUser holding algorithm lock (null if unlocked)
lock_algorithmstringAlgorithm name being executed
lock_acquired_attimestampWhen lock was acquired

FieldMinMaxAllowed CharactersUnique Scope
Mapping name1255Unicode letters, numbers, spaces, -_.Not unique
Snapshot name1255Unicode letters, numbers, spaces, -_.Not unique
Instance name1255Unicode letters, numbers, spaces, -_.Not unique
Description04000Any UnicodeN/A
FieldMinMaxAllowed CharactersUnique Scope
Node label164ASCII letters, numbers, _ (start with letter)Per mapping version
Edge type164ASCII uppercase letters, numbers, _Per mapping version
Property name164ASCII letters, numbers, _ (start with letter)Per node/edge

The following names cannot be used for node labels or edge types:

  • Cypher keywords: NODE, RELATIONSHIP, MATCH, WHERE, RETURN, CREATE, DELETE, SET, REMOVE, WITH, ORDER, LIMIT, SKIP, UNION, CALL, YIELD
  • System prefixes: _internal_, _system_, _ryugraph_
CheckWhenError Code
Name too longCreate/UpdateVALIDATION_FAILED
Invalid charactersCreate/UpdateVALIDATION_FAILED
Reserved nameCreate/UpdateVALIDATION_FAILED
Duplicate node labelCreate mapping versionVALIDATION_FAILED
Duplicate edge typeCreate mapping versionVALIDATION_FAILED

Role hierarchy (least to most privileged): Analyst < Admin < Ops. Each higher role inherits ALL capabilities of the roles below it.

ResourceViewCreateUpdateDeleteOther
MappingAllOwnOwn only*Own only**Copy any; list versions; list snapshots
SnapshotAllFrom any mapping (owns result)Own onlyOwn only-
InstanceAllFrom any snapshot (owns result)Own onlyOwn onlyQuery any; algorithms own only
Export QueueOwn only----

*Update creates a new immutable version (requires change description) **Delete fails if any snapshots exist for any version of the mapping

All Analyst capabilities, plus:

  • CRUD any user’s resources (mappings, snapshots, instances)
  • Run algorithms on any instance
  • View query logs and algorithm run history
  • Read all export queue entries

Ops-only endpoints (config, cluster, jobs) are NOT accessible to Admin.

All Admin capabilities (Ops inherits every Admin and Analyst capability), plus:

  • View cluster health status and metrics
  • Configure global lifecycle defaults and hard limits
  • Configure concurrency limits (per-analyst, cluster-wide)
  • Manage allowed data sources (catalogs/schemas for Schema Browser)
  • Trigger metadata cache refresh
  • Enable/disable maintenance mode
  • Force terminate stuck instances
  • Retry/cancel failed exports
  • Manage background jobs (trigger, cancel, view status)

See system-design/authorization.spec.md for the authoritative RBAC matrix.


OperationConcurrencyNotes
Read (Cypher queries)ConcurrentAllowed
Algorithm writesExclusive lockOne algorithm run per instance at a time
Structure modificationNot permittedNo add/delete nodes/edges
  • Lock is implicit: created automatically when algorithm starts, released automatically when it finishes
  • Lock includes: holder user ID, algorithm name, start time
  • If algorithm hangs, lock remains held - user must terminate the instance
  • Clear error message when instance is locked (e.g., “Instance locked by user X running algorithm Y since Z”)

Hierarchy: Global Defaults & Hard Limits (Ops only) → Local Override (resource owner, must be ≤ hard limit)

Parameters per resource type:

  • Max age (TTL): Delete/terminate after duration
  • Inactivity timeout: Delete/terminate after no activity

Activity definitions:

  • Mapping: Used to create snapshot
  • Snapshot: Used to create instance
  • Instance: Query executed or algorithm run

Concurrency limits (Ops configurable):

  • Instances per analyst
  • Total cluster capacity

Limit enforcement (instance creation only):

  • HTTP 409 Conflict when limit exceeded
  • Error message includes: current count, max allowed, which limit hit (per-analyst or cluster)
  • Email notification sent to: analyst, all admins, all ops users
ResourceFieldDefaultSource
Mappingttlnull (no expiry)global_config
Mappinginactivity_timeoutP30D (30 days)global_config
SnapshotttlP7D (7 days)global_config
Snapshotinactivity_timeoutP3D (3 days)global_config
InstancettlPT24H (24 hours)global_config
Instanceinactivity_timeoutPT4H (4 hours)global_config
Child ResourceInherits FromOverride Allowed
SnapshotParent mapping’s lifecycle settingsYes (can be shorter only)
InstanceParent snapshot’s lifecycle settingsYes (can be shorter only)
EndpointParameterDefault
List endpointslimit50
List endpointsoffset0
List endpointssortcreated_at DESC
Snapshot createmapping_versionmapping.current_version
Query executetimeout_ms60000

ResourceSearchable FieldsMatch Type
Mappingname, descriptionCase-insensitive substring
Snapshotname, descriptionCase-insensitive substring
Instancename, descriptionCase-insensitive substring
ResourceFilterable FieldsOperators
Mappingowner, created_atequals, date range
Snapshotowner, status, mapping_id, created_atequals, date range
Instanceowner, status, snapshot_id, created_atequals, date range
ResourceSortable FieldsDefault
Mappingname, created_at, updated_atcreated_at DESC
Snapshotname, created_at, statuscreated_at DESC
Instancename, created_at, statuscreated_at DESC
ParameterDefaultMaximum
limit50100
offset0No limit

Response includes:

  • total_count: Total matching resources
  • has_more: Boolean indicating more pages exist

All algorithms execute server-side within the Ryugraph pod.

MethodAlgorithms
Ryugraph algo extensionPageRank, Connected Components, Shortest Path, Louvain, Label Propagation, Triangle Count
NetworkX (in-process)Centrality (degree, closeness, betweenness, eigenvector), Community (Louvain, Girvan-Newman), Paths, Clustering, Link prediction

Key capabilities:

  • Ryugraph is embedded in Python - NetworkX runs in the same process
  • Bidirectional integration: Ryugraph query results convert to NetworkX graphs, algorithm results write back to Ryugraph node/edge properties
  • Native Ryugraph algorithms (C++) are faster; NetworkX provides broader algorithm coverage

Result storage: Algorithm results written to node/edge properties. Changes persist for instance lifetime but are not exportable - no facility to persist algorithm results back to snapshots.


ResourceSoft LimitHard LimitNotes
Graph size (memory)1.5 GB2 GBPer instance
Node count10 million50 millionPer instance
Edge count50 million200 millionPer instance
Properties per node/edge50100Schema constraint
Concurrent queries1020Per instance
OperationExpected (p50)Target (p95)Timeout
Simple query (<1000 results)< 500ms< 2s60s
Complex query (aggregation)< 2s< 10s60s
Algorithm (PageRank, 1M nodes)< 30s< 2 min30 min
Algorithm (Betweenness, 100K nodes)< 2 min< 10 min30 min
Instance startup< 1 min< 3 min5 min
Snapshot export (1M rows)< 5 min< 15 min30 min
OperationExpected Throughput
Snapshot exports (concurrent)10 parallel
Instance startups (concurrent)20 parallel
API requests (Control Plane)100 req/sec

1. Control Plane (Python/FastAPI REST API)

Section titled “1. Control Plane (Python/FastAPI REST API)”

Note: The platform is accessed exclusively via the Python SDK in Jupyter environments. There is no web interface.

REST API (accessed via Jupyter SDK):

Mappings: GET/POST/DELETE /api/mappings
GET /api/mappings/:id
PUT /api/mappings/:id - Creates new version (requires change_description)
POST /api/mappings/:id/copy
PUT /api/mappings/:id/lifecycle
GET /api/mappings/:id/versions - List all versions
GET /api/mappings/:id/versions/:v - Get specific version
GET /api/mappings/:id/snapshots - List snapshots across all versions
(NOTE: route is currently commented out in
packages/control-plane/src/control_plane/routers/api/mappings.py:594;
ADR-149 Tier-A.8 follow-up to re-enable.)
Snapshots: GET /api/snapshots (read-only, CRUD disabled)
GET /api/snapshots/:id (read-only, CRUD disabled)
PUT /api/snapshots/:id/lifecycle
Instances: GET/POST /api/instances
POST /api/instances/from-mapping (creates snapshot automatically)
GET/PUT/DELETE /api/instances/:id
PUT /api/instances/:id/lifecycle
POST /api/instances/:id/terminate
Locks: GET /api/instances/:id/lock - Check lock status (no create/delete - lock is implicit)
Config (Ops): GET/PUT /api/config/lifecycle
GET/PUT /api/config/concurrency
Cluster (Ops): GET /api/cluster/health
GET /api/cluster/instances
GET /api/cluster/metrics

Note: Audit logging is handled by the company’s external observability stack (TBD), not the Control Plane API.

Snapshot creation is asynchronous to avoid blocking the user during long-running Starburst exports.

Export Strategy (Two-Tier):

TierMethodWhen Used
PrimaryStarburst Galaxy system.unload()Server-side, distributed execution
FallbackPyArrow client-side exportWhen system.unload() unavailable

The export worker automatically falls back to PyArrow if server-side export fails. Both paths produce identical Parquet files.

User-Observable Behavior:

  1. User triggers snapshot creation via Jupyter SDK
  2. Snapshot record created immediately with status pending
  3. Status progresses to creating during export (progress visible via SDK polling)
  4. On success: status becomes ready, node/edge counts and size recorded
  5. On failure: status becomes failed, error message recorded
  6. On cancellation (Ops): status becomes cancelled, partial data cleaned up

Progress Visibility:

During the creating phase, users can see:

  • Current phase (exporting nodes vs edges)
  • Which node/edge type is being processed
  • Completed tables with row counts

Retry Behavior:

  • Failed snapshots can be retried via SDK
  • Retry creates a new export attempt (overwrites partial data)

Cancellation:

  • Ops users can cancel in-progress exports
  • Cancellation is best-effort (current Starburst query completes)

Note: Implementation details (compute platform, message queue) are defined in export-worker.design.md and ADR-018.

3. Ryugraph Wrapper (Python/FastAPI, per instance)

Section titled “3. Ryugraph Wrapper (Python/FastAPI, per instance)”

Each graph instance runs in its own Kubernetes pod. Ryugraph is an embedded database - it runs in-process within the Python wrapper (not as a separate server).

Constraints:

  • One Ryugraph database per pod (file locking prevents multiple writers)
  • On startup: Create schema (CREATE NODE TABLE, CREATE REL TABLE) from mapping definition, then COPY FROM GCS Parquet files (nodes before edges)
  • Supports concurrent read queries, exclusive lock for algorithm writes

Instance URL structure: https://{domain}/{instance-id}/

  • Single domain, path-based routing via Ingress

REST API:

Control Plane endpoints:

GET /health, GET /status, POST /shutdown

Jupyter SDK endpoints:

POST /query - Execute Cypher query
POST /algo/{name} - Run Ryugraph native algorithm
POST /networkx/{name} - Run NetworkX algorithm
POST /subgraph - Extract subgraph
GET /lock - Check lock status (lock is implicit with algorithm execution)
GET /schema - Get graph schema

Ryugraph Explorer: Hosted at https://{domain}/{instance-id}/explorer/

4. Jupyter SDK (Python) - Primary User Interface

Section titled “4. Jupyter SDK (Python) - Primary User Interface”

The Jupyter SDK is the only user interface for the Graph OLAP Platform. All platform interactions - from mapping creation to query execution - are performed through Python code in Jupyter notebooks.

Capabilities:

CategoryFeatures
Control PlaneMapping, Snapshot, Instance CRUD with versioning, lifecycle, and favorites
QueriesCypher execution with multiple return formats (DataFrame, dict, NetworkX)
AlgorithmsNative Ryugraph + any NetworkX algorithm via dynamic discovery (500+)
VisualizationSmart auto-visualization, interactive tables, graph rendering
DeploymentZero-config Jupyter integration, Docker images, JupyterHub

Quick Start (2 lines):

from graph_olap import notebook
client = notebook.connect() # Auto-discovers config from environment

Example Workflow:

# Create and wait for snapshot
snapshot = client.snapshots.create_and_wait(mapping_id=1, name="Analysis")
# Create and connect to instance
instance = client.instances.create_and_wait(snapshot_id=snapshot.id, name="Analysis")
conn = client.instances.connect(instance.id)
# Query and visualize
result = conn.query("MATCH (c:Customer)-[p]->(pr:Product) RETURN c, p, pr LIMIT 1000")
result.show() # Auto-selects best visualization
# Run algorithms
conn.networkx.run("pagerank", node_label="Customer", property_name="pr")
conn.networkx.run("louvain_communities", property_name="community")
# Export results
df = conn.query_df("MATCH (n) RETURN n.name, n.pr, n.community")
df.to_csv("results.csv")

Implementation Details: See jupyter-sdk.design.md

Deployment: See jupyter-sdk.deployment.design.md

See ryugraph-performance.reference.md for detailed sizing rationale.

  • Memory request: 3Gi (2GB buffer pool + algorithm overhead)
  • Memory limit: 8Gi (burst capacity for NetworkX algorithms)
  • CPU request: 1 vCPU, limit: 4 vCPU
  • Ryugraph threads: 16 (4x CPU for I/O-bound GCS reads)
  • Buffer pool: 2GB (optimal for GCS COPY FROM)
  • Disk: Persistent volume for buffer pool spilling
  • Pod QoS: Burstable (enables memory bursting without eviction risk)

Failure PointSystem BehaviorUser Action Required
Starburst query failsRetry 3x with exponential backoff, then failUser can retry via SDK
GCS write failsRetry 3x with exponential backoff, then failUser can retry via SDK
Worker crashes mid-exportMessage redelivered, export restartsNone (automatic)
Max retries exhaustedMessage to DLQ, status=failedContact ops or retry manually
Failure PointSystem BehaviorUser Action Required
Pod scheduling failsInstance marked failed immediatelyCreate new instance
GCS read failsInstance marked failedVerify snapshot, create new instance
Schema creation failsInstance marked failedCheck mapping definition
Startup timeout (>5 min)Instance marked failed, pod deletedCreate new instance
  • Snapshot export: All-or-nothing. Partial files may exist in GCS but are overwritten on retry.
  • Instance startup: All-or-nothing. Partial load results in failed instance.
  • Algorithm execution: All-or-nothing. Partial results are not persisted.

Ops can enable maintenance mode to gracefully stop accepting new work during upgrades or incidents.

Behavior when enabled:

Request TypeBehavior
New resource creation (mappings, snapshots, instances)Rejected with 503 Service Unavailable
Read operations (GET requests)Allowed
In-flight operations (running exports, starting instances)Continue to completion
Terminate/delete operationsAllowed

API response during maintenance:

{
"error": {
"code": "SERVICE_UNAVAILABLE",
"message": "System is in maintenance mode. {maintenance_message}"
}
}

Configuration:

  • maintenance_mode: boolean (default: false)
  • maintenance_message: string (optional, shown to users)

MetricGranularityPurpose
Active instance countBy status, by userCapacity planning
Query latencyp50, p95, p99Performance monitoring
Algorithm execution timeBy algorithm typePerformance monitoring
Snapshot export durationPer snapshotPipeline health
Memory/disk utilizationPer instanceResource monitoring
Export queue depthCluster-widePipeline health
ConditionSeverityAction
Instance failure rate > 10% in 5 minCriticalPage on-call
Export queue depth > 50WarningNotify ops channel
Cluster capacity > 80%WarningNotify ops channel
Control Plane error rate > 5%CriticalPage on-call
MetricTargetMeasurement Window
Control Plane availability99.5%Monthly
Query latency (p95)< 10 secondsDaily
Instance startup time (p95)< 3 minutesDaily
Snapshot export success rate> 95%Weekly

Data StoreBackup FrequencyRetentionRecovery Objective
Cloud SQL (Control Plane)Daily automated30 daysRPO: 24 hours
GCS (Parquet snapshots)None (source reproducible)N/ARe-export from Starburst
ScenarioRTORPORecovery Method
Control Plane DB corruption4 hours24 hoursRestore from backup
GCS bucket deletion8 hoursN/ARe-export snapshots
Single instance failureImmediateN/ACreate new instance
Cluster failure2 hours24 hoursRestore DB, re-create instances
  • GCS: 99.999999999% (11 nines) durability (GCP SLA)
  • Cloud SQL: Automated backups with point-in-time recovery
  • Instance data: Ephemeral (not backed up, recreatable from snapshot)

Reference: ADR-070: Starburst Galaxy + BigQuery Export Platform

AspectRequirement
ServiceStarburst Galaxy (managed Trino SaaS)
Expected availability99% during business hours
Query timeout30 minutes (configurable)
Export methodsystem.unload() with PyArrow fallback
Data sourceBigQuery tables via Starburst connector
Fallback behaviorAutomatic PyArrow fallback; fail snapshot with clear error if both paths fail
AuthenticationService account with query execution permissions

Note: Local Trino emulation stack has been removed (see ADR-072). E2E tests require network connectivity to Starburst Galaxy.

AspectRequirement
Required IAM rolesstorage.objectAdmin on snapshot bucket
AuthenticationWorkload Identity (GKE pods)
Bucket locationSame region as GKE cluster
AspectRequirement
Required IAM rolescloudsql.client
AuthenticationWorkload Identity or Cloud SQL Proxy
ConnectionPrivate IP within VPC

Data StateRequirement
Data in transitTLS 1.2+ for all external connections
Data at rest (Cloud SQL)Google-managed encryption (default)
Data at rest (GCS)Google-managed encryption (default)
SecretsStored in Secret Manager, not in code/config
BoundaryRequirement
Control PlaneInternal load balancer, not public
Instance podsIngress-routed only, no direct public access
Cloud SQLPrivate IP, no public access
GCSPrivate Google Access, no public buckets
Data TypeClassificationHandling
Graph dataBusiness ConfidentialAccess logged, GCS IAM controlled
User credentialsN/ANot stored (external IdP)
API keysSecretStored encrypted, rotatable

Typical analyst workflow (all via Jupyter SDK):

  1. Create Mapping: Analyst creates a mapping programmatically using the SDK, defining SQL queries and node/edge definitions in Python
  2. Create Snapshot: Analyst triggers snapshot creation via SDK; system queues Starburst export job; SDK provides polling methods to wait for completion (status: pending → creating → ready)
  3. Create Instance: Analyst creates a graph instance from the snapshot via SDK; system provisions pod and loads graph data
  4. Analyze: Analyst runs Cypher queries and graph algorithms through the SDK, with results returned as DataFrames or visualizations
  5. Collaborate: Analyst shares notebook with colleague; colleague can see and query all instances via their own SDK connection
  6. Auto-cleanup: Instance auto-terminates after TTL or inactivity timeout (whichever comes first)

Versioning workflow:

  1. Analyst creates mapping M1 (v1) with initial node/edge definitions
  2. Analyst creates snapshot S1 from M1 (references M1 v1)
  3. Analyst edits M1, provides change description → creates M1 v2
  4. S1 remains unchanged (still references M1 v1)
  5. Analyst creates snapshot S2 from M1 (references M1 v2)
  6. Both S1 and S2 coexist; instances can be created from either

Multi-user scenario:

  • Analyst A creates mapping M1, creates snapshot S1, creates instance I1
  • Analyst B sees M1, S1, I1 in their lists (all resources visible)
  • Analyst B queries I1 (read access to any instance)
  • Analyst B creates their own instance I2 from S1 (B owns I2)
  • Analyst B runs algorithms on I2 (can only modify own instances)
  • Analyst B cannot delete M1, S1, or I1 (owned by A)

Tables: users, mappings, mapping_versions, snapshots, instances, algorithm_locks, global_config, export_queue

Design: TEXT for strings/UUIDs/timestamps (ISO 8601), INTEGER for numbers. PostgreSQL required in all environments.

Audit Event Categories (for external observability stack):

The following event categories should be captured by the company’s external logging/observability stack:

CategoryEvent Types
resourcecreate, update, delete, copy, set_lifecycle, terminate
queryexecute_cypher, extract_subgraph
algorithmalgo_start, algo_complete, algo_failed (includes algorithm name, duration)
configupdate_lifecycle_defaults, update_concurrency_limits
authlogin, logout, token_refresh, permission_denied
systemsnapshot_export_start, snapshot_export_complete, snapshot_export_failed, instance_startup, instance_shutdown, limit_exceeded

The following UX capabilities are provided through the Jupyter SDK. All user interactions occur via Python code in notebooks.

#RequirementSDK Implementation
1Progress visibilitySDK provides wait_for_* methods with progress callbacks and status polling for snapshot creation and instance startup.
2Version diffingSDK provides diff_versions() method to compare mapping versions programmatically.
3Resource relationships viewSDK provides navigation methods: mapping.snapshots, snapshot.instances, etc.
4Validation/dry runSDK provides validate() method to check SQL and mapping definitions before export.
5Favorites/bookmarksSDK provides favorite() / unfavorite() methods for quick access patterns.

Snapshot Creation Progress:

The SDK’s create_and_wait() method provides progress information through callbacks:

PhaseStatusAvailable Data
pendingSnapshotStatus.PENDINGQueue position (if available)
creating (nodes)SnapshotStatus.CREATINGCurrent table, completed tables with row counts
creating (edges)SnapshotStatus.CREATINGCurrent table, completed tables with row counts
readySnapshotStatus.READYTotal size, all row counts
failedSnapshotStatus.FAILEDError message, failed step

Instance Startup Progress:

PhaseStatusAvailable Data
starting (init)InstanceStatus.STARTINGPod scheduling status
starting (schema)InstanceStatus.STARTINGSchema creation phase
starting (nodes)InstanceStatus.STARTINGCurrent table, completed tables with row counts
starting (edges)InstanceStatus.STARTINGCurrent table, completed tables with row counts
runningInstanceStatus.RUNNINGGraph stats (node count, edge count)
failedInstanceStatus.FAILEDError message, failed step

SDK exceptions provide structured error information:

Exception Types:

ExceptionWhen Raised
PermissionDeniedErrorUser lacks permission to modify resource
ConcurrencyLimitErrorInstance limit exceeded (includes current/max counts)
DependencyErrorResource has dependent resources blocking deletion
ValidationErrorInvalid field values (includes field name and reason)
ResourceLockedErrorInstance locked by running algorithm
ServiceUnavailableErrorExternal system (Starburst, GCS) unavailable

Error Attributes:

All exceptions include:

  • message: Human-readable error description
  • code: Error code for support reference
  • details: Structured data (e.g., current_count, max_count for concurrency errors)

The SDK supports bulk operations through iteration helpers:

# Terminate multiple instances
for instance in client.instances.list(status="running"):
instance.terminate()
# Delete snapshots (respects dependencies)
for snapshot in mapping.snapshots:
try:
snapshot.delete()
except DependencyError as e:
print(f"Skipping {snapshot.name}: {e.message}")
  • All timestamps returned as Python datetime objects in UTC
  • SDK display methods format timestamps in user’s local timezone
  • Relative formatting available: snapshot.created_at_relative returns “2 minutes ago”

#AreaQuestion
1NetworkingHow do Jupyter notebooks reach Ryugraph pods? (Ingress path routing? Service mesh?)
2AuthAuthentication mechanism for Jupyter SDK? (API keys? OAuth tokens?)
3DomainWhat domain for Control Plane and Explorer?
4Jupyter LocationWhere do Jupyter notebooks run? (Same cluster? Same VPC? External?)
5ObservabilityWhat monitoring/alerting platform is standard in the organization?
6DataWhat are RTO/RPO requirements for Control Plane database? (Defaults provided, confirm acceptable)
7PerformanceWhat are acceptable query latency targets for “typical” queries? (Defaults provided, confirm acceptable)
8UXShould the platform support internationalization/localization?
9ComplianceAre there data residency requirements for GCS bucket location?
10OperationsWhat is the expected growth rate (users, instances, data volume)?

  • Security & compliance (auth, encryption, audit logging, PCI-DSS) - deferred (baseline security requirements captured above)
  • Networking configuration (Jupyter runs in separate VPC; connectivity TBD based on target environment)
  • Jupyter environment - existing, not building
  • Notifications (email, Teams) - deferred
  • Persisting algorithm results to snapshots