Skip to content

Graph OLAP Platform - Platform Operations

Document Type: Platform Operations Specification Version: 1.0 Status: Ready for Architectural Review Author: Graph OLAP Platform Team Last Updated: 2026-02-03


This architecture documentation is organized into five focused documents:

DocumentContent
Detailed ArchitectureExecutive Summary + C4 Architecture Viewpoints + Resource Management
SDK ArchitecturePython SDK, Resource Managers, Authentication
Domain & Data ArchitectureDomain Model, State Machines, Data Flows
This documentTechnology, Security, Integration, Operations, NFRs
Authorization & Access ControlRBAC Roles, Permission Matrix, Ownership Model, Enforcement

LayerTechnologyVersionLicensePurpose
Graph DatabaseRyugraph (KuzuDB fork)0.3.xMITEmbedded columnar graph database
Graph DatabaseFalkorDB4.xSource AvailableRedis-based in-memory graph
Algorithm EngineNetworkX3.xBSDGraph algorithms (Ryugraph only)
Backend FrameworkPython FastAPI0.109+MITREST APIs
Client SDKPython + httpxN/AMITJupyter notebook integration
Container RuntimeGKE (Kubernetes)1.28+N/AManaged Kubernetes
DatabaseCloud SQL PostgreSQL14N/AMetadata store
Object StorageGoogle Cloud StorageN/AN/AParquet snapshot storage
AutoscalingKEDA2.12+Apache 2.0Event-driven autoscaling
Continuous DeliveryJenkins (enterprise)N/AN/ABuild, scan, deploy pipeline (./infrastructure/cd/deploy.sh)
InfrastructureTerraform1.5+MPL-2.0Infrastructure as Code
Component Dependencies
Mermaid Source
---
config:
layout: elk
elk:
mergeEdges: false
nodePlacementStrategy: BRANDES_KOEPF
---
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#E3F2FD", "primaryTextColor": "#0D47A1", "primaryBorderColor": "#1565C0", "lineColor": "#37474F"}}}%%
flowchart TB
accTitle: Graph OLAP Platform Component Dependencies
accDescr: Shows data flow from external services through GKE cluster components to graph instances
%% Style definitions - Cagle palette
classDef external fill:#ECEFF1,stroke:#455A64,stroke-width:2px,color:#263238
classDef service fill:#E8F5E9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20
classDef data fill:#FFF8E1,stroke:#F57F17,stroke-width:2px,color:#E65100
classDef infra fill:#E3F2FD,stroke:#1565C0,stroke-width:2px,color:#0D47A1
classDef security fill:#E0F2F1,stroke:#00695C,stroke-width:2px,color:#004D40
subgraph External["External Services"]
Starburst["Starburst Galaxy"]:::external
SSO["Enterprise SSO / IdP"]:::security
Auth0["Auth0 (Demo)"]:::security
end
subgraph GKE["GKE Cluster"]
subgraph Apps["Application Layer"]
ControlPlane["Control Plane<br/>(FastAPI)"]:::service
ExportWorker["Export Worker<br/>(Python)"]:::service
end
subgraph Data["Data Layer"]
CloudSQL["Cloud SQL<br/>(PostgreSQL)"]:::data
GCS["GCS<br/>(Parquet)"]:::data
end
subgraph Graph["Graph Layer"]
GraphInstance["Graph Instance<br/>(Ryugraph/FalkorDB)"]:::infra
end
end
Starburst --> ExportWorker
SSO --> ControlPlane
Auth0 --> ControlPlane
ControlPlane --> ExportWorker
ControlPlane --> CloudSQL
ExportWorker --> Starburst
ExportWorker --> GCS
GCS --> GraphInstance

The platform supports multiple graph database backends through a pluggable wrapper architecture:

WrapperDatabaseMemory ModelAlgorithm EngineBest For
RyugraphKuzuDB (embedded)Buffer pool + diskNetworkX (Python)NetworkX integration, Python algorithms
FalkorDBFalkorDB (embedded)In-memory onlyNative C algorithmsLarger graphs, native performance

Both wrappers provide:

  • Equivalent Cypher query support
  • Same REST API interface
  • GCS Parquet data loading
  • Algorithm lock mechanism

Selection Flow:

  1. User specifies wrapper_type when creating instance
  2. Control Plane queries WrapperFactory for configuration
  3. K8s Service creates pod with wrapper-specific image and resources
  4. Wrapper loads data and reports ready state

#Control CategoryControlImplementationNIST CSF
1AuthenticationSSO Integrationenterprise IdP via auth proxy headersPR.AC-1
2AuthorizationRole-Based AccessAnalyst/Admin/Ops roles with ownership modelPR.AC-4
3Data Protection (Transit)TLS 1.3Ingress termination; WireGuard pod-to-podPR.DS-2
4Data Protection (Rest)EncryptionGoogle-managed AES-256 for Cloud SQL/GCSPR.DS-1
5Network SecurityPrivate ClusterNo public endpoints; VPC-native networkingPR.AC-5
6Secret ManagementExternal SecretsGoogle Secret Manager via ESOPR.DS-5
7Audit LoggingCloud Audit LogsAll API access logged; 400-day retentionDE.AE-3
8Container SecurityPSA RestrictedNo root, no privilege escalationPR.IP-1
9Image SecurityJenkins pipeline-driven image scanning (enterprise enterprise scanner TBC during integration)Critical-CVE gate enforced by the enterprise build pipeline; image provenance tracked via content-addressable 1.0_N_<sha7> tags published to the your container registry. See ADR-149 Tier-A.1 follow-up — specific scanner/SBOM tooling is owned by your ops team CI and not asserted by this document.DE.CM-8
10Network IsolationNetwork PoliciesDefault-deny with explicit allow (Cilium)PR.AC-5

Graph OLAP uses a hierarchical role model: Analyst < Admin < Ops. Each higher role is a strict superset of the one below it.

CapabilityAnalystAdminOps
Read all resourcesYesYesYes
CRUD own resourcesYesYesYes
CRUD any resourceNoYesYes
Bulk deleteNoYesYes
Config/Cluster/JobsNoNoYes

See Authorization & Access Control for the complete specification including the full permission matrix and enforcement architecture.

  • Protocol: TLS 1.2+
  • Termination: GKE Managed Certificates (auto-renewal)
  • WAF: Cloud Armor for API protection
  • Protocol: WireGuard (ChaCha20-Poly1305)
  • Implementation: Cilium Transparent Encryption via GKE Dataplane V2
  • Overhead: ~5% CPU (kernel-level encryption)
  • Cloud SQL: TLS required (sslmode=require)
  • GCS: HTTPS (default)
  • Starburst: HTTPS
User → your SSO → Session Cookie → Auth Proxy → X-Username Header → Services → DB role lookup
  • SDK/API: API key presented as Bearer token; auth middleware resolves username and looks up the role column from the DB-backed users table. No JWT parsing occurs in the control plane. (Updated for ADR-104)
Data TypeClassificationStorageEncryptionAccess Control
Graph Mapping DefinitionsInternalCloud SQLAES-256Platform users
Parquet SnapshotsBusiness ConfidentialGCSAES-256IAM + Workload Identity
Query ResultsTransientIn-memoryN/ASession-bound
Algorithm ResultsEphemeralIn-memoryN/AInstance owner
Audit LogsComplianceCloud LoggingAES-256Security team

SystemIntegration TypeProtocolAuthenticationPurpose
Starburst GalaxyREST APIHTTPSService account tokenData export (UNLOAD)
your SSOSAML/OIDCHTTPSSession cookiesUser authentication
Jupyter EnterprisePython SDKHTTPSAPI key (Bearer)Notebook integration
Cloud SQLPostgreSQL WireTLSIAM authMetadata storage
GCSREST APIHTTPSWorkload IdentitySnapshot storage

The Python SDK (graph-olap-sdk) provides:

  • Async/await pattern for long-running operations
  • Automatic polling with exponential backoff
  • Type-safe models via Pydantic
  • Jupyter-friendly progress indicators
from graph_olap import Client
client = Client(base_url="https://api.example.com", token="...")
# Create an instance directly from a mapping (recommended)
# Snapshots are created implicitly during instance creation
instance = await client.instances.create_from_mapping(
mapping_id=123,
wrapper_type="ryugraph",
wait=True # Polls until instance is running
)
# Execute algorithm on the instance
result = await instance.algorithms.pagerank(
damping_factor=0.85,
max_iterations=100
)
# Access the underlying snapshot if needed
snapshot = await instance.get_snapshot()
print(f"Node counts: {snapshot.node_counts}")
# For metadata-only operations (no instance creation)
info_snapshot = await client.snapshots.create(
mapping_id=123,
info_only=True, # Only collects row counts, no data export
wait=True
)
print(f"Node counts: {info_snapshot.node_counts}")
FromToProtocolTriggerError Handling
SDKControl PlaneHTTPS RESTUser actionSDK exception
SDKGraph InstanceHTTPS RESTQuery/algorithmSDK exception
Control PlaneCloud SQLPostgreSQLAPI requests503 until recovered
Control PlaneK8s APIHTTPSInstance lifecycleMark failed
Export WorkerControl PlaneHTTPS RESTJob claim/statusRetry with backoff
Export WorkerStarburstHTTPS RESTUNLOAD submissionRetry 3x → fail
Export WorkerGCSHTTPSRow countRetry 3x
Graph InstanceGCSHTTPSData loadFail fast

Node PoolMachine TypeNodesPurposeAutoscaling
Systemn1-standard-22Cluster infrastructureFixed
Control Planen1-standard-42-4Platform servicesHPA
Graph Instancesn1-highmem-40-50User graph podsCluster autoscaler
  • Pipeline: Jenkins builds images, pushes to the your container registry, and invokes ./infrastructure/cd/deploy.sh which applies Kubernetes manifests via kubectl apply.
  • Rolling Updates: Zero-downtime deployments via Kubernetes Deployment rolling strategy.
  • Manual Approval: Production pipeline stages require explicit human approval before kubectl apply.
SLISLO TargetMeasurement
Control Plane Availability99.5% monthlySuccessful requests / Total requests
API Latency (P95)< 2 secondsRequest duration histogram
Query Latency (P95)< 10 secondsCypher query execution duration
Export Success Rate> 95% weeklySuccessful exports / Total exports
Instance Startup Time (P95)< 3 minutesTime from creation to running
ComponentToolPurpose
MetricsManaged PrometheusResource and application metrics
LoggingCloud LoggingStructured JSON logs
TracingCloud TraceDistributed request tracing
DashboardsGrafanaOperational dashboards
AlertingCloud MonitoringSLO-based alerts
ScenarioRTORPORecovery Method
Instance Pod FailureImmediateN/ACreate new instance from snapshot
Control Plane Pod Failure< 1 minute0Kubernetes restart (stateless)
Cloud SQL Failure4 hours5 minutesPoint-in-time recovery
GCS Data Loss8 hoursN/ARe-export from Starburst
Regional Outage4 hours5 minutesRestore to alternate region
JobIntervalPurpose
Instance Orchestration5 secTransition waiting_for_snapshot instances to starting once snapshot is ready
Instance Reconciliation5 minSync pod state with database
Export Reconciliation5 secReset stale claims, finalize exports (deliberate exception to ADR-040 default policy — near-real-time propagation requirement)
Lifecycle Cleanup5 minTTL enforcement, orphan cleanup
Schema Cache Refresh24 hrsUpdate Starburst schema cache
Resource Monitor60 secProactive OOM prevention via memory tier upgrades

Manual Triggers: Operators can manually trigger jobs via the OpsResource API:

# Trigger a background job immediately
await client.ops.trigger_job("reconciliation") # or "lifecycle", "export_reconciliation", "schema_cache"
# Check job status
status = await client.ops.get_job_status()

The SDK provides an OpsResource for platform operations and monitoring:

MethodPurpose
get_cluster_health()Component health status (control-plane, export-worker, wrappers)
get_metrics()Prometheus metrics (instances, exports, queries)
get_state()System state summary
trigger_job(job_type)Manually trigger background job
get_job_status()Background job last-run timestamps
get_lifecycle_config()View lifecycle settings (TTL, idle timeout)
update_lifecycle_config()Update lifecycle settings
get_concurrency_config()View concurrency limits
update_concurrency_config()Update concurrency limits
get_export_config()View export job duration settings
update_export_config()Update export job duration settings
get_export_jobs()Export job debugging information

The platform provides a schema metadata API for discovering available tables and columns in Starburst:

EndpointPurpose
GET /api/schema/catalogsList available catalogs
GET /api/schema/catalogs/{catalog}/schemasList schemas in catalog
GET /api/schema/catalogs/{catalog}/schemas/{schema}/tablesList tables in schema
GET /api/schema/.../tables/{table}/columnsList columns in table
GET /api/schema/search/tables?q=patternSearch tables by pattern
GET /api/schema/search/columns?q=patternSearch columns by pattern
POST /api/schema/admin/refreshTrigger cache refresh (admin)
GET /api/schema/statsCache statistics (admin)

MetricRequirementNotes
API Response Time (P95)< 2 secondsExcluding long-running operations
Query Response Time< 10 secondsFor typical graph queries
Algorithm Execution< 5 minutesFor graphs up to 1M edges
Instance Startup< 3 minutesIncluding data load
Export Throughput> 100K rows/minutePer export job
DimensionTargetMechanism
Concurrent Users50+HPA on control plane
Concurrent Instances100+Node pool autoscaling
Graph Size10M nodes, 50M edgesMemory-based sizing
Export Parallelism5 workersKEDA scaling
ComponentTargetMechanism
Control Plane99.5%Multi-replica deployment
Cloud SQL99.95%Regional HA configuration
GCS99.99%Multi-zone redundancy
RequirementImplementation
Encryption in TransitTLS 1.2+ external, WireGuard internal
Encryption at RestGoogle-managed AES-256
Access LoggingCloud Audit Logs (400-day retention)
Network IsolationPrivate GKE cluster, VPC-native
Data ResidencyConfigurable GCP region

ComponentMonthly CostNotes
GKE Cluster$1,500 - $2,500Base cluster + node pools
Graph Instance Nodes$2,000 - $8,000Variable; Spot VMs reduce 60-70%
Cloud SQL (HA)$400 - $600db-custom-4-16384
GCS Storage$50 - $200~100GB with lifecycle
Cloud NAT / Networking$100 - $300Egress + NAT
Observability$200 - $400Logging + Prometheus
Total$4,250 - $12,000Highly variable based on usage
  • Spot VMs: 60-70% cost reduction for graph instance nodes
  • KEDA Scale-to-Zero: Export workers scale to 0 when idle
  • GCS Lifecycle: Auto-delete old snapshots
  • Instance TTL: Auto-stop idle instances

#RiskLikelihoodImpactMitigation
1Starburst UnavailabilityMediumHighRetry with backoff; clear error messages
2Graph Size Exceeds MemoryMediumMediumPre-export validation; soft/hard limits
3Unauthorized Data AccessLowCriticalRBAC; ownership model; audit logging
4Pod Scheduling DelaysLowMediumReserved capacity; priority classes
5Export Worker CrashesMediumLowStateless workers; reconciliation recovery
6Data Leakage via SnapshotsLowHighGCS IAM; no public access
7Memory ExhaustionMediumMediumBurstable QoS; timeout enforcement
8Network IssuesLowHighPrivate endpoints; health checks
9Orphaned ResourcesMediumLowReconciliation job; TTL enforcement
10Deployment FailureLowMediumGitOps; rolling updates; manual approval
GCP ComponentLock-in LevelExit Strategy
GKEMediumStandard K8s; portable to any K8s
Cloud SQLLowPostgreSQL standard; pg_dump
GCSLowS3-compatible API; gsutil export
Workload IdentityMediumAlternative: service account keys
Cloud LoggingMediumOpenTelemetry export
Managed PrometheusLowStandard Prometheus; remote_write

Change TypeApproval RequiredDeployment Method
ConfigurationTeam leadJenkins pipeline (./infrastructure/cd/deploy.shkubectl apply)
Feature releaseProduct ownerJenkins pipeline with manual approval stage
InfrastructurePlatform teamTerraform apply
Security patchSecurity teamExpedited Jenkins pipeline
ScenarioRunbook
High export queue depthScale KEDA min replicas
Instance stuck startingCheck GCS access, pod events
Database connection errorsVerify Cloud SQL status, connection pool
Algorithm timeoutIncrease timeout, check graph size


This is part of the Graph OLAP Platform architecture documentation. See also: Detailed Architecture, SDK Architecture, Domain & Data Architecture, Authorization.