Architectural Guardrails
Architectural Guardrails
Section titled “Architectural Guardrails”Overview
Section titled “Overview”This document defines the hard constraints that ALL designs and implementations must follow. These are non-negotiable decisions that have been locked to ensure consistency across the Graph OLAP Platform.
Prerequisites
Section titled “Prerequisites”- requirements.md - Functional requirements and scope
Technology Stack (Locked)
Section titled “Technology Stack (Locked)”These technology choices are final. Do NOT propose alternatives.
| Layer | Technology | Rationale |
|---|---|---|
| Graph Database | Ryugraph (KuzuDB fork), FalkorDB | Multiple wrappers via pluggable architecture (see ADR-049) |
| Control Plane Backend | Python + FastAPI | Unified stack with Wrapper/Worker/SDK, async-native, GCP SDK support |
| Control Plane DB | PostgreSQL (Cloud SQL, DO Managed, or local pod) | Standard RDBMS, PostgreSQL everywhere |
| Ryugraph Wrapper | Python + FastAPI | Required for Ryugraph bindings, NetworkX in-process |
| FalkorDB Wrapper | Python 3.12+ + FastAPI | FalkorDBLite subprocess architecture, Cypher procedures |
| Jupyter SDK | Python | Target environment is Jupyter notebooks |
| Export Job Polling | APScheduler background job | Polls export_jobs table, calls Starburst Galaxy directly (see ADR-025) |
| Object Storage | Google Cloud Storage (GCS) | Parquet file storage, Starburst integration |
| Container Orchestration | GKE (Kubernetes) | GCP-native, managed control plane |
| IaC/GitOps | Terraform (IaC), Jenkins (CI), ./infrastructure/cd/deploy.sh + kubectl apply -f infrastructure/cd/resources/ (CD) | your approved toolchain. No Helm (except Zero-to-JupyterHub), no ArgoCD, no GitHub Actions. |
Architecture Patterns (MUST Follow)
Section titled “Architecture Patterns (MUST Follow)”1. Control Plane is the Single Source of Truth
Section titled “1. Control Plane is the Single Source of Truth”- All state lives in the Control Plane database
- Workers, Ryugraph pods, and external clients MUST update state via Control Plane API
- No direct database access from workers or pods
- Control Plane API is the only interface for CRUD operations on mappings, snapshots, instances
DO: Worker → HTTP POST /api/snapshots/:id/status → Control Plane → DatabaseDO NOT: Worker → Direct SQL INSERT → Database2. Ryugraph Runs Embedded (Not as Server)
Section titled “2. Ryugraph Runs Embedded (Not as Server)”- Each graph instance runs Ryugraph in-process within a Python FastAPI wrapper
- One Ryugraph database per pod (file locking prevents multiple writers)
- Pod-per-instance model: each instance gets its own Kubernetes pod
- No shared Ryugraph server between instances
DO: [Pod] FastAPI → Ryugraph embedded → Database files on PVCDO NOT: [Pod] FastAPI → HTTP → [Separate Ryugraph Server]3. Stateless Workers, Stateful Pods
Section titled “3. Stateless Workers, Stateful Pods”- Export workers are stateless: can be scaled horizontally, restartable
- Ryugraph pods are stateful: each holds a graph instance in memory/disk
- Worker state flows through the
export_jobstable and Control Plane API - Pod state includes graph data (ephemeral, recreatable from snapshot)
4. Parquet as Interchange Format
Section titled “4. Parquet as Interchange Format”- Starburst exports data as Parquet to GCS
- Ryugraph loads data via
COPY FROMParquet files - No intermediate formats (CSV, JSON, etc.)
- Parquet column order must match schema definition order
4a. Async Export with Adaptive Polling
Section titled “4a. Async Export with Adaptive Polling”See ADR-025 for full rationale.
- Export is split into two phases: Submit (fast) and Poll (scheduled)
- Export Submitter submits UNLOAD queries and returns immediately
- Export Poller is an APScheduler background job that polls the
export_jobstable and calls Starburst Galaxy directly - Snapshot status:
pending→exporting→ready(orfailed) - Instance creation blocked until snapshot status =
ready
DO: Export Submitter → export_jobs table → APScheduler Poller → Starburst GalaxyDO NOT: Export Worker sits idle polling Starburst for 30 minutes5. Implicit Locking for Algorithms
Section titled “5. Implicit Locking for Algorithms”- No explicit lock API - locks are internal implementation detail
- Lock acquired automatically when algorithm starts
- Lock released automatically when algorithm completes (success or failure)
- Lock includes: holder user ID, algorithm name, start timestamp
- If algorithm hangs, user must terminate instance (no manual lock release)
DO: POST /algo/pagerank → [Wrapper acquires lock] → [Run algo] → [Wrapper releases lock]DO NOT: POST /lock → POST /algo/pagerank → DELETE /lockRace Condition Prevention
Section titled “Race Condition Prevention”Lock acquisition must be atomic to prevent race conditions when concurrent requests arrive:
Implementation requirement:
- Use mutex/semaphore in Wrapper Pod to serialize lock check + acquire- Pattern: acquire_mutex → check_lock → set_lock → release_mutex → run_algorithm- If lock already held, return 409 immediately (no retry/wait)
Sequence (safe): Request A: acquire_mutex → check (free) → set_lock(A) → release_mutex → run Request B: acquire_mutex → check (held by A) → release_mutex → return 409
Sequence (race without mutex - UNSAFE): Request A: check (free) ─────────────────────────► set_lock(A) → run Request B: ───────────────► check (free) → set_lock(B) → run ← CONFLICT!6. Immutable Versions, Mutable Headers
Section titled “6. Immutable Versions, Mutable Headers”- Mapping versions are immutable once created
- Mapping header (name, description, lifecycle settings) is mutable
- Editing node/edge definitions creates a new version (requires change description)
- Snapshots reference a specific mapping version
- Deleting a mapping requires deleting all snapshots first
7. Shared Schemas as Single Source of Truth
Section titled “7. Shared Schemas as Single Source of Truth”The graph-olap-schemas package defines authoritative Pydantic models for all inter-component communication. These schemas are the API contract between components.
Purpose of Shared Schemas:
- Compile-time validation - Type mismatches are caught during development, not at runtime
- Single source of truth - All components agree on data structures
- Documentation as code - Schemas document the API contracts
- Code generation - Can generate OpenAPI specs, API clients, JSON Schema
Consuming Components:
- Control Plane (defines and validates)
- Ryugraph Wrapper (consumes for API calls)
- Export Worker (consumes for status updates)
- Jupyter SDK (consumes for client models)
Schema Categories:
| Category | Purpose | Examples |
|---|---|---|
definitions | Core domain models | NodeDefinition, EdgeDefinition, PropertyDefinition |
api_resources | External API request/response | CreateMappingRequest, MappingResponse |
api_internal | Internal component communication | UpdateInstanceStatusRequest, InstanceMappingResponse |
api_common | Shared patterns | DataResponse, ErrorResponse, PaginationParams |
Correct Usage Pattern:
# In component code (e.g., wrapper, worker)from graph_olap_schemas import ( NodeDefinition, UpdateInstanceStatusRequest, InstanceMappingResponse,)
# Use shared types directly for API communicationasync def update_status(self, status: str) -> None: request = UpdateInstanceStatusRequest(status=status) await self._post("/status", json=request.model_dump(exclude_none=True))
# Use utility functions for component-specific logicdef generate_node_ddl(node: NodeDefinition) -> str: """Generate DDL from shared NodeDefinition - NOT a subclass.""" columns = [f"{node.primary_key.name} {node.primary_key.type.value} PRIMARY KEY"] for prop in node.properties: columns.append(f"{prop.name} {prop.type.value}") return f"CREATE NODE TABLE {node.label}({', '.join(columns)})"Why Schemas Must Not Be Extended:
- Extended schemas become incompatible types (Pydantic model inheritance creates distinct types)
- Values from API responses can’t be used directly with extended types
- Defeats compile-time validation (mismatches hidden by type conversion)
- Creates maintenance burden keeping extensions synchronized
See Anti-Patterns section for prohibited patterns.
8. Pluggable Multi-Wrapper Architecture
Section titled “8. Pluggable Multi-Wrapper Architecture”- Wrapper types are enum-defined in shared schemas (prevents typos, enables compile-time validation)
- Capabilities are declarative in
WRAPPER_CAPABILITIESregistry (feature discovery, validation) - Configuration is centralized in
WrapperFactoryservice (no scattered wrapper-specific logic) - Wrapper selection is user-driven at instance creation time via
wrapper_typeparameter - Adding new wrappers requires: enum value, capabilities entry, factory config, wrapper package, helm chart
- Backward compatibility maintained via defaults (Ryugraph is default wrapper type)
See ADR-049: Multi-Wrapper Pluggable Architecture for full design.
DO: wrapper_config = factory.get_wrapper_config(instance.wrapper_type)DO NOT: if instance.wrapper_type == "ryugraph": config = {...} elif instance.wrapper_type == "falkordb": config = {...}9. Owner-Based Permissions
Section titled “9. Owner-Based Permissions”- All resources are visible to all analysts (no access control lists)
- Only owner can modify/delete their resources
- Admins can modify/delete any resource
- Instance queries are open (anyone can query any instance)
- Instance algorithms are owner-restricted (analysts can only run on their own instances)
10. Authentication Model
Section titled “10. Authentication Model”References:
- ADR-104: DB-Backed User Auth Model
The platform authenticates users via DB-backed user records:
| Path | Use Case | Flow | Validation |
|---|---|---|---|
| API Access | SDK, scripts, CI/CD | X-Username header | Control Plane looks up user record in users table |
| User Sessions | Browser (JupyterHub) | OAuth2 redirect | oauth2-proxy handles OIDC flow, sets X-Username |
API Access Path: Client → Control Plane (X-Username header → users table lookup) → Response
Browser Path: Browser → oauth2-proxy → Auth0 → Cookie → JupyterHub (X-Username injected)Key Constraints:
- User identity is resolved via
X-Usernameheader, mapped to a DB-backed user record rolecolumn inuserstable determines permissions (analyst, admin, ops)- Network policies must ensure Control Plane is only accessible via proxy or internal traffic
- Updated for ADR-104
11. AMD64 Platform Enforcement
Section titled “11. AMD64 Platform Enforcement”Reference: ADR-076: Earthfile Build System Modernization
All container images MUST be built for AMD64 architecture:
# All Dockerfiles MUST include explicit platformFROM --platform=linux/amd64 python:3.11-slimRationale:
- GKE runs AMD64 nodes
- ARM64 (Apple Silicon) development machines build for wrong architecture
- Content-addressable builds require consistent platform hashes
Permission Matrix by Endpoint
Section titled “Permission Matrix by Endpoint”| Endpoint Category | Analyst | Admin | Ops |
|---|---|---|---|
| Mappings | List all, CRUD own, copy any | CRUD any | CRUD any |
| Snapshots | List all, CRUD own | CRUD any | CRUD any |
| Instances | List all, CRUD own, query any | CRUD any, query any | CRUD any, query any |
| Algorithms | Run on own instances only | Run on any instance | Run on any instance |
| Favorites | CRUD own favorites | CRUD own favorites | CRUD own favorites |
| Config (lifecycle, concurrency) | No access | No access | Read/Write |
| Cluster (health, metrics) | No access | No access | Read |
| Export Queue | Read own only | Read all | Read all |
| Ops Jobs | No access | No access | Full access |
See
system-design/authorization.spec.mdfor the complete authoritative RBAC specification.
Authorization Error Responses
Section titled “Authorization Error Responses”| Scenario | HTTP Status | Error Code |
|---|---|---|
| Missing/invalid auth token | 401 | UNAUTHORIZED |
| Analyst tries to modify other’s resource | 403 | PERMISSION_DENIED |
| Analyst tries to run algorithm on other’s instance | 403 | PERMISSION_DENIED |
| Analyst tries to access admin/ops endpoints | 403 | PERMISSION_DENIED |
| Admin tries to access ops-only endpoints | 403 | PERMISSION_DENIED |
Anti-Patterns (MUST NOT Do)
Section titled “Anti-Patterns (MUST NOT Do)”This is the authoritative list of anti-patterns. Other documents reference this section.
Database & Schema
Section titled “Database & Schema”- DO NOT access Control Plane database from Export Worker
- DO NOT access Control Plane database from Ryugraph Wrapper
- DO NOT have Control Plane access Ryugraph database directly
- DO NOT store algorithm results in Control Plane database
- PostgreSQL is the only supported database (SQLite not supported)
- DO NOT store raw Starburst query results (only metadata)
- DO NOT add columns for derived/computed data (compute at query time)
- DO NOT create circular foreign key dependencies
- DO NOT bypass application layer for direct database modifications
Query Patterns
Section titled “Query Patterns”- DO NOT use SELECT * in application queries (always list columns)
- DO NOT update multiple tables in a single transaction without necessity
- DO NOT use ORM lazy loading patterns (explicit joins only)
- DO NOT perform unbounded queries without LIMIT
Shared Schemas
Section titled “Shared Schemas”- DO NOT extend shared schema classes with subclasses (use utility functions instead)
- DO NOT define duplicate models that mirror shared schemas
- DO NOT build request/response payloads as raw dicts when a shared schema exists
- DO NOT add component-specific fields to shared schemas (propose changes to shared package)
- DO NOT use
.model_dump()without validating against the shared schema first - DO NOT ignore Pydantic validation errors (they indicate API contract violations)
DO: from graph_olap_schemas import NodeDefinition def generate_ddl(node: NodeDefinition) -> str: ...
DO NOT: class NodeDefinition(BaseNodeDefinition): # Extending schema def to_ddl(self) -> str: ...API Design
Section titled “API Design”- DO NOT create bulk modification endpoints (handle in clients)
- DO NOT create explicit lock/unlock endpoints (lock is implicit)
- DO NOT return 429 for concurrency limits (use 409 Conflict)
- DO NOT use cursor-based pagination (use offset/limit)
- DO NOT accept requests without Content-Type: application/json header
- DO NOT return HTML error pages; always return JSON error responses
- DO NOT expose internal error details (stack traces, SQL queries) in production
- DO NOT allow PUT/PATCH to modify immutable fields (id, created_at, owner_username)
- DO NOT create new mapping versions without change_description
- DO NOT allow snapshot creation from non-existent mapping versions
- DO NOT allow instance creation from snapshots not in ‘ready’ status
- DO NOT modify user SQL queries during validation (validate only, never transform)
- DO NOT skip SQL validation when updating mapping definitions
- DO NOT allow snapshot creation without successful SQL validation
Authentication & Authorization
Section titled “Authentication & Authorization”- DO NOT allow unauthenticated access to any endpoint (except health checks)
- DO NOT allow analysts to modify resources they don’t own (except admins)
- DO NOT allow analysts to run algorithms on instances they don’t own (except admins)
- DO NOT expose user credentials or API keys in responses or logs
- DO NOT store user passwords (use external IdP)
- DO NOT store sensitive data unencrypted (API keys, tokens)
Data Handling & GCS
Section titled “Data Handling & GCS”- DO NOT export algorithm results back to snapshots
- DO NOT modify graph structure after instance creation (no add/delete nodes/edges)
- DO NOT store Parquet files outside the defined path structure
- DO NOT load edges before nodes (edges reference node primary keys)
- DO NOT store algorithm results in GCS (results stay in Ryugraph only)
- DO NOT access GCS from Control Plane (only Worker and Wrapper access GCS)
- DO NOT share GCS paths between unrelated snapshots
- DO NOT delete GCS data before database record deletion succeeds
Resource Lifecycle
Section titled “Resource Lifecycle”- DO NOT delete snapshots while active instances exist (starting, running)
- DO NOT delete mappings while snapshots exist
- DO NOT allow lifecycle settings (TTL, inactivity_timeout) to exceed hard limits
- DO NOT allow inactivity timeout > TTL (logical constraint)
- DO NOT set lifecycle values on resources that reference deleted parents
- DO NOT terminate instances during algorithm execution (lock must be respected)
- DO NOT bypass lifecycle cleanup for any resource type
- DO NOT allow instance creation when concurrency limits exceeded (return 409)
- DO NOT delete a resource without also cleaning up its favorites references
Concurrency & Pod Lifecycle
Section titled “Concurrency & Pod Lifecycle”- DO NOT run multiple algorithms simultaneously on same instance
- DO NOT open multiple write transactions to same Ryugraph database
- DO NOT share Ryugraph Connection objects across async tasks
- DO NOT allow multiple Ryugraph processes in one pod (file locking)
- DO NOT keep instance pods running after termination request while locked
- DO NOT store persistent state in Wrapper pods (ephemeral only)
- DO NOT store lock state in Control Plane (query Wrapper Pod directly)
Component Communication & Network
Section titled “Component Communication & Network”- DO NOT use synchronous calls for long-running operations (use async job queue via
export_jobstable) - DO NOT bypass the Control Plane for status updates
- DO NOT expose internal APIs to external traffic
- DO NOT allow direct pod-to-pod communication (route through services)
- DO NOT bypass Ingress for external instance access
- DO NOT allow duplicate favorites for same user/resource combination
Multi-Wrapper Architecture
Section titled “Multi-Wrapper Architecture”- DO NOT hardcode wrapper-specific configuration in K8s service (use WrapperFactory)
- DO NOT use string literals for wrapper types (use WrapperType enum)
- DO NOT add wrapper-specific logic outside WrapperFactory (centralize configuration)
- DO NOT bypass wrapper capabilities registry (query capabilities, don’t assume)
- DO NOT allow wrapper_type mutation after instance creation (immutable property)
- DO NOT create wrapper-specific API endpoints (use
wrapper_typeparameter instead) - DO NOT duplicate wrapper configuration between WrapperFactory and Helm charts (WrapperFactory is source of truth for runtime config)
Export Processing
Section titled “Export Processing”- DO NOT poll Starburst in a long-running blocking call (use APScheduler background job polling
export_jobstable) - DO NOT use fixed-interval polling (use adaptive backoff intervals)
- DO NOT allow instance creation from snapshots with status != ‘ready’
- DO NOT mark snapshot as ‘ready’ until ALL export_jobs are ‘completed’
- DO NOT continue polling after Starburst returns FINISHED or FAILED state
- DO NOT lose Starburst query_id or next_uri (required for polling resumption)
SQL Validation Strategy
Section titled “SQL Validation Strategy”Mapping SQL queries are validated at multiple points to provide early feedback and prevent runtime errors.
Validation Timing
Section titled “Validation Timing”| Stage | Validation | Blocking |
|---|---|---|
| Mapping Create/Update | SQL syntax check (EXPLAIN) | Yes |
| Mapping Create/Update | Column type inference | Yes |
| Snapshot Create | Full execution dry-run (LIMIT 0) | Yes |
| Snapshot Export | Actual execution | Yes (fails snapshot) |
Validation Details
Section titled “Validation Details”On Mapping Create/Update (synchronous, via Control Plane):
- For each node/edge SQL query:
- Run EXPLAIN on query → validates syntax
- Run query with LIMIT 0 → validates tables/columns exist
- Infer column types from result schema
- Validate primary_key column exists and type matches
- Validate from_key/to_key columns exist for edges
- If any validation fails, reject with VALIDATION_FAILED error
On Snapshot Create (synchronous, before queueing):
- Run each query with LIMIT 0 to verify schema hasn’t changed
- This catches cases where:
- Tables were renamed/dropped since mapping created
- Columns were removed
- Permissions changed
- If validation fails, reject with STARBURST_ERROR (don’t queue)
Stale Mapping Handling
Section titled “Stale Mapping Handling”If Starburst schema changes after a mapping is created:
- Existing snapshots remain valid (data already exported)
- New snapshot creation fails with clear error message
- User must update mapping to fix SQL queries
- No automatic migration or fixing
Error Messages
Section titled “Error Messages”| Scenario | Error Code | Message Example |
|---|---|---|
| SQL syntax error | VALIDATION_FAILED | ”SQL syntax error at position 45: unexpected token ‘FORM‘“ |
| Table not found | VALIDATION_FAILED | ”Table ‘analytics.customers’ does not exist” |
| Column not found | VALIDATION_FAILED | ”Column ‘customer_id’ not found in result of node ‘Customer‘“ |
| Type mismatch | VALIDATION_FAILED | ”Primary key ‘id’ has type INTEGER, expected STRING” |
| Schema changed | STARBURST_ERROR | ”Column ‘city’ no longer exists in ‘analytics.customers‘“ |
Naming Conventions
Section titled “Naming Conventions”API Endpoints
Section titled “API Endpoints”Pattern: /api/{resource}s (plural, lowercase)Pattern: /api/{resource}s/:id (singular resource by ID)Pattern: /api/{resource}s/:id/{action} (verb for operations)
Examples: GET /api/mappings GET /api/mappings/:id PUT /api/mappings/:id/lifecycle POST /api/instances/:id/terminateDatabase Tables
Section titled “Database Tables”Pattern: {entity}_plural (lowercase, snake_case)Examples: users, mappings, mapping_versions, snapshots, instances, export_queueGCS Paths
Section titled “GCS Paths”Pattern: gs://bucket/{user_id}/{mapping_id}/{snapshot_id}/ ├── nodes/{node_label}/*.parquet └── edges/{edge_type}/*.parquet
Note: user_id is the snapshot owner (not mapping owner)Kubernetes Resources
Section titled “Kubernetes Resources”Pattern: graph-instance-{instance_id} (pod name)Pattern: graph-svc-{instance_id} (service name)Environment Variables
Section titled “Environment Variables”Pattern: GRAPH_OLAP_{COMPONENT}_{SETTING}Examples: GRAPH_OLAP_CONTROL_PLANE_DB_URL GRAPH_OLAP_WORKER_PUBSUB_TOPIC GRAPH_OLAP_WRAPPER_BUFFER_POOL_SIZEError Handling Standards
Section titled “Error Handling Standards”HTTP Status Codes
Section titled “HTTP Status Codes”| Code | Usage |
|---|---|
| 200 | Successful GET, PUT, DELETE |
| 201 | Successful POST (resource created) |
| 400 | Invalid request body, validation failure |
| 401 | Missing or invalid authentication |
| 403 | Permission denied (not owner, not admin) |
| 404 | Resource not found |
| 409 | Conflict (concurrency limit, resource has dependencies, locked) |
| 500 | Internal server error |
| 503 | Service unavailable (maintenance mode, Starburst unreachable, database unavailable) |
Error Response Format
Section titled “Error Response Format”All errors MUST return this JSON structure:
{ "error": { "code": "CONCURRENCY_LIMIT_EXCEEDED", "message": "Cannot create instance: analyst limit exceeded", "details": { "current_count": 5, "max_allowed": 5, "limit_type": "per_analyst" } }}Request Tracing
Section titled “Request Tracing”All API responses MUST include X-Request-ID header:
- Generate UUID for each incoming request
- Include in all log entries for that request
- Return in response header for client correlation
- Accept client-provided
X-Request-IDif present (use theirs instead of generating)
Error Codes
Section titled “Error Codes”See api.common.spec.md for the authoritative list of error codes and HTTP status mappings.
Performance Constraints
Section titled “Performance Constraints”Instance Limits
Section titled “Instance Limits”| Metric | Limit | Notes |
|---|---|---|
| Max graph size | ~2GB in memory | Ryugraph buffer pool + disk spilling |
| Max instance per analyst | Configurable (Ops) | Enforced at creation time |
| Max cluster instances | Configurable (Ops) | Enforced at creation time |
| Default TTL | <24 hours | Configurable per instance |
| Inactivity timeout | Configurable | Auto-terminate after no queries/algos |
Timeouts
Section titled “Timeouts”| Operation | Timeout | Notes |
|---|---|---|
| Starburst UNLOAD | Configurable | Worker config, default 30 min |
| Instance startup | 5 minutes | Including COPY FROM parquet |
| Cypher query | 60 seconds | Default, configurable |
| Algorithm execution | 30 minutes | Default, per-algorithm configurable |
| Health check | 10 seconds | Kubernetes liveness probe |
API Pagination
Section titled “API Pagination”| Parameter | Default | Max |
|---|---|---|
| limit | 50 | 100 |
| offset | 0 | No limit |
Retry Configuration
Section titled “Retry Configuration”| Operation | Max Retries | Backoff Strategy | Notes |
|---|---|---|---|
| Starburst query submission | 3 | Exponential (1s, 2s, 4s) | Export Submitter retries |
| Export polling (APScheduler) | Unlimited | Adaptive intervals | Polls until done or failed |
| Status update to Control Plane | 5 | Fixed (1s) | Critical path |
| GCS operations | 3 | Exponential (500ms, 1s, 2s) | Transient failures |
| Instance startup | 0 | None | Fail fast, user retries manually |
Export Polling Failure Handling (APScheduler):
- APScheduler polls the
export_jobstable at configurable intervals - On persistent failure, snapshot status is set to
failedwith error message - Failed exports require manual investigation or retry via SDK
- Ops dashboard shows failed export count for alerting
Security Boundaries
Section titled “Security Boundaries”Network Zones
Section titled “Network Zones”Mermaid Source
---title: Network Zonesconfig: layout: elk---flowchart TB accTitle: Graph OLAP Platform Network Zones accDescr: Shows GCP project network boundaries with GKE cluster, managed services, and external connections
subgraph GCP["GCP Project"] subgraph GKE["GKE Cluster"] CP["Control Plane<br/>(Internal LB)"] Pods["Instance Pods<br/>(Ingress routed)"] Ingress["Ingress Controller"]
CP --> Ingress Pods --> Ingress end
SQL["Cloud SQL<br/>(Private IP)"] Worker["Export Worker<br/>(GKE Pod)"]
CP --> SQL CP --> Worker end
subgraph External["External Systems"] Starburst["Starburst<br/>(VPC Peering)"] Jupyter["Jupyter Notebooks<br/>(Separate VPC)"] end
Worker --> Starburst Jupyter --> Ingress
classDef infra fill:#E3F2FD,stroke:#1565C0,stroke-width:2px,color:#0D47A1 classDef service fill:#E8F5E9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20 classDef data fill:#FFF8E1,stroke:#F57F17,stroke-width:2px,color:#E65100 classDef external fill:#ECEFF1,stroke:#455A64,stroke-width:2px,color:#263238
class CP,Pods,Ingress,Worker service class SQL data class Starburst,Jupyter externalCredential Management
Section titled “Credential Management”| Credential | Storage | Access |
|---|---|---|
| Starburst credentials | Secret Manager | Worker only |
| GCS service account | Workload Identity | Worker, Wrapper pods |
| Cloud SQL credentials | Secret Manager | Control Plane only |
| API keys (Jupyter SDK) | Control Plane DB | Issued to users |
Data Classification
Section titled “Data Classification”| Data Type | Sensitivity | Handling |
|---|---|---|
| Graph data (Parquet) | Business confidential | GCS bucket with IAM |
| Mapping definitions | Internal | Stored in Control Plane DB |
| Query/algorithm logs | Audit | External observability stack (retained per policy) |
| Algorithm results | Ephemeral | In-memory, not persisted |
Compatibility Requirements
Section titled “Compatibility Requirements”Database Requirements
Section titled “Database Requirements”PostgreSQL is required in all environments (SQLite is not supported).
target cluster deployment (per ADR-128):
- GCP Staging/Production: Cloud SQL PostgreSQL (project
your-gcp-project, regionasia-east2)
-- PostgreSQL-specific features are allowedSELECT * FROM mappings WHERE owner_username = ? AND created_at > ?;SELECT * FROM export_queue FOR UPDATE SKIP LOCKED; -- PostgreSQL row lockingType Mappings
Section titled “Type Mappings”| Logical Type | PostgreSQL | Python |
|---|---|---|
| UUID | TEXT | str |
| Timestamp | TEXT (ISO 8601) | datetime |
| Duration | TEXT (ISO 8601) | timedelta |
| Boolean | INTEGER (0/1) | bool |
| JSON | TEXT | dict |
Observability Standards
Section titled “Observability Standards”Structured Logging
Section titled “Structured Logging”All log entries MUST be JSON with these fields:
{ "timestamp": "2025-01-15T10:30:00Z", "level": "INFO", "component": "control-plane", "trace_id": "abc123", "user_id": "user-uuid", "message": "Snapshot created", "resource_type": "snapshot", "resource_id": "snapshot-uuid", "duration_ms": 1523}Metrics (Prometheus format)
Section titled “Metrics (Prometheus format)”| Metric | Type | Labels |
|---|---|---|
graph_olap_request_duration_seconds | Histogram | method, endpoint, status |
graph_olap_instances_active | Gauge | status |
graph_olap_snapshots_total | Counter | status |
graph_olap_algorithm_duration_seconds | Histogram | algorithm, status |
graph_olap_query_duration_seconds | Histogram | status |
Health Checks
Section titled “Health Checks”| Endpoint | Component | Checks |
|---|---|---|
/health | Control Plane | DB connectivity |
/health | Worker | Starburst Galaxy connectivity, export_jobs table access |
/health | Wrapper | Ryugraph ready, GCS accessible |
/ready | Wrapper | Graph data loaded |
Open Questions
Section titled “Open Questions”See decision.log.md for consolidated open questions and architecture decision records.