Skip to content

RyuGraph Performance Reference

This document captures performance characteristics, threading models, and optimal configuration for RyuGraph (KuzuDB fork) in the Graph OLAP Platform context.

Read these documents first:


KuzuDB uses a memory-mapped, disk-backed storage architecture:

┌─────────────────────────────────────────────────────────┐
│ Application │
├─────────────────────────────────────────────────────────┤
│ Buffer Pool │
│ (configurable size) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Page 1 │ │ Page 2 │ │ Page 3 │ │ ... │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
├─────────────────────────────────────────────────────────┤
│ Disk Storage (Persistent) │
│ + Spill Files (Temporary, auto-cleaned) │
└─────────────────────────────────────────────────────────┘

Key characteristics:

  • Buffer pool: In-memory cache for frequently accessed pages
  • Disk spilling: Automatic overflow to disk when buffer pool is full
  • Columnar storage: Data stored in columns for vectorized processing
  • MVCC: Multi-version concurrency control for transactions

KuzuDB uses a single-writer, multiple-reader model:

OperationConcurrencyNotes
Read queriesParallelMultiple connections can read simultaneously
Write transactionsSerializedOnly one write at a time (internal locking)
COPY FROMSerializedEach COPY FROM is a write transaction

Important: Even with multiple connections, write operations (including COPY FROM) are serialized internally by KuzuDB’s transaction manager.


From KuzuDB COPY Pipeline documentation:

┌─────────────────────────────────────────────────────────────────┐
│ COPY FROM Pipeline │
│ │
│ ┌─────────┐ ┌───────────┐ ┌─────────────┐ ┌──────┐ │
│ │ READER │───→│ COPY_NODE │───→│ BUILD_INDEX │───→│ SINK │ │
│ │ Thread1 │ └───────────┘ └─────────────┘ └──────┘ │
│ ├─────────┤ │
│ │ READER │ All READERs share ReaderSharedState │
│ │ Thread2 │ (coordinates file chunk distribution) │
│ ├─────────┤ │
│ │ READER │ Each processes "morsels" (2048 tuples) │
│ │ Thread3 │ independently │
│ └─────────┘ │
└─────────────────────────────────────────────────────────────────┘

Morsel-driven parallelism: Work is divided into small chunks (morsels) that are dynamically distributed across threads during runtime. Each thread processes its chunk independently, synchronizing only when necessary.

Node COPY Pipeline:

READER → COPY_NODE → BUILD_INDEX → COPY_SINK

Relationship COPY Pipeline (more complex):

Pipeline 1 (child): READER → INDEX_LOOKUP → REL_SHUFFLE
Pipeline 2 (parent): DATA_CHUNK_SCAN → COPY_REL_COLUMNS → COPY_REL_LISTS → COPY_SINK

The REL_SHUFFLE operator accumulates and partitions relationship data for node-group-specific processing.

ModeUse CaseBehavior
SerialNode tables with SERIAL primary keySequential reads, one node group at a time
Non-SerialAll other tablesParallel reads, 2048-tuple chunks

Time breakdown per operation (approximate):
├── Network I/O (GCS HTTP requests) ─────────────── 70-85%
│ └── Latency: 50-200ms per request
│ └── Waiting for bytes over network
├── Parquet Decoding ───────────────────────────── 10-20%
│ └── Decompression (snappy/zstd)
│ └── Column parsing
│ └── SIMD-optimized, fast
├── Memory/Buffer Operations ───────────────────── 3-8%
│ └── Buffer pool allocation
│ └── Data structure construction
└── Disk I/O (storage write) ───────────────────── 2-5%
└── SSD/NVMe is fast
└── Only bottleneck at very high throughput

Key insight: COPY FROM GCS is I/O bound, not CPU bound. Network latency dominates execution time.

From KuzuDB 0.7.0 Release:

Buffer PoolLoad Time (17B edges)Relative
420 GB60 min1.0x
205 GB68 min1.13x
102 GB70 min1.17x

Analysis: 4x less memory = only 17% slower. This confirms I/O-bound behavior.

From DuckDB GCS Performance Issue:

MethodTime
Sequential HTTP100s
8 parallel threads15-17s
Speedup6x

The bottleneck is request-level concurrency, not bandwidth or CPU.


For I/O-bound workloads, optimal thread count follows:

optimal_threads = CPUs × (1 + wait_time / compute_time)

For GCS Parquet loading:

  • Wait time (network): ~100ms average
  • Compute time (decode): ~10ms average
  • Ratio: 10:1
  • Optimal: CPUs × 4-10

With threads = CPUs (suboptimal):

Thread 1: [WAIT GCS ~~~~~~~~][decode][WAIT GCS ~~~~~~~~][decode]
Thread 2: [WAIT GCS ~~~~~~~~][decode][WAIT GCS ~~~~~~~~][decode]
Thread 3: [WAIT GCS ~~~~~~~~][decode][WAIT GCS ~~~~~~~~][decode]
Thread 4: [WAIT GCS ~~~~~~~~][decode][WAIT GCS ~~~~~~~~][decode]
CPU Cores: [idle~~~][busy][idle~~~~~~~~~~~][busy][idle~~~]
Average CPU utilization: ~15-25% ← CPUs mostly idle!

With threads = CPUs × 4 (optimal):

Threads 1-4: [WAIT][decode][WAIT][decode]...
Threads 5-8: [WAIT][decode][WAIT][decode]...
Threads 9-12: [WAIT][decode][WAIT][decode]...
Threads 13-16: [WAIT][decode][WAIT][decode]...
CPU Cores: Always have work from threads finishing I/O
Average CPU utilization: ~60-80% ← Much better!
Pod vCPUsmax_num_threadsRationale
188x multiplier for I/O bound
212-16Good parallelism
416-24Matches KuzuDB benchmarks
824-32Diminishing returns beyond this

Upper bounds:

  • GCS rate limits (~5,000 reads/sec per bucket)
  • Memory per thread (~10-50MB)
  • Kernel scheduling overhead

PhaseBuffer Pool Usage
COPY FROMCaches pages being written, handles overflow
IdleMinimal usage
QueriesCaches frequently accessed data pages
AlgorithmsNot used - NetworkX uses Python heap

Formula:

def optimal_buffer_pool(pod_memory_limit_bytes: int) -> int:
"""
Calculate optimal buffer pool size.
Guidelines:
- Min 512MB (KuzuDB minimum effective)
- Max 2GB (diminishing returns beyond this for typical workloads)
- ~25% of pod limit (leave room for algorithms)
"""
min_buffer = 512 * 1024 * 1024 # 512MB
max_buffer = 2 * 1024 * 1024 * 1024 # 2GB
target_ratio = 0.25 # 25% of limit
calculated = int(pod_memory_limit_bytes * target_ratio)
return max(min_buffer, min(calculated, max_buffer))

Examples:

Pod Memory LimitBuffer PoolRationale
4 GB1 GBLeave 3GB for Python/algorithms
6 GB1.5 GBBalanced
8 GB2 GBCapped at 2GB
16 GB2 GBDiminishing returns beyond 2GB

From benchmarks:

  • Buffer pool has diminishing returns for COPY FROM (I/O bound)
  • Query caching benefits plateau around 2GB for typical graph sizes
  • Larger buffer pool = less memory for NetworkX algorithms
  • Disk spilling handles overflow efficiently

CAN be parallelized (within single COPY FROM):

  • Multiple READER threads fetch files concurrently
  • Parquet decoding happens in parallel
  • GCS requests are concurrent

CANNOT be parallelized (between COPY FROM statements):

  • Each COPY FROM is a write transaction
  • KuzuDB serializes write transactions
  • COPY Customer → COPY Product → COPY PURCHASED must be sequential
# Sequential approach (correct)
for node_def in node_definitions:
# Internal parallelism: multiple threads read Parquet files
conn.execute(f"COPY {node_def.label} FROM 'gs://.../*.parquet'")
for edge_def in edge_definitions:
# Internal parallelism: multiple threads read Parquet files
conn.execute(f"COPY {edge_def.type} FROM 'gs://.../*.parquet'")

Application-level parallelism does NOT help because:

  1. Single connection is not thread-safe
  2. Multiple connections still serialize writes internally
  3. The only parallelism is WITHIN each COPY FROM

Memory Usage Over Pod Lifecycle
8GB ┤ ╭──────╮ Algorithm
│ │ │ Execution
6GB ┤ ╭────╯ ╰────╮
│ ╭────────╮ │ │
4GB ┤ ╭────╯ ╰───────╯ ╰───╮
│ │ COPY FROM │
2GB ┤───╯ ╰────
│ Startup Idle Query Idle
0GB ┼────────────────────────────────────────────────────
Time →
PhaseMemory ProfileDuration
Startup~500MB (Python, libs)10-30s
COPY FROM1-3GB (buffer pool active)30s-5min
Idle/Queries1-2GB (buffer pool + overhead)Majority
Algorithm3-8GB spike (NetworkX in heap)10s-10min

From Kubernetes Pod QoS:

QoS ClassConditionEviction Priority
Guaranteedrequests == limitsLast (highest priority)
Burstablerequests < limitsMiddle
BestEffortNo requests/limitsFirst (lowest priority)

Strategy A: Guaranteed (Production-Critical)

Section titled “Strategy A: Guaranteed (Production-Critical)”
resources:
requests:
memory: "4Gi"
cpu: "1000m"
limits:
memory: "4Gi" # Same = Guaranteed QoS
cpu: "2000m"
env:
- name: RYUGRAPH_BUFFER_POOL_SIZE
value: "1610612736" # 1.5GB
- name: RYUGRAPH_MAX_THREADS
value: "8"
AspectValue
QoSGuaranteed
Pods per 32GB node~7
OOM RiskLow
CostHigher
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "6Gi"
cpu: "2000m"
env:
- name: RYUGRAPH_BUFFER_POOL_SIZE
value: "1073741824" # 1GB
- name: RYUGRAPH_MAX_THREADS
value: "16"
AspectValue
QoSBurstable
Pods per 32GB node~12-14
OOM RiskMedium
CostModerate
Section titled “Strategy C: Dedicated Node Pool (Recommended)”
resources:
requests:
memory: "3Gi"
cpu: "1000m"
limits:
memory: "8Gi"
cpu: "2000m"
env:
- name: RYUGRAPH_BUFFER_POOL_SIZE
value: "2147483648" # 2GB
- name: RYUGRAPH_MAX_THREADS
value: "16"

With dedicated node pool (n2-highmem-4, 32GB):

AspectValue
QoSBurstable (isolated)
Pods per node3-4
OOM RiskVery Low
CostOptimal

MachinevCPUMemoryMemory/CPURecommendation
e2-standard-4416 GB4 GB/vCPUBudget
e2-highmem-4432 GB8 GB/vCPUGood
n2-highmem-4432 GB8 GB/vCPURecommended
n2-highmem-8864 GB8 GB/vCPULarge clusters

n2-highmem-4 provides the best balance of memory capacity and cost for graph workloads.

# Terraform example
resource "google_container_node_pool" "graph_instances" {
name = "graph-instances"
cluster = google_container_cluster.primary.name
node_config {
machine_type = "n2-highmem-4"
taint {
key = "workload"
value = "graph-instance"
effect = "NO_SCHEDULE"
}
labels = {
"workload-type" = "graph-instance"
}
}
autoscaling {
min_node_count = 0
max_node_count = 10
}
}
spec:
tolerations:
- key: "workload"
operator: "Equal"
value: "graph-instance"
effect: "NoSchedule"
nodeSelector:
workload-type: graph-instance

env:
- name: RYUGRAPH_BUFFER_POOL_SIZE
value: "2147483648" # 2GB
- name: RYUGRAPH_MAX_THREADS
value: "16" # 4x vCPU for I/O-bound GCS

Canonical shipped defaults (from packages/control-plane/src/control_plane/services/wrapper_factory.py, the authoritative spawn-time source):

resources:
requests:
memory: "4Gi" # ryugraph_memory_request default
cpu: "2" # ryugraph_cpu_request default
limits:
memory: "8Gi" # ryugraph_memory_limit default
cpu: "4" # ryugraph_cpu_limit default

These are the values used by WrapperFactory when spawning Ryugraph wrapper pods in production. The earlier “Strategy C (3Gi/8Gi)” reasoning in the Strategy comparison above is kept for context on how the sizing was derived, but the actually-shipped request is 4Gi, not 3Gi. The ryugraph-wrapper package has no Helm chart — pods are created imperatively by K8sService.create_wrapper_pod, so wrapper_factory.py is the single source of truth.

┌──────────────────────────────────────────────┐
│ 8 GB Pod Memory Limit │
├──────────────────────────────────────────────┤
│ Python + FastAPI + libs ~500 MB │
│ Ryugraph overhead ~200 MB │
│ Buffer pool 2,048 MB │
│ ────────────────────────────────────── │
│ Available for algorithms ~5,250 MB │
└──────────────────────────────────────────────┘
  • Machine type: n2-highmem-4 (32GB RAM, 4 vCPU)
  • Pods per node: 3-4 (safe burst capacity)
  • Autoscaling: 0-10 nodes
  • Taints: Isolate graph workloads

ConfigurationCOPY FROMQueriesAlgorithmsStability
Current (512Mi/4Gi, 2GB buffer)GoodGoodRisk OOMPoor
Strategy A (4Gi/4Gi, 1.5GB)GoodGoodLimitedExcellent
Strategy B (2Gi/6Gi, 1GB)GoodModerateGoodGood
Strategy C (3Gi/8Gi, 2GB)ExcellentGoodExcellentExcellent