Skip to content

Query Optimization

Tutorial

Query Optimization

Query planning, indexing, and performance

45 min Advanced
PerformanceOptimizationIndexing

What You'll Learn

  • Query Planning - EXPLAIN and PROFILE
  • Indexing - Speed up lookups
  • Query Patterns - Efficient patterns
  • Common Pitfalls - Avoid performance traps
# Cell 1 — Parameters
USERNAME = "_FILL_ME_IN_" # Set your email before running
# Cell 2 — Connect
from graph_olap import GraphOLAPClient
client = GraphOLAPClient(username=USERNAME)
# Cell 3 — Provision
from notebook_setup import provision
personas, conn = provision(USERNAME)
analyst = personas["analyst"]
admin = personas["admin"]
ops = personas["ops"]
client = analyst
print(f"Connected | {conn.query_scalar('MATCH (n) RETURN count(n)')} nodes")
Info:
Prerequisites
Complete G1_property_graphs before this tutorial.

Cypher provides two tools for understanding how the query engine executes your query:

  • EXPLAIN — shows the planned execution steps without actually running the query. Use it to check whether the planner chose a good strategy before executing.
  • PROFILE — runs the query AND shows the execution plan with actual row counts and timing for each step. Use it to find bottlenecks in slow queries.

Reading the output:

The plan is a tree of operations. Each operation shows:

  • Operator — the type of step (e.g. NodeByLabelScan, Filter, ProduceResults).
  • Rows — how many rows flowed through that step (PROFILE only).
  • DB Hits — how many low-level storage lookups were performed (fewer is better).

When to use which:

ToolRuns query?Shows actual stats?Use case
EXPLAINNoNoQuick check before running expensive queries
PROFILEYesYesDiagnosing slow queries after the fact
# --- EXPLAIN: view the execution plan without running the query ---
result = conn.query("EXPLAIN MATCH (n:Person)-[r]->(m) RETURN n.name, type(r), m.name")
result.show()
# --- PROFILE: run the query and see actual execution statistics ---
result = conn.query("PROFILE MATCH (n:Person)-[r]->(m) RETURN n.name, type(r), m.name")
result.show()
# --- Compare: PROFILE a query with and without a label filter ---
# Without label: scans all nodes
result = conn.query("PROFILE MATCH (n) WHERE n.name IS NOT NULL RETURN n.name LIMIT 5")
result.show()
# With label: scans only Person nodes (fewer DB hits)
result = conn.query("PROFILE MATCH (n:Person) WHERE n.name IS NOT NULL RETURN n.name LIMIT 5")
result.show()

Indexes speed up property lookups from O(n) scans to O(log n) or O(1) lookups.

When to create an index:

  • A property is used in WHERE equality or range filters.
  • A property is used in MERGE for finding existing nodes.
  • A query is slow and PROFILE shows NodeByLabelScan + Filter instead of NodeIndexSeek.

Index types:

TypeSyntaxUse case
Range indexCREATE INDEX FOR (n:Label) ON (n.property)Equality, range, prefix lookups
CompositeCREATE INDEX FOR (n:Label) ON (n.prop1, n.prop2)Multi-property filters
Full-textCREATE FULLTEXT INDEX ...Text search with tokenization

Rules of thumb:

  • Index properties that appear in WHERE clauses.
  • Do not over-index — each index adds write overhead.
  • Composite indexes help when you always filter on multiple properties together.
  • After creating an index, re-run PROFILE to confirm the planner uses it.
# --- List existing indexes ---
df = conn.query_df("CALL db.indexes()")
df
# --- Create an index on Person.name (if one does not already exist) ---
conn.query("CREATE INDEX IF NOT EXISTS FOR (p:Person) ON (p.name)")
print("Index on Person.name ensured.")
# --- PROFILE after indexing: the planner should now use NodeIndexSeek ---
result = conn.query("PROFILE MATCH (p:Person) WHERE p.name = 'Alice' RETURN p")
result.show()

Writing efficient Cypher is about helping the query planner reduce work early.

Tip 1: Always specify labels.

-- Slow: scans every node
MATCH (n) WHERE n.name = 'Alice' RETURN n
-- Fast: scans only Person nodes
MATCH (n:Person) WHERE n.name = 'Alice' RETURN n

Tip 2: Filter early with WITH + WHERE.

Push filters as close to the MATCH as possible so that fewer rows flow downstream:

MATCH (p:Person)-[r]->(m)
WITH p, count(r) AS rels
WHERE rels > 3
RETURN p.name, rels

Tip 3: Use LIMIT early.

If you only need the top N results, add LIMIT as soon as possible:

MATCH (p:Person)
WITH p ORDER BY p.name LIMIT 10
MATCH (p)-[r]->(m)
RETURN p.name, count(r) AS rels

Tip 4: Use count { } instead of expanding + counting.

-- Instead of expanding all relationships then counting:
MATCH (p:Person)-[r]->() RETURN p.name, count(r)
-- Use a count subquery (avoids materializing every relationship):
MATCH (p:Person) RETURN p.name, count { (p)-[]->() } AS rels
# --- Tip 1: Label filtering reduces scanned nodes ---
# Count nodes with vs without label
total = conn.query_scalar("MATCH (n) RETURN count(n)")
persons = conn.query_scalar("MATCH (n:Person) RETURN count(n)")
print(f"All nodes: {total} | Person nodes: {persons}")
print(f"Label filter skips {total - persons} irrelevant nodes.")
# --- Tip 2 + 3: Filter early and LIMIT early ---
df = conn.query_df("""
MATCH (n)
WITH n ORDER BY n.name LIMIT 5
OPTIONAL MATCH (n)-[r]->()
RETURN n.name AS node, count(r) AS outgoing_rels
""")
df

Certain query patterns can cause dramatic slowdowns. Learn to recognize and avoid them.

Pitfall 1: Cartesian products.

When two MATCH clauses share no variables, every row from the first is paired with every row from the second — an N x M explosion:

-- DANGER: this creates |Person| x |Company| rows!
MATCH (p:Person), (c:Company)
RETURN p.name, c.name

Fix: always connect patterns with a relationship or use WITH to pipeline.

Pitfall 2: Unbounded variable-length paths.

-- DANGER: [*] explores every possible path in the graph!
MATCH (a)-[*]->(b) RETURN a, b

Fix: always set an upper bound — [*..5] or [*1..3].

Pitfall 3: Missing labels on relationship endpoints.

-- Slow: both a and b scan all nodes
MATCH (a)-[:WORKS_AT]->(b) RETURN a.name, b.name
-- Fast: labels help the planner start from a smaller set
MATCH (a:Person)-[:WORKS_AT]->(b:Company) RETURN a.name, b.name

Pitfall 4: Collecting then filtering (instead of filtering then collecting).

-- Slow: collects all names, then filters
MATCH (p:Person)
WITH COLLECT(p.name) AS names
RETURN [n IN names WHERE n STARTS WITH 'A']
-- Fast: filter first, then collect
MATCH (p:Person)
WHERE p.name STARTS WITH 'A'
RETURN COLLECT(p.name)
# --- Pitfall demo: Cartesian product (use with caution on large graphs) ---
# See how the row count explodes when MATCH clauses share no variables
result = conn.query("""
PROFILE
MATCH (a:Person), (b:Person)
WHERE a <> b
RETURN count(*) AS cartesian_pairs
""")
result.show()
# --- Safe alternative: bounded variable-length path ---
df = conn.query_df("""
MATCH (a:Person)
WITH a LIMIT 1
MATCH path = (a)-[*1..3]-(b:Person)
WHERE a <> b
RETURN
a.name AS start,
b.name AS end,
length(path) AS hops
LIMIT 10
""")
df
# --- Best practice: filter first, then collect (not the other way around) ---
df = conn.query_df("""
MATCH (n)
WHERE n.name IS NOT NULL AND n.name STARTS WITH 'A'
RETURN labels(n)[0] AS label, COLLECT(n.name) AS a_names
""")
df

Key Takeaways

  • PROFILE shows actual execution statistics
  • Index properties used in WHERE clauses
  • Start MATCH with most selective pattern
  • Avoid Cartesian products and unbounded paths