Query Optimization
Query Optimization
Query planning, indexing, and performance
What You'll Learn
- Query Planning - EXPLAIN and PROFILE
- Indexing - Speed up lookups
- Query Patterns - Efficient patterns
- Common Pitfalls - Avoid performance traps
# Cell 1 — ParametersUSERNAME = "_FILL_ME_IN_" # Set your email before running# Cell 2 — Connectfrom graph_olap import GraphOLAPClientclient = GraphOLAPClient(username=USERNAME)
# Cell 3 — Provisionfrom notebook_setup import provisionpersonas, conn = provision(USERNAME)analyst = personas["analyst"]admin = personas["admin"]ops = personas["ops"]client = analyst
print(f"Connected | {conn.query_scalar('MATCH (n) RETURN count(n)')} nodes")Cypher provides two tools for understanding how the query engine executes your query:
EXPLAIN— shows the planned execution steps without actually running the query. Use it to check whether the planner chose a good strategy before executing.PROFILE— runs the query AND shows the execution plan with actual row counts and timing for each step. Use it to find bottlenecks in slow queries.
Reading the output:
The plan is a tree of operations. Each operation shows:
- Operator — the type of step (e.g.
NodeByLabelScan,Filter,ProduceResults). - Rows — how many rows flowed through that step (PROFILE only).
- DB Hits — how many low-level storage lookups were performed (fewer is better).
When to use which:
| Tool | Runs query? | Shows actual stats? | Use case |
|---|---|---|---|
EXPLAIN | No | No | Quick check before running expensive queries |
PROFILE | Yes | Yes | Diagnosing slow queries after the fact |
# --- EXPLAIN: view the execution plan without running the query ---result = conn.query("EXPLAIN MATCH (n:Person)-[r]->(m) RETURN n.name, type(r), m.name")result.show()# --- PROFILE: run the query and see actual execution statistics ---result = conn.query("PROFILE MATCH (n:Person)-[r]->(m) RETURN n.name, type(r), m.name")result.show()# --- Compare: PROFILE a query with and without a label filter ---# Without label: scans all nodesresult = conn.query("PROFILE MATCH (n) WHERE n.name IS NOT NULL RETURN n.name LIMIT 5")result.show()# With label: scans only Person nodes (fewer DB hits)result = conn.query("PROFILE MATCH (n:Person) WHERE n.name IS NOT NULL RETURN n.name LIMIT 5")result.show()Indexes speed up property lookups from O(n) scans to O(log n) or O(1) lookups.
When to create an index:
- A property is used in
WHEREequality or range filters. - A property is used in
MERGEfor finding existing nodes. - A query is slow and PROFILE shows
NodeByLabelScan+Filterinstead ofNodeIndexSeek.
Index types:
| Type | Syntax | Use case |
|---|---|---|
| Range index | CREATE INDEX FOR (n:Label) ON (n.property) | Equality, range, prefix lookups |
| Composite | CREATE INDEX FOR (n:Label) ON (n.prop1, n.prop2) | Multi-property filters |
| Full-text | CREATE FULLTEXT INDEX ... | Text search with tokenization |
Rules of thumb:
- Index properties that appear in
WHEREclauses. - Do not over-index — each index adds write overhead.
- Composite indexes help when you always filter on multiple properties together.
- After creating an index, re-run
PROFILEto confirm the planner uses it.
# --- List existing indexes ---df = conn.query_df("CALL db.indexes()")df# --- Create an index on Person.name (if one does not already exist) ---conn.query("CREATE INDEX IF NOT EXISTS FOR (p:Person) ON (p.name)")print("Index on Person.name ensured.")# --- PROFILE after indexing: the planner should now use NodeIndexSeek ---result = conn.query("PROFILE MATCH (p:Person) WHERE p.name = 'Alice' RETURN p")result.show()Writing efficient Cypher is about helping the query planner reduce work early.
Tip 1: Always specify labels.
-- Slow: scans every nodeMATCH (n) WHERE n.name = 'Alice' RETURN n
-- Fast: scans only Person nodesMATCH (n:Person) WHERE n.name = 'Alice' RETURN nTip 2: Filter early with WITH + WHERE.
Push filters as close to the MATCH as possible so that fewer rows flow downstream:
MATCH (p:Person)-[r]->(m)WITH p, count(r) AS relsWHERE rels > 3RETURN p.name, relsTip 3: Use LIMIT early.
If you only need the top N results, add LIMIT as soon as possible:
MATCH (p:Person)WITH p ORDER BY p.name LIMIT 10MATCH (p)-[r]->(m)RETURN p.name, count(r) AS relsTip 4: Use count { } instead of expanding + counting.
-- Instead of expanding all relationships then counting:MATCH (p:Person)-[r]->() RETURN p.name, count(r)
-- Use a count subquery (avoids materializing every relationship):MATCH (p:Person) RETURN p.name, count { (p)-[]->() } AS rels# --- Tip 1: Label filtering reduces scanned nodes ---# Count nodes with vs without labeltotal = conn.query_scalar("MATCH (n) RETURN count(n)")persons = conn.query_scalar("MATCH (n:Person) RETURN count(n)")print(f"All nodes: {total} | Person nodes: {persons}")print(f"Label filter skips {total - persons} irrelevant nodes.")# --- Tip 2 + 3: Filter early and LIMIT early ---df = conn.query_df(""" MATCH (n) WITH n ORDER BY n.name LIMIT 5 OPTIONAL MATCH (n)-[r]->() RETURN n.name AS node, count(r) AS outgoing_rels""")dfCertain query patterns can cause dramatic slowdowns. Learn to recognize and avoid them.
Pitfall 1: Cartesian products.
When two MATCH clauses share no variables, every row from the first is paired with
every row from the second — an N x M explosion:
-- DANGER: this creates |Person| x |Company| rows!MATCH (p:Person), (c:Company)RETURN p.name, c.nameFix: always connect patterns with a relationship or use WITH to pipeline.
Pitfall 2: Unbounded variable-length paths.
-- DANGER: [*] explores every possible path in the graph!MATCH (a)-[*]->(b) RETURN a, bFix: always set an upper bound — [*..5] or [*1..3].
Pitfall 3: Missing labels on relationship endpoints.
-- Slow: both a and b scan all nodesMATCH (a)-[:WORKS_AT]->(b) RETURN a.name, b.name
-- Fast: labels help the planner start from a smaller setMATCH (a:Person)-[:WORKS_AT]->(b:Company) RETURN a.name, b.namePitfall 4: Collecting then filtering (instead of filtering then collecting).
-- Slow: collects all names, then filtersMATCH (p:Person)WITH COLLECT(p.name) AS namesRETURN [n IN names WHERE n STARTS WITH 'A']
-- Fast: filter first, then collectMATCH (p:Person)WHERE p.name STARTS WITH 'A'RETURN COLLECT(p.name)# --- Pitfall demo: Cartesian product (use with caution on large graphs) ---# See how the row count explodes when MATCH clauses share no variablesresult = conn.query(""" PROFILE MATCH (a:Person), (b:Person) WHERE a <> b RETURN count(*) AS cartesian_pairs""")result.show()# --- Safe alternative: bounded variable-length path ---df = conn.query_df(""" MATCH (a:Person) WITH a LIMIT 1 MATCH path = (a)-[*1..3]-(b:Person) WHERE a <> b RETURN a.name AS start, b.name AS end, length(path) AS hops LIMIT 10""")df# --- Best practice: filter first, then collect (not the other way around) ---df = conn.query_df(""" MATCH (n) WHERE n.name IS NOT NULL AND n.name STARTS WITH 'A' RETURN labels(n)[0] AS label, COLLECT(n.name) AS a_names""")dfKey Takeaways
- PROFILE shows actual execution statistics
- Index properties used in WHERE clauses
- Start MATCH with most selective pattern
- Avoid Cartesian products and unbounded paths