Rijmwoordenboek: Solving the 3-Second Phonetic Search Problem
Categories
Tags
About the Author
Rob Schoenaker
Managing Partner at UpstreamAds and Partner at Ludulicious B.V. with over 20 years of experience in software development, specializing in .NET Core, ServiceStack, C# and database design.
The Problem: Phonetic Search Too Slow for Real-Time Use
In 2020, Van Dale Rijmwoordenboek faced a critical performance bottleneck. Users searching for rhyming words were waiting 3.2 seconds for results. For a rhyming dictionary, this was completely unacceptable.
The Challenge:
- 200,000+ Dutch words with phonetic codes
- Complex phonetic matching algorithms
- Users expecting instant rhyming suggestions
- Similarity calculations killing performance
The Numbers:
-- This query was taking 3.2+ seconds
SELECT w1.word, w2.word,
similarity(w1.phonetic_code, w2.phonetic_code) as sim
FROM words w1, words w2
WHERE w1.id < w2.id
AND similarity(w1.phonetic_code, w2.phonetic_code) > 0.8
ORDER BY sim DESC
LIMIT 20;
The Root Cause: Missing Phonetic Indexes
The problem was clear from the execution plan:
EXPLAIN ANALYZE SELECT w1.word, w2.word,
similarity(w1.phonetic_code, w2.phonetic_code) as sim
FROM words w1, words w2
WHERE similarity(w1.phonetic_code, w2.phonetic_code) > 0.8;
-- Result: Nested Loop (cost=0.00..5000000.00 rows=1000000 width=32)
-- Execution time: 3200.456 ms
What was happening:
- PostgreSQL was doing a nested loop join on 200,000+ records
- No indexes existed for phonetic similarity operations
- Every similarity calculation was computed from scratch
- Cartesian product of words was killing performance
The Solution: Multi-Layered Phonetic Indexing
Step 1: Create B-tree Index for Exact Phonetic Matches
The first breakthrough came with a proper phonetic index:
-- Custom phonetic index for exact matches
CREATE INDEX CONCURRENTLY idx_words_phonetic_btree
ON words USING btree (phonetic_code, frequency DESC);
Why This Works:
btree (phonetic_code, frequency DESC): B-tree indexes provide fast exact matches and efficient range scans- Ordered by frequency ensures most common words appear first
- Enables fast lookups for identical phonetic codes
CONCURRENTLYallows index creation without blocking writes
Immediate Result: Exact phonetic matches dropped from 3.2 seconds to 1.8 seconds (1.8x improvement)
Step 2: Add Trigram Index for Fuzzy Matching
For phonetic variations and typos, we added trigram support:
-- Trigram index for fuzzy matching
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX CONCURRENTLY idx_words_trgm
ON words USING gin (word gin_trgm_ops);
Why This Works:
gin (word gin_trgm_ops): GIN indexes with trigram operators enable fast fuzzy text matching- Trigram indexes work by breaking words into 3-character substrings
- Makes searches resilient to typos and phonetic variations
- Perfect for finding words with similar phonetic patterns
Result: Fuzzy phonetic searches improved to 1.2 seconds (2.7x improvement)
Step 3: Create Partial Index for Common Words
Most queries focused on common words, so we created a partial index:
-- Partial index for common words (massive performance win)
CREATE INDEX CONCURRENTLY idx_words_common
ON words (phonetic_code)
WHERE frequency > 1000;
Why This Works:
WHERE frequency > 1000: Only indexes common words, dramatically reducing index size- Covers 80% of queries while using only 20% of index space
- Faster index scans and better cache utilization
- Most rhyming searches focus on common words anyway
Result: Common word searches dropped to 600ms (5.3x improvement)
The Game Changer: PostgreSQL 14 B-tree Deduplication
The Problem: Bloated Indexes with Duplicate Keys
In 2021, we discovered our indexes were bloated with duplicate phonetic codes:
-- Before PostgreSQL 14: Each duplicate stored separately
CREATE INDEX idx_words_old
ON words (phonetic_code, frequency, word);
-- Result: 450MB index with lots of duplicates
The Solution: Automatic Deduplication
PostgreSQL 14+ automatic deduplication solved this:
-- PostgreSQL 14+: Automatic deduplication
CREATE INDEX CONCURRENTLY idx_words_dedup
ON words (phonetic_code, frequency, word);
-- Duplicate key values automatically deduplicated
-- Result: 180MB index (60% reduction!)
Why This Works:
- B-tree indexes traditionally store each duplicate key as a separate entry
- PostgreSQL 14+ recognizes when multiple rows have identical index key values
- Instead of storing "KAT,1000,cat" and "KAT,1000,kat" separately, it stores the key once with multiple row pointers
- Dramatically reduces index size when you have many duplicate values (common in phonetic codes)
Result: Index size reduced by 60%, query performance improved to 200ms (16x improvement)
The Final Optimization: Query Rewriting Strategy
The Problem: Inefficient Similarity Calculations
The original query was still doing expensive similarity calculations:
-- Original query (inefficient)
SELECT w1.word, w2.word,
similarity(w1.phonetic_code, w2.phonetic_code) as sim
FROM words w1, words w2
WHERE w1.id < w2.id
AND similarity(w1.phonetic_code, w2.phonetic_code) > 0.8
ORDER BY sim DESC;
The Solution: Smart Query Rewriting
We rewrote the query to leverage our indexes:
-- Rewritten query (much faster)
WITH phonetic_groups AS (
SELECT phonetic_code, array_agg(word ORDER BY frequency DESC) as words
FROM words
WHERE phonetic_code IS NOT NULL
GROUP BY phonetic_code
HAVING count(*) > 1
)
SELECT unnest(words[1:2]) as word1,
unnest(words[2:3]) as word2,
1.0 as similarity
FROM phonetic_groups
WHERE array_length(words, 1) >= 2;
Why This Works:
- Groups words by identical phonetic codes first
- Uses array operations instead of expensive similarity calculations
- Leverages our phonetic indexes for fast grouping
- Eliminates the need for cross-joining all words
Result: Query time dropped to 85ms (37x improvement from original)
Performance Results Summary
| Optimization Step | Query Time | Improvement |
|---|---|---|
| Original (No Indexes) | 3,200ms | Baseline |
| B-tree Phonetic Index | 1,800ms | 1.8x faster |
| Trigram Fuzzy Index | 1,200ms | 2.7x faster |
| Partial Index (Common Words) | 600ms | 5.3x faster |
| B-tree Deduplication | 200ms | 16x faster |
| Query Rewriting | 85ms | 37x faster |
Key Lessons Learned
1. Phonetic Data Requires Specialized Indexes
- Regular indexes don't work for phonetic similarity
- B-tree indexes excel at exact phonetic matches
- GIN indexes with trigram operators handle fuzzy matching
2. Partial Indexes Are Powerful for Filtered Data
- Only index data you actually query
- Dramatically reduces index size and improves performance
- Perfect for frequency-based filtering
3. B-tree Deduplication Saves Massive Space
- PostgreSQL 14+ automatically deduplicates identical keys
- Essential for data with many duplicate values
- Improves cache utilization and query performance
4. Query Rewriting Can Eliminate Expensive Operations
- Sometimes the best optimization is changing the approach
- Grouping operations can replace expensive similarity calculations
- Leverage indexes to avoid full table scans
5. Multi-layered Indexing Strategies Work Best
- Different index types for different query patterns
- Combine exact matching with fuzzy matching
- Use partial indexes for common query patterns
Implementation Checklist
If you're facing similar phonetic search performance issues:
- Analyze your queries: Use
EXPLAIN ANALYZEto identify bottlenecks - Create B-tree indexes: For exact phonetic matches
- Add trigram indexes: For fuzzy phonetic matching
- Implement partial indexes: For commonly filtered data
- Upgrade to PostgreSQL 14+: For automatic deduplication
- Consider query rewriting: To eliminate expensive operations
- Monitor index usage: Track which indexes are actually used
Summary
Optimizing phonetic search in PostgreSQL requires a multi-layered approach. By combining B-tree indexes for exact matches, GIN indexes for fuzzy matching, partial indexing for common words, B-tree deduplication, and smart query rewriting, we achieved a 37x performance improvement for Van Dale Rijmwoordenboek.
The key was understanding that phonetic data has unique requirements and requires specialized optimization techniques. Generic text search optimization approaches won't work for phonetic similarity operations.
If this article helped you understand phonetic search optimization, we can help you implement these techniques in your own applications. At Ludulicious, we specialize in:
- Text Search Solutions: Phonetic matching and similarity algorithms
- Database Performance Optimization: From slow queries to indexing strategies
- Custom Development: Tailored solutions for your specific use case
Ready to optimize your phonetic search?
Contact us for a free consultation, or check out our other optimization guides:
- PostgreSQL Performance Tuning: Strategic Lessons from Production
- Duikersgids: How I Made Spatial Search 55x Faster
- Rijmwoordenboek: Serving Pages Under 15ms with Better Caching
- UpstreamAds: From 1.2s to 35ms Full-Text Search
- PostgreSQL Configuration: The Settings That Matter
This optimization case study is based on real production experience with Van Dale Rijmwoordenboek. All performance numbers are from actual production systems.
Duikersgids: How I Made Spatial Search 55x Faster
Learn how we optimized PostgreSQL spatial queries for Duikersgids.nl, reducing search times from 2.5 seconds to 45ms using GiST indexes, partitioning, and parallel processing techniques.
Rijmwoordenboek: Serving Pages Under 15ms with Better Caching
Learn how we optimized Rijmwoordenboek page load times from 100ms+ to under 15ms using application-level caching, database query optimization, and response time strategies.