Phonetic Search·Jan 17, 2025·11 min read

Rijmwoordenboek: Solving the 3-Second Phonetic Search Problem

Learn how we optimized PostgreSQL phonetic search for Van Dale Rijmwoordenboek, reducing search times from 3.2 seconds to 85ms using multi-layered indexing strategies and B-tree deduplication.

About the Author

Rob Schoenaker

Managing Partner at UpstreamAds and Partner at Ludulicious B.V. with over 20 years of experience in software development, specializing in .NET Core, ServiceStack, C# and database design.

View Profile

The Problem: Phonetic Search Too Slow for Real-Time Use

In 2020, Van Dale Rijmwoordenboek faced a critical performance bottleneck. Users searching for rhyming words were waiting 3.2 seconds for results. For a rhyming dictionary, this was completely unacceptable.

The Challenge:

200,000+ Dutch words with phonetic codes
Complex phonetic matching algorithms
Users expecting instant rhyming suggestions
Similarity calculations killing performance

The Numbers:

-- This query was taking 3.2+ seconds
SELECT w1.word, w2.word, 
       similarity(w1.phonetic_code, w2.phonetic_code) as sim
FROM words w1, words w2 
WHERE w1.id < w2.id 
  AND similarity(w1.phonetic_code, w2.phonetic_code) > 0.8
ORDER BY sim DESC
LIMIT 20;

Rijmwoordenboek performance monitoring

The Root Cause: Missing Phonetic Indexes

The problem was clear from the execution plan:

EXPLAIN ANALYZE SELECT w1.word, w2.word, 
       similarity(w1.phonetic_code, w2.phonetic_code) as sim
FROM words w1, words w2 
WHERE similarity(w1.phonetic_code, w2.phonetic_code) > 0.8;

-- Result: Nested Loop (cost=0.00..5000000.00 rows=1000000 width=32)
-- Execution time: 3200.456 ms

What was happening:

PostgreSQL was doing a nested loop join on 200,000+ records
No indexes existed for phonetic similarity operations
Every similarity calculation was computed from scratch
Cartesian product of words was killing performance

The Solution: Multi-Layered Phonetic Indexing

Step 1: Create B-tree Index for Exact Phonetic Matches

The first breakthrough came with a proper phonetic index:

-- Custom phonetic index for exact matches
CREATE INDEX CONCURRENTLY idx_words_phonetic_btree 
ON words USING btree (phonetic_code, frequency DESC);

Why This Works:

btree (phonetic_code, frequency DESC): B-tree indexes provide fast exact matches and efficient range scans
Ordered by frequency ensures most common words appear first
Enables fast lookups for identical phonetic codes
CONCURRENTLY allows index creation without blocking writes

Immediate Result: Exact phonetic matches dropped from 3.2 seconds to 1.8 seconds (1.8x improvement)

Step 2: Add Trigram Index for Fuzzy Matching

For phonetic variations and typos, we added trigram support:

-- Trigram index for fuzzy matching
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX CONCURRENTLY idx_words_trgm 
ON words USING gin (word gin_trgm_ops);

Why This Works:

gin (word gin_trgm_ops): GIN indexes with trigram operators enable fast fuzzy text matching
Trigram indexes work by breaking words into 3-character substrings
Makes searches resilient to typos and phonetic variations
Perfect for finding words with similar phonetic patterns

Result: Fuzzy phonetic searches improved to 1.2 seconds (2.7x improvement)

Step 3: Create Partial Index for Common Words

Most queries focused on common words, so we created a partial index:

-- Partial index for common words (massive performance win)
CREATE INDEX CONCURRENTLY idx_words_common 
ON words (phonetic_code) 
WHERE frequency > 1000;

Why This Works:

WHERE frequency > 1000: Only indexes common words, dramatically reducing index size
Covers 80% of queries while using only 20% of index space
Faster index scans and better cache utilization
Most rhyming searches focus on common words anyway

Result: Common word searches dropped to 600ms (5.3x improvement)

The Game Changer: PostgreSQL 14 B-tree Deduplication

The Problem: Bloated Indexes with Duplicate Keys

In 2021, we discovered our indexes were bloated with duplicate phonetic codes:

-- Before PostgreSQL 14: Each duplicate stored separately
CREATE INDEX idx_words_old 
ON words (phonetic_code, frequency, word);
-- Result: 450MB index with lots of duplicates

The Solution: Automatic Deduplication

PostgreSQL 14+ automatic deduplication solved this:

-- PostgreSQL 14+: Automatic deduplication
CREATE INDEX CONCURRENTLY idx_words_dedup 
ON words (phonetic_code, frequency, word);
-- Duplicate key values automatically deduplicated
-- Result: 180MB index (60% reduction!)

Why This Works:

B-tree indexes traditionally store each duplicate key as a separate entry
PostgreSQL 14+ recognizes when multiple rows have identical index key values
Instead of storing "KAT,1000,cat" and "KAT,1000,kat" separately, it stores the key once with multiple row pointers
Dramatically reduces index size when you have many duplicate values (common in phonetic codes)

Result: Index size reduced by 60%, query performance improved to 200ms (16x improvement)

The Final Optimization: Query Rewriting Strategy

The Problem: Inefficient Similarity Calculations

The original query was still doing expensive similarity calculations:

-- Original query (inefficient)
SELECT w1.word, w2.word, 
       similarity(w1.phonetic_code, w2.phonetic_code) as sim
FROM words w1, words w2 
WHERE w1.id < w2.id 
  AND similarity(w1.phonetic_code, w2.phonetic_code) > 0.8
ORDER BY sim DESC;

The Solution: Smart Query Rewriting

We rewrote the query to leverage our indexes:

-- Rewritten query (much faster)
WITH phonetic_groups AS (
    SELECT phonetic_code, array_agg(word ORDER BY frequency DESC) as words
    FROM words 
    WHERE phonetic_code IS NOT NULL
    GROUP BY phonetic_code
    HAVING count(*) > 1
)
SELECT unnest(words[1:2]) as word1, 
       unnest(words[2:3]) as word2,
       1.0 as similarity
FROM phonetic_groups
WHERE array_length(words, 1) >= 2;

Why This Works:

Groups words by identical phonetic codes first
Uses array operations instead of expensive similarity calculations
Leverages our phonetic indexes for fast grouping
Eliminates the need for cross-joining all words

Result: Query time dropped to 85ms (37x improvement from original)

Performance Results Summary

Optimization Step	Query Time	Improvement
Original (No Indexes)	3,200ms	Baseline
B-tree Phonetic Index	1,800ms	1.8x faster
Trigram Fuzzy Index	1,200ms	2.7x faster
Partial Index (Common Words)	600ms	5.3x faster
B-tree Deduplication	200ms	16x faster
Query Rewriting	85ms	37x faster

Key Lessons Learned

1. Phonetic Data Requires Specialized Indexes

Regular indexes don't work for phonetic similarity
B-tree indexes excel at exact phonetic matches
GIN indexes with trigram operators handle fuzzy matching

2. Partial Indexes Are Powerful for Filtered Data

Only index data you actually query
Dramatically reduces index size and improves performance
Perfect for frequency-based filtering

3. B-tree Deduplication Saves Massive Space

PostgreSQL 14+ automatically deduplicates identical keys
Essential for data with many duplicate values
Improves cache utilization and query performance

4. Query Rewriting Can Eliminate Expensive Operations

Sometimes the best optimization is changing the approach
Grouping operations can replace expensive similarity calculations
Leverage indexes to avoid full table scans

5. Multi-layered Indexing Strategies Work Best

Different index types for different query patterns
Combine exact matching with fuzzy matching
Use partial indexes for common query patterns

Implementation Checklist

If you're facing similar phonetic search performance issues:

Analyze your queries: Use EXPLAIN ANALYZE to identify bottlenecks
Create B-tree indexes: For exact phonetic matches
Add trigram indexes: For fuzzy phonetic matching
Implement partial indexes: For commonly filtered data
Upgrade to PostgreSQL 14+: For automatic deduplication
Consider query rewriting: To eliminate expensive operations
Monitor index usage: Track which indexes are actually used

Summary

Optimizing phonetic search in PostgreSQL requires a multi-layered approach. By combining B-tree indexes for exact matches, GIN indexes for fuzzy matching, partial indexing for common words, B-tree deduplication, and smart query rewriting, we achieved a 37x performance improvement for Van Dale Rijmwoordenboek.

The key was understanding that phonetic data has unique requirements and requires specialized optimization techniques. Generic text search optimization approaches won't work for phonetic similarity operations.

If this article helped you understand phonetic search optimization, we can help you implement these techniques in your own applications. At Ludulicious, we specialize in:

Text Search Solutions: Phonetic matching and similarity algorithms
Database Performance Optimization: From slow queries to indexing strategies
Custom Development: Tailored solutions for your specific use case

Ready to optimize your phonetic search?

This optimization case study is based on real production experience with Van Dale Rijmwoordenboek. All performance numbers are from actual production systems.

Duikersgids: How I Made Spatial Search 55x Faster

Learn how we optimized PostgreSQL spatial queries for Duikersgids.nl, reducing search times from 2.5 seconds to 45ms using GiST indexes, partitioning, and parallel processing techniques.

Rijmwoordenboek: Serving Pages Under 15ms with Better Caching

Learn how we optimized Rijmwoordenboek page load times from 100ms+ to under 15ms using application-level caching, database query optimization, and response time strategies.

Rijmwoordenboek: Solving the 3-Second Phonetic Search Problem

Categories

Tags

About the Author

Rob Schoenaker

The Problem: Phonetic Search Too Slow for Real-Time Use

The Root Cause: Missing Phonetic Indexes

The Solution: Multi-Layered Phonetic Indexing

Step 1: Create B-tree Index for Exact Phonetic Matches

Step 2: Add Trigram Index for Fuzzy Matching

Step 3: Create Partial Index for Common Words

The Game Changer: PostgreSQL 14 B-tree Deduplication

The Problem: Bloated Indexes with Duplicate Keys

The Solution: Automatic Deduplication

The Final Optimization: Query Rewriting Strategy

The Problem: Inefficient Similarity Calculations

The Solution: Smart Query Rewriting

Performance Results Summary

Key Lessons Learned

1. Phonetic Data Requires Specialized Indexes

2. Partial Indexes Are Powerful for Filtered Data

3. B-tree Deduplication Saves Massive Space

4. Query Rewriting Can Eliminate Expensive Operations

5. Multi-layered Indexing Strategies Work Best

Implementation Checklist

Summary

Ludulicious