Phonetic Search··11 min read

Rijmwoordenboek: Solving the 3-Second Phonetic Search Problem

Learn how we optimized PostgreSQL phonetic search for Van Dale Rijmwoordenboek, reducing search times from 3.2 seconds to 85ms using multi-layered indexing strategies and B-tree deduplication.

Categories

Database OptimizationText Search

Tags

PostgreSQLPhonetic SearchB-tree IndexesGIN IndexesTrigram SearchPerformance OptimizationRijmwoordenboek

About the Author

Author avatar

Rob Schoenaker

Managing Partner at UpstreamAds and Partner at Ludulicious B.V. with over 20 years of experience in software development, specializing in .NET Core, ServiceStack, C# and database design.

Share:

The Problem: Phonetic Search Too Slow for Real-Time Use

In 2020, Van Dale Rijmwoordenboek faced a critical performance bottleneck. Users searching for rhyming words were waiting 3.2 seconds for results. For a rhyming dictionary, this was completely unacceptable.

The Challenge:

  • 200,000+ Dutch words with phonetic codes
  • Complex phonetic matching algorithms
  • Users expecting instant rhyming suggestions
  • Similarity calculations killing performance

The Numbers:

-- This query was taking 3.2+ seconds
SELECT w1.word, w2.word, 
       similarity(w1.phonetic_code, w2.phonetic_code) as sim
FROM words w1, words w2 
WHERE w1.id < w2.id 
  AND similarity(w1.phonetic_code, w2.phonetic_code) > 0.8
ORDER BY sim DESC
LIMIT 20;

The Root Cause: Missing Phonetic Indexes

The problem was clear from the execution plan:

EXPLAIN ANALYZE SELECT w1.word, w2.word, 
       similarity(w1.phonetic_code, w2.phonetic_code) as sim
FROM words w1, words w2 
WHERE similarity(w1.phonetic_code, w2.phonetic_code) > 0.8;

-- Result: Nested Loop (cost=0.00..5000000.00 rows=1000000 width=32)
-- Execution time: 3200.456 ms

What was happening:

  • PostgreSQL was doing a nested loop join on 200,000+ records
  • No indexes existed for phonetic similarity operations
  • Every similarity calculation was computed from scratch
  • Cartesian product of words was killing performance

The Solution: Multi-Layered Phonetic Indexing

Step 1: Create B-tree Index for Exact Phonetic Matches

The first breakthrough came with a proper phonetic index:

-- Custom phonetic index for exact matches
CREATE INDEX CONCURRENTLY idx_words_phonetic_btree 
ON words USING btree (phonetic_code, frequency DESC);

Why This Works:

  • btree (phonetic_code, frequency DESC): B-tree indexes provide fast exact matches and efficient range scans
  • Ordered by frequency ensures most common words appear first
  • Enables fast lookups for identical phonetic codes
  • CONCURRENTLY allows index creation without blocking writes

Immediate Result: Exact phonetic matches dropped from 3.2 seconds to 1.8 seconds (1.8x improvement)

Step 2: Add Trigram Index for Fuzzy Matching

For phonetic variations and typos, we added trigram support:

-- Trigram index for fuzzy matching
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX CONCURRENTLY idx_words_trgm 
ON words USING gin (word gin_trgm_ops);

Why This Works:

  • gin (word gin_trgm_ops): GIN indexes with trigram operators enable fast fuzzy text matching
  • Trigram indexes work by breaking words into 3-character substrings
  • Makes searches resilient to typos and phonetic variations
  • Perfect for finding words with similar phonetic patterns

Result: Fuzzy phonetic searches improved to 1.2 seconds (2.7x improvement)

Step 3: Create Partial Index for Common Words

Most queries focused on common words, so we created a partial index:

-- Partial index for common words (massive performance win)
CREATE INDEX CONCURRENTLY idx_words_common 
ON words (phonetic_code) 
WHERE frequency > 1000;

Why This Works:

  • WHERE frequency > 1000: Only indexes common words, dramatically reducing index size
  • Covers 80% of queries while using only 20% of index space
  • Faster index scans and better cache utilization
  • Most rhyming searches focus on common words anyway

Result: Common word searches dropped to 600ms (5.3x improvement)

The Game Changer: PostgreSQL 14 B-tree Deduplication

The Problem: Bloated Indexes with Duplicate Keys

In 2021, we discovered our indexes were bloated with duplicate phonetic codes:

-- Before PostgreSQL 14: Each duplicate stored separately
CREATE INDEX idx_words_old 
ON words (phonetic_code, frequency, word);
-- Result: 450MB index with lots of duplicates

The Solution: Automatic Deduplication

PostgreSQL 14+ automatic deduplication solved this:

-- PostgreSQL 14+: Automatic deduplication
CREATE INDEX CONCURRENTLY idx_words_dedup 
ON words (phonetic_code, frequency, word);
-- Duplicate key values automatically deduplicated
-- Result: 180MB index (60% reduction!)

Why This Works:

  • B-tree indexes traditionally store each duplicate key as a separate entry
  • PostgreSQL 14+ recognizes when multiple rows have identical index key values
  • Instead of storing "KAT,1000,cat" and "KAT,1000,kat" separately, it stores the key once with multiple row pointers
  • Dramatically reduces index size when you have many duplicate values (common in phonetic codes)

Result: Index size reduced by 60%, query performance improved to 200ms (16x improvement)

The Final Optimization: Query Rewriting Strategy

The Problem: Inefficient Similarity Calculations

The original query was still doing expensive similarity calculations:

-- Original query (inefficient)
SELECT w1.word, w2.word, 
       similarity(w1.phonetic_code, w2.phonetic_code) as sim
FROM words w1, words w2 
WHERE w1.id < w2.id 
  AND similarity(w1.phonetic_code, w2.phonetic_code) > 0.8
ORDER BY sim DESC;

The Solution: Smart Query Rewriting

We rewrote the query to leverage our indexes:

-- Rewritten query (much faster)
WITH phonetic_groups AS (
    SELECT phonetic_code, array_agg(word ORDER BY frequency DESC) as words
    FROM words 
    WHERE phonetic_code IS NOT NULL
    GROUP BY phonetic_code
    HAVING count(*) > 1
)
SELECT unnest(words[1:2]) as word1, 
       unnest(words[2:3]) as word2,
       1.0 as similarity
FROM phonetic_groups
WHERE array_length(words, 1) >= 2;

Why This Works:

  • Groups words by identical phonetic codes first
  • Uses array operations instead of expensive similarity calculations
  • Leverages our phonetic indexes for fast grouping
  • Eliminates the need for cross-joining all words

Result: Query time dropped to 85ms (37x improvement from original)

Performance Results Summary

Optimization StepQuery TimeImprovement
Original (No Indexes)3,200msBaseline
B-tree Phonetic Index1,800ms1.8x faster
Trigram Fuzzy Index1,200ms2.7x faster
Partial Index (Common Words)600ms5.3x faster
B-tree Deduplication200ms16x faster
Query Rewriting85ms37x faster

Key Lessons Learned

1. Phonetic Data Requires Specialized Indexes

  • Regular indexes don't work for phonetic similarity
  • B-tree indexes excel at exact phonetic matches
  • GIN indexes with trigram operators handle fuzzy matching

2. Partial Indexes Are Powerful for Filtered Data

  • Only index data you actually query
  • Dramatically reduces index size and improves performance
  • Perfect for frequency-based filtering

3. B-tree Deduplication Saves Massive Space

  • PostgreSQL 14+ automatically deduplicates identical keys
  • Essential for data with many duplicate values
  • Improves cache utilization and query performance

4. Query Rewriting Can Eliminate Expensive Operations

  • Sometimes the best optimization is changing the approach
  • Grouping operations can replace expensive similarity calculations
  • Leverage indexes to avoid full table scans

5. Multi-layered Indexing Strategies Work Best

  • Different index types for different query patterns
  • Combine exact matching with fuzzy matching
  • Use partial indexes for common query patterns

Implementation Checklist

If you're facing similar phonetic search performance issues:

  • Analyze your queries: Use EXPLAIN ANALYZE to identify bottlenecks
  • Create B-tree indexes: For exact phonetic matches
  • Add trigram indexes: For fuzzy phonetic matching
  • Implement partial indexes: For commonly filtered data
  • Upgrade to PostgreSQL 14+: For automatic deduplication
  • Consider query rewriting: To eliminate expensive operations
  • Monitor index usage: Track which indexes are actually used

Summary

Optimizing phonetic search in PostgreSQL requires a multi-layered approach. By combining B-tree indexes for exact matches, GIN indexes for fuzzy matching, partial indexing for common words, B-tree deduplication, and smart query rewriting, we achieved a 37x performance improvement for Van Dale Rijmwoordenboek.

The key was understanding that phonetic data has unique requirements and requires specialized optimization techniques. Generic text search optimization approaches won't work for phonetic similarity operations.

If this article helped you understand phonetic search optimization, we can help you implement these techniques in your own applications. At Ludulicious, we specialize in:

  • Text Search Solutions: Phonetic matching and similarity algorithms
  • Database Performance Optimization: From slow queries to indexing strategies
  • Custom Development: Tailored solutions for your specific use case

Ready to optimize your phonetic search?

Contact us for a free consultation, or check out our other optimization guides:


This optimization case study is based on real production experience with Van Dale Rijmwoordenboek. All performance numbers are from actual production systems.