A Go utility that parses PostgreSQL WAL dump output (from pg_waldump
) and stores it in a DuckDB database for analysis.
- Parses pg_waldump text output with high-performance state machine parser
- Parallel processing with configurable worker pool
- Stores WAL records in DuckDB format for fast analytical queries
- High-performance bulk loading using DuckDB's Appender API
- Batch processing for efficient database writes
- Support for both stdin and file input
- Configurable batch size and parallelism for performance tuning
- Progress reporting every 5 seconds during processing
go mod download
go build -o walstats
pg_waldump /path/to/wal/000000010000000000000001 | ./walstats -output wal.duckdb
# First save pg_waldump output
pg_waldump /path/to/wal/000000010000000000000001 > waldump.txt
# Then process it
./walstats -input waldump.txt -output wal.duckdb
-input
: Input file containing pg_waldump output (use "-" for stdin, default: "-")-output
: Output DuckDB database file (default: "waldump.duckdb")-batch-size
: Number of records to batch before writing (default: 1000)-workers
: Number of parallel workers (0 = number of CPU cores, default: 0)-help
: Show help message
The utility creates a wal_records
table with the following columns:
lsn
: Log Sequence Number (VARCHAR, PRIMARY KEY)prev_lsn
: Previous LSN (VARCHAR)xid
: Transaction ID (BIGINT)resource_mgr
: Resource manager name (VARCHAR)record_type
: Type of WAL record (VARCHAR)length
: Record length (INTEGER)total_length
: Total record length (INTEGER)description
: Full description from pg_waldump (TEXT)timestamp
: Transaction timestamp if available (TIMESTAMP)database
: Database OID (INTEGER) - parsed from rel field (2nd component)tablespace
: Tablespace OID (INTEGER) - parsed from rel field (1st component)relfilenode
: Relation file node (INTEGER)block_number
: Block number within relation (INTEGER)fork_name
: Fork name (main, fsm, vm, etc.) (VARCHAR)inserted_at
: When the record was inserted into DuckDB (TIMESTAMP)
Once the data is in DuckDB, you can analyze it using SQL:
-- Connect to DuckDB
duckdb wal.duckdb
-- Count records by resource manager
SELECT resource_mgr, COUNT(*) as count
FROM wal_records
GROUP BY resource_mgr
ORDER BY count DESC;
-- Find largest WAL records
SELECT lsn, resource_mgr, total_length, description
FROM wal_records
ORDER BY total_length DESC
LIMIT 10;
-- Analyze transaction patterns
SELECT xid, COUNT(*) as record_count, SUM(total_length) as total_size
FROM wal_records
WHERE xid > 0
GROUP BY xid
ORDER BY total_size DESC;
-- Analyze activity by relation
SELECT database, tablespace, relfilenode,
COUNT(*) as record_count,
SUM(total_length) as total_size
FROM wal_records
WHERE relfilenode IS NOT NULL
GROUP BY database, tablespace, relfilenode
ORDER BY total_size DESC;
-- Find hot blocks (frequently modified)
SELECT database, tablespace, relfilenode, block_number,
COUNT(*) as modification_count
FROM wal_records
WHERE block_number IS NOT NULL
GROUP BY database, tablespace, relfilenode, block_number
ORDER BY modification_count DESC
LIMIT 20;
- Go 1.21 or higher
- PostgreSQL 10 or higher (for pg_waldump)
- DuckDB Go driver (automatically installed via go.mod)
- State machine parser: 3-7x faster than regex-based parsing
- Parallel processing: Scales linearly with CPU cores for parsing
- DuckDB Appender API: 5-10x faster than standard SQL inserts
- Benchmarks: Can process millions of WAL records per minute on modern hardware
- Workers: Set to number of CPU cores for CPU-bound parsing (default: auto-detect)
- Batch size: Larger batches (5000-10000) can improve throughput for large files
- Memory usage: Scales with workers × batch-size × average record size
The utility is compatible with pg_waldump output from PostgreSQL 10 and later versions. The core format has remained consistent:
- PostgreSQL 10-16: Fully compatible
- PostgreSQL 9.6 and earlier: Uses
pg_xlogdump
(may need minor adjustments)
The parser handles the standard pg_waldump format:
rmgr: <manager> len (rec/tot): <rec>/<tot>, tx: <xid>, lsn: <lsn>, prev <prev_lsn>, desc: <description>
Key fields that remain consistent across versions:
- Resource manager (rmgr)
- Record lengths (len)
- Transaction ID (tx)
- Log Sequence Numbers (lsn, prev)
- Description with optional relation info (rel tablespace/database/relfilenode)
While the core format is stable, some differences exist:
- Relation format: Always
tablespace/database/relfilenode
in PG 10+ - Resource managers: New types added in newer versions (handled gracefully)
- Description details: May vary but core structure remains same
- Progress is reported every 5 seconds during processing
- DuckDB writes are serialized but parsing happens in parallel
- Unknown resource managers or description formats are stored as-is