Easily create database dumps with support for redacting data (And replacing that data with valid random values).
Supported databases
- MySQL
- PostgreSQL
- Microsoft SQL Server
- SQLite
More coming soon...
To install redactdump, run the following command:
pip install redactdumpUsage: redactdump [OPTIONS]
Create a redacted database dump.
Options:
-c, --config TEXT Path to dump configuration. [required]
-u, --user TEXT Connection username.
-p, --password TEXT Connection password.
--max-workers INTEGER Max number of tables dumped concurrently. [default: 4]
-d, --debug Enable debug mode.
--dry-run Show which rule would redact each column, without
reading or writing any data.
--help Show this message and exit.To verify that your patterns hit the right columns before trusting a dump,
run with --dry-run. It connects, lists every table and column, and shows
which rule (if any) would redact each column - without reading any rows or
creating any output files. Data-pattern rules depend on each cell's value, so
they are listed separately as evaluated per cell.
To create a dump you currently must use a configuration file, however in the future you might be able to do it all via CLI.
redactdump uses faker to generate random data.
replacement can therefore be any function from the following providers:
https://faker.readthedocs.io/en/stable/providers.html
The standard faker providers are validated in CI: each is exercised through the redaction pipeline to confirm it produces a value that is written out as a valid SQL literal.
These work normally once their package is installed alongside redactdump:
-
imagegenerates raw image bytes and needs Pillow. Afterpip install Pillow, target abyteacolumn so the bytes are written as a hex literal (on a text column you would get a Python byte-string repr instead):redact: patterns: column: - pattern: '^avatar$' replacement: image
-
xmlgenerates an XML document and needs xmltodict. Afterpip install xmltodictit returns an XML string, suitable fortextorcharacter varyingcolumns.
Some providers take arguments (for example random_int accepts min and
max). Add an arguments mapping to a rule and it is passed to the provider as
keyword arguments:
redact:
patterns:
column:
- pattern: '^age$'
replacement: random_int
arguments:
min: 18
max: 99An argument written as { import: module.attr } is resolved to the imported
object before the provider is called. This is what makes enum usable, as it
needs an enum class rather than a plain value:
redact:
patterns:
column:
- pattern: '^status$'
replacement: enum
arguments:
enum_cls:
import: myapp.models.StatusTo replace matched cells with a fixed literal instead of generated data, give
the rule a value instead of a replacement. The literal is written verbatim
and keeps its type, so 'REDACTED', 0, an empty string or null (a real
SQL NULL) all work. A rule must carry exactly one of the two keys:
redact:
patterns:
column:
- pattern: '^internal_notes$'
value: 'REDACTED'By default every cell gets a fresh random value, so the same real value
becomes a different fake value in every row. Set consistent: true on a rule
to map identical inputs to identical outputs instead, stable across rows and
tables, so joins, grouping and dedup on redacted columns keep working:
redact:
patterns:
column:
- pattern: '^email$'
replacement: email
consistent: trueThe output is derived by seeding the generator with HMAC-SHA256 over the
original value, keyed with a secret. Without configuration the secret is
random per run: identical values map consistently within one dump but
differently on the next, and someone holding only the dump cannot recompute
the mapping. To keep mappings stable across runs, provide the secret
yourself, either as redact.seed or (preferably, so it stays out of config
files) the REDACTDUMP_SEED environment variable, which takes precedence
over the config key.
Security notes:
- Treat the seed like a password. Anyone holding it can recover originals of guessable values (emails, phone numbers, sequential ids, ...) by enumerating candidates and comparing the resulting fakes against the dump.
- Consistent redaction intentionally reveals equality: two rows sharing a fake value shared the real value. Use it only where that is acceptable.
faker freely repeats values, so redacting a column that carries a UNIQUE
constraint can produce a dump that fails to replay. Set unique: true on a
rule to guarantee the rule never emits the same value twice within a run:
redact:
patterns:
column:
- pattern: '^username$'
replacement: user_name
unique: trueIf the replacement cannot produce enough distinct values for the rows being
dumped, the run aborts with an error naming the provider. unique cannot be
combined with consistent: a consistent rule must repeat its output whenever
the input repeats, which contradicts uniqueness.
By default a matching rule fabricates a value even when the original cell is
NULL. Set preserve_null: true on a rule to keep NULL cells NULL, so the
data's null distribution survives redaction:
redact:
patterns:
column:
- pattern: '^phone'
replacement: phone_number
preserve_null: trueA preserved NULL still counts as handled: no later rule will fabricate a
value for that column. Note that whether a cell is NULL is itself
information (for example, a NULL phone reveals the user supplied no phone
number); leave the flag off where that must not leak.
redact.columns redacts specific columns of specific tables, without pattern
matching. Each entry names the column (exact match) and the replacement to
use; a null replacement writes NULL. These rules apply only to the named
table and take precedence over redact.patterns rules for the same column.
The entries accept value, consistent, unique and preserve_null as well:
redact:
columns:
users:
- name: email
replacement: email
consistent: true
- name: ssn
replacement: nullfaker's community providers
are also supported. Install the provider package alongside redactdump and list
the provider classes (as dotted import paths) under redact.providers. Their
methods then become available as replacement values.
redact:
providers:
- faker_vehicle.VehicleProvider
patterns:
column:
- pattern: '^car_model'
replacement: vehicle_makeconnection.type selects the database engine:
-
pgsql(orpostgresql): PostgreSQL viapsycopg. -
mysql: MySQL viaaiomysql. -
mssql: Microsoft SQL Server viaaioodbc. This requires the Microsoft ODBC Driver 18 for SQL Server to be installed; a different installed driver can be selected withconnection.driver. The server's TLS certificate is trusted implicitly (TrustServerCertificate=yes), matching the self-signed certificate SQL Server ships with. -
sqlite: SQLite viaaiosqlite. There is no server:connection.databaseis the path to the database file, and host, port and credentials are neither needed nor used.connection: type: sqlite database: ./app.db
connection.schema selects the schema to dump. It defaults to public on
PostgreSQL, dbo on SQL Server and the connection database on MySQL. When a
schema is configured, the table names in the generated INSERT and DDL
statements are schema-qualified so the dump replays into that same schema:
connection:
type: pgsql
host: 127.0.0.1
port: 5432
database: postgres
schema: accountingconnection:
type: pgsql
host: 127.0.0.1
port: 5432
database: postgres
redact:
patterns:
column:
- pattern: '^[a-zA-Z]+_name'
replacement: name
data:
- pattern: '192.168.0.1'
replacement: ipv4
- pattern: 'John Doe'
replacement: name
output:
type: multi_file
naming: 'dump-[table_name]-[timestamp]' # Default: [table_name]-[timestamp]
location: './output/'limits.tables restricts the dump to the listed tables, and
limits.exclude_tables skips tables (audit logs, migration bookkeeping,
huge append-only tables). Both accept exact names or regular expressions;
an entry must match the whole table name, and exclusion wins when both
match:
limits:
tables:
- users
- 'orders_.*'
exclude_tables:
- alembic_version
- 'audit_.*'limits.max_rows_per_table caps every table globally. limits.per_table
overrides it for specific tables (exact names): max_rows caps that table's
rows, and where filters which rows are dumped, so "all of users, only the
last 30 days of events" is expressible:
limits:
per_table:
events:
max_rows: 10000
where: "created_at > now() - interval '30 days'"The where clause is passed to the database verbatim, so it can use any
expression the engine supports. It applies to both the row count and the
reads; max_rows then caps the filtered rows.
output.type controls how dumps are written:
multi_file: one.sqlfile per table insidelocation(a directory).file: every table is written into a single file. Withoutnamingthe file is{location}.sql; withnamingthe templated name (with[table_name]dropped) is placed in the directory oflocation. The file is recreated on each run.
Set output.ddl: true to prepend each table's CREATE TABLE statement before
its data, so the dump can recreate the schema as well as the rows. The DDL is
produced by the database itself and is dialect-aware:
- MySQL uses
SHOW CREATE TABLE, so the definition is authoritative and complete (exact types, primary key, indexes, constraints, engine and charset). - PostgreSQL has no such statement, so the definition is reconstructed from
pg_catalog: exact types (including lengths and precision such ascharacter varying(64)ornumeric(10,2)),NOT NULL, column defaults, the primary key (in column order) and secondary indexes. Foreign keys are emitted asALTER TABLE ... ADD CONSTRAINTstatements at the very end of the dump, so referenced tables and rows already exist when they are applied. - SQL Server is reconstructed the same way from
INFORMATION_SCHEMAand thesyscatalog: exact types (nvarchar(255),nvarchar(max),decimal(10,2),datetime2(7), ...),NOT NULL, column defaults, the primary key, secondary indexes and deferred foreign keys. Identity properties are not reproduced, so the dumped rows replay withoutSET IDENTITY_INSERT.
The DDL is written once per table (empty tables still get their schema).
output:
type: multi_file
location: './output/'
ddl: trueAn example of the generated PostgreSQL DDL:
CREATE TABLE "users" (
"id" integer NOT NULL DEFAULT nextval('users_id_seq'::regclass),
"email" character varying(255) NOT NULL,
"note" text,
PRIMARY KEY ("id")
);The configuration schema can be found here
Configuration
connection:
type: pgsql
host: 127.0.0.1
port: 5432
database: postgres
redact:
patterns:
column:
- pattern: '^new_'
replacement: name
data:
- pattern: '6'
replacement: random_int
output:
type: multi_file
naming: 'dump-[table_name]-[timestamp]' # Default: [table_name]-[timestamp]
location: './output/'Original data
(column_1, new_column)
6,"""John Doe"""
6,"John Doe"
6,"John Doe"
6,John Doe
1,\John Doe
1,--John Doe
12312, John Doe
99,!John Doe
99,(John Doe)
Output
INSERT INTO "table_name" ("column_1", "new_column") VALUES (890, 'Yolanda Mcdonald');
INSERT INTO "table_name" ("column_1", "new_column") VALUES (1982, 'Stephen Lewis');
INSERT INTO "table_name" ("column_1", "new_column") VALUES (2952, 'Janet Woodward');
INSERT INTO "table_name" ("column_1", "new_column") VALUES (9307, 'Joshua Price');
INSERT INTO "table_name" ("column_1", "new_column") VALUES (1, 'Tina Morrison');
INSERT INTO "table_name" ("column_1", "new_column") VALUES (1, 'Juan Mejia');
INSERT INTO "table_name" ("column_1", "new_column") VALUES (12312, 'Michael Thornton');
INSERT INTO "table_name" ("column_1", "new_column") VALUES (99, 'Adrian White');
INSERT INTO "table_name" ("column_1", "new_column") VALUES (99, 'Robin Jefferson');Values are rendered from their Python type for the connection's dialect, so a dump can be replayed into the same engine it came from. Booleans, numbers, dates/times, binary, JSON and (on PostgreSQL) arrays are all rendered as proper literals rather than a quoted Python representation.
- PostgreSQL: identifiers are double-quoted; booleans are
TRUE/FALSE;byteais a'\x..'::byteahex literal; JSON is serialised and cast ('..'::jsonb); arrays use array literals ('{1,2,3}'); and the PostgreSQL-specific types (inet,cidr,macaddr,macaddr8,interval,point,line,lseg,box,circle,polygon,tsvector,tsquery,pg_lsn,pg_snapshot,txid_snapshot) keep an explicit::typecast. - MySQL: identifiers are backtick-quoted; booleans are
1/0; binary (blob,varbinary, ...) is anX'..'hex literal;BITis its integer value; and backslashes in strings are escaped (MySQL treats them as escape characters). - SQL Server: identifiers are bracket-quoted (
[name]);bitvalues are1/0; binary (varbinary, ...) is a0x..hex literal; and strings are UnicodeN'..'literals where only single quotes need doubling.
Redacting one of the dialect-specific types requires a replacement that produces a value valid for the type.
redactdump runs on an asyncio pipeline. Tables are dumped concurrently (bounded
by --max-workers) using async database drivers (psycopg for PostgreSQL,
aiomysql for MySQL, aioodbc for SQL Server) and async file writes via
aiofiles.
Each table is read over a single connection holding one transaction (REPEATABLE READ on PostgreSQL and MySQL), and batches are ordered by the table's primary key (or every column when there is none), so a dump taken while the database is being written to neither skips nor duplicates rows within a table.
benchmarks/benchmark_dump.py seeds a live PostgreSQL database, runs the full
dump pipeline and reports throughput in rows per second. Connection settings
come from BENCH_DB_* environment variables and default to the docker-compose
Postgres service.
docker compose up -d
uv run python benchmarks/benchmark_dump.py --rows 40000 --tables 4 --iterations 3CI runs the same benchmark on every pull request via
.github/workflows/benchmark.yaml, tracks the result over time and fails the
build if throughput regresses significantly against the main baseline.
