SyntechD — synthetic population for pandemic preparedness

The problem

Why build a fake city?

To predict how a virus spreads, epidemiologists need data about individual people: their age, who they live with, what kind of home they're in, how they move around.

That data is far too sensitive to publish. What is public are neighbourhood totals from the Dutch statistics office (CBS): "this buurt has 1,880 residents, 35% aged 25–44, 60% living in apartments." Useful — but a model can't infect a percentage.

Two people we built this for

Lena, epidemiologist — needs realistic individuals to feed her outbreak model, and can load this dataset in seconds.

Daan, wastewater analyst — sees a virus spike at a treatment plant but doesn't know who lives in its catchment. "An alarm without an address."

The idea

What is a synthetic population?

It's a set of invented people that add up exactly to the real public totals, but describes no real person. Think of it as upscaling a blurry photo: the output has more detail than the input, yet stays faithful to it.

Public input
neighbourhood totals & percentages

→

Synthetic output
individual people & households

No real micro-data is used at any step — every person is generated from the public aggregates.

The method

How we build it, in five steps

Start from public CBS totals

For each of the 109 Utrecht buurten we read the published counts — age bands, household types, housing types, income, urbanicity — from CBS table 86165NED. Totals only, never individuals.

Fit a national "recipe" to each neighbourhood (IPF)

Iterative Proportional Fitting takes a national pattern of how traits combine (e.g. young singles tend to live in apartments) and rescales it so it matches each buurt's real totals — preserving the relationships, not just the headline numbers.

Build people and households one by one

We create exactly the right number of each household type, then fill them with people drawn from an exact age pool — keeping families together and children with their parents.

Place each household on the map

Every household is dropped at a random point inside its real neighbourhood outline (RD-New coordinates), jittered so it can never point at a real address.

Link to the sewers

Each neighbourhood is matched to the wastewater catchment that serves it — so a signal measured at a treatment plant can finally be read against the population behind it.

The result, visualised

Who lives in synthetic Utrecht?

These breakdowns match the CBS neighbourhood statistics. They're what makes the dataset epidemiologically useful — age and household structure drive how a respiratory virus spreads.

Age

0–14

58,640 · 15.6%

15–24

57,775 · 15.3%

25–44

138,765 · 36.8%

45–64

80,130 · 21.3%

65+

41,460 · 11.0%

A young, student-heavy city — the 25–44 group dominates. Older people cluster differently in space, which matters for shielding the vulnerable.

Households

Single

51.6%

No kids

22.4%

With kids

26.0%

Over half are single-person households — typical of a university city.

Housing type

Apartment

60.1%

Terraced

37.0%

Detached

1.4%

Other

1.5%

Density (apartments) shapes contact rates — a key model input.

How good is it?

"Realistic" — but measured, not claimed

We grade the synthetic city against the real CBS totals. The error metric (weighted MAPE) is simply the share of people or households placed in the wrong category. Lower is better.

Marginal fit

Age

0.10%

Household

0.00% (exact)

Housing

0.04%

Bars are tiny on purpose — that's a near-perfect match.

In plain terms: out of 194,055 households, only 74 sit in the wrong housing category. Age and household structure match the published statistics essentially exactly.

We also keep cross-domain consistency — a synthetic person's age, household and housing form a believable whole (100% of children live in households with children), not just three independent percentages.

Layer 2

A wastewater signal with an address

Sewage surveillance is a powerful early-warning system, but a spike at a treatment plant is meaningless without knowing who it drains. We link every neighbourhood to its GWSW catchment and add up the synthetic population behind each one — turning an anonymous signal into demographic context.

Catchment (Utrecht)	Synthetic people	Density /km²
Zuilen / Ondiep	38,235	9,723
Overvecht	32,400	5,480
Baden Powellweg	32,295	10,224
Korte Baanstraat	26,450	4,413
Kanaalweg	22,940	7,151

All 109 buurten map to one of 33 catchments; densities reflect real urban structure.

Results — gemeente Utrecht

Metric	Value
Synthetic persons / households	376,770 / 194,055
Buurten	109
Age-band marginal fit (WMAPE)	0.10 %
Household-type marginal fit (WMAPE)	0.000 % (exact)
Housing-type marginal fit (WMAPE)	0.04 %
Children (0-14) in with-kids households	100 %
Buurten → catchment	109 / 109 → 33 catchments

WMAPE = total absolute error / total (the challenge template's metric), robust to tiny industrial buurten. Full methodology and per-buurt numbers in the quality report below.

Data integrity 14 / 14 checks passed

Every headline number is independently re-derived from the raw output files by scripts/sanity_check.py — it trusts no precomputed metric.

✓persons == CBS total population (376,770)

✓household sizes sum exactly to persons

✓no orphan person→household references

✓person/household attributes consistent

✓children (0–14) only in with-kids households

✓no empty households, no null fields

✓coordinates within NL RD-New bounds

✓all 109 buurten mapped to a catchment

✓catchment populations reconcile to the total

✓catchment density = population / area

Privacy k-anonymity: min k = 5

All data is synthetic, but we ship a disclosure-control pass anyway. Over the quasi-identifiers buurt · age · household · housing (income, education and migration background are sensitive attributes, not QI), a k-anonymity suppression pass applies local generalisation (mask housing → household → age → buurt→wijk) to only the records in sub-k cells.

✓achieved minimum k = 5 (every QI cell ≥ 5)

✓only 0.08 % of records generalised

✓released as population_kanon.csv with a qi_level column

✓coordinates jittered inside the buurt polygon — never a real address

Download the dataset

population.parquet 11 MB · 376,770 rows population sample (CSV) 1,000 rows k-anon release sample min k=5 households.parquet 5 MB catchment_join.csv per-buurt link catchments.csv per-catchment quality-report.md full metrics pitch-deck.pdf 9 slides

All output is synthetic — generated from public CBS/PDOK aggregates, no real person-level data. The full 376k-row CSV is reproducible from the open-source repo with synthpop run.

Reproduce it

git clone https://github.com/BreachWhite/HackAthonnie.git
cd HackAthonnie/onegov2-synthetic-data/synthpop
cargo build --release
./target/release/synthpop run --config configs/utrecht.toml

Apache-2.0 code · CC BY 4.0 data · public sources only (CBS OData, PDOK WFS). Deterministic: a fixed seed reproduces the dataset bit-for-bit.