The problem
Why build a fake city?
To predict how a virus spreads, epidemiologists need data about individual people: their age, who they live with, what kind of home they're in, how they move around.
That data is far too sensitive to publish. What is public are neighbourhood totals from the Dutch statistics office (CBS): "this buurt has 1,880 residents, 35% aged 25–44, 60% living in apartments." Useful — but a model can't infect a percentage.
Two people we built this for
Lena, epidemiologist — needs realistic individuals to feed her outbreak model, and can load this dataset in seconds.
Daan, wastewater analyst — sees a virus spike at a treatment plant but doesn't know who lives in its catchment. "An alarm without an address."
The idea
What is a synthetic population?
It's a set of invented people that add up exactly to the real public totals, but describes no real person. Think of it as upscaling a blurry photo: the output has more detail than the input, yet stays faithful to it.
neighbourhood totals & percentages
individual people & households
No real micro-data is used at any step — every person is generated from the public aggregates.
The method
How we build it, in five steps
Start from public CBS totals
For each of the 109 Utrecht buurten we read the published counts — age bands, household types, housing types, income, urbanicity — from CBS table 86165NED. Totals only, never individuals.
Fit a national "recipe" to each neighbourhood (IPF)
Iterative Proportional Fitting takes a national pattern of how traits combine (e.g. young singles tend to live in apartments) and rescales it so it matches each buurt's real totals — preserving the relationships, not just the headline numbers.
Build people and households one by one
We create exactly the right number of each household type, then fill them with people drawn from an exact age pool — keeping families together and children with their parents.
Place each household on the map
Every household is dropped at a random point inside its real neighbourhood outline (RD-New coordinates), jittered so it can never point at a real address.
Link to the sewers
Each neighbourhood is matched to the wastewater catchment that serves it — so a signal measured at a treatment plant can finally be read against the population behind it.
The result, visualised
Who lives in synthetic Utrecht?
These breakdowns match the CBS neighbourhood statistics. They're what makes the dataset epidemiologically useful — age and household structure drive how a respiratory virus spreads.
Age
A young, student-heavy city — the 25–44 group dominates. Older people cluster differently in space, which matters for shielding the vulnerable.
Households
Over half are single-person households — typical of a university city.
Housing type
Density (apartments) shapes contact rates — a key model input.
How good is it?
"Realistic" — but measured, not claimed
We grade the synthetic city against the real CBS totals. The error metric (weighted MAPE) is simply the share of people or households placed in the wrong category. Lower is better.
Marginal fit
Bars are tiny on purpose — that's a near-perfect match.
In plain terms: out of 194,055 households, only 74 sit in the wrong housing category. Age and household structure match the published statistics essentially exactly.
We also keep cross-domain consistency — a synthetic person's age, household and housing form a believable whole (100% of children live in households with children), not just three independent percentages.
Layer 2
A wastewater signal with an address
Sewage surveillance is a powerful early-warning system, but a spike at a treatment plant is meaningless without knowing who it drains. We link every neighbourhood to its GWSW catchment and add up the synthetic population behind each one — turning an anonymous signal into demographic context.
| Catchment (Utrecht) | Synthetic people | Density /km² |
|---|---|---|
| Zuilen / Ondiep | 38,235 | 9,723 |
| Overvecht | 32,400 | 5,480 |
| Baden Powellweg | 32,295 | 10,224 |
| Korte Baanstraat | 26,450 | 4,413 |
| Kanaalweg | 22,940 | 7,151 |
All 109 buurten map to one of 33 catchments; densities reflect real urban structure.
Results — gemeente Utrecht
| Metric | Value |
|---|---|
| Synthetic persons / households | 376,770 / 194,055 |
| Buurten | 109 |
| Age-band marginal fit (WMAPE) | 0.10 % |
| Household-type marginal fit (WMAPE) | 0.000 % (exact) |
| Housing-type marginal fit (WMAPE) | 0.04 % |
| Children (0-14) in with-kids households | 100 % |
| Buurten → catchment | 109 / 109 → 33 catchments |
WMAPE = total absolute error / total (the challenge template's metric), robust to tiny industrial buurten. Full methodology and per-buurt numbers in the quality report below.
Data integrity 14 / 14 checks passed
Every headline number is independently re-derived from the raw output files by
scripts/sanity_check.py — it trusts no precomputed metric.
Privacy k-anonymity: min k = 5
All data is synthetic, but we ship a disclosure-control pass anyway. Over the quasi-identifiers buurt · age · household · housing (income, education and migration background are sensitive attributes, not QI), a k-anonymity suppression pass applies local generalisation (mask housing → household → age → buurt→wijk) to only the records in sub-k cells.
population_kanon.csv with a qi_level columnDownload the dataset
All output is synthetic — generated from public CBS/PDOK aggregates, no real
person-level data. The full 376k-row CSV is reproducible from the
open-source repo with synthpop run.
Reproduce it
git clone https://github.com/BreachWhite/HackAthonnie.git cd HackAthonnie/onegov2-synthetic-data/synthpop cargo build --release ./target/release/synthpop run --config configs/utrecht.toml
Apache-2.0 code · CC BY 4.0 data · public sources only (CBS OData, PDOK WFS). Deterministic: a fixed seed reproduces the dataset bit-for-bit.