SyntechD

A whole synthetic city, built from public statistics.

When a new disease appears, models need to know who lives where — but that personal data can't exist for privacy reasons. SyntechD invents a realistic population that matches the public neighbourhood statistics exactly, so researchers can model outbreaks without ever touching real people's data.

OneGov #2 · Synthetische Data Team BringOn Built in Rust 100% synthetic data
376,770
synthetic people
194,055
households
109
neighbourhoods (buurten)
33
wastewater catchments

The problem

Why build a fake city?

To predict how a virus spreads, epidemiologists need data about individual people: their age, who they live with, what kind of home they're in, how they move around.

That data is far too sensitive to publish. What is public are neighbourhood totals from the Dutch statistics office (CBS): "this buurt has 1,880 residents, 35% aged 25–44, 60% living in apartments." Useful — but a model can't infect a percentage.

Two people we built this for

Lena, epidemiologist — needs realistic individuals to feed her outbreak model, and can load this dataset in seconds.

Daan, wastewater analyst — sees a virus spike at a treatment plant but doesn't know who lives in its catchment. "An alarm without an address."

The idea

What is a synthetic population?

It's a set of invented people that add up exactly to the real public totals, but describes no real person. Think of it as upscaling a blurry photo: the output has more detail than the input, yet stays faithful to it.

Public input
neighbourhood totals & percentages
Synthetic output
individual people & households

No real micro-data is used at any step — every person is generated from the public aggregates.

The method

How we build it, in five steps

1

Start from public CBS totals

For each of the 109 Utrecht buurten we read the published counts — age bands, household types, housing types, income, urbanicity — from CBS table 86165NED. Totals only, never individuals.

2

Fit a national "recipe" to each neighbourhood (IPF)

Iterative Proportional Fitting takes a national pattern of how traits combine (e.g. young singles tend to live in apartments) and rescales it so it matches each buurt's real totals — preserving the relationships, not just the headline numbers.

3

Build people and households one by one

We create exactly the right number of each household type, then fill them with people drawn from an exact age pool — keeping families together and children with their parents.

4

Place each household on the map

Every household is dropped at a random point inside its real neighbourhood outline (RD-New coordinates), jittered so it can never point at a real address.

5

Link to the sewers

Each neighbourhood is matched to the wastewater catchment that serves it — so a signal measured at a treatment plant can finally be read against the population behind it.

The result, visualised

Who lives in synthetic Utrecht?

These breakdowns match the CBS neighbourhood statistics. They're what makes the dataset epidemiologically useful — age and household structure drive how a respiratory virus spreads.

Age

0–14
58,640 · 15.6%
15–24
57,775 · 15.3%
25–44
138,765 · 36.8%
45–64
80,130 · 21.3%
65+
41,460 · 11.0%

A young, student-heavy city — the 25–44 group dominates. Older people cluster differently in space, which matters for shielding the vulnerable.

Households

Single
51.6%
No kids
22.4%
With kids
26.0%

Over half are single-person households — typical of a university city.

Housing type

Apartment
60.1%
Terraced
37.0%
Detached
1.4%
Other
1.5%

Density (apartments) shapes contact rates — a key model input.

How good is it?

"Realistic" — but measured, not claimed

We grade the synthetic city against the real CBS totals. The error metric (weighted MAPE) is simply the share of people or households placed in the wrong category. Lower is better.

Marginal fit

Age
0.10%
Household
0.00% (exact)
Housing
0.04%

Bars are tiny on purpose — that's a near-perfect match.

In plain terms: out of 194,055 households, only 74 sit in the wrong housing category. Age and household structure match the published statistics essentially exactly.

We also keep cross-domain consistency — a synthetic person's age, household and housing form a believable whole (100% of children live in households with children), not just three independent percentages.

Layer 2

A wastewater signal with an address

Sewage surveillance is a powerful early-warning system, but a spike at a treatment plant is meaningless without knowing who it drains. We link every neighbourhood to its GWSW catchment and add up the synthetic population behind each one — turning an anonymous signal into demographic context.

Catchment (Utrecht)Synthetic peopleDensity /km²
Zuilen / Ondiep38,2359,723
Overvecht32,4005,480
Baden Powellweg32,29510,224
Korte Baanstraat26,4504,413
Kanaalweg22,9407,151

All 109 buurten map to one of 33 catchments; densities reflect real urban structure.

Results — gemeente Utrecht

MetricValue
Synthetic persons / households376,770 / 194,055
Buurten109
Age-band marginal fit (WMAPE)0.10 %
Household-type marginal fit (WMAPE)0.000 % (exact)
Housing-type marginal fit (WMAPE)0.04 %
Children (0-14) in with-kids households100 %
Buurten → catchment109 / 109 → 33 catchments

WMAPE = total absolute error / total (the challenge template's metric), robust to tiny industrial buurten. Full methodology and per-buurt numbers in the quality report below.

Data integrity 14 / 14 checks passed

Every headline number is independently re-derived from the raw output files by scripts/sanity_check.py — it trusts no precomputed metric.

persons == CBS total population (376,770)
household sizes sum exactly to persons
no orphan person→household references
person/household attributes consistent
children (0–14) only in with-kids households
no empty households, no null fields
coordinates within NL RD-New bounds
all 109 buurten mapped to a catchment
catchment populations reconcile to the total
catchment density = population / area

Privacy k-anonymity: min k = 5

All data is synthetic, but we ship a disclosure-control pass anyway. Over the quasi-identifiers buurt · age · household · housing (income, education and migration background are sensitive attributes, not QI), a k-anonymity suppression pass applies local generalisation (mask housing → household → age → buurt→wijk) to only the records in sub-k cells.

achieved minimum k = 5 (every QI cell ≥ 5)
only 0.08 % of records generalised
released as population_kanon.csv with a qi_level column
coordinates jittered inside the buurt polygon — never a real address

Download the dataset

All output is synthetic — generated from public CBS/PDOK aggregates, no real person-level data. The full 376k-row CSV is reproducible from the open-source repo with synthpop run.

Reproduce it

git clone https://github.com/BreachWhite/HackAthonnie.git
cd HackAthonnie/onegov2-synthetic-data/synthpop
cargo build --release
./target/release/synthpop run --config configs/utrecht.toml

Apache-2.0 code · CC BY 4.0 data · public sources only (CBS OData, PDOK WFS). Deterministic: a fixed seed reproduces the dataset bit-for-bit.