[[4]{.chapter-number}  [Data Preparation]{.chapter-title}]{#sec-data-prep .quarto-section-identifier}

4 Data Preparation

Note

This phase corresponds to scripts 00b_*, 01_* and 02_* in the pipeline.

4.1 Study Area

00b_data_load.R sets up globally-used variables (study area boundaries, grid, ids correspondence between levels, etc.) that are reused across scripts.

To run for different study areas, this script must be updated to reflect the desired boundaries.

4.2 Administrative Boundaries

Source: CAOP 2024.1 (DGT 2024)

The full continental Portugal polygon file is filtered to the Grande Lisboa and Península de Setúbal NUTS-II sub-regions. For the municipality of Lisboa, parish boundaries with the Tagus estuary are replaced by clipped geometries (FREGUESIASgeo.gpkg - using CAOP 2016 parish limits) to remove the water area in Lisbon municipality (Figure 4.1).

For parishes with multiple polygon features (due to small islands), only the polygon with the largest area is retained.

4.2.1 Conversion of 2016 to 2024 Parish Codes

The Census 2021 and IMOB 2017 travel survey use 2016 parish codes (DICOFRE). Several parishes were split between 2016 and 2024. A manual conversion table was created after inspecting the differences:

Table 4.1: Parish code conversions from 2016 to 2024

Old DICOFRE (2016)	New DICOFRE(s) (2024)
151007	151008, 151009, 151010
111127	111134, 111135
111123	111129, 111131, 111132
111126	111130, 111133

Trips associated with split parishes were redistributed proportionally to the area of the new parishes. Conservation of total trip counts was validated using assertthat.

4.3 H3 Hexagonal Grid

flowchart LR
    AML["AML boundary<br/>CAOP 2024"] --> H3["h3jsr::polygon_to_cells<br/>Resolution 8"]
    H3 --> Cells["3,686 hexagons<br/>531m edge, 0.74 km2 avg"]
    Cells --> C["Grid centroids<br/>for routing"]
    Cells --> G["Grid index table<br/>mapping levels"]

Figure 4.3: H3 grid generation and spatial assignment

The H3 hexagonal grid (Uber Technologies 2018) is generated at Resolution 8 using the h3jsr package (O’Brien 2023). Resolution 8 (example at Figure 4.4) provides a spatial unit with an average edge length of ~531 m and average area of ~0.74 km², which offers a good balance between spatial detail and computational tractability for routing-based accessibility measures.

Figure 4.4: Example of h3_grid at resolution 8 at Lisbon, from h3geo.org

The H3 grid covers the AML with 3,686 active hexagons (those intersecting the administrative boundaries, see Figure 4.5). At the limits, if a hexagon covers less than 50% of the administrative area, it is excluded from the analysis. Each hexagon is assigned to a parish and municipality via a cross-reference table (grid_nuts.csv).

Figure 4.5: H3 grid at resolution 8 intersecting the administrative boundaries of the Lisbon Metropolitan Area

4.4 Census Data Processing

Source: INE Census 2021, BGRI 2021 at the base statistical unit level (INE 2021a), NUTS-3 Área Metropolitana de Lisboa (code 170).

Census BGRI polygons are converted to centroids (Figure 4.6). The following variables are extracted:

Table 4.2: Census variables used in IMPT

Variable	INE field	Description
`population`	`N_INDIVIDUOS`	Total residents
`households`	`N_NUCLEOS_FAMILIARES`	Total households
`youth`	`N_INDIVIDUOS_0_14`	Residents aged 0–14
`adults`	`N_INDIVIDUOS_15_24` + `N_INDIVIDUOS_25_64`	Residents aged 15–64
`elderly`	`N_INDIVIDUOS_65_OU_MAIS`	Residents aged 65+
`women`	`N_INDIVIDUOS_M`	Female residents
`buildings`	`N_EDIFICIOS_CLASSICOS`	Classic buildings
`buildings_pre1945`	`N_EDIFICIOS_CONSTR_ANTES_1945`	Buildings built before 1945

For parish- and municipality-level aggregation, census data from 2016-era parishes are redistributed to 2024 parishes using the area-weighted conversion table described in Section 4.2.1.

Figure 4.6: Example of census BGRI polygons and centroids in a neighborhood of Lisbon

4.5 Dasymetric Grid Population Redistribution (COS 2023)

Source: COS 2023 v1 (DGT 2023), residential land use classes

To achieve a more accurate population distribution within the H3 grid, a dasymetric mapping approach is applied using the Carta de Ocupação do Solo (COS) 2023 (example in Figure 4.7). This is a relevant method, in particular when dealing with BGRI polygons that cover large areas, across multiple hexagons cells - preventing that one single hexagon is assigned the entire population of the BGRI polygon with it’s centroid.

Residential areas are extracted with the following classes and assigned weighting factors:

Table 4.3: COS 2023 residential classes and dasymetric weights

COS 2023 classification	Weight
Contínuas predominantemente verticais	3.0
Contínuas predominantemente horizontais	1.0
Descontínuas	0.6
Descontínuas esparsas	0.3

Figure 4.7: Example of COS 2023 residential classes at Almada

The procedure (02_census_grid_with_cos.R):

Census BGRI polygons are intersected with COS residential patches and H3 grid cells.
For each BGRI fragment, a weighted score is computed as fragment_area × weight.
Each BGRI’s population is distributed proportionally to the weighted score of each fragment.
Results are summed per H3 cell.

This recovers approximately 99.4% of the total AML population (2,854,660 vs. 2,870,208 in the census), a significant improvement over simple centroid assignment.

Important

The land use and census data from the dasymetric method is only used at the grid level for the calculation of fine-grain indicators. For the other levels, the Census 2021 BGRI data is used directly.

4.6 Points of Interest (POIs)

POIs are collected from three sources: OpenStreetMap (OpenStreetMap 2026), Carris Metropolitana open datasets (Carris Metropolitana 2026), and GTFS feeds. The following categories are used:

Table 4.4: POI categories, sources, and filters

Category	Identifier	Source	Filter / Description
Healthcare (all)	health	CM	All health centres
Hospitals	health_hospital	CM	name contains 'Hospital' or 'Centro de Medicina'
Primary care centres	health_primary	CM	name contains 'Centro de Saúde'
Supermarkets/grocery	groceries	OSM	`type %in% c('supermarket','convenience')`
Green spaces	greenspaces	OSM	`group == 'leisure' & type %in% c('park','garden')`
Recreation	recreation	OSM	`type %in% c('library','theatre','cinema','restaurant','cafe',` `'bar','pub','pitch','fitness_station',` `'swimming_pool','sports_centre','fitness_center')`
Schools (all)	schools	CM	All school types
Primary schools	schools_primary	CM	`pre_school==1 \| basic_1==1 \| basic_2==1 \| basic_3==1`
Transit stops (all)	transit	GTFS (all operators)	All stops from all operators
Transit stops (bus)	transit_bus	GTFS (Carris, CM, TCB, Cascais próxima)	Bus operators only
Transit stops (mass transit)	transit_mass	GTFS (CP, Fertagus, Metro, MTS, TTSL)	Rail, metro, ferry operators
Jobs	jobs	IMOB 2018	Jittered OD destinations (see #sec-jittering)

4.7 Travel Survey Data and OD Jittering

Source: IMOB 2017/18 (INE 2018)

The IMOB survey served as base to produce the OD matrix at parish level with modal split (walk, bike, car, public transit, other), available at biclaR project.

4.7.1 Jobs as a proxy

Since high-resolution employment data was not available (or open) for the AML during the report development, employment locations were proxied using the jittered destinations of trips from the IMOB 2017 travel survey. Specifically, trips with the purpose “Ir para o trabalho” (Commute to work) and “Tratar de assuntos profissionais” (Work-related business) were extracted, and their jittered destination points were used as POIs representing the spatial distribution of jobs.

Since accessibility and cost computations require disaggregated spatial locations, jittering is applied to generate point-level OD pairs.

Alternative data source: Ministério do Trabalho

TML has reported to use Ministério do Trabalho dataset on jobs for the Metropolitan Sustainable Urban Mobility Plan. However, they only kept them aggregated at the grid level, using a custom grid that is not compatible with the H3 grid used in this project. For future versions of IMPT, a request can be made to Ministério do Trabalho for the original data.

4.7.2 Jittering Algorithm

Jittering is performed using the odjitter R package (Carlino and Lovelace 2022), following the approach described in Lovelace, Félix, and Carlino (2022).

The algorithm disaggregates parish-level trip counts into point-to-point flows. The following layers and weights are used:

Origins: Randomly sampled points within residential buildings.
Destinations: Sampled from a building layer (from GDA), weighted by building volume (estimated as height * footprint_m2). This ensures that areas with higher office and industrial density attract more jittered trips.
Threshold: A disaggregation threshold of 50 trips is applied, meaning any OD flow with more than 50 trips is split into multiple point-to-point pairs to provide higher spatial resolution.
Deterministic Randomization: A fixed seed (rng_seed = 42) is used to ensure the generated points are consistent across pipeline runs.

The process, illustrated in Figure 4.9, generates approximately 18,500 jittered OD pairs representing the daily commuting pattern in the AML. This is then expanded to the real number of trips using the PESOFIN weights.

4.8 Road Network and DEM

Road network: Exported from HOT Export Tool (HOT 2026) in .pbf format for use with r5r (Pereira et al. 2021). Main network shown in Figure 4.10. The network includes all OSM road types and GTFS transit feeds.

Elevation: Digital Elevation Model at 30 m resolution (Figure 4.11) downloaded from Copernicus ESA (EU 2026). Used to account for topographic effects on walking and cycling travel times via the Minetti metabolic-equivalent cost function.

The r5r network is built using r5r::build_network() with the elevation = "MINETTI" option.

4.10 Vehicle Ownership

Source: IMOB 2017 data (INE 2018)

Vehicle ownership variables include average motorised vehicles per household (Figure 4.13) and percentage of households without a vehicle (Figure 4.14), aggregated at parish and municipality level. Calculations performed in 02_veh_ownership.R.

Figure 4.13: Average motorised vehicles per household

Figure 4.14: Percentage of households without a vehicle