4  Data Preparation

Note

This phase corresponds to scripts 00b_*, 01_* and 02_* in the pipeline.

4.1 Study Area

00b_data_load.R sets up globally-used variables (study area boundaries, grid, ids correspondence between levels, etc.) that are reused across scripts.

To run for different study areas, this script must be updated to reflect the desired boundaries.

4.2 Administrative Boundaries

Source: CAOP 2024.1 (DGT 2024)

The full continental Portugal polygon file is filtered to the Grande Lisboa and Península de Setúbal NUTS-II sub-regions. For the municipality of Lisboa, parish boundaries with the Tagus estuary are replaced by clipped geometries (FREGUESIASgeo.gpkg - using CAOP 2016 parish limits) to remove the water area in Lisbon municipality (Figure 4.1).

CAOP 2024.1 original geometry

Lisbon geomerty adjusted
Figure 4.1: Lisbon geometry administrative boundaries adjustment

For parishes with multiple polygon features (due to small islands), only the polygon with the largest area is retained.

4.2.1 Conversion of 2016 to 2024 Parish Codes

The Census 2021 and IMOB 2017 travel survey use 2016 parish codes (DICOFRE). Several parishes were split between 2016 and 2024. A manual conversion table was created after inspecting the differences:

Table 4.1: Parish code conversions from 2016 to 2024
Old DICOFRE (2016) New DICOFRE(s) (2024)
151007 151008, 151009, 151010
111127 111134, 111135
111123 111129, 111131, 111132
111126 111130, 111133

Trips associated with split parishes were redistributed proportionally to the area of the new parishes. Conservation of total trip counts was validated using assertthat.

Original 2016 parish boundaries

2024 parish boundaries
Figure 4.2: Example of Seixal parish split

4.3 H3 Hexagonal Grid

flowchart LR
    AML["AML boundary<br/>CAOP 2024"] --> H3["h3jsr::polygon_to_cells<br/>Resolution 8"]
    H3 --> Cells["3,686 hexagons<br/>531m edge, 0.74 km2 avg"]
    Cells --> C["Grid centroids<br/>for routing"]
    Cells --> G["Grid index table<br/>mapping levels"]
Figure 4.3: H3 grid generation and spatial assignment

The H3 hexagonal grid (Uber Technologies 2018) is generated at Resolution 8 using the h3jsr package (O’Brien 2023). Resolution 8 (example at Figure 4.4) provides a spatial unit with an average edge length of ~531 m and average area of ~0.74 km², which offers a good balance between spatial detail and computational tractability for routing-based accessibility measures.

Figure 4.4: Example of h3_grid at resolution 8 at Lisbon, from h3geo.org

The H3 grid covers the AML with 3,686 active hexagons (those intersecting the administrative boundaries, see Figure 4.5). At the limits, if a hexagon covers less than 50% of the administrative area, it is excluded from the analysis. Each hexagon is assigned to a parish and municipality via a cross-reference table (grid_nuts.csv).

Figure 4.5: H3 grid at resolution 8 intersecting the administrative boundaries of the Lisbon Metropolitan Area

4.4 Census Data Processing

Source: INE Census 2021, BGRI 2021 at the base statistical unit level (INE 2021a), NUTS-3 Área Metropolitana de Lisboa (code 170).

Census BGRI polygons are converted to centroids (Figure 4.6). The following variables are extracted:

Table 4.2: Census variables used in IMPT
Variable INE field Description
population N_INDIVIDUOS Total residents
households N_NUCLEOS_FAMILIARES Total households
youth N_INDIVIDUOS_0_14 Residents aged 0–14
adults N_INDIVIDUOS_15_24 + N_INDIVIDUOS_25_64 Residents aged 15–64
elderly N_INDIVIDUOS_65_OU_MAIS Residents aged 65+
women N_INDIVIDUOS_M Female residents
buildings N_EDIFICIOS_CLASSICOS Classic buildings
buildings_pre1945 N_EDIFICIOS_CONSTR_ANTES_1945 Buildings built before 1945

For parish- and municipality-level aggregation, census data from 2016-era parishes are redistributed to 2024 parishes using the area-weighted conversion table described in Section 4.2.1.

Figure 4.6: Example of census BGRI polygons and centroids in a neighborhood of Lisbon

4.5 Dasymetric Grid Population Redistribution (COS 2023)

Source: COS 2023 v1 (DGT 2023), residential land use classes

To achieve a more accurate population distribution within the H3 grid, a dasymetric mapping approach is applied using the Carta de Ocupação do Solo (COS) 2023 (example in Figure 4.7). This is a relevant method, in particular when dealing with BGRI polygons that cover large areas, across multiple hexagons cells - preventing that one single hexagon is assigned the entire population of the BGRI polygon with it’s centroid.

Residential areas are extracted with the following classes and assigned weighting factors:

Table 4.3: COS 2023 residential classes and dasymetric weights
COS 2023 classification Weight
Contínuas predominantemente verticais 3.0
Contínuas predominantemente horizontais 1.0
Descontínuas 0.6
Descontínuas esparsas 0.3
Figure 4.7: Example of COS 2023 residential classes at Almada

The procedure (02_census_grid_with_cos.R):

  1. Census BGRI polygons are intersected with COS residential patches and H3 grid cells.
  2. For each BGRI fragment, a weighted score is computed as fragment_area × weight.
  3. Each BGRI’s population is distributed proportionally to the weighted score of each fragment.
  4. Results are summed per H3 cell.

This recovers approximately 99.4% of the total AML population (2,854,660 vs. 2,870,208 in the census), a significant improvement over simple centroid assignment.

Important

The land use and census data from the dasymetric method is only used at the grid level for the calculation of fine-grain indicators. For the other levels, the Census 2021 BGRI data is used directly.

4.6 Points of Interest (POIs)

POIs are collected from three sources: OpenStreetMap (OpenStreetMap 2026), Carris Metropolitana open datasets (Carris Metropolitana 2026), and GTFS feeds. The following categories are used:

Table 4.4: POI categories, sources, and filters
Category Identifier Source Filter / Description
Healthcare (all) health CM All health centres
Hospitals health_hospital CM name contains 'Hospital' or 'Centro de Medicina'
Primary care centres health_primary CM name contains 'Centro de Saúde'
Supermarkets/grocery groceries OSM type %in% c('supermarket','convenience')
Green spaces greenspaces OSM group == 'leisure' & type %in% c('park','garden')
Recreation recreation OSM type %in% c('library','theatre','cinema','restaurant','cafe',
'bar','pub','pitch','fitness_station',
'swimming_pool','sports_centre','fitness_center')
Schools (all) schools CM All school types
Primary schools schools_primary CM pre_school==1 | basic_1==1 | basic_2==1 | basic_3==1
Transit stops (all) transit GTFS (all operators) All stops from all operators
Transit stops (bus) transit_bus GTFS (Carris, CM, TCB, Cascais próxima) Bus operators only
Transit stops (mass transit) transit_mass GTFS (CP, Fertagus, Metro, MTS, TTSL) Rail, metro, ferry operators
Jobs jobs IMOB 2018 Jittered OD destinations (see #sec-jittering)
Figure 4.8: Health POIs distribution

4.7 Travel Survey Data and OD Jittering

Source: IMOB 2017/18 (INE 2018)

The IMOB survey served as base to produce the OD matrix at parish level with modal split (walk, bike, car, public transit, other), available at biclaR project.

4.7.1 Jobs as a proxy

Since high-resolution employment data was not available (or open) for the AML during the report development, employment locations were proxied using the jittered destinations of trips from the IMOB 2017 travel survey. Specifically, trips with the purpose “Ir para o trabalho” (Commute to work) and “Tratar de assuntos profissionais” (Work-related business) were extracted, and their jittered destination points were used as POIs representing the spatial distribution of jobs.

Since accessibility and cost computations require disaggregated spatial locations, jittering is applied to generate point-level OD pairs.

TML has reported to use Ministério do Trabalho dataset on jobs for the Metropolitan Sustainable Urban Mobility Plan. However, they only kept them aggregated at the grid level, using a custom grid that is not compatible with the H3 grid used in this project. For future versions of IMPT, a request can be made to Ministério do Trabalho for the original data.

4.7.2 Jittering Algorithm

Jittering is performed using the odjitter R package (Carlino and Lovelace 2022), following the approach described in Lovelace, Félix, and Carlino (2022).

The algorithm disaggregates parish-level trip counts into point-to-point flows. The following layers and weights are used:

  • Origins: Randomly sampled points within residential buildings.
  • Destinations: Sampled from a building layer (from GDA), weighted by building volume (estimated as height * footprint_m2). This ensures that areas with higher office and industrial density attract more jittered trips.
  • Threshold: A disaggregation threshold of 50 trips is applied, meaning any OD flow with more than 50 trips is split into multiple point-to-point pairs to provide higher spatial resolution.
  • Deterministic Randomization: A fixed seed (rng_seed = 42) is used to ensure the generated points are consistent across pipeline runs.

The process, illustrated in Figure 4.9, generates approximately 18,500 jittered OD pairs representing the daily commuting pattern in the AML. This is then expanded to the real number of trips using the PESOFIN weights.

Residential buildings

Jittered jobs
Figure 4.9: Jittering algorithm for the Areeiro parish, in Lisbon

4.8 Road Network and DEM

Road network: Exported from HOT Export Tool (HOT 2026) in .pbf format for use with r5r (Pereira et al. 2021). Main network shown in Figure 4.10. The network includes all OSM road types and GTFS transit feeds.

Elevation: Digital Elevation Model at 30 m resolution (Figure 4.11) downloaded from Copernicus ESA (EU 2026). Used to account for topographic effects on walking and cycling travel times via the Minetti metabolic-equivalent cost function.

The r5r network is built using r5r::build_network() with the elevation = "MINETTI" option.

Figure 4.10: Main road network
Figure 4.11: Digital Elevation Model

4.9 Modal Share from Census

Source: Census 2021 (INE 2021b)

Script 02_census_modalShare.R extracts and computes modal shares (private vehicle, public transport, active modes) at parish (Figure 4.12) and municipality level. These are used to compute modal-share-weighted affordability metrics (see Chapter 8).

Public transport

Car

Walk

Bike
Figure 4.12: Census modal share for different modes

4.10 Vehicle Ownership

Source: IMOB 2017 data (INE 2018)

Vehicle ownership variables include average motorised vehicles per household (Figure 4.13) and percentage of households without a vehicle (Figure 4.14), aggregated at parish and municipality level. Calculations performed in 02_veh_ownership.R.

Figure 4.13: Average motorised vehicles per household
Figure 4.14: Percentage of households without a vehicle