TrafficFlowBench: A Reproducible Big-Data Benchmark for Traffic-State, OD-Demand, and Congestion Analytics

IEEE Big Data Cup 2026 Traffic Data Competition TrafficFlowBench: A Reproducible Big-Data Benchmark for Traffic-State, OD-Demand, and Congestion Analytics Dataset Description and Evaluation Criteria

Version: Draft 1.0

Date: June 2026

Organized by: Shenzhen Urban Transport Planning Center Co., Ltd. (SUTPC) and the IEEE ITS Society Technical Committee on Travel Information and Traffic Management

Competition co-chairs: Xuesong (Simon) Zhou (co-Chair, IEEE ITS Society TC on Travel Information & Traffic Management) and Xiaochun Zhang (Chief Scientist, SUTPC; tentative)

Data foundation: an OSM2GMNS network backbone, public loop-detector observations (PeMS-style), open GPS/trajectory traces, and POI-based trip generation (public, multi-source)

This competition welcomes AI-based and hybrid submissions — graph neural networks, spatiotemporal foundation models, physics-informed neural networks (PINNs), reinforcement learning, computational-graph ODME, and LLM-agent analytics — alongside classical traffic-flow theory, data assimilation, and optimization.

1. Introduction

The IEEE Big Data Cup 2026 Traffic Data Competition addresses a fundamental challenge in transportation big data: how to reconstruct, predict, and explain physically meaningful regional traffic states from sparse, heterogeneous observations — reproducibly, and at genuine scale. The released benchmark is built on a real metropolitan freeway network — an OpenStreetMap-derived (OSM2GMNS) network backbone carrying public loop-detector observations from a PeMS-style panel of 150 mainline detectors across five Los Angeles freeways (I-5, I-10, I-110, I-210, I-405), both directions, at 5-minute resolution for all of 2025 — roughly 11.3 million speed/flow/occupancy observations — together with open GPS/trajectory traces, POI-based trip generation, and NGSIM-style vehicle trajectories that take the same corridors down to sub-second, vehicle-level resolution. The task is big-data in size yet physical in nature: every quantity is tied to traffic-flow theory, so an outsider can enter with a modern ML toolkit and still produce results a traffic engineer can trust.

Most existing traffic-data competitions reward short-term speed prediction or missing-value imputation.

They do not ask whether the predicted field is physically consistent, where congestion originates, how queues propagate, or which travel demand causes which bottleneck. TrafficFlowBench is built around four connected scientific questions, each grounded in traffic-flow theory and each made concretely scorable:

Fundamental-diagram recovery. Can a method recover the per-link speed–density–flow relationship (free-flow speed, capacity, critical density, jam density) that governs every other quantity in the benchmark?
Traffic-state estimation and prediction. Can it reconstruct the state at unobserved locations (estimation) and forecast it forward in time (prediction), while respecting the kinematic-wave conservation law — the setting in which physics-informed neural networks are natural?
Congestion-pattern recognition. Can it detect congestion episodes, identify active bottlenecks, and measure shockwave propagation in terms consistent with the Rankine–Hugoniot condition, rather than as black-box anomaly flags?
Demand-to-congestion OD estimation. Can it estimate time-dependent OD demand that not only matches counts but also explains the observed congestion — tracing which OD flows load which bottleneck?

The goal is not black-box accuracy alone, but methods that are accurate, reproducible, interpretable, and consistent with transportation-flow principles. Reproducibility is a first-class objective: the benchmark is anchored on a fully public California dataset and is designed so that an optional second region can be added later under one shared schema, validator, assignment operator, and scoring pipeline.

1.1 Two participation divisions: Open and Expert

To welcome the broad big-data community while still rewarding deep traffic-science work, the competition runs in two divisions with separate leaderboards and prizes.

Open Division — no transportation background needed. One familiar task: reconstruct and predict traffic speed and flow at held-out detectors and times, scored on standard accuracy (RMSE/MAE). Anyone comfortable with time-series or spatiotemporal machine learning can make a valid submission in an afternoon using the starter kit. Large student and newcomer prizes. This is the on-ramp.

Expert Division — the full physical benchmark. Adds fundamental-diagram recovery, OD demand estimation, congestion queue/wave diagnostics, physical-consistency scoring, and scalability. Higher reward, dedicated awards, and the citable benchmark contribution.

A team may enter either or both; Open-Division accuracy carries over as the backbone of the Expert Division. No one needs to know fundamental-diagram theory, shockwaves, or ODME to compete — and win — in the Open Division; the physics that grounds the Expert Division is collected in the optional Appendix A.

1.2 The Primary Dataset and a Prospective Second Region

Dataset (1) — California Open Integrated Mobility Benchmark. An integrated, research-derived open mobility dataset, not a direct redistribution of any single operational database. The foundation is an OpenStreetMap-derived freeway network converted into the GMNS format through the OSM2GMNS workflow; this network layer defines nodes, directed links, geometry, link length, facility type, lanes, free-flow speed, and corridor identifiers, and serves as the common spatial reference for all other layers. On top of it the benchmark integrates several open or permission-cleared layers: public loop-detector observations — PeMS-style 5-minute speed, flow, and occupancy where permitted — provide the traffic-state calibration and validation layer; open GPS traces, trajectory samples, and POI-based trip generation provide route priors, through-traffic patterns, and the base OD demand (§3.2); and the organizers compute derived research labels — congestion onset, recovery, duration, bottleneck location, shockwave propagation, and OD-to-congestion attribution. Original data ownership remains with the respective source agencies and data providers; the released package consists of harmonized schemas, derived labels, reproducible processing scripts, the assignment/loading operator, and scoring tools. Fully public and end-to-end reproducible, this dataset anchors the primary leaderboards.

Positioning. TrafficFlowBench is not a repackaged PeMS forecasting dataset. It is a GMNS-centered, multi-source California traffic-flow benchmark that uses OSM2GMNS as the network foundation, open GPS/trajectory and POI information for route and OD priors, public loop-detector observations for state calibration, and reproducible algorithms to generate physically meaningful congestion, shockwave, and demand-attribution labels. Sensor observations derive from public California sources (PeMS-style detector data); network connectivity derives from open-source references (OSM-style). The released California data should be cited as research-derived data integrating public sensor references and open-source connectivity — not a direct Caltrans or OSM extract — and each source agency retains ownership of its original data. A fuller provenance and limitation statement is given in §10.

Dataset (2), prospective — possible Shenzhen support. A potential second region with possible support from the Shenzhen Urban Transport Planning Center Co., Ltd. (SUTPC). Whether, when, and in what form any Shenzhen data could be shared is not yet determined and remains subject to privacy and legal review. If it were available, it might contribute anonymized, aggregated traffic measurements and demand references from a different traffic regime, enabling cross-region replication. The benchmark is designed so that California stands alone; any Shenzhen contribution is an enhancement, not a dependency.

If both regions are available they would be complementary: California offers dense, public, high-quality fixed-sensor coverage with proxy OD, while Shenzhen would add additional demand references in a different regime. A method that transfers across them would demonstrate a transferable mechanism rather than a single-city fit — but the cross-region track is activated only if and when the partner data clear review, and no part of the core competition depends on it.

1.3 Reproducible-Research Positioning

TrafficFlowBench is a reproducible-research benchmark for transportation engineering, jointly organized by SUTPC and the IEEE ITS Society Technical Committee on Travel Information and Traffic Management with contributions from members of RERITE (Reproducible Research in Transportation Engineering). It operationalizes three levels of reproducibility; an optional second region, if released, would make the third directly measurable, which is otherwise tested across corridors and scenarios.

Level	Name	What the benchmark guarantees
Level 1	Data reproducibility	Standardized GMNS node/link/sensor/count/OD/trajectory schemas with explicit units, time intervals, spatial references, and anonymization/provenance notes; a datasheet and per-region data card.
Level 2	Computational reproducibility	Runnable code, pinned containers, fixed seeds, a shipped validator, baselines, the frozen assignment operator, and scoring tools, so any submission re-runs under a consistent environment.
Level 3	Scientific replicability	A public primary region (California) plus an optional, prospective second region that — if partner data clear privacy review — enables a cross-region transfer track measuring performance retention; replicability is otherwise tested across corridors and scenarios within California.

1.4 Relationship to Prior Traffic Benchmarks

Speed-forecasting datasets (METR-LA, PEMS-BAY; Li et al. 2018) and frame-prediction challenges (Traffic4Cast) evaluate prediction error on a single region and a single modality. The machine-learning literature for traffic-state estimation and prediction is large and fast-moving — spanning statistical models, spatiotemporal graph neural networks, transformers, physics-informed neural networks, and low-rank/tensor methods (surveyed by Seo et al. 2017; Tedjopurnomo et al. 2020; Jiang & Luo 2022) — yet results are rarely comparable because datasets, preprocessing, and evaluation differ across papers.

TrafficFlowBench differs on three axes: (i) it scores physical consistency and fundamental-diagram recovery, not error alone; (ii) it unifies state, demand (ODME), and congestion into one causal chain with a shipped assignment operator; and (iii) it is reproducibility-scored and designed for cross-region replication (with an optional second region if partner data are cleared). A methodological landscape mapping these literature families to shipped baselines and tracks is given in §8.1, and required ablation reporting (§7.8) makes component contributions comparable across submissions.

Operationally, the benchmark supports the digital-twin loop of accurately understanding the present state (estimation), precisely predicting the near future (prediction), and explaining/attributing congestion so agencies can intervene effectively. It is intended to be cited as community infrastructure, not a one-time leaderboard.

1.5 Glossary (for non-transportation teams)

Term	Meaning
FD (fundamental diagram)	The equilibrium relationship among flow q, density k, and speed v on a road link; q = k·v.
GMNS	General Modeling Network Specification — an open node/link tabular format for road networks.
ODME	Origin–Destination Matrix Estimation — inferring how much demand travels between each origin and destination.
TSE	Traffic-State Estimation — inferring speed/flow/density where/when not directly measured.
PINN	Physics-Informed Neural Network — a network trained to fit data and a governing PDE residual.
Shockwave speed	Propagation speed of a boundary between two traffic states; the Rankine–Hugoniot chord slope on the flow–density plane.
Active bottleneck	A link discharging at capacity with a queue upstream and free flow downstream.
VDS / detector	A fixed sensor (PeMS Vehicle Detector Station) reporting speed/flow/occupancy.
Space–time IoU	Intersection-over-union of predicted vs. true congested regions in the link×time plane.
k-anonymity	An anonymized record is indistinguishable from at least k−1 others.

2. Scientific Foundations (optional — see Appendix A)

Every quantity scored in the Expert Division is grounded in traffic-flow theory — the fundamental diagram, the kinematic-wave (LWR) conservation law, the Rankine–Hugoniot shockwave condition, and Newell queueing. These definitions are collected in Appendix A and are not needed to enter the Open Division.

Newcomers can skip straight to the tasks (§3); the appendix is there for teams that want to build physical consistency into their methods or compete in the Expert Division.

3. Problem Description and Tracks

Three technical tracks are evaluated independently. Track 1's speed/flow accuracy is the Open Division (accessible to any ML team); the FD-recovery, OD-estimation, and congestion components form the Expert Division (§1.1). Teams may enter one track or several, and one region or both. Region-specific leaderboards and a combined cross-region leaderboard run in parallel; a one-region team is never disadvantaged on the per-region boards (the combined board is opt-in).

3.1 Track 1 — TrafficStateBench

Track 1 has two clearly separated sub-tasks plus an FD-recovery component, with distinct holdout protocols (§6) to prevent leakage. Methods from every major estimation/prediction family are welcome and each has a shipped same-data baseline (§8.1); finalists report a standardized ablation isolating their key components (§7.8).

3.1.1 Sub-task 1a — Estimation (spatial imputation / nowcast)

Reconstruct speed, volume, and occupancy at spatially held-out sensors over the same time window the inputs cover. A fixed fraction of sensor_ids (30% / 50% / 70%, set by missing_rate_scenario) is masked; their rows are removed from released files and scored as ground truth. The challenge is inferring a complete, physically plausible field from sparse measurements.

3.1.2 Sub-task 1b — Prediction (forecast)

Given all observations up to a forecast origin t0 (in setting.csv), predict speed and volume on all visible sensors at horizons +15, +30, and +60 minutes. The test window is strictly later than training; no row of any file in the test window is released until scoring.

3.1.3 FD recovery

Teams optionally report per-link FD parameters (v_f, q_cap, k_crit, k_jam). Recovery is credited by S_FD (§7.2), rewarding congested-branch fidelity that speed RMSE alone ignores and connecting directly to the organizers' six-model FD suite.

3.2 Track 2 — ODMEBench

Estimate time-dependent OD demand that is consistent with counts AND explains congestion. The track is made well-posed by three shipped artifacts.

3.2.1 The shipped assignment / loading operator

Because mapping OD to link flow must be identical for every team, the organizers release loading.py: a frozen path-link incidence matrix Δ (link × path × departure-bin) plus a kinematic-wave network-loading model. Teams output an OD matrix only; the organizers run the operator to obtain link flows v = Δ·g (g = path flows from OD via the released route set). A static time-expanded assignment matrix A is used for L0/L1; the dynamic loader is used for L2.

3.2.2 Well-posedness: an open-data base OD + reasonable deviation

Dynamic OD from counts is underdetermined, so each instance ships a base (reference) OD built entirely from open data — no toll or proprietary demand data is used. An OpenStreetMap network and GPS trace files supply route and through-traffic structure, and POI-based trip generation (trip production and attraction from points of interest) supplies zonal demand totals. With a fixed route set and a stated route-choice parameter θ, the reference solution follows the generalized-least-squares computational-graph formulation (Ma et al. 2020; Zhang et al. 2024) and minimizes

min_g ‖ v(g) − v_obs ‖²_W (reproduce link counts) + λ · D(g, g_base) (reasonable deviation) where v_obs are observed link counts, g_base is the open-data base OD, and D is a deviation penalty (entropy / GLS). A valid estimate must (i) reproduce observed link counts, (ii) deviate only reasonably from the base OD, and (iii) preserve the corridor's overall through-traffic routes and average travel distance (trip- length distribution). Submissions are judged against this base-anchored reference, converting an ill-posed inverse into a uniquely scored, physically reasonable one.

3.2.3 Demand-to-congestion attribution (the flagship sub-task)

For each ground-truth congestion episode e with bottleneck link ℓ over its active window T_e, teams submit an OD-attribution vector od_contribution[e] = {(o,d): share}, summing to 1, naming which OD pairs load that bottleneck. The organizers compute the ground-truth share from the reference loading,

c*(o,d) = ( Σ_{t∈T_e} Δ[ℓ, paths(o,d), t] · g*[o,d,t] ) / ( Σ Δ[ℓ,·,t] · g* ), and score it by S_attribution = 1 − ½‖c_team − c*‖₁ (top-k overlap reported as secondary). This makes “tracing demand to queues” directly and reproducibly scorable, and is the competition's principal scientific differentiator.

3.3 Track 3 — ShockwaveBench (queue localization, wave tracking, congestion duration)

This track is explicitly about where a queue forms, how it propagates, and how long it lasts. It is anchored by a reproducible congestion-event kernel — a deterministic rule that turns speed, flow, and occupancy fields into scorable congestion episodes — and is offered at three levels of difficulty so that Open-Division teams can enter at Level 1 while transportation- and physics-oriented teams compete on Levels 2–3. Episodes are reconstructed in their space–time evolution using the Appendix A definitions (§A.3).

3.3.1 The congestion-event kernel

Congestion is labeled by a link-specific speed cutoff rather than a single global anomaly threshold, so that the same rule produces comparable episodes on a 130 km/h freeway and a 70 km/h connector. The default cutoff couples the free-flow speed to the capacity-regime speed:

v_cut(ℓ) = min( 0.60 · v_free(ℓ), v_cap(ℓ) ), with v_cap(ℓ) = q_cap(ℓ) / k_crit(ℓ). A link–time cell is congested when v(ℓ,t) ≤ v_cut(ℓ), written as the indicator I(ℓ,t) = 1[ v(ℓ,t) ≤ v_cut(ℓ) ]. Short isolated drops are filtered by requiring at least m = 3 consecutive 5-minute intervals (15 min) below the cutoff before an episode opens. The default 0.60 multiplier is released in setting.csv and the organizers report a sensitivity check at 0.55, 0.60, and 0.65. For each congestion episode the benchmark defines the following scorable quantities:

Quantity	Benchmark definition
T0	Onset — first time speed stays below v_cut for at least m consecutive intervals (default 3 × 5 min = 15 min).
T2	Time of lowest speed during the episode: T2 = argmin v(t) over [T0, T3].
v_t2	Lowest observed or predicted speed, attained at T2.
T3	Recovery — first time after T2 that speed stays above v_cut for at least r consecutive intervals (default r = 3).
P	Congestion duration P = T3 − T0.
N_cong	Congested-period throughput — vehicles passing the bottleneck/discharge link over [T0, T3] (see below), reported for AM and PM peaks.

Congested-period throughput converts the queue into a demand-loading quantity. For an episode e on the bottleneck/discharge link ℓ*, it is the volume integral over the active window:

N_cong(e) = Σ_{ t ∈ [T0_e, T3_e] } q(ℓ*, t) · Δt, with Δt = 5/60 h for 5-minute data.

This yields three linked, separately scored outputs per episode: duration (how long the queue lasts, P), severity (how far speed falls, v_t2), and throughput/loading (how much traffic the bottleneck discharges during the congested window, N_cong).

Throughput is what ties ODMEBench to ShockwaveBench: an OD estimate is judged not only by count matching but by whether it loads the bottleneck with the observed congested-period volume (§3.2.3).

Recurring vs. non-recurring. Each episode also carries a recurrence label built from history alone. Let R(ℓ,τ) = P( I(ℓ,t)=1 | link ℓ, time-of-day τ, day type ) be the historical congestion probability. An episode is recurring when congestion appears frequently at the same link and time-of-day across comparable days; non-recurring when it is rare under that profile, or associated with an incident, weather, or work zone; and mixed when a recurring bottleneck shows abnormal duration or severity. The label is computable from the detector history; an optional incident feed (e.g. the CHP incident records) can sharpen the non-recurring class but the benchmark does not depend on it.

3.3.2 Three levels of difficulty

Shockwave estimation is hard, so Track 3 is graded in three levels. Open-Division teams can compete at Level 1 using only the congestion-event kernel; transportation- and physics-oriented teams compete on Levels 2–3.

Level	Name	Required output
Level 1	Congestion duration	T0, T2, T3, P, v_t2, and congested-period throughput N_cong.
Level 2	Queue localization	Bottleneck link, affected upstream links, and maximum queue extent (km).
Level 3	Shockwave tracking	Queue-front trajectory on v(x,t) and the shockwave speed (validated against u_shock, §A.4).

Depending on the level entered, teams report for each episode:

Congestion onset and duration — onset t0, recovery t3, and the congestion duration P = t3 − t0 (the central quantity of the QVDF queueing volume-delay model; Cheng et al. 2022);
Queue localization — the bottleneck location (§A.5), the affected links, and the maximum queue extent (km of contiguous upstream queue);
Wave tracking — the propagation direction and shockwave speed (validated against u_shock, §A.4): the backward-forming queue front traced on the speed contour v(x,t);
Severity and recovery — the lowest speed v_t2, the discharge rate μ as the queue clears, and the recovery pattern. These are exactly the queue/wave quantities of the QVDF model (P, v_t2, the queue Q(t), discharge μ), so a method's congestion diagnostics are physically interpretable and directly comparable to a calibrated reference. Scoring (§7.4) uses a fixed, organizer-defined set of episodes so teams cannot drop hard ones, and includes a recall floor.

3.4 Objective Function and Per-Track Scoring

Each track has its own composite. A team entering a single track is ranked on that track's composite plus the reproducibility component, renormalized so absent terms never penalize. The combined “full” score is a weighted blend of the three track composites. Ranking under partial participation: every (division × track × region) pair has its own leaderboard, and a team is ranked only against others who submitted to that same cell — so entering a single track, a single region, or only the Open Division is never penalized. The combined all-tracks leaderboard is opt-in: a team that skips a track is simply absent from it, not scored zero.

Score	Definition
S_State (Track 1)	0.35·S_Estimation(1a) + 0.30·S_Prediction(1b) + 0.20·S_FD + 0.15·S_phys
S_ODME (Track 2)	0.45·S_OD-volume + 0.25·S_link-count + 0.30·S_attribution
S_Congestion (Track 3)	0.25·F1_event + 0.20·IoU_space-time + 0.20·S_duration + 0.15·S_throughput + 0.10·S_propagation + 0.10·S_recurrence
Full combined score	0.30·S_State + 0.30·S_ODME + 0.25·S_Congestion + 0.15·S_Reproducibility

The reproducibility component is restricted to objective, auto-checkable items (§7.6); subjective documentation/model-card quality is a finalist/prize gate, not a leaderboard term, so it cannot reorder near-tied methods.

Separate leaderboards are maintained per track, per region, and for the transfer track.

Scalability — the largest instance a method solves within the declared compute budget — is separately credited (§7.9) and maintained as its own leaderboard axis, so methods that hold at full network scale are rewarded over those that work only on a single corridor.

3.5 Physical and Data-Consistency Requirements

A validator checks these automatically before any score is computed.

C1. Network consistency All outputs reference valid nodes, links, sensors, zones, and time intervals for the chosen region and instance.

C2. Nonnegative demand and flow OD demand, link flows, and volume predictions are nonnegative.

C3. Time-bin consistency Outputs follow the released time-aggregation interval; periods may not be shifted, merged, or redefined unless allowed.

C4. OD-to-count consistency Estimated OD, after the shipped loading operator (§3.2.1), reproduces observed counts within tolerance; the residual feeds S_link-count.

C5. Physical (FD) consistency Hard checks: (a) q = k·v within tolerance ε for every reported (v, q, occupancy) triple; (b) q ≤ q_cap·(1+ε); (c) no point above the free-flow line (higher density must not coincide with higher speed on the congested branch); capacity drop is rewarded, not penalized.

C6. No re-identification No attempt to identify travelers, vehicles, devices, or accounts. Applies with particular force to any anonymized OD or trajectory data, if shared.

C7. Schema and unit fidelity Released field names, units, and identifiers used exactly; not-applicable numeric fields encoded as -1; timestamps in UTC plus local_tz.

C8. Reproducibility Finalists submit a pinned container, run.sh, fixed seeds, and a short report; organizers reproduce results within tolerance on fixed hardware (§7.6).

C9. No leakage / no external joins Held-out cells must not be reconstructed from input fields that should have been masked (trajectory speed, OD avg_travel_time); no joining benchmark data to external GPS/registration/social data that could re-identify.

C10. Transfer honesty For the transfer track, the diagnostics.training_data field must truthfully declare the source region; no target-region labels may be used in training (validator-checked).

4. Input Data Format

Reference section — skim on first read. An Open-Division entrant needs only sensor_readings.csv (speed, flow, occupancy); the remaining files matter mainly for the Expert Division. The full per-field tables below are provided for completeness and may be treated as a data dictionary.

Each regional instance is a set of CSV/Parquet files in a GMNS-style schema. The two regions share identical schemas; only values differ. All timestamps are UTC with a local_tz field; volume is normalized to veh/h regardless of interval length.

4.1 node.csv

Field	Type	Description	Notes
node_id	int/string	Unique node identifier	Sequential
node_type	string	Node category	road, sensor, zone_centroid, toll_gantry, junction
x_coord	float	Longitude (WGS84)
y_coord	float	Latitude (WGS84)
name	string	Optional node name
zone_id	string	Associated zone, if any	-1/empty if N/A

4.2 link.csv

California geometry/attributes are OSM-derived; a prospective Shenzhen region, if available, would be SUTPC-curated. New fields support the FD/PINN framework of §2.

Field	Type	Description	Notes
link_id	int/string	Unique link identifier	Sequential
from_node_id / to_node_id	int/string	Upstream / downstream node	References node_id
length	float	Link length (km)
x_start_km / x_end_km	float	1-D corridor coordinate	Continuous space frame for PINNs (§A.6)
facility_type	string	Facility class	freeway, express_lane, arterial, ramp, connector, tunnel/bridge
free_speed	float	Free-flow speed v_f (km/h)	Reference
capacity	float	Capacity q_cap (veh/h)	Per direction
critical_density	float	k_crit (veh/km)	FD parameter (§A.2)
jam_density	float	k_jam (veh/km)	FD parameter (§A.2)
lanes	int	Number of lanes
direction	string	Direction / corridor label

4.3 sensor_readings.csv

California: PeMS 5-minute speed/flow/occupancy. A prospective Shenzhen region, if available: SUTPC section detectors (speed and flow; occupancy where available).

Field	Type	Description	Notes
sensor_id	string	Sensor / station id	PeMS VDS or anonymized section id
link_id	string	Associated link	References link_id
timestamp	datetime	Observation time (UTC)	Aligned to time_interval_minutes
speed	float	Measured speed (km/h)
volume	float	Measured flow (veh/h)	Interval-normalized
occupancy	float	Occupancy (%)	-1 if unavailable
quality_flag	string	Quality / imputation indicator	observed, imputed

4.4 traffic_counts.csv, 4.5 trajectory_sample.csv

File	Fields
traffic_counts.csv	count_id, link_id, timestamp, count, vehicle_class (car/truck/bus/all), direction, quality_flag.
trajectory_sample.csv	trip_id, timestamp, link_id, speed, travel_time, path_sequence_id, quality_flag.

Per C9, trajectory speed/travel-time for held-out links and forecast windows are withheld or coarsened so they cannot leak the answer.

4.6 base_od.csv (open-data reference OD)

The reference OD is built from open sources only — no toll or proprietary demand data is required. An OpenStreetMap network plus GPS trace files provide route and through-traffic structure, and POI-based trip production/attraction (trip generation from points of interest) provides zonal demand totals. The released base_od.csv is the ODME prior g_base (§3.2.2); observed link counts are the calibration target the estimate must reproduce.

Field	Type	Description	Notes
od_record_id	string	OD-record id
origin_zone / destination_zone	string	Origin / destination zone	References zone_id
departure_time_bin	datetime/string	Departure interval	Aligned to interval
volume	float	Base OD demand (veh)	Nonnegative; from POI trip generation
avg_travel_time	float	Reference OD travel time (min)	Optional; for trip-length checks
data_source	string	Provenance	osm_gps, poi, fused

4.7 setting.csv, base_od.csv, and shipped operators

setting.csv carries: region; time_interval_minutes; g_eff (effective vehicle length); train/val/test dates; forecast_origin and horizon_minutes (1b); missing_rate_scenario; demand_multiplier; congestion_speed_threshold θ; congestion_cutoff_ratio (default 0.60 for v_cut, §3.3.1); min_onset_intervals m and min_recovery_intervals r (default 3 each); tau; route-choice θ and λ; evaluation_track. Each instance also ships base_od.csv (open-data prior OD), the route set, initial/boundary conditions for test corridors, and the frozen loading operator loading.py.

4.8 Congestion-event label files (Expert Division)

These derived-label files ship with each Expert-Division instance and make the congestion-event kernel (§3.3.1) reproducible. They are organizer-computed from the released observations; teams consume them as ground truth and reference priors, not as additional inputs to be re-predicted unless the track calls for it.

File	Purpose
congestion_episode_label.csv	Ground-truth episodes: episode_id, link_id, T0, T2, T3, P, v_t2, bottleneck_link, affected_upstream_links, max_queue_extent_km, recurrence_class.
congested_throughput.csv	Congested-period throughput N_cong over [T0, T3] on the bottleneck/discharge link, reported separately for AM and PM peaks.
recurrence_profile.csv	Historical recurrence probability R(ℓ,τ) by link, time-of-day bin, and day type, used for the recurring / non-recurring / mixed label.
incident_event.csv	Optional external incident, work-zone, or weather labels (e.g. CHP incident records) that sharpen the non-recurring class; the benchmark does not depend on it.
base_path_flow.csv	OD–path–departure-bin flow reference supporting OD-to-congestion attribution (§3.2.3): od_record_id, path_sequence_id, departure_time_bin, flow.

5. Solution Output Format

A single JSON (recommended) or Parquet file with two top-level keys: inputs (metadata) and outputs (predictions).

{
"inputs": { "benchmark_name": "TrafficFlowBench", "region": "california",
"track": "full", "instance": "A-L2", "scenario": "sparse_sensor",
"time_interval_minutes": 15, "forecast_origin": "...", "demand_multiplier": 1.0 },
"outputs": {
"traffic_states": [], // 1a estimation at held-out sensors
"forecasts": [], // 1b prediction at +15/+30/+60 min
"fd_parameters": [], // per-link v_f, q_cap, k_crit, k_jam
"od_estimates": [], // time-dependent OD matrix
"od_contribution": [], // per-episode OD attribution (sums to 1)
"congestion_episodes": [],
"diagnostics": { "training_data": "california", "runtime_s": 0, "seeds": [] }
}
}

Components: traffic_states/forecasts (link_id, timestamp, [horizon_minutes], predicted_speed, predicted_volume, predicted_occupancy, congestion_probability); fd_parameters (link_id, v_f, q_cap, k_crit, k_jam); od_estimates (origin_zone, destination_zone, departure_time_bin, estimated_volume, estimated_travel_time); od_contribution (episode_id, list of {origin_zone, destination_zone, share}); congestion_episodes (episode_id, t0, t2, t3, duration_min, v_t2, affected_links, bottleneck_link_id, max_queue_extent_km, propagation_direction, propagation_speed, congested_throughput_am, congested_throughput_pm, recurrence_class, recovery_pattern).

Level-1 entries (§3.3.2) need only t0, t2, t3, duration_min, v_t2, and congested-throughput fields; queue/wave fields are required at Levels 2–3.

6. Problem Instances, Holdout Design, and Anti-Leakage

The primary regional family is A = California, with instances of increasing scale. A prospective family B = Shenzhen and a cross-region transfer family (T) are released only if and when the partner data clear SUTPC privacy and legal review; their scope and timing are not yet confirmed. The competition runs in full on California alone, so the design does not depend on the Shenzhen release.

6.1 Region A — California Open Integrated Mobility Benchmark: a real 150-detector freeway network

The California family is built from the Caltrans PeMS panel for Los Angeles District 7 (year 2018, 5-min) on an OSM-derived GMNS network, with NGSIM-style trajectories for the microscopic instance — four instances of increasing scale, analogous to the RAS competition's L0–L3 ladder. Participants are encouraged to solve the largest instance their method can scale to; the ladder exists to reward scalability, from a single corridor up to the full ≈11.3 M-observation network and on to vehicle-trajectory resolution (scored per §7.9).

Instance	Network / corridors	Scale (data points)
A-L0 (Practice)	A single freeway segment (e.g. an I-5 N sub-corridor); for code/validator testing.	~5 detectors, 1 day (~1.4k records)
A-L1 (Corridor)	One freeway × direction (e.g. I-210 E, 18 detectors; or I-5 N, 15) with speed/flow/occupancy, counts, base OD, and labeled congestion episodes.	15–18 detectors × a year of 5-min data (~1.0–1.3 M records)
A-L2 (Regional)	The full LA District 7 freeway network — five freeways (I-5/10/110/210/405) × both directions = 10 corridors, 3–6 lane strata, ramps, OD zones. Primary CA ranking instance.	150 detectors, ≈11.3 M observations (150 × 75,166)
A-L3 (Microscopic)	A vehicle-trajectory corridor (NGSIM US-101/I-80; the I-95 trajectory testbed) converted to GMNS — vehicle-resolution wave tracking and queue localization.	millions of (vehicle, t, x) points, sub-second

6.2 Region B — Shenzhen (prospective, subject to privacy review)

The following is an indicative design should the partner data be cleared for release; none of it is committed, and the figures are illustrative rather than guaranteed.

Instance	Description	Scale (if released)
B-L0 (Practice)	Small anonymized section of a Shenzhen corridor with section flow/speed and sample OD.	small section, 1 day
B-L1 (Corridor)	A Shenzhen expressway corridor: section flow/speed, anonymized OD, trip paths, trajectories.	≈1 corridor, ≈2 weeks
B-L2 (Regional subnetwork)	Broader Shenzhen expressway subnetwork where data permit; anonymized OD matrices and section counts.	multi-corridor (if released)

6.3 Cross-Region Transfer Track (T)

Train/calibrate on one region, evaluate on the other, with no target-region labels (C10). Transfer is scored on a fixed harmonized common-field set (speed km/h, flow veh/h, congestion under one θ, episodes); units, time zones, and facility vocabularies are crosswalked in the datasheet.

Reference-consistency instances (optional, multi-source). Two public datasets serve as small “golden” instances for certifying multi-source consistency (§A.8) rather than for ranking.

Mobile Century (UC Berkeley, I-880, 2008; Herrera et al. 2010) provides loop detectors, GPS probes, ramps, and path travel times over one corridor — the cleanest test that a method fuses heterogeneous data consistently.

NGSIM trajectories, converted to the GMNS network frame, give a vehicle-level ground-truth field against which the trajectory→(k,q) mapping can be certified (line integrals vs. Edie's definitions) before such data are used as attribution ground truth.

Both feed the congestion-event kernel directly: NGSIM gives a vehicle-resolution queue front for validating Level-3 shockwave tracking and the T0/T2/T3 episode boundaries, while Mobile Century checks that the same kernel yields consistent onset, duration, and v_t2 when computed from fused loop-plus-probe speeds rather than loops alone.

Instance	Train on	Evaluate on
T1 — CA → SZ	Region A (A-L1/A-L2)	Region B (B-L1/B-L2)
T2 — SZ → CA	Region B (B-L1/B-L2)	Region A (A-L1/A-L2)

6.4 Holdout design and anti-leakage

Three disjoint windows per region: train, val (next block), test (sealed; released only at scoring).
1a estimation: a fixed spatial mask of sensor_ids (30/50/70%) identical across train/test; held -out rows are removed from released files and audited.
1b prediction: the test window is strictly after train; no shuffling across the t0 boundary.
C9 firewall: trajectory speed/travel-time and OD avg_travel_time are withheld for held-out links/times; a spot-check audit rejects submissions whose held-out accuracy is explainable by a field that should have been masked.
Public-validation labels are a small, disjoint subset clearly distinguished from the hidden test set; the public/hidden relationship is published to discourage public-val overfitting.

6.5 Observation scenarios (core vs. optional)

To keep the contest approachable, only two core scenarios define the launch — full_observation and sparse_sensor — and the Open Division runs entirely on these. The others are optional stress tests the Expert Division and later phases may enable; participants are never required to handle all of them.

Scenario	Tier	Description
full_observation	Required	Most sensor/count fields provided.
sparse_sensor	Required	A fixed spatial neighborhood of sensors is hidden (estimation, 1a).
missing_count	optional	Some count records withheld.
partial_od	optional	Some OD records aggregated/hidden (only base-level OD ever exposed).
stress_demand	optional	Demand scaled up (demand_multiplier > 1).
transfer_test	optional	Model trained in one region, evaluated in the other (§6.3).

7. Evaluation Criteria

Two phases: (1) format/feasibility validation, then (2) score-based ranking. All metrics are computed by released, open-source scripts so any score is independently reproducible. Estimation/prediction errors are computed on held-out cells only, indexed by the held-out set H.

7.1 Format and feasibility validation

The validator checks required fields, valid identifiers, numeric ranges, nonnegativity, UTC timestamps aligned to intervals, the C5 physical identities, and the C9/C10 leakage and transfer-honesty rules. Failing submissions are rejected or penalized.

7.2 TrafficStateBench metrics

Estimation (1a) and prediction (1b), all on held-out cells:

masked RMSE = sqrt( Σ_H (ŷ−y)² / |H| ); congestion-weighted MAE = Σ_H w|ŷ−y| / Σ_H w, w = 1 + λ·1[v < θ·v_f]

masked speed/volume RMSE and MAE; congestion-weighted MAE (λ default 3);
prediction (1b) reported horizon-stratified at +15/+30/+60 min;
congestion_probability scored by Brier score and log-loss, with a 10-bin reliability/ECE check; ranked on Brier, required skill-positive vs. the climatological base rate;
physical-residual score S_phys = mean over cells of |∂k/∂t + ∂q/∂x| (finite differences on the §A.6 chain) — lower is better, rewarding conservation-consistent fields.

FD recovery (S_FD), credited per link against the reference FD:

7.3 ODMEBench metrics (anti-gaming)

Scored over a fixed, organizer-defined OD set (not the team's submitted subset), volume-weighted so high- flow corridors dominate and teams cannot drop hard ODs:

S_OD-volume = 1 − Σ w_od·|ĝ−g*| / Σ w_od·g*, w_od ∝ g*; plus a top-quartile-by- volume subset scored alone.

S_link-count: residual between assigned and observed link counts after the shipped loading operator (§3.2.1) — the primary, hard target, identical for all teams;
reasonable deviation from the open-data base OD g_base (the estimate must not drift arbitrarily far from the prior);
average travel distance and trip-length-distribution match (EMD/KS between assigned trip-length histograms) — ensures realistic through-traffic;
temporal-profile correlation and OD-matrix structural similarity (cosine / SSIM on the zone×zone aggregate);
overall spatial-pattern consistency across the network.

7.4 ShockwaveBench and demand-to-congestion attribution

S_Congestion = 0.25·F1_event + 0.20·IoU_space -time + 0.20·S_duration + 0.15·S_throughput + 0.10·S_propagation + 0.10·S_recurrence

onset/recovery detection F1 with a recall floor and per-region episode-coverage requirement (no dropping hard episodes);
space–time queue-region IoU over the fixed organizer episode set;
S_duration: error in congestion duration P = T3 − T0, with onset T0, lowest -speed time T2, and recovery T3 produced by the §3.3.1 cutoff rule (also scoring v_t2 and max -queue-extent error);

S_throughput: error in congested-period throughput N_cong = Σ q(ℓ*,t)·Δt over the active window [T0, T3], scored separately for AM and PM peaks;
S_propagation: propagation_speed error against u_shock (§A.4); bottleneck location (§A.5) and affected upstream links enter the Level-2 queue-region IoU term;
S_recurrence: agreement of the recurring / non-recurring / mixed label against the historical recurrence profile R(ℓ,τ) (§3.3.1); an optional incident feed may sharpen the non-recurring class but is not required;
attribution (Track 2 × 3): S_attribution = 1 − ½‖c_team − c*‖₁ against the reference share c* (§3.2.3).

7.5 Cross-region transfer score (guarded)

Transfer rewards performance retention, with guards against the unstable-ratio and under-tuning failure modes:

TransferScore = clip( M(target | source) / max(M(target | in-region), M_floor), 0, 1.2 )

Here M is the chosen track metric — normalized to [0,1], higher-is-better (e.g. 1 − normalized RMSE for Track 1) — and is distinct from the held-out cell set H of §7.2; M_floor is a small positive constant that prevents division by a near-zero denominator.

A team is transfer-eligible only if its in-region M exceeds a no-skill baseline (historical-mean/persistence); the in-region model must be the team's best submitted model (removing the incentive to under-tune the denominator); and absolute in-region and transfer scores are reported alongside the ratio.

7.6 Reproducibility component (objective, auto-checkable)

The 15% reproducibility term is restricted to machine-verifiable items; subjective quality is a finalist/prize gate, not a leaderboard term.

Item	Check
Validator pass	Submission passes format/feasibility (§7.1) — automatic.
Container + run.sh	A pinned Docker/Apptainer image (digest) and a single run.sh reproduce all reported numbers.
Seeds + environment	Fixed RNG seeds (torch/numpy/python, cudnn.deterministic) and a hash-pinned env are declared; ≥3-seed mean±std reported.
Re-run within tolerance	Organizers re-run on fixed published hardware (e.g., 1×A100 40GB, 24h cap); “reproduced” = headline metrics within ±2% relative (or ±1 F1 point).
Artifact badges	ACM-style Available / Functional / Reproduced badges; NeurIPS-style reproducibility checklist at submission.

7.7 Ablation and component analysis (required for finalists)

Because the dominant methods (ST-GNNs, transformers, PINNs, low-rank/tensor models) are composed of separable components, raw leaderboard rank reveals little about why a method works. Finalists must report a small, standardized ablation study so component contributions are comparable across submissions. At minimum, a finalist reports the change in the relevant track metric when each applicable component is removed or replaced:

Physics term — remove the conservation/LWR residual or FD closure and re-measure S_phys and accuracy;
Spatial structure — replace the graph/attention adjacency with an identity (no spatial coupling) to isolate the ST-GNN contribution;
Temporal context — shorten or remove the historical lookback window;
Data modality — drop one input source (counts, probe trajectories, or seed OD) to quantify its marginal value;
Low-rank/tensor prior — vary the rank or remove the low-rank regularizer (tensor/matrix-factorization methods);
Prediction horizon — report the metric stratified at +15/+30/+60 min (1b) to expose horizon sensitivity. Ablations are reported in the finalist technical report and reproduced from the submitted code; they inform the methodology gate (§7.8) and a “best component analysis” recognition, but do not by themselves reorder the leaderboard.

7.8 Methodology and report (finalist/prize gate)

Finalist reports are assessed for method clarity, computational efficiency, data-ethics compliance, interpretability, and transportation-domain insight, including a model card (intended use, limitations, failure modes). This gate governs prizes and does not reorder the leaderboard.

7.9 Scalability and compute

A big-data benchmark must reward methods that hold at network scale, not only on a single corridor. Beyond the per-track accuracy scores, each submission earns a scalability credit for the largest instance it solves within the declared compute budget: the full 150-detector LA District 7 network (≈11.3 M observations, A-L2) and the vehicle-trajectory instance (A-L3) carry progressively more credit than the practice and single-corridor instances.

Teams report runtime, peak memory, and hardware in diagnostics; the organizers verify the claimed scale during finalist reproduction (§7.6), and a method's wall-clock growth from A-L1 → A-L2 is reported as an empirical scaling curve.

Scalability is maintained as its own leaderboard axis and carries a dedicated award (mirroring the RAS competition's scalability criterion); on the combined leaderboard it breaks ties, so an equally accurate method demonstrated at full network scale ranks above one shown only on a small corridor.

8. Provided Files, Baselines, and Tooling

For each region the package includes:

input files (node, link, sensor_readings, traffic_counts, base_od, trajectory_sample, setting), the route set, initial/boundary conditions, and — for Expert-Division instances — the congestion-event label files (congestion_episode_label, congested_throughput, recurrence_profile, base_path_flow; incident_event where available, §4.8);
the frozen assignment/loading operator loading.py and a sample solution_result.json;
the validator (validator.py / fast_validator.py) enforcing C1–C10;

a complete baseline suite: (i) historical-average / day-of-week mean; (ii) kNN + graph spatial interpolation; (iii) ST-GNN baselines (DCRNN and Graph WaveNet); (iv) a PINN baseline (LWR residual + FD closure) for 1a and 1b; (v) a data-assimilation baseline (Cell-Transmission Model + Extended/Ensemble Kalman Filter); (vi) a computational-graph ODME baseline against the shipped operator;
scoring scripts for every metric in §7, including the transfer score and the attribution score;
visualization scripts (speed contours, queue propagation, OD matrices, FD scatter with fitted branches);
a pinned container and environment lock so baselines and scores reproduce bit -for-bit.

8.1 Methodological landscape and baseline -to-literature mapping

The shipped baselines span the major method families in the traffic-state estimation and prediction literature, so every entrant has a credible same-data yardstick from their own school of methods.

The table maps each family to representative references, the baseline that implements it, and the track it primarily targets.

The computational-graph / differentiable-assignment lineage (Wu et al. 2018; Ma et al. 2020; Lu et al. 2023; Kim et al. 2024) is the source of the shipped assignment operator and the PINN/ODME baselines, connecting demand estimation and state estimation under one differentiable graph.

Method family	Representative references	Shipped baseline	Track
Statistical / classical	ARIMA (Williams & Hoel 2003); historical average; Kalman filter	HA / day-of-week mean	1a/1b
Classical ML	SVR, kNN regression	kNN + graph interpolation	1a
Spatiotemporal GNN	DCRNN (Li et al. 2018); STGCN (Yu et al. 2018); Graph WaveNet (Wu et al. 2019); AGCRN (Bai et al. 2020)	DCRNN, Graph WaveNet	1a/1b
Transformer / attention	ASTGCN (Guo et al. 2019); STAEformer reference	optional	1b
Physics-informed (PINN)	physics-regularized DL (Shi et al. 2021); computational-graph PINN (Lu et al. 2023)	LWR-residual PINN	1a/1b
Low-rank / tensor	Bayesian temporal factorization (Chen & Sun 2019, 2021); LATC	low-rank completion	1a
Data assimilation	CTM + Kalman/EnKF (Work et al. 2010)	CTM + EnKF	1a/1b
Computational-graph ODME	Wu et al. 2018; Ma et al. 2020; Kim et al. 2024	comp.-graph ODME	2

8.2 Benchmark output format and starter teaching guide

Submissions are scored through a shipped network-loading operator that maps demand to link flows by the standard OD → path → link assignment (link_flow = A_link-path · (P_route · f_OD)), so a submitted OD matrix is loaded the same way for every team. Fundamental-diagram parameters and traffic-state estimates are reported per link in that same layered layout, keeping demand, supply, and state in one consistent structure rather than three disconnected files. This is a format-and-scoring convention, not a required method: teams may use any model and only need to emit the standard fields.

To lower the entry barrier for big-data teams new to transportation, the package ships an illustrative, numpy- only starter teaching guide (under 150 lines) that walks a participant from raw CSVs to a valid solution_result.json. In one script it demonstrates (i) OD→path→link loading on a toy network with OD-to- congestion attribution; (ii) triangular-FD recovery from noisy detector points, reproducing S_FD ≈ 0.99 on synthetic data; and (iii) masked traffic-state estimation via the q = k·v identity — making explicit why density (not flow alone) resolves the FD branch, the exact place where PINN and data-assimilation methods add value.

9. Reproducibility, Data Ethics, Licensing, and Sustainability

9.1 Data ethics and privacy

k-anonymity k ≥ 10 for any released OD/trajectory aggregate; cells below the threshold are suppressed or merged; any released OD aggregate is reported at zone level with ≥ 15-min bins;
C9 prohibits external-data joins that could re-identify (no linking to external GPS, registration, or social data);
a click-through Data Use Agreement / EULA gates download: benchmark-use only, no redistribution, no re-identification, deletion on request, mirroring embargo;
an IRB/ethics determination (or exemption) and SUTPC legal clearance would be cited for any Shenzhen data; a takedown procedure is published.

9.2 Licensing, DOI, and datasheet

Benchmark data are released under CC BY 4.0 (CC BY-NC 4.0 where SUTPC requires); baselines, validator, loading operator, and scoring under Apache-2.0. A Zenodo DOI with semantic versioning (v1.0.0 frozen at launch) and a required BibTeX citation are conditions of use. A datasheet for datasets (Gebru et al.) and per- region data cards (motivation, composition, collection, anonymization, known biases) accompany the release, satisfying FAIR concretely.

9.3 Sustainability

A named maintaining organization and contact owner host a permanent post-2026 leaderboard (e.g., Codabench/EvalAI) with an evergreen rolling-submission mode; a versioning and deprecation policy is published; each version is pinned as a GitHub release plus a Zenodo snapshot; and the SUTPC data- partnership renewal terms are documented so Region B remains available.

9.4 Competition logistics

Eligibility is open to academic and industry teams worldwide; a team may enter any subset of tracks/regions.

A compute cap (declared per submission; finalist re-runs bounded by the §7.6 hardware/time budget) keeps the contest fair across resource levels. Per-region and per-track prizes ensure single-region teams can win.

The tentative schedule below is aligned to IEEE Big Data 2026 (Phoenix, AZ, USA; December 14–17, 2026); exact dates are confirmed in the official call.

Milestone	Tentative date
Registration opens	July 15, 2026
Practice data + starter kit release (A-L0 / A-L1)	August 1, 2026
Full data release; public-validation phase opens	August 17, 2026
Public-validation phase (rolling, public leaderboard)	August–October 2026
Final submission deadline (sealed test set)	November 6, 2026
Finalist reproduction window (§7.6)	November 9–27, 2026
Winners notified	December 1, 2026
Results workshop & awards (IEEE Big Data 2026, Phoenix, AZ)	December 14–17, 2026

Dates are tentative and aligned to the IEEE Big Data 2026 calendar; the public-validation and sealed-test phases run on a hosted, evergreen leaderboard (§9.6), which remains open for rolling post-event submissions.

9.5 Prizes

Cash prizes and certificates are awarded per division and track:

Gold Prize: $1,500 + Certificate of Excellence.
Silver Prize: $1,000 + Certificate of Excellence.
Bronze Prize: $500 + Certificate of Recognition.
Student / Newcomer Prize: $500 + Certificate of Recognition (best Open-Division entry from a student or first-time team).

A dedicated scalability award (§7.9) recognizes the method demonstrated at the largest scale, and per-division, per-track, and per-region awards ensure that Open-Division and single-region teams can win.

Final cash amounts are confirmed within the competition's allocated sponsorship budget (§9.9).

9.6 Distribution plan

The benchmark is distributed through open, persistent channels.

Code (validator, baselines, the loading operator, and the starter kit) is released on GitHub under Apache-2.0; each version is snapshotted to a citable Zenodo DOI for FAIR archival. The public California (PeMS+OSM) package is released under CC BY 4.0.

The contest runs on a hosted, evergreen leaderboard (Codabench/EvalAI) with a public practice phase and a sealed test phase, so submissions can be re-scored after the event.

Any partner-contributed data (e.g. a future Shenzhen release) are gated behind a click-through Data Use Agreement and published only after privacy clearance, with a datasheet and per-region data cards.

A permanent mirror (GitHub release + Zenodo) protects against URL rot.

9.7 Promotion plan

The competition is announced in the IEEE Big Data Cup track and promoted through the IEEE ITS Society Technical Committee on Travel Information and Traffic Management, the RERITE working group, and standard community channels (INFORMS lists, transportation and machine-learning mailing lists, and the organizers' networks).

To grow participation and lower the entry barrier, the package ships a numpy-only starter teaching guide and baseline notebooks, with a kickoff webinar/tutorial and worked examples; the contest is cross-posted on Kaggle/Codabench-style platforms with student-friendly practice instances and per-track/per-region prizes so single-region teams can win.

Sponsor and data-partner outreach supports prizes and travel grants. A results workshop at the host venue, an open-access summary, and an invited journal special issue or RERITE report disseminate the winning methods and keep the benchmark a living, cited resource beyond 2026.

9.8 Organizing committee and governance

The competition is co-organized by the Shenzhen Urban Transport Planning Center Co., Ltd. (SUTPC) and the IEEE ITS Society Technical Committee on Travel Information and Traffic Management, with contributions from the RERITE Working Group (Reproducible Research in Transportation Engineering).

Competition chair (organizer). Dr. Xuesong (Simon) Zhou serves as co-Chair of the IEEE ITS Society Technical Committee on Travel Information and Traffic Management and organizes this competition under the committee’s endorsement and on its behalf. He currently serves as Chair of the Transportation Research Board Standing Committee on Transportation Network Analysis (AEP13), as a Senior Associate Editor of Transportation Research Part B: Methodological, and as an Associate Editor of Transportation Science; and is Vice-Chair and Major Event Coordinator of the RERITE Working Group (https://reriteworkinggroup.github.io/RERITE_website/).

Co-chair (tentative). Dr. Xiaochun Zhang, Chief Scientist of SUTPC, is expected to serve as co-chair, anchoring the data partnership and the prospective cross-region (Shenzhen) track. This appointment is tentative pending SUTPC confirmation.

Governance and endorsement. The IEEE ITS Society Technical Committee provides community oversight and endorsement; SUTPC provides organizing and data-partnership support; and the RERITE Working Group provides the reproducibility review underpinning the three-level reproducibility design (§1.3). This governance keeps the benchmark maintained as community infrastructure beyond 2026 (§9.3), independent of any single contributor.

International co-organizers (technical committee, tentative). The organizing technical committee includes international members from leading institutions, tentatively Dr. Cathy Wu (Massachusetts Institute of Technology) and Dr. Yudai Honma (The University of Tokyo), extending the benchmark’s reach across the AI-for-mobility and transportation-science communities in North America and Asia.

Track record and expected participation. The organizers previously ran the INFORMS RAS competition, which drew 50+ teams on Kaggle; the L0–L3 instance ladder used here mirrors that proven design. Building on that base, and on the Open Division’s afternoon-entry path and student/newcomer prizes, the organizers target 100+ registered teams, including first-time and non-transportation machine-learning entrants.

9.9 Sponsorship

A Gold-level sponsorship of US$15,000 has been confirmed for the host conference, of which approximately US$3,500 is allocated to this competition’s cash prize pool — the Gold, Silver, and Bronze awards of §9.5 — together with a US$500 student/newcomer prize and operating costs. Scalability recognition is certificate- based to broaden participation; additional sponsor and data-partner outreach (§9.7) may expand the pool and support travel grants.

10. Data Provenance and Limitation Disclaimer

The released data are research-derived and privacy-preserving.

Region A — California. Sensor observations derive from Caltrans PeMS; network connectivity/geometry from open-source references, principally OpenStreetMap, encoded in GMNS. OD demand is a probe-derived or synthetic proxy, not a measured record. Cite as research-derived data integrated from public sensor references (PeMS-style detector data) and open-source network references (OSM-style connectivity) — not a direct Caltrans or OSM extract.

Region B — Shenzhen (prospective). Any possible Shenzhen contribution from SUTPC remains subject to privacy and legal review, and might be shared only in limited, anonymized, aggregated form, or deferred, or not shared at all. The core competition does not depend on it. If shared, the records would not be raw personally identifiable data: only selected fields would be published after SUTPC review, schema harmonization, anonymization, and legal clearance, behind a Data Use Agreement.

Participants treat all data as benchmark data for algorithmic evaluation only and must not attempt to identify individuals, vehicles, accounts, or devices. Cite the public portion as research-derived data integrating public sensor references (PeMS-style) and open-source network references (OSM-style) — not a direct release of any agency's raw operational database.

11. Expected Impact

TrafficFlowBench is designed as durable, reproducible infrastructure for transportation big data, physically informed machine learning, OD estimation, and congestion analytics. By grounding every scored quantity in traffic-flow theory (FD, LWR, Rankine–Hugoniot, Newell queueing), by unifying state, demand, and congestion through a shipped assignment operator, and by measuring whether results replicate across corridors — and, if a second region is released, across cities — it moves the field from single-city case studies toward shared, transferable, physically interpretable evaluation — supporting planning, traffic management, toll-corridor operations, digital-twin calibration, and connected/automated mobility research, in the spirit of RERITE open science.

12. References and Resources

Lighthill MJ, Whitham GB (1955) On kinematic waves II: A theory of traffic flow on long crowded roads. Proc. R. Soc. A 229:317–345.
Richards PI (1956) Shock waves on the highway. Operations Research 4(1):42–51.
Newell GF (1993) A simplified theory of kinematic waves in highway traffic, Parts I–III. Transportation Research Part B 27(4):281–313.
Daganzo CF (1994) The cell transmission model. Transportation Research Part B 28(4):269–287.
Cheng Q, Liu Z, Lin Y, Zhou X (2021) An s-shaped three-parameter (S3) traffic stream model with consistent car-following relationship. Transportation Research Part B 153:246–271.
Cheng Q, Liu Z, Guo J, Wu X, Pendyala R, Belezamo B, Zhou X (2022) Estimating key traffic state parameters through parsimonious spatial queue models. Transportation Research Part C 137:103596.
Seo T, Bayen AM, Kusakabe T, Asakura Y (2017) Traffic state estimation on highway: A comprehensive survey. Annual Reviews in Control 43:128–151.
Li Y, Yu R, Shahabi C, Liu Y (2018) Diffusion convolutional recurrent neural network: Data-driven traffic forecasting (DCRNN). ICLR. (METR-LA, PEMS-BAY benchmarks.)
Yu B, Yin H, Zhu Z (2018) Spatio-temporal graph convolutional networks (STGCN). IJCAI.
Wu Z, Pan S, Long G, Jiang J, Zhang C (2019) Graph WaveNet for deep spatial-temporal graph modeling. IJCAI.
Guo S, Lin Y, Feng N, Song C, Wan H (2019) Attention-based spatial-temporal graph convolutional networks (ASTGCN). AAAI.
Bai L, Yao L, Li C, Wang X, Wang C (2020) Adaptive graph convolutional recurrent network (AGCRN). NeurIPS.
Tedjopurnomo DA, et al. (2020) A survey on modern deep neural network for traffic prediction. IEEE TKDE.
Jiang W, Luo J (2022) Graph neural network for traffic forecasting: A survey. Expert Systems with Applications 207:117921.
Chen X, Sun L (2021) Bayesian temporal factorization for multidimensional time series prediction. IEEE TPAMI.
Shi R, Mo Z, Di X (2021) Physics-informed deep learning for traffic state estimation. AAAI.
Zhang J, Mao S, Yang L, Ma W, Li S, Gao Z (2024) Physics-informed deep learning for traffic state estimation based on the traffic flow model and computational graph method. Information Fusion 101:101971.
Lu J, et al. (2023) Physics-informed neural networks on computational graphs for traffic state estimation. (Multi-source data as linear functionals of a shared state field.)
Herrera JC, Work DB, Herring R, Ban X, Jacobson Q, Bayen AM (2010) Evaluation of traffic data obtained via GPS-enabled mobile phones: The Mobile Century field experiment. Transportation Research Part C 18(4):568–583.
U.S. FHWA, Next Generation Simulation (NGSIM) vehicle trajectory dataset. (Trajectory ground truth; convertible to GMNS.)
Wu X, Guo J, Xian K, Zhou X (2018) Hierarchical travel demand estimation: a forward-and-backward propagation framework on a layered computational graph. Transportation Research Part C 96:321–346.
Ma W, Pi X, Qian S (2020) Estimating multi-class dynamic OD demand through a forward-backward algorithm on computational graphs. Transportation Research Part C 119:102747.
Gebru T, et al. (2021) Datasheets for datasets. Communications of the ACM 64(12):86–92.
Caltrans Performance Measurement System (PeMS). https://pems.dot.ca.gov/
California Highway Patrol (CHP) Traffic Incident Information (optional non-recurring-congestion labels). https://cad.chp.ca.gov/
OpenStreetMap (OSM). https://www.openstreetmap.org/
General Modeling Network Specification (GMNS). https://github.com/zephyr-data-specs/GMNS
RERITE — Reproducible Research in Transportation Engineering. https://www.reproducibletransportation.org/
IEEE ITS Society Technical Committee on Travel Information and Traffic Management (co-organizer). https://ieee-itss.org/chapters-committees/traffic-travel-management/
TAP101 — Traffic Assignment resources. https://github.com/jdlph/TAP101

Any Shenzhen data support would be provided by SUTPC, subject to privacy and legal review.

Appendix A — Scientific Foundations (the physics behind the scored quantities)

Every quantity scored in this competition is defined here in flow-theoretic terms, so that ground truth is unambiguous and physically interpretable. Teams are not required to use these models, but the benchmark is

generated and scored with respect to them, and methods that respect them are rewarded.

A.1 State variables and the q = k·v identity

Each link–time cell carries three coupled variables: flow q (veh/h), density k (veh/km), and space-mean speed v (km/h), related by the hydrodynamic identity

q = k · v.

Detectors report speed and flow directly; density is derived consistently as k = q / v where both are observed, or from occupancy as k = occupancy / g_eff, where g_eff (effective vehicle length, m) is a per-region constant released in setting.csv. This makes every (k, q) point — and therefore every fundamental-diagram fit and shockwave slope — reproducible from the released fields.

A.2 The link fundamental diagram

Each link carries an equilibrium FD q = Q(k) with four interpretable parameters: free-flow speed v_f, capacity q_cap, critical density k_crit, and jam density k_jam, with q_cap = Q(k_crit). The reference form is the triangular (Newell) FD:

Q(k) = v_f · k for k ≤ k_crit (free-flow branch)

Q(k) = w · (k_jam − k) for k > k_crit (congested branch)

w = q_cap / (k_jam − k_crit) (backward kinematic-wave speed)

The free-flow branch has slope v_f > 0; the congested branch has slope dq/dk = −w < 0.

Baselines and the FD-recovery score also accept S3 (Cheng et al. 2021), Greenshields, Van Aerde, and logistic forms; the organizers provide a six-model calibration suite. The critical speed is v_crit = q_cap / k_crit.

A.3 Congestion regime and the congestion-event kernel

A link–time cell is congested when it lies on the congested branch of the FD — that is, k ≥ k_crit (equivalently v ≤ v_crit with dq/dk < 0). A speed-ratio proxy v/v_f ≤ θ (default θ = 0.7) is used as an operational label only after the regime test, because two cells with the same speed ratio can lie on opposite FD branches. Per-link v_crit is reported from the FD fit so the regime test is well-defined everywhere.

Operational episodes are produced by a deterministic congestion-event kernel (the rule scored in §3.3). The regime test above sets when a cell is congested; the kernel fixes the cutoff used to open and close an episode. The default link cutoff couples free-flow speed to the capacity-regime speed,

v_cut(ℓ) = min( 0.60 · v_f(ℓ), v_cap(ℓ) ),

v_cap(ℓ) = q_cap(ℓ) / k_crit(ℓ),

with the congested indicator I(ℓ,t) = 1[ v(ℓ,t) ≤ v_cut(ℓ) ].

An episode opens at onset T0, the first time I stays 1 for at least m consecutive intervals (default m = 3, i.e. 15 min at 5-min resolution), reaches lowest speed v_t2 at T2 = argmin v(t), and closes at recovery T3, the first time after T2 that speed stays above v_cut for at least r consecutive intervals. Its duration is P = T3 − T0.

The cutoff ratio (default 0.60) and the persistence counts m, r are released in setting.csv; the organizers report a 0.55/0.60/0.65 sensitivity.

Congested-period throughput loads the bottleneck/discharge link ℓ* over the active window,

N_cong(e) = Σ_{ t ∈ [T0_e, T3_e] } q(ℓ*, t) · Δt,

and is reported for AM and PM peaks.

The recurrence label uses the historical probability R(ℓ,τ) = P( I(ℓ,t)=1 | ℓ, time-of-day τ, day type ): recurring when high and stable across comparable days, non-recurring when rare under that profile or incident-associated, and mixed when a recurring bottleneck shows abnormal P or v_t2.

Duration, severity (v_t2), throughput (N_cong), and the recurrence label are exactly the quantities scored in §7.4, and feed the QVDF queueing parameters (P, v_t2, Q(t), μ) of §A.7.

A.4 Shockwaves: the Rankine–Hugoniot condition

A shockwave is the boundary between an upstream state (k_u, q_u) and a downstream state (k_d, q_d). Its propagation speed is the chord slope on the flow–density plane:

u_shock = (q_d − q_u) / (k_d − k_u) (km/h, signed; negative = upstream-propagating)

A queue forming behind a bottleneck propagates upstream at u_shock ≈ −w (typically −12 to −20 km/h). Submitted propagation_speed values are scored against u_shock computed from detector (k, q) states bracketing the front; |u_shock| > v_f is flagged infeasible.

A.5 Active bottlenecks and queue extent

A link b is an active bottleneck over [t1, t2] if and only if all three hold:

sustained speed drop v_b ≤ θ·v_f for at least τ (default 15 min);
the immediately upstream link is congested (k ≥ k_crit); and
the immediately downstream link is free-flowing (v ≈ v_f, discharging near q_cap).

The maximum queue extent is the spatial length from b to the farthest contiguous upstream link with v ≤ θ·v_f, reported in km for cross-region comparability.

A capacity drop — discharge flow settling to roughly 0.85–0.95·q_cap after queue formation — is rewarded, not penalized, when reproduced.

A.6 Conservation law (the PINN setting)

On a 1-D corridor with coordinate x, the kinematic-wave (LWR) conservation law governs the state:

∂k/∂t + ∂q/∂x = 0, with q = k·v and closure q = Q(k).

To make this directly usable by PINNs and data-assimilation methods, the benchmark exposes a continuous space–time frame: link.csv carries x_start_km and x_end_km (a 1-D chain per corridor), critical_density and jam_density per link, and each test corridor ships an initial condition k(x, 0) and boundary conditions (upstream inflow, downstream state) from the first/last observed cross-sections.

The physical residual r = |∂k/∂t + ∂q/∂x|, evaluated by finite differences on this chain, is a reported and weighted score (§7).

A.7 From demand to congestion: the causal chain

ODMEBench and ShockwaveBench are unified by the demand-exceeds-capacity mechanism. A bottleneck b activates when the assigned approach demand D_b(t) exceeds its capacity; the queue length L_b then evolves by conservation (Newell; in practice estimated through parsimonious spatial queue models, Cheng et al. 2022):

dL_b/dt = [ D_b(t) − q_cap(b) ] / (k_jam − k_crit), active while L_b > 0.

An estimated OD matrix is therefore required not merely to reproduce counts, but to explain congestion: after assignment, the demand loaded onto an observed bottleneck must exceed its capacity during the observed onset window. This is what the demand-to-congestion attribution sub-task (§3.2.3, §7.4) scores.

A.8 Multi-source consistency as the validator backbone

The benchmark fuses heterogeneous data types — loop/section detectors, GPS/probe speeds, trajectory fragments, path travel times, and ramp flows. The unifying principle is that every observation is written as a linear functional of one shared space–time state field, with the functional fixed by sensor geometry.

A loop reading is a section-flow integral; a probe speed is a point sample; a trajectory segment yields both a count (density) and a section flow via line integrals; a travel time is a path integral of 1/v; a ramp imposes a mass-conservation correction.

Consistency is enforced once, jointly, by requiring the field to reproduce every source through its own mapping under the conservation law (§A.6) and the FD closure (§A.2). This is why the same pipeline adapts to a new corridor by simply rebuilding the mapping matrices from geometry.

The validator therefore performs a cross-source consistency check, with one subtlety confirmed on real data: on the released Caltrans PeMS panels the identity q = k·v holds to a median relative error of ≈ 0.1%, because PeMS density is derived as d = f/s.

Algebraically dependent fields like (f, s, d) are thus two degrees of freedom, not three; the validator treats them as a single shared observation of the field rather than as independent evidence, so a method cannot triple-count one measurement. Where a residual is large, it flags a sensor-health or unit problem before any score is computed.

Appendix B — Related Literature by Track

This appendix collects the literature each track of TrafficFlowBench draws on, organized to mirror the benchmark structure: traffic-state estimation under three observation regimes (random missing with detectors, complete missing without detectors, and measurement-error correction) and prediction feed TrafficStateBench; OD estimation feeds ODMEBench; and shockwave, queue, and incident analysis feeds ShockwaveBench. Entries are listed alphabetically within each group. The works actually cited in the proposal text appear in §12; the lists below are a broader map of related methods rather than a citation list.

B.1 Traffic-state estimation — random missing (with detectors)

Asif, M. T., Dauwels, J., Goh, C. Y., et al. (2016). A compressive sensing approach to urban traffic estimation. IEEE Transactions on Intelligent Transportation Systems.
Chen, C., Kwon, J., Rice, J., Skabardonis, A., & Varaiya, P. (2003). Detecting Errors and Imputing Missing Data for Single-Loop Surveillance Systems. Transportation Research Record Journal of the Transportation Research Board, 1855(1), 160-167. https://doi.org/10.3141/1855-20
Chen, X., Chen, Y., Saunier, N., & Sun, L. (2021). Scalable low-rank tensor learning for spatiotemporal traffic data imputation. Transportation Research Part C: Emerging Technologies, 129, 103226. https://doi.org/10.1016/j.trc.2021.103226
Chen, X., He, Z., & Sun, L. (2019). A Bayesian tensor decomposition approach for spatiotemporal traffic data imputation. Transportation Research Part C: Emerging Technologies, 98, 73-84. https://doi.org/10.1016/j.trc.2018.11.003
Deng, W. L., Lei, H., & Zhou, X. (2013). Traffic state estimation and uncertainty quantification based on heterogeneous data sources: A three detector approach. Transportation Research Part B: Methodological, 57, 132-157. https://doi.org/10.1016/j.trb.2013.08.015
Duan, Y., Lv, Y., Liu, Y., & Wang, F. (2016). An efficient realization of deep learning for traffic data imputation. Transportation Research Part C: Emerging Technologies, 72, 168-181. https://doi.org/10.1016/j.trc.2016.09.015
Li, H., Li, M., Lin, X., He, F., & Wang, Y. (2020). A spatiotemporal approach for traffic data imputation with complicated missing patterns. Transportation Research Part C: Emerging Technologies, 119, 102730. https://doi.org/10.1016/j.trc.2020.102730
Li, L., Li, Y., & Li, Z. (2013). Efficient missing data imputing for traffic flow by considering temporal and spatial dependence. Transportation Research Part C: Emerging Technologies, 34, 108-120. https://doi.org/10.1016/j.trc.2013.05.008
Nie, T., Qin, G., Ma, W., Mei, Y., & Sun, J. (2024). ImputeFormer: Low Rankness-Induced Transformers for Generalizable Spatiotemporal Imputation., 2260-2271. https://doi.org/10.1145/3637528.3671751
Qu, L., Hu, J., Li, L., & Zhang, Y. (2009). PPCA-Based Missing Data Imputation for Traffic Flow Volume: A Systematical Approach. IEEE Transactions on Intelligent Transportation Systems, 10(3), 512-522. https://doi.org/10.1109/tits.2009.2026312
Tan, H., Feng, G., Feng, J., Wang, W., Zhang, Y., & Li, F. (2013). A tensor-based method for missing traffic data completion. Transportation Research Part C: Emerging Technologies, 28, 15-27. https://doi.org/10.1016/j.trc.2012.12.007
Yoon, J., Jordon, J., & Schaar, M. V. D. (2018). GAIN: Missing Data Imputation using Generative Adversarial Nets. arXiv. https://doi.org/10.48550/arxiv.1806.02920
Zheng, Z., He, Y., Wang, Z., & Ma, W. (2025). A Physics-Regularized Multiscale Attention Network for Spatiotemporal Traffic Data Imputation. IEEE Transactions on Intelligent Transportation Systems, 26(11), 19522-19537. https://doi.org/10.1109/tits.2025.3595779

B.2 Traffic-state estimation — complete missing (no detectors)

Ambühl, L., & Menéndez, M. (2016). Data fusion algorithm for macroscopic fundamental diagram estimation. Transportation Research Part C: Emerging Technologies, 71, 184-197. https://doi.org/10.1016/j.trc.2016.07.013
Dakic, I., & Menéndez, M. (2018). On the use of Lagrangian observations from public transport and probe vehicles to estimate car space-mean speeds in bi-modal urban networks. Transportation Research Part C: Emerging Technologies, 91, 317-334. https://doi.org/10.1016/j.trc.2018.04.004
Fedorov, A., Nikolskaia, K., Ivanov, S., Shepelev, V., & Minbaleev, A. (2019). Traffic flow estimation with data from a video surveillance camera. Journal of Big Data, 6(1), 73. https://doi.org/10.1186/s40537-019-0234-z
Hu, Z., Zheng, Z., Menendez, M., & Ma, W. (2025). From global open multi-source data to network-wide traffic flow: A large-scale case study across multiple cities. Communications in Transportation Research, 5, 100222. https://doi.org/10.1016/j.commtr.2025.100222
Liu, Z., Liu, Y., Meng, Q., & Cheng, Q. (2019). A tailored machine learning approach for urban transport network flow estimation. Transportation Research Part C: Emerging Technologies, 108, 130-150. https://doi.org/10.1016/j.trc.2019.09.006
Mahajan, V., Cantelmo, G., Rothfeld, R., & Antoniou, C. (2022). Predicting network flows from speeds using open data and transfer learning. IET Intelligent Transport Systems, 17(4), 804-824. https://doi.org/10.1049/itr2.12305
Pendyala, R. M., & Pas, E. I. (2000). MULTI-DAY AND MULTI-PERIOD DATA FOR TRAVEL DEMAND ANALYSIS AND MODELING. Transportation Research Circular.
Wang, P., Lai, J., Huang, Z., Tan, Q., & Lin, T. (2020). Estimating Traffic Flow in Large Road Networks Based on Multi-Source Traffic Data. IEEE Transactions on Intelligent Transportation Systems, 22(9), 5672-5683. https://doi.org/10.1109/tits.2020.2988801
Zhan, X., Li, R., & Ukkusuri, S. V. (2020). Link-based traffic state estimation and prediction for arterial networks using license-plate recognition data. Transportation Research Part C: Emerging Technologies, 117, 102660. https://doi.org/10.1016/j.trc.2020.102660
Zhan, X., Zheng, Y., Yi, X., & Ukkusuri, S. V. (2016). Citywide Traffic Volume Estimation Using Trajectory Data. IEEE Transactions on Knowledge and Data Engineering, 29(2), 272-285. https://doi.org/10.1109/tkde.2016.2621104
Zhang, J., Mao, S., Yang, L., Ma, W., Li, S., & Gao, Z. (2023). Physics-informed deep learning for traffic state estimation based on the traffic flow model and computational graph method. Information Fusion, 101, 101971. https://doi.org/10.1016/j.inffus.2023.101971
Zhang, Z., Li, M., Lin, X., & Wang, Y. (2020). Network-wide traffic flow estimation with insufficient volume detection and crowdsourcing data. Transportation Research Part C: Emerging Technologies, 121, 102870. https://doi.org/10.1016/j.trc.2020.102870
Zhong, M., Lingras, P., & Sharma, S. C. (2004). Estimation of missing traffic counts using factor, genetic, neural, and regression techniques. Transportation Research Part C: Emerging Technologies, 12(2), 139-166. https://doi.org/10.1016/j.trc.2004.07.006
Zhou, X. S., Kim, T., Ameli, M., Zhu, H. B., Honma, Y., & Pendyala, R. M. (2025). Flow-through tensors: A unified computational graph architecture for multi-layer transportation network optimization. Artificial Intelligence for Transportation, 1, 100006. https://doi.org/10.1016/j.ait.2025.100006

B.3 Traffic-state estimation — measurement errors

Feng, X., Zhang, H., Wang, C., & Zheng, H. (2022). Traffic Data Recovery From Corrupted and Incomplete Observations via Spatial-Temporal TRPCA. IEEE Transactions on Intelligent Transportation Systems, 23(10), 17835-17848. https://doi.org/10.1109/tits.2022.3151925
Kikuchi, S., Mangalpally, S., & Gupta, A. (2006). Method for Balancing Observed Boarding and Alighting Counts on a Transit Line. Transportation Research Record Journal of the Transportation Research Board, 1971(1), 42-50. https://doi.org/10.1177/0361198106197100105
Kim, D., Shin, S., Park, D., & Kim, J. (2018). CORRECTION OF MEASURED TRAFFIC VOLUME ON EXPRESSWAYS BASED ON TRAFFIC VOLUME BALANCING. WIT transactions on the built environment, 1, 361-371. https://doi.org/10.2495/ut180331
Vanajakshi, L., & Rilett, L. R. (2004). Loop Detector Data Diagnostics Based on Conservation-of-Vehicles Principle. Transportation Research Record Journal of the Transportation Research Board, 1870(1), 162-169. https://doi.org/10.3141/1870-21
Yang, Y., Yang, H., & Fan, Y. (2019). Networked sensor data error estimation. Transportation Research Part B: Methodological, 122, 20-39. https://doi.org/10.1016/j.trb.2019.01.013
Yin, P., Sun, Z., Jin, W., & Xin, J. (2017). l1-minimization method for link flow correction. Transportation Research Part B: Methodological, 104, 398-408. https://doi.org/10.1016/j.trb.2017.08.006
Zheng, Z., Ahn, S., Chen, D., & Laval, J. (2010). Applications of wavelet transform for analysis of freeway traffic: Bottlenecks, transient traffic, and traffic oscillations. Transportation Research Part B: Methodological, 45(2), 372-390. https://doi.org/10.1016/j.trb.2010.08.002
Zheng, Z., & Su, D. (2016). Traffic state estimation through compressed sensing and Markov random field. Transportation Research Part B: Methodological, 91, 525-554. https://doi.org/10.1016/j.trb.2016.06.009
Zheng, Z., Wang, Z., Fu, H., & Ma, W. (2025). Estimating Erratic Measurement Errors in Network-Wide Traffic Flow via Virtual Balance Sensors. Transportation Science, 59(4), 721-742. https://doi.org/10.1287/trsc.2023.0493
Zheng, Z., Wang, Z., Hu, Z., Wan, Z., & Ma, W. (2024). Recovering traffic data from the corrupted noise: A doubly physics-regularized denoising diffusion model. Transportation Research Part C: Emerging Technologies, 160, 104513. https://doi.org/10.1016/j.trc.2024.104513
Zhou, X. S., Luo, X. R., Abbasi, M., Huang, Z., & Tyagi, A. (2024). Emerging Data Cleaning and Fusion for Traffic Model Calibration-Data Fusion for Microsimulation Model Calibration.

B.4 Traffic-state prediction

Bai, L., Yao, L., Li, C., Wang, X., & Wang, C. (2020). Adaptive Graph Convolutional Recurrent Network for Traffic Forecasting. arXiv. https://doi.org/10.48550/arxiv.2007.02842
Fei, X., Lu, C., & Liu, K. (2011). A bayesian dynamic linear model approach for real-time short-term freeway travel time prediction. Transportation Research Part C: Emerging Technologies, 19(6), 1306-1318. https://doi.org/10.1016/j.trc.2010.10.005
Guo, K., Hu, Y., Qian, S., Liu, H., Zhang, K., Sun, Y., Gao, J., & Yin, B. (2020). Optimized Graph Convolution Recurrent Neural Network for Traffic Prediction. IEEE Transactions on Intelligent Transportation Systems, 22(2), 1138-1149. https://doi.org/10.1109/tits.2019.2963722
Guo, S., Lin, Y., Feng, N., Song, C., & Wan, H. (2019). Attention Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 922-929). https://doi.org/10.1609/aaai.v33i01.3301922
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. NeurIPS. https://arxiv.org/abs/1612.01474
Li, M., & Zhu, Z. (2021). Spatial-Temporal Fusion Graph Neural Networks for Traffic Flow Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 4189-4196). https://doi.org/10.1609/aaai.v35i5.16542
Li, Y., Yu, R., Shahabi, C., & Liu, Y. (2018). Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. ICLR.
Lv, Y., Duan, Y., Kang, W., Li, Z., & Wang, F.-Y. (2015). Traffic flow prediction with big data: A deep learning approach. IEEE Transactions on Intelligent Transportation Systems, 16(2), 865-873. https://doi.org/10.1109/TITS.2014.2345663
Ma, X., Tao, Z., Wang, Y., Yu, H., & Wang, Y. (2015). Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies, 54, 187-197. https://doi.org/10.1016/j.trc.2015.03.014
Mallick, T., Macfarlane, J., & Balaprakash, P. (2024). Uncertainty Quantification for Traffic Forecasting Using Deep-Ensemble-Based Spatiotemporal Graph Neural Networks. IEEE Transactions on Intelligent Transportation Systems, 25(8), 9141-9152. https://doi.org/10.1109/tits.2024.3381099
Min, W., & Wynter, L. (2010). Real-time road traffic prediction with spatio-temporal correlations. Transportation Research Part C: Emerging Technologies, 19(4), 606-616. https://doi.org/10.1016/j.trc.2010.10.002
Peng, C., Xu, C., Kudryavtsev, F., Ai, Q., Gao, Y., & Jiao, Y. (2025). Spatiotemporal Factorized Graph Neural Networks for Joint Large-Scale Traffic Prediction and Online Pattern Recognition. IEEE Transactions on Intelligent Transportation Systems, 26(10), 14896-14909. https://doi.org/10.1109/tits.2025.3585197
Shao, J., Li, S., Zhang, K., Wang, A., & Li, M. (2025). Cross-City traffic prediction based on deep domain adaptive transfer learning. Transportation Research Part C: Emerging Technologies, 176, 105152. https://doi.org/10.1016/j.trc.2025.105152
Shi, X., Qi, H., Shen, Y., Wu, G., & Yin, B. (2020). A Spatial-Temporal Attention Approach for Traffic Prediction. IEEE Transactions on Intelligent Transportation Systems, 22(8), 4909-4918. https://doi.org/10.1109/tits.2020.2983651
Sun, S., Zhang, C., & Yu, G. (2006). A Bayesian Network Approach to Traffic Flow Forecasting. IEEE Transactions on Intelligent Transportation Systems, 7(1), 124-132. https://doi.org/10.1109/tits.2006.869623
van Lint, J. W. C., & van Zuylen, H. J. (2005). Monitoring and predicting freeway travel time reliability: Using width and skewness of day-to-day travel time distributions. Transportation Research Record.
Vlahogianni, E. I., Karlaftis, M. G., & Golias, J. C. (2014). Short-term traffic forecasting: Where we are and where we're going. Transportation Research Part C: Emerging Technologies, 43, 3-19. https://doi.org/10.1016/j.trc.2014.01.005
Williams, B. M., & Nihan, N. L. (1995). Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process. Journal of Transportation Engineering, 121(6), 471-481. https://doi.org/10.1061/(ASCE)0733-947X(1995)121:6(471)
Yan, Y., Cui, S., Liu, J., Zhao, Y., Zhou, B., & Kuo, Y. (2024). Multimodal fusion for large-scale traffic prediction with heterogeneous retentive networks. Information Fusion, 114, 102695. https://doi.org/10.1016/j.inffus.2024.102695
Yang, A., Li, Z., Li, X., Liu, W., Yang, X., Sun, H., Chen, M., Zheng, Y., & Gong, Y. (2025). Spatio-Temporal Multivariate Probabilistic Modeling for Traffic Prediction. IEEE Transactions on Knowledge and Data Engineering, 37(5), 2986-3005. https://doi.org/10.1109/tkde.2025.3539680
Yu, B., Yin, H., & Zhu, Z. (2018). Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting., 3634-3640. https://doi.org/10.24963/ijcai.2018/505
Zhao, L., Song, Y., Zhang, C., Liu, Y., Wang, P., Lin, T., Deng, M., & Li, H. (2019). T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction. IEEE Transactions on Intelligent Transportation Systems, 21(9), 3848-3858. https://doi.org/10.1109/tits.2019.2935152

B.5 OD estimation (ODMEBench)

Antoniou, C., Barceló, J., Breen, M., Bullejos, M., Casas, J., Cipriani, E., Ciuffo, B., Djukic, T., Hoogendoorn, S., Marzano, V., Montero, L., Nigro, M., Perarnau, J., Punzo, V., Toledo, T., & van Lint, H. (2016). Towards a generic benchmarking platform for origin-destination flows estimation/updating algorithms: Design, demonstration and validation. Transportation Research Part C: Emerging Technologies, 66, 79-98. https://doi.org/10.1016/j.trc.2015.08.009
Antoniou, C., Ben-Akiva, M., & Koutsopoulos, H. N. (2007). Nonlinear Kalman filtering algorithms for on-line calibration of dynamic traffic assignment models. IEEE Transactions on Intelligent Transportation Systems, 8(4), 661-670. https://doi.org/10.1109/TITS.2007.908569
Ashok, K., & Ben-Akiva, M. E. (1993). Dynamic origin-destination matrix estimation and prediction for real-time traffic management systems. Transportation and Traffic Theory.
Ashok, K., & Ben-Akiva, M. E. (2000). Alternative Approaches for Real-Time Estimation and Prediction of Time-Dependent Origin-Destination Flows. Transportation Science, 34(1), 21-36. https://doi.org/10.1287/trsc.34.1.21.12282
Balakrishna, R., Koutsopoulos, H. N., & Ben-Akiva, M. (2007). Calibration and validation of dynamic traffic assignment systems. Transportation and Traffic Theory, 407-426. https://doi.org/10.1016/B978-008044680-6/50023-4
Bierlaire, M., & Crittin, F. (2006). Solving Noisy, Large-Scale Fixed-Point Problems and Systems of Nonlinear Equations. Transportation Science, 40(1), 44-63. https://doi.org/10.1287/trsc.1050.0119
Cascetta, E., Inaudi, D., & Marquis, G. P. (1993). Dynamic Estimators of Origin-Destination Matrices Using Traffic Counts. Transportation Science, 27(4), 363-373. https://doi.org/10.1287/trsc.27.4.363
Castillo, E., Menéndez, J. M., & Jiménez, P. (2007). Trip matrix and path flow reconstruction and estimation based on plate scanning and link observations. Transportation Research Part B: Methodological, 42(5), 455-481. https://doi.org/10.1016/j.trb.2007.09.004
Fisk, C. S. (1988). On combining maximum entropy trip matrix estimation with user optimal assignment. Transportation Research Part B: Methodological, 22(1), 69-73. https://doi.org/10.1016/0191-2615(88)90035-5
Hazelton, M. L. (2000). Estimation of origin-destination matrices from link flows on uncongested networks. Transportation Research Part B: Methodological, 34(7), 549-566. https://doi.org/10.1016/s0191-2615(99)00037-5
Ma, W., Pi, X., & Qian, S. (2020). Estimating multi-class dynamic origin-destination demand through a forward-backward algorithm on computational graphs. Transportation Research Part C: Emerging Technologies, 119, 102747. https://doi.org/10.1016/j.trc.2020.102747
Ma, W., & Qian, Z. S. (2018). Estimating multi-year 24/7 origin-destination demand using high-granular multi-source traffic data. Transportation Research Part C: Emerging Technologies, 96, 96-121. https://doi.org/10.1016/j.trc.2018.09.002
Nie, Y. (2006). A Variational Inequality Approach For Inferring Dynamic Origin-Destination Travel Demands.
Sherali, H. D., & Park, T. (2001). Estimation of dynamic origin-destination trip tables for a general network. Transportation Research Part B: Methodological, 35(3), 217-235. https://doi.org/10.1016/s0191-2615(99)00048-x
Spiess, H. (1987). A maximum likelihood model for estimating origin-destination matrices. Transportation Research Part B: Methodological, 21(5), 395-412. https://doi.org/10.1016/0191-2615(87)90037-3
van Zuylen, H. J., & Willumsen, L. G. (1980). The most likely trip matrix estimated from traffic counts. Transportation Research Part B: Methodological, 14(3), 281-293. https://doi.org/10.1016/0191-2615(80)90008-9
Yang, H. (1995). Heuristic algorithms for the bilevel origin-destination matrix estimation problem. Transportation Research Part B: Methodological, 29(4), 231-242. https://doi.org/10.1016/0191-2615(95)00003-v
Yang, H., Sasaki, T., Iida, Y., & Asakura, Y. (1992). Estimation of origin-destination matrices from link traffic counts on congested networks. Transportation Research Part B: Methodological, 26(6), 417-434. https://doi.org/10.1016/0191-2615(92)90008-k
Zhang, J., Che, H., Chen, F., Ma, W., & He, Z. (2021). Short-term origin-destination demand prediction in urban rail transit systems: A channel-wise attentive split-convolutional neural network method. Transportation Research Part C: Emerging Technologies, 124, 102928. https://doi.org/10.1016/j.trc.2020.102928
Zhou, X., & Mahmassani, H. (2006). Dynamic Origin-Destination Demand Estimation Using Automatic Vehicle Identification Data. IEEE Transactions on Intelligent Transportation Systems, 7(1), 105-114. https://doi.org/10.1109/tits.2006.869629
Zhou, X., & Mahmassani, H. S. (2007). A structural state space model for real-time traffic origin-destination demand estimation and prediction in a day-to-day learning framework. Transportation Research Part B: Methodological, 41(8), 823-840. https://doi.org/10.1016/j.trb.2007.02.004

B.6 Shockwaves, queues, and incidents (ShockwaveBench)

Elfar, A., Xavier, C. B., Talebpour, A., & Mahmassani, H. S. (2018). Traffic Shockwave Detection in a Connected Environment using the Speed Distribution of Individual Vehicles. Transportation Research Record Journal of the Transportation Research Board, 2672(20), 203-214. https://doi.org/10.1177/0361198118794717
Wang, Z., Liu, K., Zhu, L., & Jiang, H. (2021). Detecting the occurrence times and locations of multiple traffic crashes simultaneously with probe vehicle data. Transportation Research Part C: Emerging Technologies, 126, 103014. https://doi.org/10.1016/j.trc.2021.103014
Wang, Z., Qi, X., & Jiang, H. (2018). Estimating the spatiotemporal impact of traffic incidents: An integer programming approach consistent with the propagation of shockwaves. Transportation Research Part B: Methodological, 111, 356-369. https://doi.org/10.1016/j.trb.2018.02.014
Wu, X., & Liu, H. X. (2011). A shockwave profile model for traffic flow on congested urban arterials. Transportation Research Part B: Methodological, 45(10), 1768-1786. https://doi.org/10.1016/j.trb.2011.07.013
Zheng, Z., Wang, Z., Chen, X., Ma, W., & Ran, B. (2025). Spatiotemporal clustering for the impact region caused by a traffic incident: An improved fuzzy C-means approach with guaranteed consistency. Transportmetrica A: Transport Science, 21(1), 2236719. https://doi.org/10.1080/23249935.2023.2236719