2025-08-27

Route Optimization Is a Data Problem, Not an Algorithms Problem

By Anders Nylund · Principal · 9 min read

The vehicle routing problem (VRP) and its variants have been studied by operations researchers for 60 years. There are excellent solvers — both exact methods like branch-and-price and heuristic approaches like large neighborhood search — that are well-understood, well-documented, and available in open-source libraries. If you have clean input data and a well-specified problem, getting to a high-quality routing solution is not a research problem. It's an engineering problem with known solutions.

So why, in 2025, do so many logistics operators still run suboptimal routes? It's not because they're using bad algorithms. It's because the data feeding those algorithms is incomplete, stale, inconsistently structured, and rarely includes the feedback signal that would allow the model to improve over time. Route optimization is a data problem, not an algorithms problem. The solvers are good. The data pipelines are a mess.

What good routing data actually requires

A routing optimizer needs, at minimum: stop locations with reliable geocoding, time windows that reflect actual customer availability (not default 9–5), service time estimates that account for real dwell time at each stop type, vehicle capacity constraints that include not just weight and volume but operational constraints like refrigeration zones or hazardous goods separation requirements, and driver constraints like maximum hours of service and shift start/end times.

Most operators have some version of most of these inputs. The problem is quality and completeness. Geocoding errors accumulate over time as customer addresses change and nobody updates the master address record. Time windows are often defaults set when the customer account was created, not actual delivery windows that were verified recently. Service time estimates are usually averages that don't account for customer-specific idiosyncrasies — the stop that always takes 45 minutes because there's only one loading dock, or the stop that's consistently faster because the customer pre-stages orders.

Each of these data quality issues is individually small. Collectively, they mean that the "optimization" the solver is performing is optimization against a problem specification that doesn't match reality. The solution is mathematically optimal for the wrong problem. In practice, this manifests as routes that look good on paper and then regularly get modified by drivers in the field — stops resequenced, time windows missed, service times exceeded. The route plan becomes a starting point for improvisation rather than an executable instruction.

The feedback loop problem

What separates good route optimization systems from mediocre ones is not the solver — it's the feedback loop. A system that captures actual execution data (GPS traces, actual dwell times, actual sequence followed, exception events) and feeds that back into the model's parameters improves over time. A system that generates routes and then discards the execution data learns nothing.

Here's a concrete example of what this matters for. Service time estimation is one of the most impactful parameters in routing quality. If your model underestimates service time at a particular stop type by 8 minutes on average, the downstream schedule effects compound across the route. By stop 12, the driver is 90 minutes behind the plan. The fix is to let the model learn actual service times from execution data — GPS traces showing time at location, or manual driver inputs confirming dwell time — and update the service time estimates per stop type, or per specific customer, on a rolling basis.

Most operators do not have this feedback loop running. The route optimization system and the execution system are separate applications that do not share data. The routing software runs its solver, generates a plan, pushes it to the TMS, and then has no visibility into what actually happened. Actual service times are either not captured at all, or captured in a format that requires manual extraction and cleaning before it could be useful as model input. The result is that the same estimation errors persist indefinitely because there's no mechanism for the model to observe and correct them.

What FourKites is doing differently

When we backed FourKites in 2024, what differentiated them was the scope and depth of their real-time tracking platform. Their system covers ocean, rail, truck, and parcel across 200+ countries — a multimodal coverage footprint that no carrier-specific or mode-specific solution can replicate. That breadth matters because supply chains don't respect modal boundaries: a delay in an ocean leg propagates into a rail leg which propagates into a last-mile leg, and you can't optimize the downstream without visibility into the upstream.

What makes FourKites structurally defensible is not the tracking itself — tracking is increasingly a commodity — but the predictive ETA model built on top of it. Their model learns from billions of historical shipment data points, continuously improving accuracy as it observes deviations between predicted and actual arrival times. The value isn't in knowing where a shipment is right now; it's in knowing when it will arrive and why it might not. The automated exception workflows built on top of that prediction layer — alerting, rerouting, carrier escalation — are where the operational value concentrates. Tracking tells you what happened. The FourKites prediction layer tells you what's about to happen and what to do about it.

The integration problem: where the data gap lives

The reason most operators don't have clean, complete routing data is not that the data doesn't exist — it's that it lives in too many places. Order data in the OMS. Customer master data in the ERP. Vehicle and driver data in the fleet management system. Historical execution data in the TMS, if it's captured at all. Real-time traffic and road condition data from external providers. Each of these sources has different update frequencies, different data quality standards, and different APIs or file formats for extraction.

The practical question for a route optimization software company is: how do you build the data integration layer without making it a custom integration project for every customer? The companies that have answered this well have built connectors for the 8–10 TMS and ERP systems that cover 70–80% of their target customer base, and built a data normalization layer that handles the variation in how the same fields (stop address, time window, vehicle capacity) are represented across those systems. The companies that have answered it poorly have a sales process where every new customer requires 3–6 months of custom integration work before the optimization can run.

For seed-stage companies building in this space, the integration architecture is a make-or-break decision. Build the connectors for the systems your first 10 customers actually run. Don't build a general-purpose middleware layer — that's a different product. Build the connectors, normalize the data, run the solver, close the feedback loop. The route quality will compound from there.

The algorithm ceiling

One more point worth making directly: there's a ceiling on what algorithm improvement can deliver in practical routing quality. Modern VRP solvers, operating on clean problem instances, get within 1–2% of optimal on typical commercial routing problems. The algorithmic gains available from switching from a good heuristic solver to a marginally better one are measured in fractions of a percent. The quality gains available from fixing data quality — correcting service time estimates, updating time windows, improving geocoding accuracy — are often measured in 10–20% improvements in on-time delivery rates and driver utilization.

This is why we're skeptical of routing optimization pitches that lead with algorithmic differentiation. If a company's primary claim is "our solver is better," we immediately want to know: how are they ensuring the data feeding that solver is accurate, complete, and continuously improving? The solver is table stakes. The data infrastructure is the product.