2022-05-24

Portfolio Lessons from Year One of Deployment

By Anders Nylund · Principal · 8 min read

I joined Kvistlund in early 2021 after five years as an ML engineer at a Nordic logistics software company. My background is in production systems: the unglamorous engineering work of getting a model from a Jupyter notebook into something that runs reliably in a customer's environment, handles data quality failures gracefully, and produces predictions that are calibrated well enough that a human operator can act on them without constantly second-guessing the output.

Fourteen months into my time here, I want to share what we've learned from watching our portfolio companies build and deploy production ML systems in the European logistics context. This is not a research summary or a framework — it's a set of observations from direct technical engagement with founders who are building real products and running into real problems.

Data drift in logistics is faster and more structured than most ML teams expect

In standard ML deployment practice, the concern about data drift is often framed in fairly abstract terms: the distribution of your input features changes over time, your model's predictions degrade, you need to monitor and retrain. In logistics, this abstract concern has a very concrete set of causes that experienced practitioners know about and inexperienced ones get blindsided by.

Carrier behavior is a primary driver. A road freight ETA model trained on carrier event data from 2020-2021 will have encoded assumptions about reporting frequency, geolocation precision, and event completeness that reflect the carrier's technology stack at that time. When a carrier upgrades their mobile driver app, switches telematics providers, or changes their event reporting protocol, the data distribution that the model was trained on shifts — sometimes gradually, sometimes suddenly. The model doesn't know the carrier changed its app. The predictions start degrading in ways that are hard to debug without explicit monitoring for carrier-level data quality metrics.

Port operations are a second major drift source. A model trained on port dwell time data from a period of normal operations will produce different predictions than warranted when port congestion rises significantly. Congestion creates systematic deviations from historical patterns — vessels waiting at anchor for berths that are operating at 150% of planned throughput, dwell times 3–4x historical averages — that were not present in the training data. A model without explicit congestion features will not handle this gracefully. During the 2021 port congestion period, we saw this play out in real time: several companies whose ocean visibility products had been performing well in 2019-2020 saw their prediction accuracy drop significantly, not because their engineering had gotten worse, but because the operational environment they were predicting in had moved far outside their training distribution.

We are not saying every logistics AI company needs to build a full data observability platform from day one. For early-stage companies, the minimum viable monitoring is carrier-level and lane-level accuracy tracking with automated alerting when accuracy drops below threshold on a specific carrier or trade lane. This lets you identify drift early and diagnose whether it's a model problem, a data problem, or an operations problem. Without this, you often don't find out until a customer's operations team calls to ask why your predictions have been wrong for three weeks.

Model architecture choices at seed stage have longer tails than founders usually anticipate

The ML architecture decisions made during the first six months of product development tend to persist far longer than most founding teams expect. This is not unique to logistics software — it's a general property of production ML systems — but in logistics, the architectural commitments are particularly sticky because the data pipelines that feed the models are expensive to rebuild.

The most common pattern we see is founding teams choosing a model architecture for their initial use case that works well on the problem it was designed for but constrains what the product can evolve into. A demand forecasting model that was designed as a univariate time series predictor for a specific product category can produce accurate forecasts for that category, but extending it to multivariate prediction — incorporating external signals like carrier rate trends, weather, or promotional calendars — often requires rebuilding rather than extending. The teams that build multi-input prediction infrastructure from the beginning, even when they don't immediately need all the inputs, have an easier time expanding their product surface later.

A related pattern: feature stores. Logistics AI products that end up serving multiple prediction use cases — ETA prediction and capacity prediction and demand forecasting — eventually need a shared feature computation infrastructure. Companies that build this from day one have a significant advantage over companies that rebuild it twice. The founders who have production ML experience in a large-scale environment tend to anticipate this need. The founders who are strong data scientists but have less production deployment experience tend to rediscover it.

Customer retention in production ML products is driven by accuracy consistency, not peak accuracy

When we evaluate logistics AI products as investors, we pay close attention to accuracy metrics. When we watch portfolio companies try to retain customers, we observe that accuracy metrics are necessary but not sufficient for explaining retention outcomes.

The more predictive factor for customer retention, based on what we've observed across our portfolio, is accuracy consistency: whether the model performs at a reliable level across the full range of conditions the customer encounters, not just under the conditions where the model performs best. A freight visibility product that produces 90% ETA accuracy on major standard-route shipments and 55% accuracy on shipments with port transshipments is not actually a high-accuracy product for a customer whose freight mix includes significant transshipment volume — it's a product that performs well in conditions that don't represent their problem.

Customers understand this fairly quickly in production. The initial pilot is often structured around the favorable conditions where the product performs best, which creates an accurate initial impression of model quality. When the product goes into production and the full range of the customer's freight mix is flowing through it, the average accuracy the customer experiences is lower than the pilot suggested. This expectation gap is a significant contributor to churn at the 9–12 month mark, in our observation.

The founding teams that handle this well are the ones who are explicit about accuracy by segment — who can show a customer their predicted accuracy on specific trade lanes, carrier combinations, and shipment types before go-live, and who build their expansion narrative around improving accuracy on the harder segments rather than overpromising on the full mix.

Infrastructure cost unit economics matter earlier than most seed-stage ML companies assume

AI infrastructure costs — GPU compute for model training, inference serving at scale, data storage for training data at the volume that production logistics operations generate — are not uniformly significant at seed stage, but they are not uniformly negligible either. For logistics AI companies that are processing high volumes of carrier event data in real time, the infrastructure cost per unit of prediction can become a meaningful margin constraint at relatively modest scale.

We've seen this most acutely in companies building real-time ocean visibility products. AIS vessel tracking data is available at high frequency for tens of thousands of vessels simultaneously. If you are ingesting full AIS data streams and processing them through a real-time prediction pipeline, the data ingestion and storage costs scale with the number of vessels you're tracking, not with the number of your customers' shipments. This creates a cost structure that needs careful management as the product scales. Companies that started tracking a broader universe than their customers' actual freight mix — "we need the full picture to build accurate models" — found that the data costs were non-trivial before their revenue base justified them.

The lesson is not to under-invest in data. The lesson is to be deliberate about scoping data collection to the specific prediction problems your product needs to solve for your current customers, and to build the architecture with a clear view of how data costs scale. Founders with strong ML research backgrounds sometimes approach data collection with a research mindset — more data is better, and the cost is a secondary concern. In a production product with real margin requirements, that mindset needs an operational check.

What we look for in technical diligence as a result of these observations

After fourteen months of close engagement with production ML systems in the logistics domain, our technical diligence on new investments has evolved. We spend more time on data quality monitoring infrastructure — whether the team has built carrier-level accuracy tracking and automated alerting, or whether they're still relying on customer calls to detect degradation. We probe more carefully on model architecture scalability — whether the initial design constrains what the product can become, and whether the founding team is aware of those constraints. We ask specifically about accuracy by segment, not just headline accuracy, and we try to understand what the customer sees in production versus what the pilot showed.

These are not qualitative soft signals. They are specific technical questions with specific good and bad answers. Founders who have built production ML systems at scale answer them confidently and in detail. Founders who haven't are often surprised that we're asking.