I spent four years building demand forecasting models at a Nordic grocery retailer before joining Kvistlund. The gap between what is marketed as ML forecasting and what is actually running in production at most retail and grocery logistics operations is large enough to define an investment thesis.
Most supply chain software vendors with "AI-powered forecasting" products are running exponential smoothing or ARIMA variants with a gradient boosting layer on top for promotional events. That's not inherently bad — these methods work reasonably well for stable, high-velocity SKUs. What they do not handle well is the long tail of SKUs, new product introductions, extreme demand events, and the complex cross-product demand interactions that define real grocery operations. When a vendor says "our ML accurately predicts demand across your assortment," what they usually mean is "our ML accurately predicts demand for the 20% of your assortment that moves enough volume to generate a reliable historical signal."
The statistical baseline and its limits
Classical time series forecasting methods — exponential smoothing, Holt-Winters, ARIMA — were designed for single time series with stable trend and seasonality structure. They require sufficient history (typically 12–24 months of regular sales data) and perform well when the data-generating process is stationary: the same SKU, sold in the same store or channel, with predictable seasonal variation.
Grocery and retail logistics violates this assumption continuously. New SKUs enter the assortment with no history. Existing SKUs get promoted, discounted, or temporarily out-of-stocked in ways that corrupt the historical signal. Seasonal patterns shift year-over-year with weather variation. Competitor activity affects category velocity. The cross-price elasticity between a branded product and its private label equivalent means that forecasting them independently produces systematically wrong results for both.
Practitioners who have worked with these methods in production understand their limitations intuitively. The statistical baseline is a useful starting point, but it requires constant manual override and exception management by demand planners who are effectively correcting for the model's blind spots with their own judgment. This is expensive in planner time and inconsistent in quality.
Where ML adds genuine value
ML-based forecasting approaches — gradient boosted trees, deep learning sequence models, global neural network models — add genuine value in specific, well-defined ways. First, they handle the feature engineering problem better than univariate time series methods. A gradient boosting model that ingests price, promotional flag, competitor pricing, weather, local events, and category trend as features alongside historical demand produces more accurate forecasts for promoted items than any univariate method can, because it explicitly models the relationship between those drivers and demand outcomes.
Second, global models — models trained across all SKUs simultaneously rather than fitting a separate model per SKU — solve the cold-start problem for new products. A global neural network that has learned the demand profile shape for 50,000 SKUs can make a reasonable forecast for a new SKU that shares product attributes and shelf placement characteristics with known SKUs, even before any sales history has accumulated. This is a genuinely new capability that statistical methods don't have.
Third, ML models handle intermittent demand — the long tail of slow-moving SKUs that generate one or two sales per week — better than ARIMA variants, which break down on sparse data. Approaches like cropped count models or zero-inflated distributions can produce calibrated uncertainty estimates for slow-moving items that help replenishment systems make better order decisions under uncertainty.
The production engineering challenge
Building ML forecasting models that perform in academic benchmarks is straightforward. Running them in production at scale — daily forecast refreshes for 50,000 SKUs across 200 stores, with results pushed to replenishment systems before the next order window — is an engineering discipline that is fundamentally different from model development.
Feature pipelines are the hardest part. Promotional calendars need to be ingested, cleaned, and encoded before the promotional event, not after it. Weather data needs to be joined to store locations with appropriate lag structures. Competitor pricing data needs normalization. Each data source has its own latency characteristics, failure modes, and quality issues. A model that produces excellent forecasts when the features are clean produces degraded forecasts when the promotional flag is late because the campaign management system had an API outage. Building resilient feature pipelines that degrade gracefully under data quality failures is unglamorous engineering work that most academic ML teams have never done.
Model retraining cadence is the second operational discipline. A model trained on two-year-old data will drift from a market that has had supply chain disruptions, category mix shifts, and assortment changes. Retraining frequency needs to be balanced against compute cost and the risk of overfitting to recent noise. The right cadence varies by category — fresh food SKUs need more frequent retraining than shelf-stable grocery. These decisions require both ML knowledge and operational context.
What this means for evaluating forecasting products
When we evaluate a demand forecasting startup, we ask specific questions about production architecture. What is the model retraining cadence? How are promotional events handled — are they a feature input or a separate overlay process? What is the fallback when a data source fails? How is model performance tracked in production, and what triggers a model refresh versus a manual override?
Founders who have run these systems in production have immediate, specific answers to these questions. Founders who built good research models and then productized them without operational experience tend to give answers that describe the demo environment rather than the reality of production deployments. The gap is usually visible within the first 15 minutes of a technical diligence conversation.
We're not saying that teams without grocery-specific operational experience can't build good forecasting products — they can, and the best teams learn fast. What we are saying is that the production engineering discipline required to run ML forecasting reliably in a high-SKU, high-velocity retail environment is distinct from model building skill, and the time to learn it is not post-investment. Investors who evaluate demand forecasting startups on model performance benchmarks alone are missing the harder part of the technical risk.
The enterprise sales reality
Grocery chains and large retailers buying demand forecasting software are not buying a model — they're buying a workflow change. Demand planners who have spent years with the incumbent system need to trust the new model's outputs enough to act on them. This trust-building process is slow, and it requires the forecasting product to generate observable wins — specific SKUs or categories where the model clearly outperforms the previous approach on metrics the planners understand and care about, like shelf availability during promotional peaks or waste reduction on fresh categories.
The companies building in this space that we find most compelling are the ones who treat the planner's trust-building journey as a product design problem, not just a sales problem. That means explainability features that let planners understand why the model generated a specific forecast, exception interfaces that surface the high-uncertainty cases for human review, and performance dashboards that make model accuracy visible over time in terms the business cares about. These features don't show up in a model performance benchmark. They show up in whether the customer actually changes their behavior based on the forecasts.