Against Time Series Foundation Models
Or: My Experience in Modern Forecasting
Time-series foundational models (TSFMs) are currently engaged in a knife fight to prove their worth against statistical models that have been around for half a century. TSFMs borrow similar architectural patterns from LLMs, and take a similar approach to curating large foundational datasets and pretraining. Only instead of words, the primitive is a time-series.
Why are they struggling to prove their worth? This is obviously not what happened in other domains where foundational models have emerged. Is it some property of reality that forecasting time-series is harder? Or are these models just not solving the right problems? And what exactly is the problem we solve when we build forecasts?
I’ve spent a lot of time using and tweaking these models, along with most other modalities of time-series prediction. After studying economics, I started my career at the Federal Reserve building term structure forecasts, moved on to Amazon on the supply-chain side, and recently finished a four-year run at Stripe working on cohort and financial forecasting. I’ve built a lot of forecasts throughout that period, but I also came to think more carefully about the deeper problem forecasts are actually trying to solve.
My prediction is that TSFMs will have some limited uses, but the future is not going to be larger and larger time-series foundation models. The future is going to be general agentic models doing search over specific forecasting problems, and then fitting something closer to a structural time-series model.
The TSFM bet
The foundation-model bet in time series is that pretraining across domains, and increasingly on synthetic data, will allow models to learn common temporal patterns across relevant kinds of data. This is not a crazy hypothesis.
Within forecasting, foundational datasets tend to consist of huge amounts of data indexed by time. Sometimes they also include tags for the domain or subdomain they belong to. But for the most part, they are just massive collections of real-world measurements over time. Increasingly, they also include synthetically generated time series.
The hope is that the model weights will absorb general facts about how data behaves through time. Consider, for example, that if two time series from totally different domains are both chugging along happily and then drop to zero for an extended period, something pathological has probably happened to them. Whether it is a heart rate or an asset price, once things hit zero they have often entered some sort of dead state. Or take seasonality: huge parts of the world, from human behavior to weather systems, are governed by recurring patterns across the calendar.
So the underlying idea is straightforward. Maybe there really are broad temporal priors that transfer across domains, and maybe sufficiently large models trained on sufficiently broad data can internalize them. When we train on all these time series, we are implicitly asking: what if all time series came from some kind of global time-series data generator?
Why foundation models underwhelm
Before I talk too much about the philosophy of why TSFMs don’t work very well, I should ground what I’m saying in some actual empirics. On newer and more realistic benchmarks like fev-bench, the latest foundation models do beat statistical baselines.
But if you look at the actual numbers, the story is much less impressive than it first sounds. On fev-bench, the strongest models achieve skill scores in roughly the 35–40% range, meaning they reduce error relative to Seasonal Naive by about a third. That is not nothing. But it is also not what a major breakthrough looks like. They also still lose to naive models on a number of composite series. These models are still in a competition where Seasonal Naive is a serious contender.
Jeff Dean recently highlighted some of Google’s cutting edge time-series foundation research. If we peek at it, we see it wins on the benchmark. One comparison here is that we see it beats an algorithm called N-BEATS. N-BEATS was a clever case of using structural priors and deep learning, but I could train it on my desktop back in 2019. We’ve since had the AI revolution and new GPUs, and Google can barely beat it?
This observation is true across all foundational time-series papers. The problem is the depth of structure to learn from language, audio, video, and other rich structures simply doesn’t exist in a time-series. I view a lot of these time-series benchmarks as similar to the 100M sprint. Beating them is a big accomplishment, but substantively you’re not winning by a lot, as we’re reaching the limits of predictive power in forecasts. Lots of these benchmarks look like this series below, from a famous airlines time-series. There is often not that much more juice to squeeze from these benchmarks.
As a final example on this point there was this model called DLinear. Zeng et al. (2023) showed that a model with basically a single linear projection could outperform transformer-based architectures on standard long-term forecasting benchmarks. The complex attention mechanisms, the positional encodings, the multi-head self-attention supposedly learning temporal dependencies across domains, all of it bought you less than a linear projection. That is not what you would expect to see if these datasets contained a deep reservoir of nonlinear structure waiting to be discovered.
That is really the key empirical point. If you come to a forecasting problem with millions of parameters trained on millions of other time series, and on average you beat a decent seasonal baseline by roughly a third, the natural conclusion is not that we are on the edge of a time-series foundation-model revolution. It is that the amount of learnable structure in these benchmark problems is fairly limited.
Why the gains are smaller than they look
The argument against my position is that there are patterns across all time-series, like cyclicality or trends, where we do get a useful prior from other time-series. This is true. But it also argues against the universality of a foundational model trained only on time-series. In this world it’s not that we’re learning a universal property from other time-series, we’re learning a specific property from a curated set of time-series. We’ve restricted our time-series to the set of what we consider well-behaved and forecastable.
If you look at these benchmarks they typically do not have series that are pathologically noisy, structurally explosive, or dominated by bizarre edge-cases. They are made up of the kinds of series we have decided are worth collecting, cleaning, publishing, and forecasting. So yes, electricity demand probably does have shapes that tell us something about other time-series. But part of the reason for that is that these are all forecastable time-series to begin with. We picked them because they live in a relatively tame part of the space.
In practice, my experience is that pretrained foundation models can at best alleviate the need for algorithmic tricks to catch model pathologies. When I built production forecasting systems, it was never sufficient to just pick a few out-of-the-box models and run them on each time-series. That is where the scientific libraries end. In practice, we would always have to build a complicated harness to catch edge-cases, explosive forecasts, pathological issues in our data, and other guards against forecasts that would result in us accidentally building an extra data-center or stocking too many shoes.
When you use a model that was pretrained on forecastable time-series, it learns a high-dimensional prior that your series probably should not blow up and go up by 10,000%. An ARIMA model that is specified incorrectly on growth rates has no conception of this. So part of the gain from TSFMs may simply be that they learn a prior against insane forecasts, not that they are uncovering deep nonlinear structure.
Why real forecasting problems break the whole frame
A useful example here is forecasts of asset prices. This is a case where the data is historically time-indexed, and it should be self-evident that forecasting the returns of a stock will allow you to print money.
Yet in fact forecasting the direct returns of a given asset in the SP500 is functionally impossible. Even within quant shops, they try to forecast adjacent time-series. A quant may try to forecast the short-run ratio between an asset value and some esoteric commodity prices to look for mean reversion (perhaps the firm depends on those commodities). This pivots the problem from forecasting the non-stationary value of the actual price of the asset to forecasting a stationary ratio between X/Y. Most complex predictions in adversarial markets operate this way.
At the Fed, I saw the same thing, the 10-year rate forecast depended on a structural theory connecting maturities, inflation-protected securities, and swaps, none of which a TSFM knows about. At Amazon we would forecast units per customer rather than only units. We spent as much time deciding the actual object to forecast, as we did estimating the model to forecast it.
As a final example, at Stripe we worked on cold-start cohort forecasts to model nascent trends. Defining subsets of users across dimensions, and building an algorithm to recursively forecast them with only a few samples wasn’t the obvious and straightforward approach. In fact, it took us nearly a year to transform the data into the exact correct object that would encapsulate what the company required.
The exact definition of when a customer officially becomes a customer, and how long we should track a cohort for, are intrinsically related to the strategic space leadership operates within. Should we forecast a cohort forward for one year? Or two years? Should we identify AI companies separately? These are in some sense scientific questions, but they’re scientific questions to optimize the global loss function of max(revenue | business actions). We are not bestowed with some values and simply tasked with minimizing forecasting error.
At that point the problem is no longer just “fit a forecasting function conditional on the data.” Sometimes you have to define the forecasting function conditional on the data. Sometimes you have to search for the right data and then define the function. Sometimes you are searching over both the data and the function at the same time. You want some method of indexing probably current information within your domain. Historically, algorithmic forecasting has not considered itself a search problem, because algorithmically searching for knowledge was too difficult. As a result, the field doubled down on the time-series approach to forecasting, which is historical signal detection and extrapolation.
Consider one example to show just how large this function space is. Suppose you’re dealing with equity markets, specifically commodity futures and equity returns, looking at five-minute sampled returns.
We know there tend to be long-run structural equilibriums in the price of these assets. We also know that in the short run, institutional players who are hedging or rebalancing can cause short-term deviations. These markets are usually quite efficient and liquid. Short-term deviations are generally on the order of basis points, but really astute traders and dialed-in statistical arbitrage will get rid of those.
You come to that data with a prior: that there’s some very, very subtle but theoretically justified signal for why there should be a lead-lag covariance matrix in asset returns. If you have that prior, then when you notice it in your data, first, your model has to be structured such that there’s a functional expression that could pick up that signal. Second, how do you know if that signal is noise?
If your model is a very flexible attention model, it might see that sometimes — not on all occasions — there’s a very small relationship between the fourth-lagged covariance of some asset and 20 minutes in the future on some other one, some sort of mean reversion. If your model were aware that this was an equity market and had some theoretical financial prior, it would probably give more weight to that signal. It would also be aware that in this world, the loss function is extremely sensitive, because even predicting a few basis points can make a ton of money.
In fact, much of time-series econometrics and asset pricing can be thought of as imposing theoretically-based functional priors on how a time-series should be allowed to behave. Essentially the economist is asserting, based on their knowledge of the game theoretic dynamics of a market, that data violating these functions is itself wrong. This can be seen in no-arbitrage conditions, but really in the entire field more generally.
This is built on the philosophy of science in economics, which would say something like: “If you have a time-series of $20 bills on the ground, it’s fine if they drop then are quickly picked up, but if one is on the ground for three days in a row in a busy location, the data itself is probably wrong and we shouldn’t learn from it.”
Of course, this is only economics. Every field has its own structural invariants. Electricity forecasting has physics-based constraints, database pricing has compute restrictions, retail or manufacturing demand has to follow operational constraints. These all bound the problems, but in a way that doesn’t itself flow into the models.
There is a naive idea that if you throw in this additional information as a covariate it will automatically correct for these structural misspecifications, but that’s cope. In temporally constrained sample size data, covariates are not a replacement for structural specifications. A hyper-flexible TSFM model that encourages you to dump in a ton of covariates is extremely unlikely to uncover these subtle signals, which instead need to be built into the structure of the model.
Despite this, you have these attention-based models, these deep learning models that are capable of picking up such relationships, and they don’t have that prior. All they’re getting is temporal data. They don’t know anything about the domain.
Who Are these Models Even For?
This raises a question. Yes, you can build a model that tries to be as accurate as possible across a wide set of domains using only quantitative temporal data. But why would you do that? Who is that helping? When do people say, “I really need an architecture right now that works on any domain, that doesn’t have to know any priors or theory about my domain”? Why would you want that?
There is one reason: If you just need a half-decent forecast from an API and don’t have time to build your own system, these models are fine. They’re like the modern day Facebook Prophet. Use them and move on with your life. This essay isn’t about that use case. It’s about whether TSFMs are the future of the field, and I don’t think they are.
I don’t see that approach scaling us to a universal solution though for even slightly complicated domains with heavy context.
Why we didn’t just do this before
One question here, is if structural time-series models work so well, why did anyone even start building time-series foundational models? In fact, the answer to this question was so obvious there was no point in asking it for a very long time.
The reason is that having a human craft a custom ARIMA or state-space model did not scale. When you have to build 1,000 forecasts a night, you can’t have 1,000 separate models for each one. Even if you could build that many, maintaining this many idiosyncratic models would require a massive and expensive team. So the fundamental problem of the applied forecasting field was how could you build automatic algorithms that were almost as good as bespoke custom models for each time-series made by an artisanal econometrician.
(As an aside story, there was one Saturday at Amazon a long time ago where we did exactly that. For a critical release we had 5 people come in and manually review and adjust thousands of forecasts. This sounds insane, and yeah it sort of was, but there honestly wasn’t at the time any better way to produce accurate forecasts at scale for a critical seasonal release.)
From here though you can see where we’re going, because we’re approaching a world where we can have 1,000 custom agents.
What I mean by a small model
A lot of my argument so far has been empirical and based on my own experience as to why low-parameter models work well. I also want to explain the mechanics a bit from first principles. Everyone who has worked in forecasting knows what an auto-regressive integrated moving average model is (ARIMA). This essay isn’t a statistics lecture, but a simple way to think about it is that ARIMA is a family of relatively small statistical functions that learns how to weight previous observations to predict the future, sometimes on the level of the series itself and sometimes on its changes.
The important point is not the textbook definition. It’s that ARIMA lives in a small but expressive function space. ETS models are also small. State space models are often small. EWMA is small. Croston is small. The broader point is that recursive statistical models are much more expressive than they look. A small number of parameters can still generate a very wide range of behaviors once they govern a recursive process through time. As a smart man once said, "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." Level, trend, seasonality, damping, differencing, error propagation, these are not toy choices. Small parameter count here does not mean a weak function space. These are compact structural guesses about how a series behaves.
Once you see that, the scale mismatch becomes hard to ignore. If a model with a handful of parameters is often competitive, and a model with hundreds of millions of parameters only improves accuracy by something like 30% on optimistic benchmark readings, then you have to ask what all those extra parameters are really buying you.
We do not really know the full answer to that. But I can say what I think they are buying you. I think some nontrivial amount of it is the sort of edge-case handling and harness logic that people used to have to build explicitly around smaller models. Because these systems are trained on large collections of forecastable series, they are probably learning something like an ARIMA or ETS functional space internally, but with broader global priors on the parameters that prevent pathological cases from blowing up. A lot of the ugliness here comes from estimation error, missing values, bad preprocessing, and other issues of that nature, not from some deep nonlinear forecasting signal waiting to be discovered.
And even these benchmark comparisons against ARIMA are not really true comparisons. In practice they are usually comparisons against algorithmic wrappers on ARIMA, like auto-ARIMA or auto-ETS, which try to guess the right specification using statistical heuristics. That is a very different thing from asking a talented econometrician with actual domain knowledge to hand-craft a statistical model, taking care to deal with outliers in a theoretically justifiable way.
In practice, I have never once run auto-ARIMA on some time-series and called that a production forecast. I’ve spent years building harnesses to make it easy and scalable to add domain-knowledge restrictions into these spaces. The Amazon ARIMA was very different from the Stripe ARIMA. The core algorithm might be the same, but I would often spend months crafting the surrounding harness so it could work at scale in ways that were accurate and credible.
The number of choices here that require context is difficult to overstate. Should you log-transform the data? Difference it? Add parameter restrictions? Add guardrails based on prior beliefs? Decide whether historical errors should propagate? These are not cosmetic choices. In temporally constrained data, they often determine whether the model is sensible at all. Because the sample size is fundamentally limited, the data usually does not just speak for itself here in the way it can in richer domains like language.
Here’s an example: At my last job I was dealing with a forecast where I had to forecast about a million time series. That’s both a lot and not a lot depending on your perspective and tech stack. It’s too much to just throw into pandas and run sequentially, particularly including backtests. You could make it work with distributed compute, but then you’re dealing with distributed compute.
The existing foundational models and open-source architectures aren’t built for this kind of low-signal, sparse data. Most of the time-series were noisy series from smaller users. I know things about the users. I know the upstream data pipelines. I know what additional data I could get if I needed it, which would not be trivial to get. From four years of working on the problem, I know which patterns, users, and times are more likely to be seasonal vs. noise. A lot of times it’s sparse demand. Sometimes it’s broken data, and I know which it is from context.
For probably half the series, I think we would want an EWMA — I know enough about the data to have the good sense we can’t beat that. For the other series, we need something to model seasonality, some extrapolation. The series inform each other in the sense that knowing what other series look like helps you say, “This series is probably going to look the same as them.”
But there is no reason they should have any lead-lag relationship with each other. I know what a user did last month won’t explain much about what a different user does this month. Any model that permits the ability to learn that will only pick up estimation error, i.e. learn from noise. And remember, real-world time-series often only have 24 months of observations, so unlike traditional deep learning in the LLM space, you can’t saturate the models with parameters and trust the sheer size of your data will overwhelm noise.
So what architecture do I use for my data? What’s the single model that I should pick out of a benchmark table that will “just work”?
The answer is that I just built my own model, really quickly. I knew that basically a de-trended ETS model works on all of these, but I wanted my parameters to be informed by other series. So I worked with GPT-5 Pro and said: I need an architecture where I’m basically fitting an ETS model to every series, but I’d like to regularize each series toward some sort of global distribution. If there’s some really noisy short series and I’m trying to estimate the parameters of this model, they might be penalized toward a global mean.
You could obviously extend this. You could say you have categorical variables, so you want to penalize toward the local category. You can do stuff with seasonality too: what percentage of this is seasonal? How should I trust whether a series is seasonal or not? If I have some prior from looking at the series, like half of them are seasonal, then I can bake that into the model.
This model is accurate and fast because I’ve made it sample efficient for my data. The pure attention of these other models is so flexible, and flexibility when it’s not needed is not a great thing in statistics. It lets you overfit, makes you more sample hungry, and in time series, you’re often very sample constrained.
Yes, I have 1.7 million time series, but they all have the same time index. So yes, I have a high cross-sectional sample, but my temporal sample is still very low. If you think of each year as a single growth number for a company — which is both how companies think about it and also sort of true, since the latent growth for a company is not super volatile day-to-day — then I might actually only have four data points yearly. That’s not a lot. Sample efficiency is important.
This model ended up being used for about 800,000 series, with an EWMA for the small noisy series. The ETS structure had a handful of parameters for each one, and then there were a handful of global parameters. So we ended up with around 4 million parameters. Of course that is not the same thing as a 200M parameter deep model. Most of these were local parameters inside an explicit recursive structure. They were not all jointly learning some giant hidden representation of the world. Each one could be introspected and had a clear effect on a specific series or on the global shrinkage behavior.
When I see these hundred million or billion parameter deep learning models, what I think is: you’re burning millions of parameters to try and learn the optimal 10-parameter structural representation. Wouldn’t it be better if you had some sort of AI construct the correct structural representation, estimate the parameters for it, and give you a forecast the agent thinks is reasonable?
(If you really truly actually have non-linear functions to learn , then you should go for a semi-parametric approach, and fit a structural time-series model to your data, but add a nonlinear layer on the residuals to pick them up)
I would rather have a time-series agent observe the full context of our data pipelines, data, and problem, then propose and implement solutions of this shape, rather than pass all those series into a TSFM and cross my fingers that the pre-training and fine-tuning gets it right.
TimesFM is 200M parameters. Chronos ranges from 20M to 710M. TimeGPT is closed but reportedly similar scale. Compare that to an ARIMA(1,1,1), which has 3 parameters, or ETS, which has maybe 10–15.
What I’m really describing here are two separate axes. One is whether you are fitting inside a fixed model class, or searching over the model structure itself. The other is whether you have rich outside context, or almost none. Once you lay the space out this way, I think the current time-series foundation-model paradigm looks much less impressive, and the alternative becomes easier to see.
The problem in forecasting is often not that we lack a function class expressive enough to fit the future. It is that we need to find the right small function, with the right constraints, for the right object.
What the agentic system actually looks like
My argument is basically that forecasting has spent a long time moving between the bottom two boxes, because those were the only two scalable production options in the past. The real opportunity now is in the top-right box, because we can increasingly spin up agents.
The first forecasting library I built at Stripe was an attempt to operationalize the bottom-left box at scale, with production-grade harnesses and execution environments. Over time, we pushed parts of the system toward the bottom-right box, with mixed results. The main benefit there was that when you have to build forecasts for millions of time-series, it is more efficient to push more of that work onto the GPU. If I were still there, I would be working on the top-right box.
I would assign a single agent to each account. I would not think of this as building a custom forecasting model so much as synthetically creating a business analyst and an econometrician responsible for forecasting a single user. These agents would not just produce forecasts.
They would also function as forecasters, business analysts, anomaly detectors, and advocates for their customer. Instead of going ever deeper into just time-series forecasts, the field should instead expand upwards and downwards on the forecasting stack. Forecasting analytics should be an intrinsic part of the problem space we solve.
Each agent could pass messages to other agents, and pass information upward to agents responsible for larger parts of the forecasting space. Some higher-order agents might be responsible for products, others for regions, and they could themselves build forecasts at higher levels of aggregation, then reconcile those forecasts back down to the lower levels. The complexity here stays in search, context, and memory, and then gets compressed into small symbolic models when needed.
In that world, the forecasting system stops looking like a time-series model and starts looking more like an actual forecasting organization. The tribal knowledge of the forecasting system no longer lives only in code, dashboards, and scattered human judgment. It becomes owned by the agents themselves. There would still be context engineering to impose common patterns and guardrails, but each agent would own its own module, maintain its own state, and evolve as it learned new things or got new context.
Why this is better than TSFMs
In that world, the difference between statistics and deep learning, or machine learning and statistics, stops being as stark as it is today. Today it is usually either humans pick functions, or machines pick functions that you do not get to observe. That is what a lot of machine learning really is. There is some function being learned, but once the function becomes too complicated for a human to reason about, there is not much alternative. Language models really do seem to live in that world. The function is so complicated that no human could ever write it down, hold it in their head, or reason through it directly.
But forecasting does not seem to have this property. Reality is too chaotic, and predicting the future is too difficult, for there to be some remarkable hidden function just waiting to be summoned through enough GPUs. In the forecasting domains we actually care about, richer function classes usually do not reveal some deep nonlinear truth. They overfit, or they mistake noise for signal, adding estimation error.
In that sense, this is a kind of epistemic regularization. The set of functions a human can actually observe is smaller, more tractable, and more parsimonious than the space of all possible functions. In forecasting, that seems to be a feature rather than a bug.
There are at least two other benefits to this setup. First, we do not really know why TSFMs make the forecasts they make. We know the architecture, we know the training setup, but once the parameter count is in the millions, it becomes very difficult to speak mechanically about what the model is actually doing. Second, while a human probably would not read the symbolic logic for a million separate forecasts, they could. That legibility matters. High-context reasoning mapped into a low-parameter symbolic model remains something a human can inspect, critique, and build institutional knowledge around.
A Better Research Agenda
It’s not trivial. You can’t just paste all your data into GPT and ask it to do this. You have to build it. You need a general purpose analytics agent that observes your data, takes in whatever context matters, and then proposes an architecture that encodes those contextual priors. At that point you are no longer just throwing an attention model at all your time series. You are using search and context to craft an architecture that scales and is actually suited to your data. And you are not locked into deep learning here. The agent can propose a statistical model, a state space model, a hybrid, or even a judgement-based forecast.
This is almost the opposite of what a foundation time-series model does. The model’s “beliefs” are locked in weights, inaccessible to critique, and unable to be updated with a single piece of new context without retraining.
The agentic approach recreates more of that workflow: gather context, form explicit hypotheses about the structure of the problem, choose a model that encodes those hypotheses, estimate parameters, and explain why. If someone pushes back and says “you ignored seasonality,” you can actually revisit the reasoning rather than shrug at a loss curve.
And from first principles, I do not think we live in a world where sufficiently large models discover some hidden nonlinear signal that reveals the future in deep detail. Questions like “when will the Ukraine war be resolved” are not going to be answered by a billion-parameter transformer that has finally found the right latent pattern. We do not live in the world of Nostradamus.
The hard problems with the agentic approach are not estimation. They are search, verification, and failure modes. When an agent decides your data needs a hierarchical state space model with some specific constraints, how do you know it is right? But at least now the opacity has moved from the weights into an explicit data generating process with an explanation of the structure. That is progress, and it is much closer to how economists traditionally forecast.
We should be a little careful here, as we do not want to assume traditional forecasts are the optimal thing to recreate. With that said, on any forecasting system I have ever built, I could always beat an automated system forecast if I went forecast by forecast, looked at the results, looked up additional context, and tweaked it based on that information.
None of this guarantees the agent is calibrated or that the search process finds good hypotheses. Those are real open problems. But they are problems we know something about from decades of research on human judgment, and they can be attacked with legible interventions: better retrieval, explicit uncertainty quantification, adversarial review of model choices. That seems like a more tractable research agenda than hoping scale solves interpretability.
As humans, for now, we still own instructing the agents, the forecasts themselves and how they are used. In this world our job starts looking a little less like scientists, who often pass on forecasts to the decision makers. We should instead build the agents, then promote ourselves to the decision makers.





Cool stuff, can I ask what you studied for your bachelor's?
Great post—enjoyed it especially in how specific you got. Definitely resonates with my experience as well.