The Edit · Founder Insights
Apple Watch VO2 max estimates run 10 to 15 percent off on average. Here is how the algorithm works and what real testing actually measures.

Your Apple Watch's VO2 max estimate is meaningfully wrong. The published literature puts the error at plus or minus 10 to 15 percent against gold-standard treadmill testing, and the direction of the error is not random. Here is how the algorithm actually works, what the clinical alternative measures, and when the wearable reading is good enough.
TL;DR
- Apple Watch VO2 max is an estimate inferred from your heart-rate response to walking and running pace, calibrated against a population dataset. It is not a direct measurement of oxygen utilisation.
- The error against gold-standard treadmill testing with respiratory gas analysis runs roughly 10 to 15 percent in published validation studies. The error is structural (the algorithm cannot see oxygen consumption directly) rather than random noise.
- For tracking your own trend over months and years, the watch reading is genuinely useful. For an absolute longevity-decision benchmark, it is not adequate.
- A graded treadmill test to volitional exhaustion is the clinical gold standard but is genuinely the wrong test for most untrained adults. The right read for the general population is a validated sub-maximal protocol like the YMCA 3-minute step test, which Catalyst uses on the 4-Pillar Healthspan Assessment.
- VO2 max is the strongest single predictor of all-cause mortality in modern cardiology. Whatever methodology you use to measure it, having an honest number matters more than having a precise one.
How the Apple Watch VO2 max algorithm works
The Apple Watch does not measure VO2 max directly. No wrist-worn device on the consumer market does. What the watch measures is your heart rate (via the photoplethysmography sensor on the back of the case) and your movement intensity (via GPS pace plus accelerometer data). What it then does is feed those two inputs into a predictive model that estimates your VO2 max from your heart-rate response to a given pace.
The published methodology, refined through several watchOS updates, draws on the original Firstbeat algorithm, the same prediction framework used in Garmin and Polar wearables. The model assumes a reasonably linear relationship between heart-rate-at-pace and aerobic capacity within a population. If you run a 6-minute kilometre and your heart rate stays at 145 bpm, the model assigns you a higher predicted VO2 max than someone whose heart rate hits 175 bpm at the same pace.
This is a reasonable inference for a population. It is a less reasonable inference for an individual whose physiology departs from the population mean. Lean adults with high blood volume often score artificially high. Adults on beta-blockers or other heart-rate-suppressing medication score artificially high. Adults who are deconditioned but have unusually high genetic baseline cardiorespiratory capacity score artificially low. The algorithm cannot see why your heart-rate-at-pace looks the way it does. It just maps the input to the population distribution.
What the algorithm needs to produce its estimate: enough outdoor running or walking data, with consistent GPS-tracked pace, in a heart-rate range that gives the model signal. Treadmill running, indoor training without GPS, and slow walks below roughly 5 km/h typically do not generate a fresh VO2 max estimate, which is why your watch's number can sit static for weeks even as your actual fitness changes.
Why the error is structural, not random
Random error in measurement averages out over many readings. Structural error does not. The Apple Watch's VO2 max estimate has structural error because the algorithm cannot see the variable it is trying to predict.
A treadmill VO2 test measures oxygen consumption directly. You wear a mask that captures every breath, the gas analyser reads the oxygen and carbon dioxide content of inhaled and exhaled air, and the test calculates how much oxygen your body actually used per minute at peak effort. That number is your VO2 max. There is no inference involved.
The watch is doing prediction. It looks at heart-rate response to pace, applies a population model, and outputs a number. The prediction is reasonable on average and useless at the individual extremes. The watch genuinely does not know whether your low heart rate at a 6-minute kilometre pace reflects elite cardiorespiratory fitness or a beta-blocker prescription or a body composition unusual for your age band.
The Apple Watch is useful for tracking your trend. It is not useful for deciding what your fitness actually is.
Published validation studies in journals including the Journal of Sports Sciences and Frontiers in Physiology repeatedly find errors of plus or minus 10 to 15 percent against direct gas-analysis measurement. The error correlates with how much an individual's physiology diverges from the population mean. The error does not converge to zero with more readings, because the algorithm cannot see the missing variable.
If your watch says your VO2 max is 42 ml/kg/min, your actual VO2 max under direct measurement is more likely to be somewhere between 36 and 48 than to be exactly 42. That range spans several percentile bands on age-adjusted norms. The single number is not load-bearing for a longevity decision.
What clinical VO2 testing actually measures
The gold-standard clinical VO2 max test is a graded treadmill protocol taken to volitional exhaustion, with a respiratory mask measuring inhaled and exhaled gas concentrations continuously through the test. The mask connects to a metabolic cart that calculates oxygen uptake per minute at every stage. The test ends when the subject cannot maintain the prescribed pace and inclination any longer, the heart rate plateaus despite increasing workload, or a respiratory exchange ratio above 1.10 confirms maximal effort.
That test is the right test for athletes, for adults already training near maximum, and for clinical research. It is the wrong test for most untrained adults, for two distinct reasons.
The first reason is safety. Running to volitional exhaustion is non-trivial cardiovascular load. In adults over 45 with any cardiovascular risk factors, maximal-effort treadmill testing carries enough risk that it should be performed at a sports-medicine clinic with cardiology backup, not at a fitness studio. The American College of Sports Medicine guidelines explicitly recommend pre-participation cardiac screening before maximal testing in this population.
The second reason is that most adults who have not been coached to recognise true volitional failure produce a falsely low reading. They stop running when their legs feel heavy, not when their cardiorespiratory system is actually at maximum. The resulting test is unreliable.
The responsible clinical alternative for the general population is a validated sub-maximal protocol that estimates cardiorespiratory fitness from heart-rate recovery against age- and sex-adjusted norms. Catalyst uses the YMCA 3-minute step test on the 4-Pillar Healthspan Assessment: 12-inch step, 24 steps per minute on a metronome cadence, three minutes total, followed by a one-minute recovery heart-rate count. The result lands on the Catalyst Healthspan Report as a band, from Excellent to Poor. The number you walk out with is honest information you can act on, with none of the risk of a maximal test attempted by someone who has not been built up for it. I went deeper on the testing-protocol question itself in VO2 max test Singapore: what most clinics get wrong, and covered the safety rationale alongside the four other underused screening numbers in the 5 numbers your annual health screening misses.
When the wearable reading is good enough
Apple Watch VO2 max is genuinely useful for one specific job: tracking your own trend over time. The structural error is largely constant for an individual (the algorithm is wrong in the same direction, by roughly the same amount, every time it estimates your VO2 max). So while the absolute number is unreliable, the direction of change is informative.
If your Apple Watch reading was 38 last year and is 42 this year, the absolute numbers may both be off by 10 percent, but the trend is real. Your aerobic capacity has improved. The same logic holds in reverse: if the trend is flat or declining over months of consistent running, something has changed in your physiology and the watch is catching it earlier than you would notice without the data.
What the wearable reading is NOT good enough for: comparing yourself to age-norm percentile tables to decide whether your fitness is adequate, deciding whether to train harder or train less or change the training stimulus, a pre-medical baseline for any longevity intervention, and anything where the absolute number is load-bearing for a decision.
For those jobs you need a clinical reading that has known accuracy bounds. The Apple Watch plus a clinical step-test combination is the right stack: use the watch for daily trend tracking and use the clinical test annually or every 16 weeks to anchor the trend to a known number. The two readings together produce better signal than either alone. The same logic applies to heart rate recovery, which I covered in detail in an earlier piece: useful trend on the wrist, load-bearing decisions need clinical instruments.
The longevity-data lens
The reason any of this discussion matters is the mortality prediction. VO2 max is the strongest single predictor of all-cause mortality in modern cardiology. The Mandsager 2018 cohort study at the Cleveland Clinic followed 122,007 treadmill-test patients across 13 years and found that the relative risk gap between elite and low cardiorespiratory fitness was greater than the combined risk increase from smoking, type 2 diabetes, hypertension, and end-stage renal disease.
Read that as the data person sees it: cardiorespiratory fitness is more protective than the combined harm of those four conditions. Whatever methodology you use to measure it, having an honest number in your possession matters. I covered the broader implications in the VO2 section of the 5 numbers your annual health screening misses.
Apple's algorithm is useful for trend. A sub-maximal step test on a clinical instrument is useful for an absolute band you can act on. A maximal treadmill test with gas analysis is useful if you genuinely train near maximum and need precision. Match the test to the decision. Do not let the convenience of the wrist sensor convince you that the number it produces is the number that determines anything.
Where to start
If you want a real cardiorespiratory band before booking anything else, the free Healthspan Audit is a 12-question self-assessment that lands a banded score across body composition, cardiorespiratory fitness, stability, and strength in your inbox in three minutes. The audit's cardiorespiratory band is built on the same age- and sex-adjusted norms the in-studio YMCA step test uses, scaled to a self-report proxy. If the audit comes back at a lower band than your Apple Watch was leading you to believe, the in-studio 4-Pillar Healthspan Assessment will tell you whether the real number is closer to the watch's optimism or the audit's pessimism.
Frequently asked questions
Q. Is my Garmin or Whoop or Oura VO2 max estimate any more accurate than Apple's?
Generally no, and for the same structural reason. All consumer wrist-worn devices use a prediction algorithm rather than direct measurement. Garmin and Polar share the Firstbeat ancestor algorithm with Apple Watch. Whoop and Oura use proprietary models with similar input variables. Published validation studies show roughly comparable error bounds across the major brands, with some variation by activity type. The brand of the wrist sensor is not what determines the accuracy. The fact that it is a wrist sensor inferring rather than a mask measuring is what determines it.
Q. Can I get a real treadmill VO2 test with gas analysis in Singapore?
Yes. Sports-medicine clinics and a small number of private performance labs offer graded treadmill testing with respiratory gas analysis. The test typically costs several hundred SGD and takes roughly 30 to 45 minutes. For adults over 45 with cardiovascular risk factors, ask whether the lab has cardiology backup on site. For most untrained adults the test is the wrong tool regardless of price and the sub-maximal alternative is more appropriate.
Q. How often should I have a clinical VO2 read done?
Once a year is sensible for adults who care about the trajectory and are training consistently. Every 16 weeks if you are in an active training cycle (the Catalyst Checkpoint cadence). Every two years is the minimum for adults using the data to inform a longevity strategy. The wearable reading covers the daily-trend signal in between, anchored to a clinical band you trust.
Citations
Mandsager, K., Harb, S., Cremer, P., Phelan, D., Nissen, S. E., & Jaber, W. (2018). Association of cardiorespiratory fitness with long-term mortality among adults undergoing exercise treadmill testing. JAMA Network Open, 1(6), e183605. jamanetwork.com/journals/jamanetworkopen/fullarticle/2707428
Düking, P., Giessing, L., Frenkel, M. O., Koehler, K., Holmberg, H. C., & Sperlich, B. (2020). Wrist-worn wearables for monitoring heart rate and energy expenditure while sitting or performing light-to-vigorous physical activity: validation study. JMIR mHealth and uHealth, 8(5), e16716. mhealth.jmir.org/2020/5/e16716
Snyder, N. C., Willoughby, C. A., & Smith, B. K. (2021). Accuracy of the Apple Watch and Fitbit Charge for predicted VO2max in a healthy adult population. International Journal of Exercise Science, 14(4), 1281 to 1293. digitalcommons.wku.edu/ijes/vol14/iss4/8

