MTBF and MTTR explained: the two metrics that cost you the most money
MTBF and MTTR explained: the two metrics that cost you the most money
A €30,000 pump that fails every 18 months isn't "expensive". What's expensive is the 6 hours of downtime every time it fails at 3 a.m. in the middle of a shift. That's the MTBF/MTTR equation in one sentence — and it's the equation most maintenance teams optimise wrong, because they conflate the two metrics or measure them with bad data.
This article walks through what each metric actually means, how to compute them with industry-standard taxonomy (ISO 14224), what realistic targets look like by asset type, and a 4-step playbook to improve each one independently.
MTBF: the misunderstood metric
Mean Time Between Failures = total operating time / number of failures.
Worked example: a centrifugal pump runs 8,760 hours per year (continuous duty) and suffers 3 unplanned failures. MTBF = 8,760 / 3 = 2,920 hours, or about 4 months between failures.
That's the formula. The errors are everywhere else.
Confusing MTBF with lifespan. A pump with MTBF of 2,920 hours doesn't fail every 4 months in lockstep — it fails on average every 4 months over a long enough sample. The actual time-to-failure follows a Weibull distribution. For wear-out modes (β > 1 in Weibull terminology), the failures cluster around the expected lifespan and MTBF underestimates the real risk near end-of-life. For random failure modes (β ≈ 1), MTBF is a meaningful expectation.
Counting planned downtime as failure. Cleaning, changeovers, and scheduled maintenance are not failures. If your CMMS lumps them in, your MTBF will collapse and you'll be optimising a metric you can't actually move.
Too short a measurement window. MTBF computed over 3 months on a single pump produces noise. You need at least 12 months and ideally a population of similar assets for the statistics to be stable.
Realistic ranges by asset type. Industrial centrifugal pumps: 18–60 months MTBF in continuous duty. CNC spindles: 5,000–15,000 operating hours. Conveyor motors in logistics: 30,000–50,000 hours (3–5 years of 24/7). Hydraulic systems: 10,000–20,000 hours. Industrial robots: 80,000+ hours when properly maintained. These are population averages — your specific asset will diverge based on duty cycle, environment, and maintenance practice.
MTTR: the metric you actually control today
Mean Time To Repair = total downtime / number of repairs.
The number that matters every day. MTBF tells you how often things fail; MTTR tells you how long you bleed when they do. And here's the trap: most teams treat MTTR as "how long the technician was turning a wrench". The reality is much wider.
Decompose a typical 4.5-hour MTTR event at a Belgian food plant:
- Detection: 30 minutes (operator notices output drop, escalates)
- Travel/wait for technician: 90 minutes (on-call from a different site)
- Diagnosis: 60 minutes (intermittent fault, multiple suspects)
- Actual repair: 60 minutes (replacing a contactor)
- Test and restart: 30 minutes (validation under load)
Total: 4.5 hours. The "repair" is 22% of MTTR. The other 78% is logistics, diagnosis, and validation.
The fix isn't a faster wrench. It's:
- Vibration sensors and live alarms — detection drops from 30 minutes to under 5
- Pre-positioned spare parts kit on-site — eliminates travel
- Operators trained for level-1 diagnostics with a documented decision tree — diagnosis falls from 60 to 20 minutes
- Validation procedure standardised and pre-rehearsed
After: detection 5 min + travel 0 + diagnosis 20 + repair 60 + test 15 = 100 minutes. MTTR cut by 63% without changing the actual repair time.
The MTTR levers, in priority order: detection time, diagnosis time, parts availability, repair efficiency, validation. Most plants have spent years optimising the last two and ignored the first two — where 70% of the gain lives.
Availability: the metric that combines both
The two metrics fuse in one formula:
Availability = MTBF / (MTBF + MTTR)
With MTBF = 2,920h and MTTR = 4h:
- Availability = 2,920 / (2,920 + 4) = 2,920 / 2,924 = 99.86%
Now compare the same MTBF with MTTR doubled to 12 hours:
- Availability = 2,920 / 2,932 = 99.59%
The difference looks small (0.27 percentage points) but over 8,760 hours per year it's 24 hours of additional production lost per asset. On a Belgian factory floor with 50 critical assets, that's 1,200 hours per year — equivalent to a full month of production.
Common availability benchmarks:
- 99.0% — typical for non-critical industrial equipment
- 99.5% — well-managed plants on critical assets
- 99.9% — top quartile, requires structured reliability programme
- 99.99% — rare and expensive (semiconductor fabs, pharma critical lines)
Availability is what executives understand. MTBF and MTTR are the levers.
How to actually measure these in your plant
You need a CMMS — there's no clean alternative. Excel logs work for one asset for one year as a proof-of-concept, but at scale you need structured failure events with timestamps, asset IDs, failure modes, repair durations, and parts consumed.
The minimum log fields, drawn from ISO 14224:
- Equipment ID and class
- Failure date and time of detection
- Failure mode (mechanical, electrical, instrumentation, human error)
- Failure mechanism (wear, fatigue, corrosion, fracture, lubrication, contamination)
- Detection method (operator, alarm, scheduled inspection, condition monitoring)
- Time to start repair (work order to technician on-site)
- Time to complete repair
- Parts consumed (codes from your inventory)
- Root cause (filled in within 7 days, not at the moment of repair)
Twelve months of clean data is the minimum to derive meaningful MTBF. Less than that and you're chasing noise.
The data quality audit you must run before trusting numbers. Pick a random month, pull 20 failure events, and verify with the technician on the floor: did this really happen as logged? You'll typically find 20–40% of events with wrong duration, wrong failure mode, or compounded events logged as one. Until you fix that, your MTBF is fiction.
Improving MTBF: the 4-step playbook
1. Failure mode analysis on the top 5 critical assets. Take the assets that, if down, stop production. For each, list the 3 most common failure modes from the past 24 months. This is FMEA-light — not the full automotive-grade exercise, just enough to focus.
2. Root cause for each top failure mode. Bearing wear → is it lubrication, alignment, overload, or end-of-life? Electrical contactor failure → is it duty cycle exceeded, harmonics, or contamination? Don't accept "it just broke" as a root cause.
3. Targeted intervention per root cause.
- Lubrication issue → revised lubrication schedule + ultrasonic listening to confirm grease application
- Alignment → laser alignment after every motor swap, before ramp-up
- Overload → motor sizing review + soft-start
- Contamination → environmental sealing + filtered air
4. Measure before/after over 6 months. Don't celebrate after 3 weeks of zero failures — the previous failure cycle was 4 months. You need 2× the previous MTBF to claim a real improvement.
Realistic gain: a structured programme on the top 5 assets delivers +30–50% MTBF in 12 months. Anyone promising more is selling something.
Improving MTTR: the 4-step playbook
1. Decompose your current MTTR. For each failure event, log the five segments (detection, travel, diagnosis, parts, repair). Sample 30 events to find the longest segment.
2. Attack the longest segment first. Almost universally, it's either parts (30 minutes to several hours waiting for a courier from a regional warehouse) or diagnosis (technician arriving without a clear protocol).
3. Targeted intervention.
- Parts → critical spares kit on-site for the top 20 SKUs (typical investment: €15–40k for a medium plant; payback in 3–6 months on avoided downtime)
- Diagnosis → documented decision trees per asset class, level-1 operator training on standard fault patterns, condition monitoring data accessible to technicians from their phone
- Detection → IoT sensors + alarm escalation directly to on-call technician (skip the operator–supervisor–technician chain)
4. Track per-event MTTR, not just monthly average. Averages hide outliers. A single 12-hour repair drags up the average; the 30 four-hour repairs are the real workload. Dashboard the distribution, not just the mean.
Quick wins: 50% MTTR reduction in 3–6 months is normal with focused effort. The 70/30 rule applies — 70% of the gain comes from logistics and diagnosis, not faster physical repair.
Closing
MTBF and MTTR are not numbers to report — they're decisions. Every euro you spend should reduce one or the other in a way that improves availability where it matters financially. The starting move is unglamorous: 12 months of clean failure data through a structured CMMS, with the data quality verified at the floor level. Without that, every "improvement" you publish is noise.
If you want a free reliability audit on your top 5 critical assets — failure mode analysis, MTBF estimate, MTTR decomposition and an actionable plan — reach out. We've run these on Belgian, French and Spanish sites for ten years and the patterns are consistent. You can also see our offer for predictive maintenance and corrective maintenance — both directly impact MTBF and MTTR respectively.
Want to apply this at your plant?
Request a free diagnostic and we'll show you how to implement it.
Request free diagnostic