How phone-only sleep tracking works, and where it loses to a Watch

An engineering log on building an iOS sleep tracker with nothing but the accelerometer: a two-layer detector, a self-calibrating staging engine that runs once at end of night, and the limits I can't code my way out of.

What this is, and what it isn't

I built an iOS sleep tracker that uses only the phone's accelerometer. No wearable, no account, no subscription. I come from a backend and tooling background, and I'd never written a line of Swift before this; I leaned on an LLM for the unfamiliar APIs as I went. A year of my own nightly use is why I trust it doesn't crash on my phone, on the one or two devices and OS versions I run. That is a sample of one user; it says nothing about crash rates across the 150 phones in the dataset below, and I don't want it read as if the larger dataset backs the reliability claim. It doesn't.

The accuracy numbers below come from a separate, larger source: 565 paired phone-plus-wearable nights across 150 distinct iPhones, measured against consumer wearables pulled through HealthKit, not lab polysomnography. I should be clear about what that dataset is: it's an opt-in convenience sample of self-selected users who happened to also log a wearable on the same night, not a controlled cohort with random assignment. 565 nights across 150 phones is a median of roughly 3.8 nights per device, so most phones contributed only a handful of nights. I'll keep flagging exactly what that reference is and isn't.

Up front, so nobody wastes their time: if you want minute-accurate wake detection or accurate sleep staging, buy a wrist wearable. A wrist-worn device has access to physiological signals a phone resting on the mattress simply does not. This post is about how far you can get without that hardware, and an honest accounting of where the gap is unbridgeable. My reference throughout is a consumer wearable, not clinical PSG, and as I'll show below it isn't even uniformly an Apple Watch. I have not done clinical validation, and I'm not going to pretend agreement-with-a-wearable is the same thing as clinical accuracy.

The one behavior I actually cared about, the reason this exists at all, is mundane: I want ambient sound (rain, waves) to keep playing while I'm awake and stop on its own once the phone decides I'm asleep. Not cut out on a fixed timer while I'm still lying there awake. Everything technical below is in service of getting that one side-effect to fire at the right moment, and then, separately, producing a defensible hypnogram after the fact.

Two layers, and why they're separate

There are two detectors. The split is online vs offline processing, a real-time path and a batch path. That's not novel; the reason it matters here is a specific failure mode I'll get to in the next section.

Layer 1 is a real-time threshold state machine in Swift. It samples the accelerometer at a constant 5Hz, computes the variance of the magnitude sqrt(x^2 + y^2 + z^2) over a sliding window, and drives the live side-effect off that. When it classifies you as asleep and about ten quiet minutes hold, it hard-pauses the audio. To be precise about what 'pause' means: it calls pause() on the player. It is not a volume fade. The sound stops.

There's a backstop so you don't get an infinite-playback bug if the detector never trips, but I should be precise about what it is rather than oversell it. It's a 14-hour maximum-session cap, checked hourly, that force-stops the session if nothing else has stopped it first. It is not a tight, per-minute safety net. There is a per-minute guardian countdown, but that only runs when the user has picked a fixed timed-close duration, not in smart or unlimited mode, so for the main use case the 14-hour cap is the real outer limit.

Layer 2 is the offline staging pipeline: a platform-agnostic JavaScript engine that compiles the entire night in one burst after you stop recording. It does the real staging work. Layer 1 only ever has to answer one cheap question in real time, 'is this person probably asleep right now,' and it can afford to be a little wrong because all it controls is whether the rain keeps playing. Layer 2 answers the expensive question, 'what did the whole night actually look like,' and it gets to see all the data at once.

Keeping them separate means the live detector's job and the staging pipeline's job never compete. The live one optimizes for a responsive side-effect; the offline one optimizes for correctness on the full record. Again, online-vs-offline is bread-and-butter; the interesting part is what happens when you let the live layer feed the offline one.

Why constant 5Hz, and the pollution loop I deleted

An earlier version was cleverer about battery. It had eco and burst adaptive sampling modes: sample less often when things looked quiet, ramp up when they didn't. I deleted all of it. Here's the failure mode that forced the decision.

Adaptive sampling means the sample rate depends on the real-time classification. If Layer 1 mis-detects 'asleep' early, while you're actually still awake and reading, it downsamples. The downsampled, degraded stretch then gets fed forward into the offline analyzer, which now has to stage a night where the data density itself was decided by a guess that may have been wrong. A bad real-time call corrupts the very evidence the offline engine needs to catch that the real-time call was bad. It's a chicken-and-egg pollution loop: the error feeds the condition that produced the error.

Constant 5Hz severs the loop. The offline engine always sees a uniform, full-rate record regardless of what Layer 1 believed in the moment. This costs battery I could have saved, and I made that trade on purpose: correctness over battery. The collection is cheap enough that paying it constantly is fine; what I refuse to do is let a live heuristic decide what data the offline analyzer is allowed to see.

I should connect this to the accuracy section before it sounds more consequential than it is. Severing the pollution loop protects data integrity, it guarantees the offline engine sees an uncorrupted record. It did not buy meaningful sleep/wake discrimination: as you'll see, the measured binary classifier lands at a median Cohen kappa of about 0.00, no better than a constant 'asleep' label. So treat constant 5Hz as defensible engineering hygiene that removes one class of self-inflicted error, not as a demonstrated accuracy win. The hardware ceiling sits well below where this decision matters.

The offline staging pipeline: baseline, Cole-Kripke, Viterbi

The offline pipeline runs once, at end of night, but it is not a clean three-step chain; it's a series of passes. It starts with a per-night self-calibrating baseline. People, mattresses, and phone placements differ enough that a fixed threshold is hopeless, so the engine first establishes what 'quiet' means for this specific night by taking the night's own quietest 20% of windows (the p20 of the variance distribution) as its reference floor before it stages anything. From there it runs the 1992 Cole-Kripke actigraphy algorithm, a 7-minute weighted sliding window, to make the asleep-vs-awake call. Crucially, that asleep/awake decision is Cole-Kripke's, made upstream, not the HMM's.

Only after that, and after several smoothing and cleanup passes (event fusion, binary smoothing, wake cleanup, short-island pruning), does the hidden Markov model run. And it runs only over the minutes already labelled asleep. Its three stages are not wake/light/deep; they are deep, core, and light, three levels of sleep depth. The HMM does not decide whether you're asleep; it assigns how deep, over a record where the awake minutes have already been carved out upstream. It uses log-space Viterbi for the most-likely path plus forward-backward for the posteriors. Log-space because you're multiplying long chains of small probabilities across a whole night and you will underflow otherwise.

One mismatch I should name rather than paper over: Cole-Kripke was developed and validated on wrist actigraphy, and I'm feeding it mattress motion, not wrist motion. Whole-body motion through a mattress means something different from wrist movement, so the activity counts don't carry the same semantics the original coefficients assume. The per-night self-calibration is partly compensating for using a wrist algorithm off-wrist. That's a known compromise, not a clean application of the paper.

Worth being clear: the HMM is live in production, not shadow-mode. I flipped the default on 2026-05-07.

The engine is platform-agnostic JavaScript on purpose. The iOS coupling is a single JSContext bridge (it happens to be 714 lines today, but the code carries dead paths, so I won't lean on line counts as evidence of anything). On top of that bridge there's a hot-update channel that does an optional SHA256 content-hash check over HTTPS. I want to be precise about what that check does and doesn't do: the hash travels over the same HTTPS channel as the JS payload it's hashing, so anyone who controls the server or the channel controls both. That means it provides no tamper resistance against that threat; it only detects accidental corruption or truncation. It is not a cryptographic signature and not signed-code verification, and it's skipped entirely when the server omits the hash.

The same channel lets me ship algorithm changes without going through App Store review, because the staging logic is interpreted JS config, not a native binary. I'll state the risk straight rather than call it a gray area: shipping interpreted logic that changes app behavior outside review most likely violates the App Store review guidelines, not merely a fuzzy edge of them. Apple could reject or pull the app over it, which would remove the hot-update benefit entirely. I'd stop if asked, and I'm not going to pretend the content-hash check makes the review bypass legitimate, because it doesn't.

Because the engine is plain JS with no iOS dependencies, an Android port is plausible in principle, under QuickJS instead of JavaScriptCore. I want to be careful not to sell that as a payoff I've banked: I haven't tried it, and JS-engine differences (floating-point edge cases, Math behavior, performance) could change the numeric staging output, so 'one implementation that doesn't drift' is a hope, not a result. Treat the portability as a design goal only.

JSContext performance, and the one trap that crashes

Running your algorithms in JSContext is not novel; React Native has done exactly this for years. I mention it only to preempt the objection that JS is too slow for signal processing. A full-night compile clocked around 100ms in a single measurement on my own recent iPhone for an eight-hour night; I haven't characterized variance across devices, and older supported phones will be slower, possibly several times slower, so don't read 100ms as a general number. It works over 30-second epochs rather than raw 5Hz samples: 5Hz over eight hours is about 144k raw samples by arithmetic (5 * 3600 * 8), but the HMM and Cole-Kripke passes run over a few hundred 30-second windows, not those 144k samples. It runs serially on one queue, and that's fine, because it runs exactly once at the end of the night and is never in a hot path. There is nothing real-time about it.

There is one real trap, and it's commented in the code so I don't re-step on it. If you call queue.sync from a Swift cooperative thread, you trip a libdispatch precondition and the app crashes on iOS 17/18 in Release builds specifically. So the async path is forced: queue.async plus a continuation to bridge back. It's the kind of thing that passes in Debug, passes in the simulator, and then crashes on a real device in a Release build, which is the worst possible place to find it.

Accuracy, measured against consumer wearables

These numbers come from 565 paired nights, where the same person ran the phone tracker and also had a consumer wearable logging the same night, across 150 distinct iPhones over about six weeks (2026-04-26 to 2026-06-07). Again, this is an opt-in convenience sample, not a controlled study, and most phones contributed only a few nights each. The reference timelines come in through HealthKit: by segment, roughly 78% are tagged apple_healthkit and about 22% are other HealthKit timelines that were imported but not identified by HealthKit as an Apple Watch, plus a handful from Huawei and Garmin. That 22% is the part I most want to flag as a validity problem, not a footnote: roughly a fifth of my ground truth comes from devices I can't identify, of unknown sleep-staging quality, and different wearables have materially different accuracy. Aggregating them as a single 'reference' pushes error into every comparison number below, and I did not slice the figures down to an identified-Apple-Watch-only subset, so I can't tell you how much. Everything below is population aggregate, computed server-side; there is no per-user data here and no per-user timelines left the script.

Start with the number I most want you not to misread. On binary sleep/wake, the phone and the wearable agree on a median 94.6% of minutes (p25 0.886, p75 0.977). That looks great until you compute Cohen's kappa, which corrects for agreement you'd get by chance, and the median kappa is about 0.00 (p75 only 0.108). What that means in plain terms: almost all of the 94.6% is both timelines simply agreeing 'asleep' during the long stretch when you obviously are asleep. Minute by minute, at the job that actually matters, telling asleep from awake, the phone-only classifier is not meaningfully better than a function that blindly labels every minute 'asleep.' So please do not read 94.6% as '94.6% correct.' It isn't. This is the most important admission in the post and I'd rather say it myself than have it found.

The wake side is where you can see the limit physically. On 47% of nights the phone reports zero wake-bouts during the night, versus 8% of nights on the wearable. Brief wrist arousals, the few-minute wake-ups a wrist sensor catches, are essentially invisible to a phone sitting on the nightstand. That's a sensor limitation, not a knob I failed to tune. There is no threshold I can set that makes a mattress feel a thirty-second arousal the way a wrist does.

Where the phone does best is total sleep duration, and even here I should give you the spread, not just the middle. Median absolute error is 42 minutes per night, but that's the median: the p25 is 18 minutes and the p75 is 90 minutes, so a quarter of nights are off by an hour and a half or more, and I'm not quoting a p90 or a hard upper bound. There's no consistent direction to the error, it isn't systematically over- or under-counting, the bias sits near zero with a wide spread. For a typical seven-hour night a 42-minute median is roughly 10% at the median with an unquoted tail above it. So I'd call this usable for trends and for the broad shape of the night, not validated accuracy. The wake-detail and the staging are where it doesn't hold up.

On depth, across the whole cohort, the phone's deep-sleep percentage runs a median of about 5.2 percentage points below the wearable's (the deep% bias is -5.2pp). At cohort scale the phone reports less deep sleep than the wearable, and that is consistent with the small residual deep undercount I'd seen on my own device. I'll stop short of saying it validates anything: both observations could share the same systematic cause, and the reference here is a heterogeneous set of consumer wearables whose own deep-sleep staging is itself unreliable. So I can't confidently attribute the -5.2pp gap to the phone rather than to error in the reference. It's a real, directionally consistent gap of modest size; that's all I can claim.

I want to be careful about the 4-class staging, because it does not come from this cohort at all. My only staging-agreement evidence is a single-subject, 9-night calibration on my own device and my own phone, and I cannot generalize it; the 565-night cohort did not validate 4-class staging. So I'm deliberately not quoting a staging-agreement percentage as if it were a result, because the only number I have is one body on one phone. The big cohort validated four things and only four: binary sleep/wake, total duration, the cohort-level deep-sleep bias, and the WASO/wake-bout behavior. On that same 9-night set, before calibration, the model over-reported deep sleep by about 60 percentage points (a +59.5pp figure I can't source to the cohort file, only to that calibration artifact). I'd read that less as 'calibration helps' and more as evidence the staging is fit to one person: if a single calibration step swings deep sleep by that much, the post-calibration output is mostly the calibration doing the work on one subject, and I have no reason to believe the fit transfers to other bodies.

REM is not in any of this. The phone classifier never emits REM at all; it emits deep, core, light, and awake only. Accelerometer-only data cannot distinguish REM, so the code refuses to claim it rather than guess and present the guess as a stage. REM is therefore neither emitted nor validated.

One thing that's worth saying because it reads as a contradiction and isn't: these wearable timelines are used in exactly one direction. They measure the phone's accuracy. They are never fed back to train or tune the phone classifier. That keeps this consistent with the privacy wall described later, no wearable data is used to improve the model, it's only used, after the fact and in aggregate, to grade it.

The bottom line, stated plainly: a wrist wearable beats this for minute-accurate wake detection and for staging, and it isn't close on either. The phone is usable for total sleep duration and the broad shape of the night, it reports less deep sleep than the wearable at the cohort level by a modest amount, and against an imperfect reference. If you need accurate staging or accurate wake detection, that's a wrist device's job, and I'd rather tell you that than dress up a 94.6% that mostly means 'you were asleep, and so was the obvious guess.'

The 8-hour Live Activity wall

Bottom line first: the lock-screen Live Activity is not reliable across a full night, and I can't make it reliable from userspace. On the iOS versions and devices I've tested, iOS consistently ends a Live Activity at around 8 hours of runtime in my device logs. After that, every update() and end() call becomes a no-op, and the lock screen freezes mid-state, stuck showing whatever it last rendered. I have device-log evidence of this on the versions I ran, and Apple documents the Live Activity budget as approximate and it has changed across OS versions, so I'm reporting it as what I consistently observed, not as a proven universal '8 hours, every device, every time.'

Here's what I do to make freezes rarer, knowing none of it is a fix. I recreate the activity on foreground to reset the runtime clock. I use a 17.2+ push-to-start renewal in the background past the 7.5h mark. And I call a graceful end() at around 7.83h so the system tears it down cleanly rather than welding it shut at 8h, which would leave it frozen and unrecoverable for the rest of the session. These moves reduce how often you see a stuck lock screen. They don't guarantee you never will, because the underlying limit is the OS's and all I can do is manage my odds against it.

Battery: record-then-compile

The battery strategy follows directly from the two-layer split. Collection is cheap and runs all night at constant 5Hz. The expensive math, Cole-Kripke and Viterbi, runs exactly once at the end of the night, not continuously. You don't pay for staging until you're done sleeping.

I have to own a real gap here, not dress it up as modesty. I don't have an overnight battery-drain figure I'd defend across phones, OS versions, and battery health, and the honest read of that is uncomfortable: I framed constant 5Hz plus a silent audio track as a deliberate 'correctness over battery' trade-off, but I never actually measured what that trade-off costs. You can't really call a cost-benefit decision deliberate if you never put a number on the cost. So treat 'correctness over battery' as my design intent, not as a measured, justified trade. Measuring overnight drain on named devices is work I still owe.

What I can tell you about where the cost goes: a silent audio track keeps the AVAudioSession alive so CoreMotion keeps firing through the night. That session is the real overnight cost driver, not the arithmetic. If you're battery-sensitive, that's the line item to watch.

What I do throttle overnight is the state machine's feature-processing cadence, and I want to state it precisely because it's easy to overstate. While you're awake there is no throttle at all: it just processes features at the natural 1Hz cadence, every per-second update, no skipping. Once you're classified asleep, it adds a 10-second skip gate, so roughly one in ten of the per-second feature updates actually gets processed, except that intense motion bypasses the gate so a real movement is never skipped. None of this touches the raw accelerometer: the 5Hz raw sampling is never throttled, awake or asleep. That distinction is the whole point of the pollution-loop section above: processing cadence is safe to vary, data density is not.

Privacy: not fully offline, and I'll say so

I'm not going to claim 'nothing leaves your phone,' because it isn't true. There's no account and no login. The only identifier is identifierForVendor: a stable, per-device, per-vendor ID. That's pseudonymous, not anonymous, and I'm not going to dress it up as the latter.

Here's the actual split. The app does not upload raw snore and sleep-talk audio (PCM); it writes it to local Documents on the device, and the privacy manifest explicitly explains why AudioData is intentionally not declared. I'll scope that claim carefully rather than say 'never leaves the device' as an absolute: I'm describing what the current code does, raw audio is recorded locally and not uploaded. I won't promise it about all code paths for all time, because the snore feature runs through Apple's on-device SoundAnalysis model and audio session (system-level processing I don't fully control), and because the hot-update channel described earlier can change app behavior outside App Store review. So the guarantee is about the code shipping today, not about every future config I might push.

What does upload is minute-level motion features and stage labels, keyed to that identifierForVendor, used to tune the per-night calibration. I should be exact about what 'tune the per-night calibration' means, because it reads as a contradiction with the accuracy section's 'never tuned.' The per-night calibration is the per-night self-calibrating baseline from the staging section: it derives thresholds from the night's own variance distribution, on-device, for that one night. Uploaded motion features and stage labels let me see, in aggregate across users, whether that calibration logic is behaving and adjust the algorithm I ship. What never happens: no wearable-derived signal feeds calibration, directly or indirectly through the aggregate. The wearable timelines only grade accuracy after the fact. So when the accuracy section says the model is never tuned with wearable data, that's the precise claim, your motion features can inform the calibration logic I ship; wearable data does not.

I'm not claiming you can't be re-identified from a long enough series of minute-level motion features tied to a stable device ID; that's a behavioral fingerprint and I won't pretend otherwise. I'm telling you exactly what goes up and what doesn't: motion features and stage labels go up, your recorded audio does not.

Snore detection, and the 1-star review that caused it

Snore detection is not homegrown DSP, and I'm not going to imply it is. It's Apple's SoundAnalysis ML model plus an RMS gate. I used Apple's trained model because I couldn't build a better sound classifier from scratch for this, not because I've shown it's the right tool, I haven't measured that. To match the rest of this post: I haven't measured its false-positive rate, and SoundAnalysis will occasionally tag a fan or a partner's breathing as snoring. So treat the snore clips as a recording you review, not a metric, and not 'detection' in any validated sense.

The feature exists for an unglamorous reason. My first-ever App Store review was a 1-star, written in Chinese, asking for snore recording. That was the entire review. They were right to want it, and they were right that the app was incomplete without it, so I built it. I'd rather credit the actual origin than invent a tidier product-vision story.

The code is not clean

Since this is an engineering log, here's the part I'd normally leave out. The staging engine and the live detector are both large, and the line counts are high partly because of duplication and dead paths I haven't cut, not because the problem needs that much code. A clean reimplementation would be a fraction of the size, which is exactly why I won't use any of these line counts as a measure of anything. SimplifiedSleepDetector is a roughly 2,500-line god object; the name is a small joke at this point. I'm calling out the volume explicitly because the alternative read, given that I leaned on an LLM for the unfamiliar APIs, is 'unreviewed bulk,' and the honest answer is that some of it is exactly that.

The stop logic, the thing that decides recording is over, is actually decided in one place: a single evaluation function makes the call. What's spread out is the enforcement. That one decision then has to be carried out across three layers: a guardian-notification fan-out to three-plus subscribers, a 15-second polling timer, and a shared playback-reconciliation check. So the decision point is single; the execution is triplicated. There's also at least one genuinely dead path in there: .autoStopTriggered is posted and has no subscribers. It does nothing, and it's still there.

It would be a little too convenient to call the parts I like 'decisions' and the parts I don't 'scar tissue.' I won't pretend the stop path is clean. The decision is in one place, which is fine, but having to enforce that one decision across three separate mechanisms, a notification fan-out, a poll, and a reconciliation check, is more machinery than the job should need, and the dead notification still sitting in it is evidence I haven't cleaned up after myself. That's a design and tidiness failure, not just untidiness I can wave off. The architecture I do stand behind, the two-layer split, constant 5Hz, the platform-agnostic JS engine, refusing to emit REM, is real. So is the mess. Pretending only the first half is true would make everything else in this post less trustworthy.

Sleep Island is the iOS app this write-up describes.