Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.

Context
Mobile apps operate in unreliable network environments. Users expect instant feedback regardless of connectivity. An offline-first sync engine treats the local database as the source of truth and syncs with the server asynchronously.
Problem
Most mobile apps treat the network as a given. They show a spinner, make a request, render the response. This breaks in three common scenarios:
- Flaky connections: elevators, tunnels, rural areas, crowded venues
- High latency: emerging markets where round trips take 2 to 5 seconds
- Aggressive battery optimization: the OS kills background connections on both Android and iOS
The core problem: how do you keep the app fully functional offline while ensuring data consistency when connectivity returns?
Constraints
- Local database must be the single source of truth for reads
- Mutations must be captured and queued for async sync
- Conflict resolution must be deterministic and predictable
- Sync must be idempotent (safe to retry any operation)
- Battery and bandwidth must be respected (no sync on every keystroke)
- The engine must recover from mid-sync crashes without data loss
Design
The sync engine sits between the app's data layer and the remote API. Four responsibilities:
- Local persistence: all reads and writes hit a local database
- Change tracking: mutations captured as an append-only operation log
- Sync scheduling: background process pushes and pulls when connectivity allows
- Conflict resolution: deterministic strategy when local and remote diverge

Related: Designing Mobile Systems for Poor Network Conditions.
See also: Designing Event Schemas That Survive Product Changes.
Operation Log
Every mutation gets written to an append-only log before touching the local database. Each entry contains:
- Unique operation ID
- Entity type and entity ID
- Operation type (create / update / delete)
- Logical timestamp (monotonically increasing counter, not wall clock)
- Payload (for creates and updates)
data class SyncOperation(
val id: String = UUID.randomUUID().toString(),
val entityType: String,
val entityId: String,
val type: OperationType,
val timestamp: Long,
val payload: Map<String, Any?>?,
val status: SyncStatus = SyncStatus.PENDING
)
enum class OperationType { CREATE, UPDATE, DELETE }
enum class SyncStatus { PENDING, IN_FLIGHT, SYNCED, FAILED }Logical clocks avoid issues with users changing device time or timezone drift across devices.
Sync Scheduling
Batch operations. Sync when conditions are favorable:
| Trigger | Strategy |
|---|---|
| Network available | ConnectivityManager (Android) / NWPathMonitor (iOS) |
| Debounce | Wait 2 to 5 seconds after last write |
| Retry | Exponential backoff: 1s, 2s, 4s, 8s, capped at 60s |
| Periodic fallback | WorkManager / BGTaskScheduler every 15 minutes |
class SyncScheduler(
private val connectivityMonitor: ConnectivityMonitor,
private val syncEngine: SyncEngine
) {
private var debounceJob: Job? = null
fun onLocalWrite() {
debounceJob?.cancel()
debounceJob = scope.launch {
delay(3_000)
if (connectivityMonitor.isConnected()) {
syncEngine.push()
}
}
}
}Conflict Resolution
Two devices edit the same record while both are offline. Three strategies, ordered by complexity:
Last-Write-Wins (LWW): highest logical timestamp wins. Simple. Silently discards changes. Acceptable for user preferences, read receipts.
Field-Level Merge: merge at field level. Device A changes name, device B changes email, both survive. Conflict only when the same field is modified on both sides.
fun mergeFields(
base: Map<String, Any?>,
local: Map<String, Any?>,
remote: Map<String, Any?>
): Map<String, Any?> {
val merged = base.toMutableMap()
for (key in (local.keys + remote.keys)) {
val localChanged = local[key] != base[key]
val remoteChanged = remote[key] != base[key]
merged[key] = when {
localChanged && !remoteChanged -> local[key]
!localChanged && remoteChanged -> remote[key]
localChanged && remoteChanged -> remote[key] // LWW fallback per field
else -> base[key]
}
}
return merged
}Application-Level Resolution: domain-specific logic. Inventory systems sum deltas. Collaborative editors use CRDTs. Financial transactions require explicit user resolution.
Handling Deletes
Physical deletion creates a re-creation problem: if one device deletes a record and another hasn't synced, the un-synced device will re-create it.
Solution: tombstones. Mark records as deleted with a deletedAt timestamp. Propagate the tombstone via sync. Purge tombstones older than 30 days.
data class Entity(
val id: String,
val data: Map<String, Any?>,
val updatedAt: Long,
val deletedAt: Long? = null // null = alive, non-null = tombstone
)Ordering Guarantees
Operations on the same entity must be applied in order. Operations on different entities can be applied in any order.
- Group pending operations by entity ID
- Sort each group by logical timestamp
- Send sequentially per entity, wait for acknowledgment
- Different entities can sync concurrently
Trade-offs
| Gain | Cost |
|---|---|
| Works offline | Local database + operation log storage overhead |
| Instant UI feedback | Eventual consistency, UI may show stale data |
| Resilient to network failures | Conflict resolution complexity is domain-specific |
| Battery-friendly batching | Sync delay means data is not immediately available on other devices |
For real-time multiplayer games or live auctions, this architecture is wrong. Know which category your app falls into before committing.
Failure Modes
| Failure | Mitigation |
|---|---|
| Network drops mid-sync | Idempotent operations with operation ID as server-side idempotency key |
| App killed by OS during sync | Transactional batches: local DB update + queue insertion in one transaction |
| Double-send of operations | Mark as IN_FLIGHT during sync, reset to PENDING on failure |
| Permanently failing operations | Dead letter queue after N retries for manual inspection |
| Clock skew between devices | Logical clocks instead of wall-clock timestamps |
| Tombstone not propagated | Periodic full-state reconciliation as fallback |
Scaling Considerations
- Operation log growth: compact the log periodically. Merge consecutive updates to the same entity into a single operation
- Large backlogs: if a device comes online after extended offline, paginate sync. Do not send 10,000 operations in one batch
- Server-side fan-out: when multiple devices sync for the same user, the server must handle concurrent writes with proper locking or CAS (compare-and-swap)
- Selective sync: not all entities need to be synced. Allow per-entity-type opt-in to reduce bandwidth and storage
Observability
Track these metrics to understand sync health in production:
- Sync latency: time between local mutation and server acknowledgment
- Queue depth: number of pending operations per device (alerts if consistently growing)
- Conflict rate: percentage of sync operations that trigger conflict resolution
- Failure rate: percentage of operations that enter the dead letter queue
- Tombstone accumulation: count of active tombstones (indicates deletion patterns)
Instrument the sync engine to emit structured logs for each operation lifecycle: PENDING, IN_FLIGHT, SYNCED, FAILED, DEAD_LETTER.
Key Takeaways
- Local database is the source of truth. The server is a peer that eventually catches up
- Use logical clocks, not wall clocks
- Conflict resolution strategy depends on the domain. Start with LWW, graduate to field-level merge when needed
- Tombstones solve the delete propagation problem
- Idempotency is non-negotiable. Every operation must be safe to retry
- Start simple: local persistence, operation queue, LWW. Layer complexity as requirements demand
Further Reading
- Designing an Experimentation Platform for Mobile Apps: System design for a mobile experimentation platform covering assignment, exposure tracking, metric collection, statistical analysis, and ...
- Designing Background Job Systems for Mobile Apps: Architecture for reliable background job execution on Android, covering WorkManager, job prioritization, constraint handling, and failure...
- Designing Secure Auth Flows for Mobile Applications: Architecture for secure authentication flows in mobile apps, covering OAuth 2.0 with PKCE, token management, biometric auth, and session ...
Final Thoughts
The best sync engines are invisible. The user edits data, puts the phone in a pocket, and everything converges. Building that experience requires careful thinking about operation logs, conflict resolution, ordering guarantees, and failure recovery.
Start with the minimum viable sync: local persistence, an operation queue, last-write-wins. Layer in field-level merging, smarter scheduling, and observability as usage patterns emerge. The architecture should grow with the product, not ahead of it.
Recommended
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.
Designing a Feature Flag and Remote Config System
Architecture and trade-offs for building a feature flag and remote configuration system that handles targeting, rollout, and consistency across mobile clients.