Designing an Offline-First Sync Engine for Mobile Apps

Context

Mobile apps operate in unreliable network environments. Users expect instant feedback regardless of connectivity. An offline-first sync engine treats the local database as the source of truth and syncs with the server asynchronously.

Problem

Most mobile apps treat the network as a given. They show a spinner, make a request, render the response. This breaks in three common scenarios:

Flaky connections: elevators, tunnels, rural areas, crowded venues
High latency: emerging markets where round trips take 2 to 5 seconds
Aggressive battery optimization: the OS kills background connections on both Android and iOS

The core problem: how do you keep the app fully functional offline while ensuring data consistency when connectivity returns?

Constraints

Local database must be the single source of truth for reads
Mutations must be captured and queued for async sync
Conflict resolution must be deterministic and predictable
Sync must be idempotent (safe to retry any operation)
Battery and bandwidth must be respected (no sync on every keystroke)
The engine must recover from mid-sync crashes without data loss

Design

The sync engine sits between the app's data layer and the remote API. Four responsibilities:

Local persistence: all reads and writes hit a local database
Change tracking: mutations captured as an append-only operation log
Sync scheduling: background process pushes and pulls when connectivity allows
Conflict resolution: deterministic strategy when local and remote diverge

Offline-First Sync Engine Architecture

Operation Log

Every mutation gets written to an append-only log before touching the local database. Each entry contains:

Unique operation ID
Entity type and entity ID
Operation type (create / update / delete)
Logical timestamp (monotonically increasing counter, not wall clock)
Payload (for creates and updates)

data class SyncOperation(
    val id: String = UUID.randomUUID().toString(),
    val entityType: String,
    val entityId: String,
    val type: OperationType,
    val timestamp: Long,
    val payload: Map<String, Any?>?,
    val status: SyncStatus = SyncStatus.PENDING
)
 
enum class OperationType { CREATE, UPDATE, DELETE }
enum class SyncStatus { PENDING, IN_FLIGHT, SYNCED, FAILED }

Logical clocks avoid issues with users changing device time or timezone drift across devices.

Sync Scheduling

Batch operations. Sync when conditions are favorable:

Trigger	Strategy
Network available	`ConnectivityManager` (Android) / `NWPathMonitor` (iOS)
Debounce	Wait 2 to 5 seconds after last write
Retry	Exponential backoff: 1s, 2s, 4s, 8s, capped at 60s
Periodic fallback	WorkManager / BGTaskScheduler every 15 minutes

class SyncScheduler(
    private val connectivityMonitor: ConnectivityMonitor,
    private val syncEngine: SyncEngine
) {
    private var debounceJob: Job? = null
 
    fun onLocalWrite() {
        debounceJob?.cancel()
        debounceJob = scope.launch {
            delay(3_000)
            if (connectivityMonitor.isConnected()) {
                syncEngine.push()
            }
        }
    }
}

Conflict Resolution

Two devices edit the same record while both are offline. Three strategies, ordered by complexity:

Last-Write-Wins (LWW): highest logical timestamp wins. Simple. Silently discards changes. Acceptable for user preferences, read receipts.

Field-Level Merge: merge at field level. Device A changes name, device B changes email, both survive. Conflict only when the same field is modified on both sides.

fun mergeFields(
    base: Map<String, Any?>,
    local: Map<String, Any?>,
    remote: Map<String, Any?>
): Map<String, Any?> {
    val merged = base.toMutableMap()
    for (key in (local.keys + remote.keys)) {
        val localChanged = local[key] != base[key]
        val remoteChanged = remote[key] != base[key]
        merged[key] = when {
            localChanged && !remoteChanged -> local[key]
            !localChanged && remoteChanged -> remote[key]
            localChanged && remoteChanged -> remote[key] // LWW fallback per field
            else -> base[key]
        }
    }
    return merged
}

Application-Level Resolution: domain-specific logic. Inventory systems sum deltas. Collaborative editors use CRDTs. Financial transactions require explicit user resolution.

Handling Deletes

Physical deletion creates a re-creation problem: if one device deletes a record and another hasn't synced, the un-synced device will re-create it.

Solution: tombstones. Mark records as deleted with a deletedAt timestamp. Propagate the tombstone via sync. Purge tombstones older than 30 days.

data class Entity(
    val id: String,
    val data: Map<String, Any?>,
    val updatedAt: Long,
    val deletedAt: Long? = null  // null = alive, non-null = tombstone
)

Ordering Guarantees

Operations on the same entity must be applied in order. Operations on different entities can be applied in any order.

Group pending operations by entity ID
Sort each group by logical timestamp
Send sequentially per entity, wait for acknowledgment
Different entities can sync concurrently

Trade-offs

Gain	Cost
Works offline	Local database + operation log storage overhead
Instant UI feedback	Eventual consistency, UI may show stale data
Resilient to network failures	Conflict resolution complexity is domain-specific
Battery-friendly batching	Sync delay means data is not immediately available on other devices

For real-time multiplayer games or live auctions, this architecture is wrong. Know which category your app falls into before committing.

Failure Modes

Failure	Mitigation
Network drops mid-sync	Idempotent operations with operation ID as server-side idempotency key
App killed by OS during sync	Transactional batches: local DB update + queue insertion in one transaction
Double-send of operations	Mark as `IN_FLIGHT` during sync, reset to `PENDING` on failure
Permanently failing operations	Dead letter queue after N retries for manual inspection
Clock skew between devices	Logical clocks instead of wall-clock timestamps
Tombstone not propagated	Periodic full-state reconciliation as fallback

Scaling Considerations

Operation log growth: compact the log periodically. Merge consecutive updates to the same entity into a single operation
Large backlogs: if a device comes online after extended offline, paginate sync. Do not send 10,000 operations in one batch
Server-side fan-out: when multiple devices sync for the same user, the server must handle concurrent writes with proper locking or CAS (compare-and-swap)
Selective sync: not all entities need to be synced. Allow per-entity-type opt-in to reduce bandwidth and storage

Observability

Track these metrics to understand sync health in production:

Sync latency: time between local mutation and server acknowledgment
Queue depth: number of pending operations per device (alerts if consistently growing)
Conflict rate: percentage of sync operations that trigger conflict resolution
Failure rate: percentage of operations that enter the dead letter queue
Tombstone accumulation: count of active tombstones (indicates deletion patterns)

Instrument the sync engine to emit structured logs for each operation lifecycle: PENDING, IN_FLIGHT, SYNCED, FAILED, DEAD_LETTER.

Key Takeaways

Local database is the source of truth. The server is a peer that eventually catches up
Use logical clocks, not wall clocks
Conflict resolution strategy depends on the domain. Start with LWW, graduate to field-level merge when needed
Tombstones solve the delete propagation problem
Idempotency is non-negotiable. Every operation must be safe to retry
Start simple: local persistence, operation queue, LWW. Layer complexity as requirements demand

Final Thoughts

The best sync engines are invisible. The user edits data, puts the phone in a pocket, and everything converges. Building that experience requires careful thinking about operation logs, conflict resolution, ordering guarantees, and failure recovery.

Start with the minimum viable sync: local persistence, an operation queue, last-write-wins. Layer in field-level merging, smarter scheduling, and observability as usage patterns emerge. The architecture should grow with the product, not ahead of it.