Expand-migrate-contract migration recipes

migrations specs/migrations/expand-migrate-contract.kmd

Receitas concretas pro padrão expand → migrate → contract mandado por `policies/always-on.kmd § R3.1`, por tier de storage. Componentes copy-paste-tweak; não reinventam (e não fazem big-bang ALTER que trava a frota).

Spec: Expand-migrate-contract recipes (R3.1)

Status: draft v0.1 (2026-05-24). Receitas validadas em produção em ≥ 1 componente serão promovidas a stable. Componentes consultam este doc antes de mexer em schema/layout pra não derrubar a frota.

A regra (de `always-on.kmd § R3.1`)

Toda migração que muda forma de dado segue 3 releases mínimo:

Expand (release N): adiciona nova forma; código N escreve em ambas as formas; lê preferencialmente da nova, fallback pra antiga.
Migrate (entre N e N+1): backfill assíncrono converte rows antigas pra forma nova (job em background, idempotente, resumível, per-tenant).
Contract (release N+1, depois do migrate confirmado): código pára de escrever na forma antiga; opcionalmente dropa coluna/tabela.

Validar que N+1 e N coexistem durante o intervalo de rollout (T4 em specs/testing/always-on-recipes.kmd).

Por que isto existe

R3.1 é a dimensão que mais quebra rollouts na prática. ALTER TABLE ... ADD COLUMN NOT NULL DEFAULT now() em tabela ativa tranca escritas por minutos. Pior: dropar coluna em release N e ler dela em client N-1 já em produção = 500 silencioso pro user. Sem padrão formal, cada componente inventa um caminho diferente e a Stack acumula débito de migrações inseguras.

Por tier de storage

Postgres / kdb-next (SQL OLTP)

Expand: adicionar coluna nullable

-- release N: migration 0042_users_add_phone.up.sql
ALTER TABLE users
  ADD COLUMN phone TEXT;          -- nullable, no rewrite (PG >= 11 fast)

-- Index online (won't block writes)
CREATE INDEX CONCURRENTLY idx_users_phone ON users (phone);

Code change in release N:

// Write: populate both old and new path during expand
type UserUpdate struct {
    LegacyPhone *string  // deprecated, kept until contract release
    Phone       *string  // new canonical
}

func WriteUser(u UserUpdate) error {
    if u.Phone != nil && u.LegacyPhone == nil {
        u.LegacyPhone = u.Phone  // dual-write
    }
    // INSERT / UPDATE using both columns
}

// Read: prefer new, fallback to old
func ReadUser(id int64) (User, error) {
    row, err := db.QueryRow("SELECT phone, legacy_phone FROM users WHERE id=$1", id)
    if row.Phone.Valid { return User{Phone: row.Phone.String}, nil }
    return User{Phone: row.LegacyPhone.String}, nil
}

Migrate: per-tenant backfill

// run by a separate worker, idempotent, per-tenant checkpoint
func BackfillPhone(ctx context.Context, tenantID int64) error {
    cursor := loadCheckpoint(tenantID)
    for {
        rows, err := db.Query(`
            SELECT id, legacy_phone FROM users
            WHERE koder_user_id = $1 AND phone IS NULL AND id > $2
            ORDER BY id LIMIT 1000`,
            tenantID, cursor)
        if err != nil { return err }
        if len(rows) == 0 { break }
        for _, r := range rows {
            _, err := db.Exec(
              "UPDATE users SET phone = $1 WHERE id = $2 AND phone IS NULL",
              normalizePhone(r.LegacyPhone), r.ID)
            if err != nil { return err }
        }
        cursor = rows[len(rows)-1].ID
        saveCheckpoint(tenantID, cursor)
    }
    return nil
}

Contract: drop the old column

-- release N+1 (≥ 90 days after expand, after backfill confirmed)
-- migration 0043_users_drop_legacy_phone.up.sql
ALTER TABLE users DROP COLUMN legacy_phone;

Checklist before running 0043:

Stage 1 confirmed: 100% of users.phone populated (run SELECT count(*) FROM users WHERE phone IS NULL AND legacy_phone IS NOT NULL — expect 0).
Window R1.1 elapsed: at least window_duration_days since N rolled out fully. Default 180 days; check koder.toml [compat].
No code path reads legacy_phone (grep the monorepo).
T4 (specs/testing/always-on-recipes.kmd § T4) green for the drop.

Forbidden one-shot patterns

-- ✗ ALTER ... ADD COLUMN NOT NULL DEFAULT … on populated table — rewrites it
ALTER TABLE users ADD COLUMN phone TEXT NOT NULL DEFAULT '';

-- ✗ ALTER ... DROP COLUMN inside same release as the code change
ALTER TABLE users DROP COLUMN legacy_phone;  -- in release N, NOT OK

-- ✗ CREATE INDEX (without CONCURRENTLY) on busy table — blocks writes
CREATE INDEX idx_users_phone ON users (phone);

-- ✗ Adding NOT NULL via ALTER without prior backfill
ALTER TABLE users ALTER COLUMN phone SET NOT NULL;

NOT NULL: 3 steps

-- step 1 (release N): add nullable column
ALTER TABLE users ADD COLUMN phone TEXT;

-- step 2 (release N..N+1): backfill via worker (see Migrate section above)

-- step 3 (release N+1, AFTER 100% populated): tighten constraint
ALTER TABLE users ALTER COLUMN phone SET NOT NULL;

kdb (Koder DB — TiKV-backed) — RFC-001 substrate

kdb's key-prefix model and versioned descriptors per rfcs/stack-RFC-001-kdb-as-unified-data-plane.kmd make expand/migrate/ contract a key-namespace operation, not DDL.

Expand: dual-write to new key shape

// release N: write to BOTH old and new key namespaces
err = kdb.PutMulti(ctx, []kdb.Op{
    {Key: oldKey(tenantID, itemID), Value: encodeOld(payload)},
    {Key: newKey(tenantID, itemID), Value: encodeNew(payload)},
})

Reads prefer new namespace, fall back to old:

v, err := kdb.Get(ctx, newKey(tenantID, itemID))
if errors.Is(err, kdb.ErrNotFound) {
    v, err = kdb.Get(ctx, oldKey(tenantID, itemID))
}

Migrate: per-tenant range scan + transform

// background job, resumable via checkpoint
func migrateKDBPrefix(ctx context.Context, tenantID int64) error {
    cursor := loadCheckpoint(tenantID)
    for {
        scan, err := kdb.RangeScan(ctx, kdb.Range{
            From:  oldKeyPrefix(tenantID, cursor),
            To:    oldKeyPrefixEnd(tenantID),
            Limit: 1000,
        })
        if err != nil { return err }
        if len(scan) == 0 { break }
        var ops []kdb.Op
        for _, entry := range scan {
            new := transform(entry.Value)
            ops = append(ops, kdb.Op{
                Key: newKeyFromOld(entry.Key), Value: new,
            })
        }
        if err := kdb.PutMulti(ctx, ops); err != nil { return err }
        cursor = scan[len(scan)-1].Key
        saveCheckpoint(tenantID, cursor)
    }
    return nil
}

Contract: stop writing old prefix, GC old keys

// release N+1 (after migration confirmed)
// 1. Code stops writing old prefix
// 2. Async GC worker deletes oldKey range after window R1.1 expires:
err = kdb.DeleteRange(ctx, kdb.Range{
    From: oldKeyPrefix(tenantID, 0),
    To:   oldKeyPrefixEnd(tenantID),
})

kdb-specific guidance

Range scans + checkpoint: kdb.RangeScan with paging avoids loading whole tenant into memory.
TiKV regions: very large per-tenant migrations may benefit from splitting the range across regions and parallel migrating; see infra/data/kdb/docs/ for the canonical split helpers.
Multi-tenancy: per policies/multi-tenant-by-default.kmd, migrate one tenant at a time; don't lock all tenants together.

Redis / KV (hot-path cache or session store)

Expand: dual-write with TTL

// release N: write to both keys, same TTL
pipe := redis.Pipeline()
pipe.Set(ctx, oldKey(uid), encodeOld(v), ttl)
pipe.Set(ctx, newKey(uid), encodeNew(v), ttl)
_, err := pipe.Exec(ctx)

Read prefers new, fallback to old:

v, err := redis.Get(ctx, newKey(uid)).Result()
if errors.Is(err, redis.Nil) {
    v, err = redis.Get(ctx, oldKey(uid)).Result()
}

Migrate: scan + transform

// SCAN cursor-based iteration; idempotent
var cursor uint64
for {
    keys, next, err := redis.Scan(ctx, cursor, oldKeyPattern(), 1000).Result()
    if err != nil { return err }
    for _, k := range keys {
        v, _ := redis.Get(ctx, k).Result()
        newK := translateKey(k)
        ttl, _ := redis.TTL(ctx, k).Result()
        redis.Set(ctx, newK, transform(v), ttl)
    }
    if next == 0 { break }
    cursor = next
}

Contract: drop old prefix

// release N+1: stop dual-write, then GC
// Option A: rely on TTL expiry (waits up to max TTL)
// Option B: scan + DEL the old prefix explicitly

for cursor := uint64(0); ; {
    keys, next, _ := redis.Scan(ctx, cursor, oldKeyPattern(), 1000).Result()
    if len(keys) > 0 { redis.Del(ctx, keys...) }
    if next == 0 { break }
    cursor = next
}

Redis-specific guidance

KEYS pattern is forbidden in production (blocks the single Redis thread). Use SCAN cursor iteration.
For tens-of-millions of keys, batch DEL in groups of 1000 with sleep between batches to avoid evictions.
TTL preservation during migrate: read TTL of each key, set new key with same remaining TTL.

S3 / object storage (kdrive blobs, attachments, exports)

Expand: dual-write paths

// release N: upload to both new and legacy paths
oldKey := fmt.Sprintf("koder/uploads/%d/%s", oldFlat, blobID)
newKey := fmt.Sprintf("koder/%d/%s/%s", uid, wsID, blobID)
s3.PutObject(ctx, &s3.PutObjectInput{Key: &newKey, Body: body})
s3.CopyObject(ctx, &s3.CopyObjectInput{
    Bucket:     &bucket,
    CopySource: ptrf("%s/%s", bucket, newKey),
    Key:        &oldKey,
})

Reads prefer new, fall back to old:

obj, err := s3.GetObject(ctx, &s3.GetObjectInput{Key: &newKey})
if isNotFound(err) {
    obj, err = s3.GetObject(ctx, &s3.GetObjectInput{Key: &oldKey})
}

Migrate: server-side copy per tenant

// background job — uses S3's server-side copy (no download/upload)
listInput := &s3.ListObjectsV2Input{
    Bucket: &bucket,
    Prefix: ptrf("koder/uploads/%d/", oldFlat),
}
for paginator := s3.NewListObjectsV2Paginator(s3client, listInput); paginator.HasMorePages(); {
    page, err := paginator.NextPage(ctx)
    if err != nil { return err }
    for _, obj := range page.Contents {
        newKey := translateS3Key(*obj.Key)
        s3.CopyObject(ctx, &s3.CopyObjectInput{
            Bucket:     &bucket,
            CopySource: ptrf("%s/%s", bucket, *obj.Key),
            Key:        &newKey,
        })
    }
}

Contract: lifecycle policy or batch delete

// release N+1: lifecycle rule that deletes legacy prefix after R1.1 window
{
  "Rules": [{
    "ID": "expire-legacy-uploads",
    "Filter": { "Prefix": "koder/uploads/" },
    "Status": "Enabled",
    "Expiration": { "Days": 180 }
  }]
}

S3-specific guidance

Server-side copy (s3:CopyObject) avoids egress and is atomic per object.
For very large buckets, S3 Batch Operations handles billions of objects with retry and reporting.
Multipart blobs > 5 GB require s3:UploadPartCopy; same idempotency story.

Common pitfalls (per-tier)

Pitfall	Why it bites	Fix
Single-release ADD + use	Old code on N-1 servers can't read the new column; deserialization fails	Always expand+migrate+contract across ≥ 3 releases
Migrate runs synchronously in API request	Request times out on large tenants; user sees 504	Background worker, per-tenant checkpoint, idempotent
Backfill without `WHERE phone IS NULL`	Re-overwrites already-migrated rows; race with live writes	Always filter on "still in old form"
DROP COLUMN in same release as code change	Client N-1 still in production tries to read dropped column → 500	DROP only after window R1.1 elapsed
Lifecycle delete before clients drained	Reads from N-1 clients hit 404 (object gone)	Lifecycle window ≥ R1.1 `window_duration_days`
Migration job without resumability	One crash mid-tenant restarts the whole tenant; idempotency check expensive	Per-tenant checkpoint, idempotent UPDATEs

Test pairing

Each migration MUST pair with T4 (specs/testing/always-on-recipes.kmd § T4):

Restore prod-like snapshot.
Start sustained read+write load.
Apply migration in another connection.
Assert peak per-second latency < 100 ms.

Migrations that fail T4 don't ship until reworked into the expand-migrate-contract pattern.

Per-component override

A component can TIGHTEN the migration cadence (e.g., window_duration_days = 365 implies DROP only after 365 days elapsed). Cannot LOOSEN below Stack defaults (R1.1).

<component>/koder.toml [compat] is the source of truth for the component's own window; recipes here use the Stack default 180.

Status

v0.1 (2026-05-24): 4 tiers covered (Postgres/kdb-next, Redis, S3). Patterns extracted from services/foundation/id/engine LGPD cascade migration + multi-tenant rollout.
v0.2 planned: vector DB (embedding migrations between dims/ models), time-series Timescale hypertables, queue/stream schema evolution (event versioning).
Promotion to v1.0: when ≥ 3 components ship a real expand- migrate-contract migration following these recipes and validate them in production rollouts.

References

policies/always-on.kmd
policies/multi-tenant-by-default.kmd
specs/testing/always-on-recipes.kmd
rfcs/stack-RFC-001-kdb-as-unified-data-plane.kmd