plox: lazy-trust verifiable, on-PDS bulk did:plc operation archival

atproto ethos as applied to open data? an experiment

tl;dr: plox is a canonical encoding of the did:plc directory, stored in an atproto PDS. as of may 2026 at://cerulea.blue/blue.cerulea.plox.bundle is taking up 35 GB on disk (zstd-compressed; ≈80 GB raw). you can use it to bootstrap a PLC directory mirror in minutes rather than hours, and since anyone can recompute and verify the plox CIDs against the canonical plc.directory, we have lazy-trust properties.

anyway, i’ll get to the long version:

stable identity on atproto

please humor me while i define my terms:

a DID is essentially a referent to a blob of user-controlled JSON (the “DID Document”), you “resolve” (or “dereference”) the DID to get the contents of the DID doc. on the mainline AT Protocol, we can choose between did:plc and did:web. these have a bunch of different trade-offs but in principle a did:web points to a location, while a did:plc refers to an object.

to resolve an atproto did:web, you basically submit a HTTPS request to .well-known/did.json at the hostname in the identifier1 . web DIDs are ripe for shenanigans: the user becomes privy to metadata about when/by which applications/how their DID is dereferenced, can provide different versions of the DID document to different requesters2, or perform other HTTP attacks (e.g. slow-loris) against a requesting application. if your application is resolving a did:web, make sure you guard against slow, very large, or otherwise unusual responses.

to resolve a did:plc, you ask the authoritative “DID PLC Directory” (at plc.directory) to give you the document for the given identifier. this is a severe centralization risk (see below), but the upshot is that we have very cheap resolution of an object that’s referred to by a non-arbitrary, deterministic identifier: PLC DIDs are controlled by cryptographic keys, and their identifiers are a truncated hash of the creation data, so replicas of the ledger can validate and reproduce the exact same ID and key mutation log when presented the same (signed) operations3.

but there’s a problem: the directory of PLC DIDs only allows you to submit (in testing) 500 requests every 5 minutes. this is fine for most atproto services, but not for running a full-scale relay (where every DID must be dereferenced to check if a PDS/signing key is authoritative for an event) or for extremely scaled-up atproto apps (where 500 DID cache misses in 5 minutes is plausible). this is reason A to run a read replica.

reason B to run a replica is that we need it in the case of did:plc exit: we’ve all (passively) agreed to use the central service controlled by Bluesky PBC (for now; a Verein soon) to look up PLC DIDs. therefore, they essentially form a Proof-by-Decree ledger (fiat DIDs, if you will): the central directory is supposed to accept all valid operations (subject to rate limits), but can technically refuse at any time - in this case, we can all agree on a directory run by someone else, but they need to already have a complete copy of the operation log, we don’t want to beg our adversary.

running a replica

a PLC directory has to perform the following tasks:

to be able to mirror all the operations, we must first fetch all the operations. as mentioned before, we can only submit a request every ≈600ms (5min / 500 requests) but there are almost a hundred million plc ops (a hundred thousand /export requests in sequence - about 17 hours of non-stop backfill!)

in comparison, plox bundles can be fetched in parallel (from as many separate PDSes as are hosting the data), contain 500,000 records each, and can allow you to backfill up to the latest plox snapshot, then cut over to the plc directory. since you will probably only be behind by 500,000 records, you then only have to fetch an upper bound of 500 /export requests from the canonical directory, which takes five minutes.

release

the plox source code is on tangled and is still work-in-progress & evolving - expect some related projects to come soon, like scripts to bootstrap the reference plc replica from the plox data & cerulea’s own plc replica implementation.