Scalable RAG Pipelines

Architecture

February 15, 2026

For my Technical Ledger section, I wanted the AI to actually understand the content — not just keyword-match it. The goal of this script is simple: pull my ledger entries from DatoCMS, turn them into clean embeddings with OpenAI, and store everything in Upstash Vector so the assistant can retrieve the right context later.

This isn’t meant to be overly clever. It’s meant to be reliable and easy to rerun whenever new ledger entries are published.

The Retrieval Dilemma

The tricky part wasn’t generating embeddings — that part is straightforward. The real challenge was deciding what the embedding input should look like so retrieval stays accurate as the ledger grows.

Each ledger entry in DatoCMS is modular. Instead of embedding one big blob of markdown, I flatten the structured promptNotes into a predictable format. That way the model sees clear semantic boundaries instead of noisy free-form text.

Scalability at the Edge

This script itself runs as an offline seed step, but I still optimized the flow to behave well at scale. The important decision was batching embeddings instead of generating them one by one.

Once the CMS data is fetched, everything flows through the same pipeline: normalize → embed → upsert. Even if the ledger grows significantly, the process doesn’t really change — it just handles more rows.

TypeScript

Infrastructure Considerations

The seeding flow is intentionally linear and easy to reason about. First, I fetch all ledger entries from DatoCMS. If nothing comes back, the script exits early so CI doesn’t silently succeed with empty data.

From there, each ledger is transformed into a clean embedding payload. The helper is doing most of the important work here — flattening the modular content while preserving meaning.

This produces consistent, retrieval-friendly text instead of raw CMS noise.

When it’s time to generate embeddings, everything is sent in one batch. This keeps the process fast and avoids unnecessary API overhead.

Finally, I upsert into Upstash Vector together with rich metadata. The metadata is important because the retrieval layer later uses things like slug, category, and title to build the final response and links.

At this point the ledger content becomes searchable context for the AI.

Conclusion

This script is the quiet workhorse behind the Technical Ledger assistant. It takes structured knowledge from DatoCMS, normalizes it into embedding-friendly text, and pushes everything into Upstash Vector in one clean pass.

Nothing fancy — just a predictable pipeline that I can rerun whenever new ledger entries ship. The result is that the AI responses stay grounded in real content instead of drifting into generic answers.

Back to Ledgers