OpenSearch with Zero ETL: Why We Still Keep a Wrench Handy

Info QuantumAdvantage AI - 13 August 2025 - 6:23 pm

Abstract

On paper, AWS’s Zero ETL from DynamoDB to OpenSearch sounds like magic: define a pipeline, load your table, and watch your index build itself. In practice? It works great, until it doesn’t. From unexpected mapping gaps to painful data resets, branch management quirks, and cost gotchas, we learned that “zero ETL” still requires hands-on control to achieve predictable results.

Why We Did It This Way

Our architecture uses DynamoDB as the primary store for operational data, and OpenSearch for search and analytics. AWS’s Zero ETL pipelines looked like a perfect fit, no custom ETL scripts, direct sync from table to index, automatic mapping creation.

The challenge? We also use Amplify for our infrastructure, and Amplify’s handling of OpenSearch in multi-branch setups was (and still is) problematic. It wanted to create a separate OpenSearch domain for each branch, along with separate pipelines, which increases cost and complexity. Our architecture works better with:

One shared domain for all test environments, with different pipelines per environment.
A separate production domain, scaled and managed independently from the test domain.

This required us to manage OpenSearch domains and pipelines manually, outside of Amplify’s default resource management.

How Zero ETL Works in Practice

When you first define a Zero ETL pipeline and populate your DynamoDB table, OpenSearch automatically:

Creates the index.
Builds the mappings based on the initial data it sees.

That’s the catch: the mappings are inferred from whatever fields are present in that first sync. If a field is null for all rows during that initial load, no mapping is created for it, and future data for that field won’t be indexed.

At the time, our approach was to drop and rebuild the index when this happened, because it was the fastest way to ensure all mappings were correct. Later, we learned that you can add new fields to an existing mapping via the _mapping API — but if the field type needs to change, a full rebuild is still required.

The Mapping Gap Problem

We ran into this when we intentionally had sparse fields in a dataset. The result: OpenSearch silently skipped creating those mappings, and the data never showed up in the index.

Our fix at the time:

Drop the index.
Recreate it with the correct mappings.
Push the data through the pipeline again.

What we know now:

You can add missing fields via _mapping if you catch it early.
For type changes or multiple mismatches, rebuilding is still the most straightforward option.

Data Reset Pain

Testing or rebuilding an index often means clearing data from DynamoDB and re-ingesting it. That’s easy for small datasets, but with 100,000+ rows, it gets slow:

DynamoDB batch operations are capped at 25 items per request.
The DynamoDB console lets you bulk delete 300 rows at a time, but it’s still tedious at scale.

We did build a Python script to force updates (e.g., changing a lastUpdated timestamp) so the pipeline would reprocess rows, but it was still constrained by the same 25-item batch limit, making it slow for very large datasets.

We looked into Step Functions and parallelized Lambda processing for faster deletes and updates, but since Step Functions aren’t “native” to Amplify, our first attempts weren’t cohesive with our deployment model. It’s on our list to revisit for the future.

Touching Rows Doesn’t Work (Without Changes)

We experimented with “touching” rows, making an update call without actually changing the data, to trigger DynamoDB Streams. This doesn’t work because DynamoDB Streams only fire on real changes.

The workaround was to actually update a field (like a timestamp), which successfully triggers the Zero ETL pipeline.

The Mapping Mystery Bug

In one case, everything looked fine, DynamoDB had all the right data, the mappings looked correct, and yet the index wasn’t updating. The fix? Another manual mapping update.

This reinforced the lesson that Zero ETL is not a set-and-forget process. Sometimes it needs a nudge, or even a full rebuild.

Cost Control Tips

We discovered two key cost factors for OpenSearch pipelines:

Pipelines cost money even when idle, if they’re enabled, you’re paying for them, regardless of document flow.
OCUs (OpenSearch Compute Units) matter, if you set a pipeline to run with 2–4 OCUs, you’ll be billed for 2 OCUs minimum as long as that pipeline is enabled.

Our solution:

Keep pipelines disabled until we actually need to run them.
Use minimal OCUs for test pipelines and scale up only for production workloads.

Lessons Learned

Don’t let Amplify auto-manage OpenSearch domains in multi-branch setups, it creates unnecessary domains and complexity.
Group test environments into a shared domain with multiple pipelines; keep production on its own domain.
Populate all fields in your initial dataset to ensure Zero ETL generates complete mappings. If a field is null for all rows during the first load, it won’t get mapped.
Missing fields? You can add them with _mapping, but type changes require a rebuild.
DynamoDB deletes/updates are slow at scale, plan ahead for large resets.
“Touch” updates don’t work unless you actually change a value.
Keep pipelines off until needed to save costs.
OCU settings directly impact billing, be intentional about your min/max settings.
Sometimes a manual mapping update or full rebuild is the fastest path forward.

Final Thoughts

Zero ETL from DynamoDB to OpenSearch can save a lot of boilerplate ETL work, when everything lines up. But if you have sparse datasets, large tables, or multi-branch deployments, expect to get your hands dirty.

We’re still using it, but with guardrails: manual domain management, careful initial loads, and the occasional full index rebuild. As we learn more about OpenSearch reindexing and mapping management, we’re adding more tools to our toolbox.

If you’ve run into similar challenges, I’d love to hear how you handled them.

Matt Pitts, Sr Architect

CATEGORIES:

AI & Machine Learning-Cloud Architecture-Data & Analytics-Tutorials & How-Tos

Tags:

No tags