Idempotency is a beautiful property for distributed systems. With it, you can perform the same operation many times and the output changes as if you did it just once.
This drastically simplifies failure handling. If an API call for an account balance increase times out, what do you do? Repeating the call might double-credit; not doing so might under-credit. Peeking at the balance before and after the API request doesn't help either, since somebody else may modify the balance in between calls. (Race condition!)
Luckily, we can make this API call idempotent. The caller generates a unique identifier and sends it along with the request. If the API already applied a change with that identifer, it ignores the request, otherwise it makes the change.
Failure handling is simple. If an API request times out, just keep retrying until the API indicates responds with "success". At that point we know the credit has been applied exactly once.
This same technique can ensure reliable data movement between sources and a downstream datastore.
The sources tag each data element with a unique identifier.
The compute stage uses a SQL transaction to ensure elements aren't loaded more than once. We're leveraging the ACID properties of the database at the end of our processing chain. Here is the SQL:
START TRANSACTION;
INSERT INTO loaded_elements (unique_id)
VALUES ('8708dcd6-75ce-4c8a-847c-b96e535710b8');
INSERT INTO output_data (...) VALUES (...);
COMMIT;
The trick is placing a unique constraint on the unique_id
column
in loaded_elements
. The first insert statement fails if the
element has already been loaded. In a race condition, only one transaction
will successfully commit. The other will hit a serialization error and the
transaction will roll back. In MySQL this table could be created as:
CREATE TABLE loaded_elements (
unique_id char(36) character set ascii,
PRIMARY KEY( unique_id )
) ENGINE=InnoDB;
The unique identifiers can be generated as a uuid. This requires no shared state or collaboration between the data sources.
Using a type 1 uuid makes the most sense. When new uuid1s are inserted,
they'll typically append to the end of the loaded_elements
table.
In contrast, uuid4s will insert randomly through the table, causing poor
performance in a typical, B-tree based database. (An alternative combines a
timestamp with a uuid4, defines the unique constraint over both timestamp and
uuid4, and transmits both fields from source to the datastore).
We can batch at the data sources to reduce the number of entries tracked in
loaded_elements
. The idea is the same. The uuid just applies to
the batch rather than individual data elements.
Whether tracking events or batches, it is important that the original data is immutable. Meaning once an identifier is assigned, the associated data may not change.
A benefit of transactional deduplication is that smarts are an optimization rather than a requirement. If data is lost in transit, we can query the database to see what's been loaded and only replay what's missing from the list. Being idempotent, we don't need to worry about any batches already in flight.
Overall, making data flows idempotent is easy if you have an ACID database at the end of the processing chain and any intermediate processing stages are stateless. Being able to resend data with confidence simplifies the entire transport layer, since it can just make "best-effort" delivery rather than trying to be transactional along the way.
At HasOffers, we used this technique to make a reliable, near real-time data loader from our attribution servers to AWS Redshift.