-
Notifications
You must be signed in to change notification settings - Fork 77
Description
Hi, I was going through Chronon docs trying to understand details on how the online/offline consistency is guaranteed but could not find exact details.
I'm particularly interested in 3 scenarios:
- Joining data in online vs offline modes
- Fetching data from KVStore vs Fetching data from batch store (Hive?)
- Stateless request-time user-defined transforms (on-demand features) - both online and offline
Details:
-
I'm interested in Chronon-level Join operator/function implementation details: in streaming case join is fundamentally different (the most common approach is windowed join afaik) from batch join, hence running those on the same data may produce different results.
My question is how does Chronon guarantee consistency between online and offline data, specifically for joins? Does it use Kappa architecture (e.g running the same streaming pipeline for offline data)? If so, what kind of streaming join is used? I'd like to understand this in-depth for both Spark Structured Streaming and Flink engines. -
For fetching/loading online/offline data: my understanding is that when executed in offline mode Chronon dumps resulting data in Hive, for online data goes to KVStore. Is there any guarantee that if I load data at specific timestamp from offline store (Hive) I'll get the exact same result as if I fetched KVStore at this exact timestamp? If so, how does it work exactly?
-
Does Chronon allow any last-mile request-time user-defined stateless transformations (in Tecton those are called on-demand features, e.g. getting user's request time at millisecond granularity). If so, how are these computed at online and offline and same question w.r.t. data consistency.