Hacker News

Chronon, Airbnb's ML feature platform, is now open source

2024-04-0817:27224112medium.com

A feature platform that offers observability and management tools, allows ML practitioners to use a variety of data sources, while handling…

Show article

A feature platform that offers observability and management tools, allows ML practitioners to use a variety of data sources, while handling the complexity of data engineering, and provides low latency streaming.

By: Varant Zanoyan, Nikhil Simha Raprolu

Chronon allows ML practitioners to use a variety of data sources as inputs to feature transformations. It handles the complexity of data plumbing, such as batch and streaming compute, provides low latency serving, and offers a host of observability and management tools.

Airbnb is happy to announce that Chronon, our ML Feature Platform, is now open source. Join our community Discord channel to chat with us.

We’re excited to be making this announcement along with our partners at Stripe, who are early adopters and co-maintainers of the project.

This blog post covers the main motivation and functionality of Chronon. For an overview of core concepts in Chronon, please see this previous post.

Background

We built Chronon to relieve a common pain point for ML practitioners: they were spending the majority of their time managing the data that powers their models rather than on modeling itself.

Prior to Chronon, practitioners would use one of the following two approaches:

Replicate offline-online: ML practitioners train the model with data from the data warehouse, then figure out ways to replicate those features in the online environment. The benefit of this approach is that it allows practitioners to utilize the full data warehouse, both the data sources and powerful tools for large-scale data transformation. The downside is that this leaves no clear way to serve model features for online inference, resulting in inconsistencies and label leakage that severely affect model performance.
Log and wait: ML practitioners start with the data that is available in the online serving environment from which the model inference will run. They log relevant features to the data warehouse. Once enough data has accumulated, they train the model on the logs, and serve with the same data. The benefit of this approach is that consistency is guaranteed and leakage is unlikely. However the major drawback is that it can result in long wait times, hindering the ability to respond quickly to changing user behavior.

The Chronon approach allows for the best of both worlds. Chronon requires ML practitioners to define their features only once, powering both offline flows for model training as well as online flows for model inference. Additionally, Chronon offers powerful tooling for feature chaining, observability and data quality, and feature sharing and management.

How It Works

Below we explore the main components that power most of Chronon’s functionality using a simple example derived from the quickstart guide. You can follow that guide to run this example.

Let’s assume that we’re a large online retailer, and we’ve detected a fraud vector based on users making purchases and later returning items. We want to train a model to predict whether a given transaction is likely to result in a fraudulent return. We will call this model each time a user starts the checkout flow.

Defining Features

Purchases Data: We can aggregate the purchases log data to the user level to give us a view into this user’s previous activity on our platform. Specifically, we can compute SUMs, COUNTs and AVERAGEs of their previous purchase amounts over various time windows.

source = Source( events=EventSource( table="data.purchases", # This points to the log table in the warehouse with historical purchase events, updated in batch daily topic="events/purchases", # The streaming source topic query=Query( selects=select("user_id","purchase_price"), # Select the fields we care about time_column="ts") # The event time ))window_sizes = [Window(length=day, timeUnit=TimeUnit.DAYS) for day in [3, 14, 30]] # Define some window sizes to use belowv1 = GroupBy( sources=[source], keys=["user_id"], # We are aggregating by user online=True, aggregations=[Aggregation( input_column="purchase_price", operation=Operation.SUM, windows=window_sizes ), # The sum of purchases prices in various windows Aggregation( input_column="purchase_price", operation=Operation.COUNT, windows=window_sizes ), # The count of purchases in various windows Aggregation( input_column="purchase_price", operation=Operation.AVERAGE, windows=window_sizes ), # The average purchases by user in various windows Aggregation( input_column="purchase_price", operation=Operation.LAST_K(10), ), # The last 10 purchase prices aggregated as a list ],)

This creates a `GroupBy` which transforms the `purchases` event data into useful features by aggregating various fields over various time windows, with `user_id` as a primary key.

This transforms raw purchases log data into useful features at the user level.

User Data: Turning User data into features is a littler simpler, primarily because we don’t have to worry about performing aggregations. In this case, the primary key of the source data is the same as the primary key of the feature, so we can simply extract column values rather than perform aggregations over rows:

source = Source( entities=EntitySource( snapshotTable="data.users", # This points to a table that contains daily snapshots of all users query=Query( selects=select("user_id","account_created_ds","email_verified"), # Select the fields we care about ) ))v1 = GroupBy( sources=[source], keys=["user_id"], # Primary key is the same as the primary key for the source table aggregations=None, # In this case, there are no aggregations or windows to define online=True,)

This creates a `GroupBy` which extracts dimensions from the `data.users` table for use as features, with `user_id` as a primary key.

Joining these features together: Next, we need to combine the previously defined features into a single view that can be both backfilled for model training and served online as a complete vector for model inference. We can achieve this using the Join API.

For our use case, it’s very important that features are computed as of the correct timestamp. Because our model runs when the checkout flow begins, we want to use the corresponding timestamp in our backfill, such that feature values for model training logically match what the model will see in online inference.

Here’s what the definition would look like. Note that it combines our previously defined features in the right_parts portion of the API (along with another feature set called returns).

source = Source( events=EventSource( table="data.checkouts", query=Query( selects=select("user_id"), # The primary key used to join various GroupBys together time_column="ts", ) # The event time used to compute feature values as-of ))v1 = Join( left=source, right_parts=[JoinPart(group_by=group_by) for group_by in [purchases_v1, returns_v1, users]] # Include the three GroupBys)

Backfills/Offline Computation

The first thing that a user would likely do with the above Join definition is run a backfill with it to produce historical feature values for model training. Chronon performs this backfill with a few key benefits:

Point-in-time accuracy: Notice the source that is used as the “left” side of the join above. It is built on top of the “data.checkouts” source, which includes a “ts” timestamp on each row that corresponds to the logical time of that particular checkout. Every feature computation is guaranteed to be window-accurate as of that timestamp. So for the one-month sum of previous user purchases, every row will be computed for the user as of the timestamp provided by the left-hand source.
Skew handling: Chronon’s backfill algorithms are optimized for handling highly skewed datasets, avoiding frustrating OOMs and hanging jobs.
Computational efficiency optimizations: Chronon is able to bake in a number of optimizations directly into the backend, reducing compute time and cost.

Online Computation

Chronon abstracts away a lot of complexity for online feature computation. In the above examples, it would compute features based on whether the feature is a batch feature or a streaming feature.

Batch features (for example, the User features above)

Because the User features are built on top of a batch table, Chronon will simply run a daily batch job to compute the new feature values as new data lands in the batch data store and upload them to the online KV store for serving.

Streaming features (for example, the Purchases features above)

The Purchases features are built on a source that includes a streaming component, as indicated by the inclusion of a “topic” in the source. In this case, Chronon will still run a batch upload in addition to a streaming job for real time updates. The batch jobs is responsible for:

Seeding the values: For long windows, it wouldn’t be practical to rewind the stream and play back all raw events.
Compressing “the middle of the window” and providing tail accuracy: For precise window accuracy, we need raw events at both the head and the tail of the window.

The streaming job then writes updates to the KV store to keep feature values up to date at fetch time.

Online Serving / Fetch API

Chronon offers an API to fetch features with low latency. We can either fetch values for individual GroupBys (i.e. the Users or Purchases features defined above) or for a Join. Here’s an example of what one such request and response for a Join would look like:

// Fetching all features for user=123Map<String, String> keyMap = new HashMap<>();keyMap.put("user", "123")Fetcher.fetch_join(new Request("quickstart_training_set_v1", keyMap));// Sample response (map of feature name to value)'{"purchase_price_avg_3d":14.2341, "purchase_price_avg_14d":11.89352, ...}'

Java code that fetches all features for user 123. The return type is a map of feature name to feature value.

The above example uses the Java client. There is also a Scala client and a Python CLI tool for easy testing and debugging:

run.py --mode=fetch -k '{"user_id":123}' -n quickstart/training_set -t join> {"purchase_price_avg_3d":14.2341, "purchase_price_avg_14d":11.89352, ...}

Utilizes the run.py CLI tool to make the same fetch request as the Java code above. run.py is a convenient way to quickly test Chronon workflows like fetching.

Another option is to wrap these APIs into a service and make requests via a REST endpoint. This approach is used within Airbnb for fetching features in non-Java environments such as Ruby.

Online-Offline Consistency

Chronon not only helps online-offline accuracy, it also offers a way to measure it. The measurement pipeline starts with the logs of the online fetch requests. These logs include the primary keys and timestamp of the request, along with the fetched feature values. Chronon then passes the keys and timestamps to a Join backfill as the left side, asking the compute engine to backfill the feature values. It then compares the backfilled values to actual fetched values to measure consistency.

What’s Next?

Open source is just the first step in an exciting journey that we look forward to taking with our partners at Stripe and the broader community.

Our vision is to create a platform that enables ML practitioners to make the best possible decisions about how to leverage their data and makes enacting those decisions as easy as possible. Here are some questions that we’re currently using to inform our roadmap:

How much further can we lower the cost of iteration and computation?

Chronon is already built for the scale of data processed by large companies such as Airbnb and Stripe. However, there are always further optimizations that we can make to our compute engine, both to reduce the compute cost and the “time cost” of creating and experimenting with new features.

How much easier can we make authoring a new feature?

Feature engineering is the process by which humans express their domain knowledge to create signals that the model can leverage. Chronon could integrate NLP to allow ML practitioners to express these feature ideas in natural language and generate working feature definition code as a starting point for their iteration.

Lowering the technical bar to feature creation would in turn open the door to new kinds of collaboration between ML practitioners and partners who have valuable domain expertise.

Can we improve the way models are maintained?

Changing user behavior can cause shifts in model performance because the data that the model was trained on no longer applies to the current situation. We imagine a platform that can detect these shifts and create a strategy to address them early and proactively, either by retraining, adding new features, modifying existing features, or some combination of the above.

Can the platform itself become an intelligent agent that helps ML practitioners build and deploy the best possible models?

The more metadata that we gather into the platform layer, the more powerful it can become as a general ML assistant.

We mentioned the goal of creating a platform that can automatically run experiments with new data to identify ways to improve models. Such a platform might also help with data management by allowing ML practitioners to ask questions such as “What kinds of features tend to be most useful when modeling this use case?” or “What data sources might help me create features that capture signal about this target?” A platform that could answer these types of questions represents the next level of intelligent automation.

Getting Started

Here are some resources to help you get started or to evaluate if Chronon is a good fit for your team.

Interested in this type of work? Check out our open roles here — we’re hiring.

Acknowledgements

Sponsors: Henry Saputra Yi Li Jack Song

Contributors: Pengyu Hou Cristian Figueroa Haozhen Ding Sophie Wang Vamsee Yarlagadda Haichun Chen Donghan Zhang Hao Cen Yuli Han Evgenii Shapiro Atul Kale Patrick Yoon

Read the original article

vquemener

Karma: 2330

@Hacker__News
@hacker._news

Comments

By sfink 2024-04-0915:552 reply

It's refreshing to read something about ML and inference and have it not be anything related to a transformer architecture sending up fruit growing from a huge heap of rotten, unknown, mostly irrelevant data. With traditional ML, it's useful to talk about the sources of bias and error, and even measure some of them. You can do things that improve them without starting over on everything else.

With LLMs, it's more like you buy a large pancake machine that you dump all of your compost into (and you suspect the installers might have hooked up to your sewage line as input too). It triples your electricity bill, it makes bizarre screeching noises as it runs, you haven't seen your cat in a week, but at the end out come some damn fine pancakes.

I apologize. I'm talking about the thing that I was saying was a relief to be not talking about.

By nikhilsimha 2024-04-0917:571 reply

I agree with you - about the sentiment around the GenAI megaphone.

FWIW, Chronon does serve context within prompts to personalize LLM responses. It is also used to time-travel new prompts for evaluation.

By cactusplant7374 2024-04-0919:111 reply

> time-travel new prompts for evaluation

What does this mean?

By nikhilsimha 2024-04-0919:264 reply

Imagine you are building a customer support bot for a food delivery app.

The user might say - I need a refund. The bot needs to know contextual information - order details, delivery tracking details etc.

Now you have written a prompt template that needs to be rendered with contextual information. This rendered prompt is what the model will use to decide whether to issue a refund or not.

Before you deploy this prompt to prod, you want to evaluate its performance - instances where it correctly decided to issue or decline a refund.

To evaluate, you can “replay” historical refund requests. The issue is that the information in the context changes with time. You want to instead simulate the value of the context at a historical point in time - or time-travel.

By jamesblonde 2024-04-0919:41

Are you using function calling for the context info?

By jamesblonde 2024-04-0919:31

Time-travel evals, nice.

By uoaei 2024-04-0921:503 reply

In what world is it appropriate or even legal to decide on refunds via LLM?

Can you give an example that's not ripe for abuse? This really doesn't sell LLMs as anything useful except insulation from the consequences of bad decisions.

By throwaway2037 2024-04-102:551 reply

Don't think of LLM as completely replacing the support agent here; rather augmenting. A lot of customer service is setting/finding context: customer name, account, order, item, etc. If an LLM chatbot can do all of that, then handoff to a human support agent, there is real cost savings to be had, without reducing the quality of service.

By uoaei 2024-04-105:01

I'd love for others to think that way. I am a very vocal (in my own bubble) advocate for human-in-the-loop ML.

By amenhotep 2024-04-1013:23

Have you requested a refund off Amazon lately? They have an automated system where, iirc, a wizard will ask you a few questions and then process it, presumably inspecting your customer history and so on. If the system thinks your request looks genuine and it's within whatever parameters they've set, it'll accept instantly, refund you, sometimes without even asking you to send the item back. If it's less sure, it will pass the request on to a human agent to be dealt with like it would have been in the Before Times.

I can see no reason why it would be illegal or inappropriate to use an LLM as part of the initial flow there. In fact I see no reason why it would be illegal for Amazon to simply flip a coin to decide whether to immediately accept your refund. (Appropriateness is another matter!)

I guess you're assuming the LLM would be the only point of contact with no recourse if it rejects you? Which strikes me as very pessimistic, unless you live in a very poorly regulated country.

By nikhilsimha 2024-04-0922:07

"Imagine" is the operative word :-)

By haolez 2024-04-0923:12

What do you think of the approach in DSPy[0]? It seems to give a more traditional ML feel to LLM optimization.

[0] https://dspy.ai/

By giovannibonetti 2024-04-0916:114 reply

What is the difference between a ML feature store and a low-latency OLAP DB platform/data warehouse? I see many similarities between both, like the possibility of performing aggregation of large data sets in a very short time.

By csmpltn 2024-04-0918:131 reply

There is none. The industry is being flooded with DS and "AI" majors (and other generally non-technical people) that have zero historical context on storage and database systems - and so everything needs to be reinvented (but in Python this time) and rebranded. At the end of the day you're simply looking at different mixtures of relational databases, key-value stores, graph databases, caches, time-series databases, column stores, etc. The same stuff we've had for 50+ years.

By nikhilsimha 2024-04-0918:383 reply

Two main differences - ability to time travel for training data generation and the ability to push compute to the write side of the view rather than the read side for low latency feature serving.

By csmpltn 2024-04-0919:293 reply

> "ability to time travel for training"

Nah, this is nothing new.

We've solved this for ages with "snapshots" or "archives", or fancy indexing strategies, or just a freaking "timestamp" column in your tables.

By ezvz 2024-04-0919:482 reply

There's a lot more to it than snapshots or timestamped columns when it comes to ML training data generation. We often have windowed aggregations that need to computed as of precise intra-day timestamps in order to achieve parity between training data (backfilled in batch) and the data that is being served online realtime (with streaming aggregations being computed realtime).

Standard OLAP solutions right now are really good at "What's the X day sum of this column as of this timestamp", but when every row of your training data has a precise intra-day timestamp that you need windowed aggregations to be accurate as-of, this is a different challenge.

And when you have many people sharing these aggregations, but with potentially different timestamps/timelines, you also want them sharing partial aggregations where possibly for efficiency.

All of this is well beyond the scope that is addressed by standard OLAP data solutions.

Not to mention the fact that the offline computation needs to translate seamlessly to power online serving (i.e. seeding feature values, and combining with streaming realtime aggregations), and the need for online/offline consistency measurement.

That's why a lot of teams don't even bother with this, and basically just log their feature values from online to offline. But this limits what kind of data they can use, and also how quickly they can iterate on new features (need to wait for enough log data to accumulate before you can train).

By giovannibonetti 2024-04-101:26

> Standard OLAP solutions right now are really good at "What's the X day sum of this column as of this timestamp", but when every row of your training data has a precise intra-day timestamp that you need windowed aggregations to be accurate as-of, this is a different challenge.

As long as your OLAP table/projection/materialized view is sorted/clustered by that timestamp, it will be able to efficiently pick only the data in that interval for your query, regardless of the precision you need.

> And when you have many people sharing these aggregations, but with potentially different timestamps/timelines, you also want them sharing partial aggregations where possibly for efficiency.

> All of this is well beyond the scope that is addressed by standard OLAP data solutions.

I think the StarRocks open-source OLAP DB supports this as a query rewrite mechanism that optimizes performance by using data from materialized views. It can build UNION queries to handle date ranges [1]

[1] https://docs.starrocks.io/docs/using_starrocks/query_rewrite...

By mulmen 2024-04-0920:421 reply

I’m still not seeing how this is a novel problem. You just apply a filter to your timestamp column and re-run the window function. It will give you the same value down to the resolution of the timestamp every time.

By ezvz 2024-04-0921:342 reply

Let's try an example: `average page views in the last 1, 7, 30, 60, 180 days`

You need these values accurate as of ~500k timestamps for 10k different page ids, with significant skew for some page ids.

So you have a "left" table with 500k rows, each with a page id and timestamp. Then you have a `page_views` table with many millions/billions/whatever rows that need to be aggregated.

Sure, you could do this with backfill with SQL and fancy window functions. But let's just look at what you would need to do to actually make this work, assuming you wanted it to be serving online with realtime updates (from a page_views kafka topic that is the source of the page views table):

For online serving: 1. Decompose the batch computation to SUM and COUNT and seed the values in your KV store 2. Write the streaming job that does realtime updates to your SUMs/COUNTs. 3. Have an API for fetching and finalizing the AVERAGE value.

For Backfilling: 1. Write your verbose query with windowed aggregations (I encourage you to actually try it). 2. Often you also want a daily front-fill job for scheduled retraining. Now you're also thinking about how to reuse previous values. Maybe you reuse your decomposed SUMs/COUNTs above, but if so you're now orchestrating these pipelines.

For making sure you didn't mess it up: 1. Compare logs of fetched features to backfilled values to make sure that they're temporally consistent.

For sharing: 1. Let's say other ML practitioners are also playing around with this feature, but with a different timelines (i.e. different timestamps). Are they redoing all of the computation? Or are you orchestrating caching and reusing partial windows?

So you can do all that, or you can write a few lines of python in Chronon.

Now let's say you want to add a window. Or say you want to change it so it's aggregated by `user_id` rather than `page_id`. Or say you want to add other aggregations other than AVERAGE. You can redo all of that again, or change a few lines of Python.

By mulmen 2024-04-0922:261 reply

I admit this is a bit outside my wheelhouse so I’m probably still missing something.

Isn’t this just a table with 5bn rows of timestamp, page_type, page_views_t1d, page_views_t7d, page_views_t30d, page_views_t60d, and page_views_t180d? You can even compute this incrementally or in parallel by timestamp and/or page_type.

What’s the magic Chronon is doing?

By better365 2024-04-103:491 reply

For offline computation, it is okay with the table with 5bn rows. But for online serving, it would be really challenging to serve the features at a few milliseconds.

But even for offline computation, for the same computation logic, the code will be duplicated in lots of places. we have observed the ML practitioners copied sql queries all over. In the end, it is not possible for debugging, feature interpretability and lineage.

Chronon abstracts all those away so that ML practitioners can focus on the core problems they are dealing with, rather than spending time on the ML Ops.

For an extreme use case, one user defined 1000 features with 250 lines of code, which is definitely impossible with SQL queries, not to even mention the extra work to serve those features.

By mulmen 2024-04-104:312 reply

How does Chronon do this faster than the precomputed table? And in a single docker container? Is it doing logically similar operations but just automating the creation and orchestration of the aggregation tasks? How does it work?

By better365 2024-04-1117:451 reply

We utilize a lambda architecture, which incorporates the concept of precomputed tables as well. Those precomputed tables store intermediate representation of the final results. These precomputed tables are capable of providing snapshot or daily accuracy features. However, when it comes to real-time features that require point-in-time correctness, using precomputed tables may present challenges.

For the offline computations, we will reuse those intermediate results to avoid calculation from the beginning again. So the engine can actually scale sub-linearly.

By mulmen 2024-04-1118:56

Thanks. How does Chronon serve the real-time features without precomputed tables?

By throwaway2037 2024-04-103:001 reply

This is good post. You had me until this part:

    > So you can do all that, or you can write a few lines of python in Chronon.

It all seems a bit handvwavy here. Will Chronon work as well as the SQL version or be correct? I vote for an LLM tool to help you write those queries. Or is that effectively what Chronon is doing?

By better365 2024-04-103:421 reply

For correctness, yes, it works as well as the SQL version. And the aggregation can be extensible for other operations easily. For example, we have an operation of last, which is not even available in standard SQL.

By mulmen 2024-04-1020:011 reply

I’ll stop short of calling comparisons to standard SQL disingenuous but it’s definitely unrealistic because no standard SQL implementation exists.

What does this “last” operation do? There’s definitely a LAST_VALUE() window function in the databases I use. It is available in Postgres, Redshift, EMR, Oracle, MySQL, MSSQL, Bigquery, and certainly others I am not aware of.

By better365 2024-04-1117:501 reply

That's fair.

Actually, Last is usually called last_k(n), so that you can specify the number of the values in the result array. For example, if the input column is page_view_id and n = 300, it will return the last 300 page_view_id as an array. If a window is used, for example, 7d, it will truncate the results to the past 7d. The LAST_VALUE() seems to return the last value from an ordered set. Hope that helps. Thanks for your interests.

By mulmen 2024-04-1119:04

In SQL we do that with a RANK window function then apply a filter to that rank. It can also be done with a correlated subquery.

By echrisinger 2024-04-0920:511 reply

What's with the dismissiveness? The author is a senior staff engineer at a huge company & has worked in this space for years. I'd suspect they've done their diligence...

By nikhilsimha 2024-04-0919:413 reply

Snapshots can’t travel back with milliseconds precision or even minute level precision. They are just full dumps at regular fixed intervals in time.

By hobs 2024-04-0920:531 reply

https://en.wikipedia.org/wiki/Sixth_normal_form Basically we've had time travel (via triggers or built in temporal tables or just writing the data) for a long time, its just expensive to have it all for an OLTP database.

We've also had slowly changing dimensions to solve this type of problem for a decent amount of time for the labels that sit on top of everything, though really these are just fact tables with a similar historical approach.

By ezvz 2024-04-0921:141 reply

6NF works well for some temporal data, but I haven't seen it work well for windowed aggregations because the start/end time format of saving values doesn't handle events "falling out of the window" too well. At least the examples I've seen have values change due to explicit mutation events.

By hobs 2024-04-0923:32

Agree, you don't really want to pre-aggregate your temporal data, or it will effectively only aggregate at each row-time boundary and the value is lower than just keeping the individual calculations.

By _se 2024-04-0919:481 reply

Databases have had many forms of time travel for 30+ years now.

By threeseed 2024-04-0921:521 reply

Not at the latency needed for feature serving and most databases struggle with column limits.

But please enlighten us on which databases to use so Airbnb (and the rest of us) can stop wasting time.

By refset 2024-04-0922:49

Shameless plug, but XTDB v2 is being built for low-latency bitemporal queries over columnar storage and might be applicable: https://docs.xtdb.com/quickstart/query-the-past.html

We've not been developing v2 with ML feature serving in mind so far, but I would love to speak with anyone interested in this use case and figure out where the gaps are.

By mulmen 2024-04-0920:441 reply

Snapshots don’t have to be at regular intervals and can be at whatever resolution you choose. You could snapshot as the first step of training then keep that snapshot for the life of the resulting model. Or you could use some other time travel methodology. Snapshots are only one of many options.

By nikhilsimha 2024-04-0921:012 reply

These are reconstruction of features / columns that don’t exist yet.

By mulmen 2024-04-118:35

I don’t understand what this means. How can something be reconstructed without first existing? Is this not just a caching exercise?

By jyhu 2024-04-0920:48

Have you guys considered Rockset? What you mentioned are some classic real-time aggregation use cases and Rockset seems to support that well: https://docs.rockset.com/documentation/docs/ingestion-rollup...

By ShamelessC 2024-04-0919:241 reply

> ability to time travel for training data generation

What now?

By nikhilsimha 2024-04-0919:301 reply

Pardon the jargon. But it is a necessary addition to the vocabulary.

To evaluate if a feature is valuable, you could attach the value of the feature to past inferences and retrain a new model to check for improvement in performance.

But this “attach”-ing needs the feature value to be as of the time of the past inference.

By mulmen 2024-04-0920:221 reply

That’s not a new concept.

By better365 2024-04-103:541 reply

True. But it is not necessary to reinvent the wheel for engineers. :)

By mulmen 2024-04-105:05

That’s the point of this subthread though. What’s the new thing Chronon is doing? It can’t just be point in time features because that’s already a thing.

By jamesblonde 2024-04-0917:02

You need the columnar store for both training data and batch inference data. If you have a batch ML system that works with time series data, the feature store will help you create point in time correct training data snapshots from the mutable feature datab(no future data leakagae), as well as batch inference data.

For real-time ml systems, it give uou row oriented retrival latencies for features.

Most importantly, it helps modularize your ML system into feature pipelines training pipelines, and inference pipelines. No onolithic ML pipelines.

By uoaei 2024-04-0916:37

Feature stores are more for fast read and moderate write/update for ML training and inference flows. Good organization and fast query of relatively clean data.

Data warehouse is more for relatively unstructured or blobby data with moderate read access and capacity for massive files.

OLAP is mostly for feeding streaming and event-driven flows, including but not limited to ML.

By nikhilsimha 2024-04-0917:381 reply

the ability generate training sets against historical inferences to back-test new features

another one is the focus on pushing as much compute to the write-side as possible (within Chronon) - specially joins and aggregations.

OLAP databases and even graph databases don't scale well to high read traffic. Even when they do, the latencies are very high.

By giovannibonetti 2024-04-0918:171 reply

You may want to take a look at Starrocks [1]. It is an open-source DB [2] that competes with Clickhouse [3] and claims to scale well – even with joins – to handle use cases like real-time and user-facing analytics, where most queries should run in a fraction of a second.

[1] https://www.starrocks.io/ [2] https://github.com/StarRocks/starrocks [3] https://www.starrocks.io/blog/starrocks-vs-clickhouse-the-qu...

By nikhilsimha 2024-04-0918:343 reply

We did and gave up due to scalability limitations.

Fundamentally most of the computation needs to happen before the read request is sent.

By jvican 2024-04-0918:421 reply

Hey! I work on the ML Feature Infra at Netflix, operating a similar system to Chronon but with some crucial differences. What other alternatives aside from Starrocks did you evaluate as potential replacements prior to building Chronon? Curious if you got to try Tecton or Materialize.com.

By nikhilsimha 2024-04-0919:195 reply

We haven’t tried materialize - IIUC materialized is pure kappa. Since we need to correct upstream data errors and forget selective data(GDPR) automatically - we need a lambda system.

Tecton, we evaluated, but decided that the time-travel strategy wasn’t scalable for our needs at the time.

A philosophical difference with tecton is that, we believe the compute primitives (aggregation and enrichment) need to be composable. We don’t have a FeatureSet or a TrainingSet for that reason - we instead have GroupBy and Join.

This enables chaining or composition to handle normalization (think 3NF) / star-schema in the warehouse.

Side benefit is that, non ml use-cases are able to leverage functionality within Chronon.

By jamesblonde 2024-04-0919:57

FeatureSets are mutable data and TrainingSets are consistent snapshots of feature data (from FeatureSets). I fail to see what that has to do with composability. Join is still available for FeatureSets to enable composable feature views - join is resuse of feature data. GroupBy is just an aggregation in a feature pipeline, not sure your point here. You can still do star schema (and even snowflake schema if you have the right abstractions).

By jamesblonde 2024-04-0919:581 reply

Normalization is a model-dependent transformation and happens after the feature store - needs to be consistent between training and inference pipelines.

By nikhilsimha 2024-04-0920:311 reply

Normalization is overloaded. I was referring to schema normalization (3NF etc) not feature normalization - like standard scaling etc.

By jamesblonde 2024-04-0920:491 reply

Ok, but star schema is denormalized. Snowflake is normalized.

By nikhilsimha 2024-04-0922:10

To be pedantic, even in star schema - the dim tables are denormalized, fact tables are not.

I agree that my statement would be much better if used snowflake schema instead.

By throwaway2037 2024-04-103:041 reply

What is the meaning of pure kappa?

By pedrosorio 2024-04-103:46

https://learn.microsoft.com/en-us/azure/architecture/databas...

By jvican 2024-04-0921:19

Thank you for sharing!

By _mdb 2024-04-104:00

[dead]

By omeze 2024-04-0919:12

That evaluation would be an amazing addendum or engineering blog post! I know it’s not as sexy as announcing a product, but from an engineering perspective the process matters as much as the outcome :)

By esafak 2024-04-0918:501 reply

Please can you expand? What limitations, computations?

By nikhilsimha 2024-04-0919:081 reply

Let’s say you want to compute avg transaction value of a user in the last 90days. You could pull individual transactions and average during the request time - or you could pre compute a partial aggregates and re-aggregate on read.

OLAP systems are fundamentally designed to scale the read path - former approach. Feature serving needs the latter.

By esafak 2024-04-0920:501 reply

Does Chronon automatically determine what intermediate calculations should be cached? Does it accept hints?

By nikhilsimha 2024-04-0922:10

We don't accept hints yet - but we determine what to cache.

By whiplash451 2024-04-0914:561 reply

First of all, congrats on the release! Well done. A few questions:

- Since the platform is designed to scale, it would be nice to see scalability benchmarks

- Is the platform compatible with human-in-the-loop workflows? In my experience, those workflows tend to require vastly different needs than fully automated workflows (e.g. online advertising)

By nikhilsimha 2024-04-0917:40

re: scalability benchmarks - we plan to publish more benchmark information against publicly available datasets in the near future.

re: human-in-the-loop workflows - do you mean labeling?