As of today, we are still using a Redis Sorted Set to serve our live daily leaderboard screen. Back then, we tried serving it from Rockset.
So every document would look like this:
We queried the collection for the top 10 users with the most steps on a certain day. When querying the collection to build the leaderboard we would use the RANK() operator. That method was not practical and was too slow/demanded high resources.
Now with Rollups, there might be a better way using this rollup definition:
SELECT
userId,
day,
SUM(steps) AS steps,
RANK() OVER(PARTITION BY day ORDER BY SUM(steps) DESC) AS position,
FROM
_input
GROUP BY
userId,
day
This Rollups work and seem to sustain correct results as data is coming in. If you want to get the top 10 users you just do a simple “leaf” query for rows with rank 1,2,3… No sorting is needed so querying this collection is an order of magnitude faster than the old way.
My question is, will that Rollup scale under high traffic? Is it some kind of a “hack” I’ve managed to create or a legitimate Rollup definition?
You are doing the aggregations correctly: SUM(steps) *AS* steps
This will be computed at ingestion time and not at query time.
Now, for the window part of the query: RANK() OVER(PARTITION BY day ORDER BY SUM(steps) DESC) AS position
this will be actually run at query time, NOT, at ingestion time. Yes, this is scalable because the aggregation part of the query is now computed at ingest time and does not take up compute resources at query time.
Please let me know if this answers your questions, and I hope this helps.
Nadine
Sure, no problem. So, it’s a mix of both. SUM(steps) AS stepswill run at ingestion time. And, RANK() OVER(PARTITION BY day ORDER BY SUM(steps) DESC) AS positionwill run at query time. Does this help explain it? Let me know which part I should elaborate further.
Mm… are you sure?
Because when I look at the collection in the dashboard, the position (rank) property is there, always ready and updated. I don’t need to query the collection.
In short, SUM() is applied at data ingestion time. Then, we apply RANK() on top of the pre-calculated result of SUM() and this is ran at query time.
When you look at the preview and collection, you’ll see the full rollup query being run, including the RANK() part, and not just the part that happens at data ingestion. The part that’s important, is you’re saving on compute at query time when you apply the SUM() during rollups. But, compute is being used at query time for RANK().
Does this help a little more with your question? Please let me know.