Creating a "live leaderboard" with a Rollup collection

yaron · September 17, 2021, 7:49pm

Hi!

As of today, we are still using a Redis Sorted Set to serve our live daily leaderboard screen. Back then, we tried serving it from Rockset.
So every document would look like this:

We queried the collection for the top 10 users with the most steps on a certain day. When querying the collection to build the leaderboard we would use the RANK() operator. That method was not practical and was too slow/demanded high resources.

Now with Rollups, there might be a better way using this rollup definition:

SELECT
  userId,
  day,
  SUM(steps) AS steps,
  RANK() OVER(PARTITION BY day ORDER BY SUM(steps) DESC) AS position,
FROM
  _input
GROUP BY
  userId,
  day

This Rollups work and seem to sustain correct results as data is coming in. If you want to get the top 10 users you just do a simple “leaf” query for rows with rank 1,2,3… No sorting is needed so querying this collection is an order of magnitude faster than the old way.

My question is, will that Rollup scale under high traffic? Is it some kind of a “hack” I’ve managed to create or a legitimate Rollup definition?

Any help would be much appreciated.

nadine · September 17, 2021, 8:49pm

Hi @yaron,

You are doing the aggregations correctly:
SUM(steps) *AS* steps

This will be computed at ingestion time and not at query time.

Now, for the window part of the query:
RANK() OVER(PARTITION BY day ORDER BY SUM(steps) DESC) AS position

this will be actually run at query time, NOT, at ingestion time. Yes, this is scalable because the aggregation part of the query is now computed at ingest time and does not take up compute resources at query time.

Please let me know if this answers your questions, and I hope this helps.
Nadine

yaron · September 18, 2021, 10:25am

I didn’t understand (-:

So in query time nothing will run?
Everything will be done in ingest time?

nadine · September 18, 2021, 3:37pm

Hi @yaron!

Sure, no problem. So, it’s a mix of both. SUM(steps) AS steps will run at ingestion time. And, RANK() OVER(PARTITION BY day ORDER BY SUM(steps) DESC) AS position will run at query time. Does this help explain it? Let me know which part I should elaborate further.

Best,
n

yaron · September 19, 2021, 1:33am

Mm… are you sure?
Because when I look at the collection in the dashboard, the position (rank) property is there, always ready and updated. I don’t need to query the collection.

nadine · September 20, 2021, 4:51pm

Hi @yaron! This is a very great question. I think this blog will help you:

Particularly this part of the blog:

In short, SUM() is applied at data ingestion time. Then, we apply RANK() on top of the pre-calculated result of SUM() and this is ran at query time.

When you look at the preview and collection, you’ll see the full rollup query being run, including the RANK() part, and not just the part that happens at data ingestion. The part that’s important, is you’re saving on compute at query time when you apply the SUM() during rollups. But, compute is being used at query time for RANK().

Does this help a little more with your question? Please let me know.

Best,
n

yaron · September 20, 2021, 5:05pm

Thanks! Got it…

Topic		Replies	Views
Using DISTINCT in Rollups Open Q & A	1	435	September 13, 2021
Aggregating data Querying your data	1	481	June 30, 2021
Data clustering Open Q & A	5	463	September 13, 2021
How should I transform already ingested data Querying your data	0	437	March 14, 2022
Creating APIs Programmatically SDKs	1	521	August 20, 2021

Creating a "live leaderboard" with a Rollup collection

Related Topics