Data clustering

dserna · September 9, 2021, 7:30am

The situation is that we have a huge table of 30M, and we want to filter and make some group by, looking in the documentation data clustering could be a good solution, someone have more information about this feature? I think it could be a good way to improve the performance in a huge set where you can not agregate to less than 1M rows.

Thanks.

bryfi · September 9, 2021, 4:15pm

Hey @dserna,

For your collection size, the data clustering option could be helpful for sure. Some things to keep in mind with using Data Clustering:

Data Clustering only works with the Rockset Column Index. This means that you would need to add HINT(access_path=column_scan) to the end of the the FROM tables in your queries to leverage it (see this doc for more).

Data Clustering is a collection setting that must be set when the collection is created. If this is turned on for your org, the collection would need to be recreated to take advantage of Data Clustering.

Data Clustering must be set via the Collection Creation API (this doc for more). This means that any collection that would leverage Data Clustering would have to be created, via that API, with Data Clustering enabled, and you will need to specify the attribute that you would like the Data Clustering to be on (the API calls it the clustering_key).

Hopefully this helps. Let me know if you need anything else on it!

dserna · September 9, 2021, 4:34pm

Thanks for the help. We will try and posted here the results.

veeve · September 9, 2021, 5:21pm

Hi @dserna, have you shared that query with Rockset support? There may be other opportunities to tweak it to perform better that doesn’t involve clustering.

dserna · September 10, 2021, 1:07pm

Hi, yes, sometime ago i share the query, the problem is that it has a heavy “group by” ,30m rows and 2 “where”. We are going to try to do a better ingest with rollups and let a final table where we can save the two “where”.

nadine · September 13, 2021, 6:25pm

thanks @dserna! let us know if you more questions pop up!

Topic		Replies	Views
Verifying the clustering key is being used by queries Querying your data	0	310	August 12, 2022
Aggregating data Querying your data	1	482	June 30, 2021
Creating a "live leaderboard" with a Rollup collection Querying your data	6	585	September 20, 2021
Why UNNEST is still required since the nested data has already been flatten during ingestion? Open Q & A	7	365	November 7, 2022
UNNEST on array exceeds memory limit Open Q & A	1	194	September 18, 2023

Data clustering

Related Topics