Data clustering

The situation is that we have a huge table of 30M, and we want to filter and make some group by, looking in the documentation data clustering could be a good solution, someone have more information about this feature? I think it could be a good way to improve the performance in a huge set where you can not agregate to less than 1M rows.

Thanks.

2 Likes

Hey @dserna,

For your collection size, the data clustering option could be helpful for sure. Some things to keep in mind with using Data Clustering:

Data Clustering only works with the Rockset Column Index. This means that you would need to add HINT(access_path=column_scan) to the end of the the FROM tables in your queries to leverage it (see this doc for more).

Data Clustering is a collection setting that must be set when the collection is created. If this is turned on for your org, the collection would need to be recreated to take advantage of Data Clustering.

Data Clustering must be set via the Collection Creation API (this doc for more). This means that any collection that would leverage Data Clustering would have to be created, via that API, with Data Clustering enabled, and you will need to specify the attribute that you would like the Data Clustering to be on (the API calls it the clustering_key).

Hopefully this helps. Let me know if you need anything else on it!

Thanks for the help. We will try and posted here the results.

Hi @dserna, have you shared that query with Rockset support? There may be other opportunities to tweak it to perform better that doesn’t involve clustering.

Hi, yes, sometime ago i share the query, the problem is that it has a heavy “group by” ,30m rows and 2 “where”. We are going to try to do a better ingest with rollups and let a final table where we can save the two “where”.

thanks @dserna! let us know if you more questions pop up!