Ways to improve performance of array_intersect?

I’m currently trying to improve a query where the bottleneck seems to be evaluating if two arrays intersect. For context, we have text campaigns as one table and another table with an audience of users to text. Filtering by an audience is optional, and it’s only when we do this filter that the query performance is significantly slower. When filtering campaigns by audience, we want to return only the campaigns where at least one of the users that were messaged are in the audience. My goal is to get under 30 seconds in all cases but closer to 10 seconds consistently would be reasonable.

  • Specifying an audience is optional (hence the audience_id=‘’ logic)
  • campaigns.messagedUsers = array of strings
  • audiences.contact_ids = array of strings
  • array size is usually between 0-100k in length
  • using large VI

Simplified query:

SELECT
	...
FROM
	development.campaigns
WHERE campaigns.messagingService = :messaging_service
	AND campaigns.scheduledAt >= :start_date
	AND campaigns.scheduledAt < :end_date
    -- If audience is provided, check that members of the audience were part of the messaged users
    AND (
      (:audience_id != ''
         AND LENGTH(ARRAY_INTERSECT(campaigns.messagedUsers, (SELECT audiences.contact_ids
        FROM development.audiences
        WHERE audiences.audience_id = :audience_id)))> 0
      )
         OR
         (:audience_id = '')
         )
GROUP BY
	...
ORDER BY
	... ASC;

Results:

  • 24 rows in 37354ms
  • 24 rows in 7757ms
  • 24 rows in 14136ms


The query profiler shows the most expensive operation is the Evaluate operation.

I’m assuming some of the variation is due to ingestion but currently, there are no other queries being run against the instance.

What I’ve tried:
I thought that I might be able to get more performance by using a Javascript UDF because I could return as soon as I found at least one instance where a messaged user was in an audience vs evaluating both arrays fully and then checking the results.

script {{{
  export function array_any(arr1, arr2) {
    return arr1.some(elem=>arr2.indexOf(elem) !== -1);
  }
}}}

SELECT
	...
FROM
	development.campaigns
WHERE campaigns.messagingService = :messaging_service
	AND campaigns.scheduledAt >= :start_date
	AND campaigns.scheduledAt < :end_date
    -- If audience is provided, check that members of the audience were part of the messaged users
    AND (
      (:audience_id != ''
        AND _script.array_any(campaigns.messagedUsers, (SELECT audiences.contact_ids
        FROM development.audiences
        WHERE audiences.audience_id = :audience_id)) = true
      )
         OR
         (:audience_id = '')
         )
GROUP BY
	...
ORDER BY
	... ASC;

Results:

This query profile was from the last execution (23s) and one thing I noticed is that the hash join is now the most expensive operation but it doesn’t seem right beceause the cpu time for the udf seems to be way smaller than I would expect. And the cpu time totals on the right overall don’t add up to 23s but maybe they aren’t supposed to?

At this point, I felt like it would be better to ask for help than continue blindly trying things. Any direction would be greatly appreciated!

Hello Jonathan,

I think I see a way to improve the performance here. If I understand correctly, this is the expensive part of your query:

LENGTH(ARRAY_INTERSECT(campaigns.messagedUsers, (SELECT audiences.contact_ids
        FROM development.audiences
        WHERE audiences.audience_id = :audience_id)))> 0

I think you may be able to get better performance by pushing down the predicate on messagedUsers into the subquery. That would look something like this:

EXISTS(SELECT 1
        FROM development.audiences
        WHERE audiences.audience_id = :audience_id AND ARRAY_CONTAINS(campaigns.messagedUsers, audiences.contact_ids)
)

This will allow us to use the index on audiences to more quickly detect intersection. Please forgive me if I’ve made any syntax mistakes.

If you would like more help, please reach out over Live Chat so we can assist you one-on-one. If you can also grant us access to your VI to provide support (under Settings → Security → Rockset Support Access) we will be able to provide more in-depth support for your data set.

Have a great day,
Ben

1 Like