I’m currently trying to improve a query where the bottleneck seems to be evaluating if two arrays intersect. For context, we have text campaigns as one table and another table with an audience of users to text. Filtering by an audience is optional, and it’s only when we do this filter that the query performance is significantly slower. When filtering campaigns by audience, we want to return only the campaigns where at least one of the users that were messaged are in the audience. My goal is to get under 30 seconds in all cases but closer to 10 seconds consistently would be reasonable.
- Specifying an audience is optional (hence the audience_id=‘’ logic)
- campaigns.messagedUsers = array of strings
- audiences.contact_ids = array of strings
- array size is usually between 0-100k in length
- using large VI
Simplified query:
SELECT
...
FROM
development.campaigns
WHERE campaigns.messagingService = :messaging_service
AND campaigns.scheduledAt >= :start_date
AND campaigns.scheduledAt < :end_date
-- If audience is provided, check that members of the audience were part of the messaged users
AND (
(:audience_id != ''
AND LENGTH(ARRAY_INTERSECT(campaigns.messagedUsers, (SELECT audiences.contact_ids
FROM development.audiences
WHERE audiences.audience_id = :audience_id)))> 0
)
OR
(:audience_id = '')
)
GROUP BY
...
ORDER BY
... ASC;
Results:
- 24 rows in 37354ms
- 24 rows in 7757ms
- 24 rows in 14136ms
The query profiler shows the most expensive operation is the Evaluate operation.
I’m assuming some of the variation is due to ingestion but currently, there are no other queries being run against the instance.
What I’ve tried:
I thought that I might be able to get more performance by using a Javascript UDF because I could return as soon as I found at least one instance where a messaged user was in an audience vs evaluating both arrays fully and then checking the results.
script {{{
export function array_any(arr1, arr2) {
return arr1.some(elem=>arr2.indexOf(elem) !== -1);
}
}}}
SELECT
...
FROM
development.campaigns
WHERE campaigns.messagingService = :messaging_service
AND campaigns.scheduledAt >= :start_date
AND campaigns.scheduledAt < :end_date
-- If audience is provided, check that members of the audience were part of the messaged users
AND (
(:audience_id != ''
AND _script.array_any(campaigns.messagedUsers, (SELECT audiences.contact_ids
FROM development.audiences
WHERE audiences.audience_id = :audience_id)) = true
)
OR
(:audience_id = '')
)
GROUP BY
...
ORDER BY
... ASC;
Results:
- 24 rows in 26358ms
- 24 rows in 23068ms
- 24 rows in 57817ms
- 24 rows in 23185ms
This query profile was from the last execution (23s) and one thing I noticed is that the hash join is now the most expensive operation but it doesn’t seem right beceause the cpu time for the udf seems to be way smaller than I would expect. And the cpu time totals on the right overall don’t add up to 23s but maybe they aren’t supposed to?
At this point, I felt like it would be better to ask for help than continue blindly trying things. Any direction would be greatly appreciated!