Query for selecting sequence of hits consumes large quantity of data - sql

I'm trying to measure the conversion rate through alternative funnels on a website. My query has been designed to output a count of sessions that viewed the relevant start URL and a count of sessions that hit the confirmation page strictly in that order. It does this by comparing the times of the hits.
My query appears to return accurate figures, but in doing so selects a massive quantity of data, just under 23GB for what I've attempted to limit to one hour of one day. I don't seem to have written my query in a particularly efficient way and gather that I'll use up all of my company's data quota fairly quickly if I continue to use it.
Here's the offending query in full:
WITH
s1 AS (
SELECT
fullVisitorId,
visitId,
LOWER(h.page.pagePath),
device.deviceCategory AS platform,
MIN(h.time) AS s1_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
LOWER(h.page.pagePath) LIKE '{funnel-start-url-1}%' OR LOWER(h.page.pagePath) LIKE '{funnel-start-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
AND
h.type = "PAGE"
GROUP BY
path,
platform,
fullVisitorId,
visitId
ORDER BY
fullVisitorId ASC, visitId ASC
),
confirmations AS (
SELECT
fullVisitorId,
visitId,
MIN(h.time) AS confirmation_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
h.type = "PAGE"
AND
LOWER(h.page.pagePath) LIKE '{confirmation-url-1}%' OR LOWER(h.page.pagePath) LIKE '{confirmations-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
GROUP BY
fullVisitorId,
visitId
)
SELECT
platform,
path,
COUNT(path) AS Views,
SUM(
CASE
WHEN s1.s1_time < confirmations.confirmation_time
THEN 1
ELSE 0
END
) AS SubsequentPurchases
FROM
s1
LEFT JOIN
confirmations
ON
s1.fullVisitorId = confirmations.fullVisitorId
AND
s1.visitId = confirmations.visitId
GROUP BY
platform,
path
What is it about this query that means it has to process so much data? Is there a better way to get at these numbers. Ideally any method should be able to measure the multiple different routes, but I'd settle for sustainability at this point.

There are probably a few ways that you can optimize your query but it seems like it won't entirely solve your issue (as I'll further try to explain).
As for the query, this one does the same but avoids re-selecting data and the LEFT JOIN operation:
SELECT
path,
platform,
COUNT(path) views,
COUNT(CASE WHEN last_hn > first_hn THEN 1 END) SubsequentPurchases
from(
SELECT
fv,
v,
platform,
path,
first_hn,
MAX(last_hn) OVER(PARTITION BY fv, v) last_hn
from(
SELECT
fullvisitorid fv,
visitid v,
device.devicecategory platform,
LOWER(hits.page.pagepath) path,
MIN(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product') THEN hits.hitnumber ELSE null END) first_hn,
MAX(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'success') then hits.hitnumber ELSE null END) last_hn
FROM `project_id.data_set.ga_sessions_20170112`,
UNNEST(hits) hits
WHERE
REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product|success')
AND totals.visits = 1
AND hits.type = 'PAGE'
GROUP BY
fv, v, path, platform
)
)
GROUP BY
path, platform
HAVING NOT REGEXP_CONTAINS(path, r'success')
first_hn tracks the funnel-start-url (in which I used the terms "catalog" and "product") and the last_hn tracks the confirmation URLs (which I used the term "success" but could add more values in the regex selector). Also, by using MIN and MAX operations and the analytical functions you can have some optimizations in your query.
There are a few points though to make here:
If you insert WHERE hits.hithour = 20, BigQuery still has to scan the whole table to find what is 20 from what is not. That means that the 23Gbs you observed still accounts for the whole day.
For comparison, I tested your query against our ga_sessions and it took around 31 days to reach 23Gb of data. As you are not selecting that many fields, it shouldn't be that easy to reach this amount unless you have a considerable high traffic volume coming from your data source.
Given current pricing for BigQuery, 23Gbs would consume you roughly $0.11 to process, which is quite cost-efficient.
Another thing I could imagine is that you are running this query several times a day and have no cache or some proper architecture for these operations.
All this being said, you can optimize your query but I suspect it won't change that much in the end as it seems you have quite a high volume of data. Processing 23Gbs a few times shouldn't be a problem but if you are concerned that it will reach your quota then it seems like you are running several times a day this query.
This being the case, see if using either some cache flag or saving the results into another table and then querying it instead will help. Also, you could start saving daily tables with just the sessions you are interested in (having the URL patterns you are looking for) and then running your final query in these newly created tables, which would allow you to query over a bigger range of days spending much less for that.

Related

AWS Timestream query to get average measure for the first month of samples

In AWS Timestream I am trying to get the average heart rate for the first month since we have received heart rate samples for a specific user and the average for the last week. I'm having trouble with the query to get the first month part. When I try to use MIN(time) in the where clause I get the error: WHERE clause cannot contain aggregations, window functions or grouping operations.
SELECT * FROM "DATABASE"."TABLE"
WHERE measure_name = 'heart_rate' AND time < min(time) + 30
If I add it as a column and try to query on the column, I get the error: Column 'first_sample_time' does not exist
SELECT MIN(time) AS first_sample_time FROM "DATABASE"."TABLE"
WHERE measure_name = 'heart_rate' AND time > first_sample_time
Also if I try to add to MIN(time) I get the error: line 1:18: '+' cannot be applied to timestamp, integer
SELECT MIN(time) + 30 AS first_sample_time FROM "DATABASE"."TABLE"
Here is what I finally came up with but I'm wondering if there is a better way to do it?
WITH first_month AS (
SELECT
Min(time) AS creation_date,
From_milliseconds(
To_milliseconds(
Min(time)
) + 2628000000
) AS end_of_first_month,
USER
FROM
"DATABASE"."TABLE"
WHERE
USER = 'xxx'
AND measure_name = 'heart_rate'
GROUP BY
USER
),
first_month_avg AS (
SELECT
Avg(hm.measure_value :: DOUBLE) AS first_month_average,
fm.USER
FROM
"DATABASE"."TABLE" hm
JOIN first_month fm ON hm.USER = fm.USER
WHERE
measure_name = 'heart_rate'
AND hm.time BETWEEN fm.creation_date
AND fm.end_of_first_month
GROUP BY
fm.USER
),
last_week_avg AS (
SELECT
Avg(measure_value :: DOUBLE) AS last_week_average,
USER
FROM
"DATABASE"."TABLE"
WHERE
measure_name = 'heart_rate'
AND time > ago(14d)
AND USER = 'xxx'
GROUP BY
USER
)
SELECT
lwa.last_week_average,
fma.first_month_average,
lwa.USER
FROM
first_month_avg fma
JOIN last_week_avg lwa ON fma.USER = lwa.USER
Is there a better or more efficient way to do this?
I can see you've run into a few challenges along the way to your solution, and hopefully I can clear these up for you and also propose a cleaner way of reaching your solution.
Filtering on aggregates
As you've experienced first hand, SQL doesn't allow aggregates in the where statement, and you also cannot filter on new columns you've created in the select statement, such as aggregates or case statements, as those columns/results are not present in the table you're querying.
Fortunately there are ways around this, such as:
Making your main query a subquery, and then filtering on the result of that query, like below
Select * from (select *,count(that_good_stuff) as total_good_stuff from tasty_table group by 1,2,3) where total_good_stuff > 69
This works because the aggregate column (count) is no longer an aggregate at the time it's called in the where statement, it's in the result of the subquery.
Having clause
If a subquery isn't your cup of tea, you can use the having clause straight after your group by statement, which acts like a where statement except exclusively for handling aggregates.
This is better than resorting to a subquery in most cases, as it's more readable and I believe more efficient.
select *,count(that_good_stuff) as total_good_stuff from tasty_table group by 1,2,3 having total_good_stuff > 69
Finally, window statements are fantastic...they've really helped condense many queries I've made in the past by removing the need for subqueries/ctes. If you could share some example raw data (remove any pii of course) I'd be happy to share an example for your use case.
Nevertheless, hope this helps!
Tom

Is there a performance benefit to repeating a WHERE filter in subqueries?

I have the following query:
WITH prices AS (
SELECT itemId
, monthId
, MIN(lastPrice / firstPrice) AS gain
FROM (
SELECT *
, FIRST_VALUE(price) OVER (PARTITION BY monthId
ORDER BY date) AS firstPrice
, LAST_VALUE(price) OVER (PARTITION BY monthId
ORDER BY date) AS lastPrice
FROM (
SELECT *
FROM foo
WHERE monthId = 82 -- a repeat of the final WHERE
) x
) x
WHERE firstPrice != 0
AND lastPrice != 0
GROUP BY itemId
, monthId
)
SELECT f.monthId
, f.itemId
, p.gain
FROM foo f
LEFT JOIN prices p
ON f.itemId = p.itemId
AND f.monthId = p.monthId
WHERE gain IS NOT NULL
AND monthId = 82 -- repeated above
As noted, the full query ends with a WHERE monthId = 82 clause, which is also present in the prices subquery.
If I remove the WHERE from the subquery, the result is the same. This makes sense since the result would be naturally constrained by the final WHERE.
However, the case without the subquery WHERE runs dramatically slower (40 vs. 3 minutes). However, I'm not proficient enough at SQL to know if this is expected or if it's merely an artifact of statistics (I've run the version with the subquery WHERE many, many times already and only now tried to remove it).
It'd make sense for this to improve performance since it allows the server to only perform the operations within prices (there are many more in my real case) on the subset of rows with monthId = 82. However, I don't know if the compiler already optimizes the subquery to filter it with that subset regardless and therefore the benefit I'm seeing is merely an illusion.
For the record, my actual FIRST/LAST_VALUE calls have ROWS BETWEEN PRECEEDING UNBOUNDED AND FOLLOWING UNBOUNDED, just omitted them to simplify the query.
The SQL Server optimizer is smart enough to push where filters into subqueries under many circumstances. However, optimizers make mistakes and they miss situations -- as would appear to be the case here. In general, you can check the query plan to see if it makes a difference.
I would be inclined to repeat the logic, just to be sure that the query is as efficient as possible.

Google Analytics session-scoped fields returning multiple values

I've discovered that there are certain GA "session" scoped fields in BigQuery that have multiple values for the same fullVisitorId and visitId fields. See the example below:
Grouping the fields doesn't help either. In GA, I've checked the number of users vs number of users split by different devices. The user count is different:
This explains what's going on, a user would be grouped under multiple devices. My conclusion is that at some point during the users session, their browser user-agent changes and in the subsequent hit, a new device type is set in GA.
I'd have hoped GA would use either the first or last value, to avoid this scenario, but I guess they don't. My question is, if I'm accepting this as a "flaw" in GA. I'd rather pick one value. What's the best way to select the last or first device value from the below query:
SELECT
fullVisitorId,
visitId,
device.deviceCategory
FROM (
SELECT
*
FROM
`project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT
*
FROM
`project.dataset.ga_sessions_*` mobile ) table
I've tried doing a sub-select and using STRING_AGG(), attempting to order by hits.time and limiting to one value and that still creates another row.
I've tested and found that the below fields all have the same issue:
visitNumber
totals.hits
totals.pageviews
totals.timeOnSite
trafficSource.campaign
trafficSource.medium
trafficSource.source
device.deviceCategory
totals.sessionQualityDim
channelGrouping
device.mobileDeviceInfo
device.mobileDeviceMarketingName
device.mobileDeviceModel
device.mobileInputSelector
device.mobileDeviceBranding
UPDATE
See below queries around this particular fullVisitorId and visitId - UNION has been removed:
visitStartTime added:
visitStartTime and hits.time added:
Well, from the looks of things, I think you have 3 options:
1 - Group by fullVisitorId, visitId; and use Max or MIN deviceCategory. That should prevent a device switcher from being double-counted, It's kind of arbitrary but then so is the GA data.
2 - Option two is similar but, if the deviceCategory result can be anything (i.e. isn't constrained in the results to just the valid deviceCategory members), you can use a CASE to check MAX(deviceCategory) = MIN(deviceCategory) and if they are different, return 'Multiple Devices'
3 - You could go further, counting the number of different devices used, construct a concatenation that lists them in some way, etc.
I'm going to write up Number 2 for you. In your question, you have 2 different queries: one with [date] and one without - I'll provide both.
Without [date]:
SELECT
fullVisitorId,
visitId,
case when max(device.deviceCategory) = min(device.deviceCategory)
then max(device.deviceCategory)
else 'Multiple Devices'
end as deviceCategory,
{metric aggregations here}
FROM
(SELECT *
FROM `project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT *
FROM `project.dataset.ga_sessions_*` mobile
) table
GROUP BY fullVisitorId, visitId
With [date]:
SELECT
[date],
fullVisitorId,
visitId,
case when max(device.deviceCategory) = min(device.deviceCategory)
then max(device.deviceCategory)
else 'Multiple Devices'
end as deviceCategory,
{metric aggregations here}
FROM
(SELECT *
FROM `project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT *
FROM `project.dataset.ga_sessions_*` mobile
) table
GROUP BY [date], fullVisitorId, visitId
I'm assuming here that the Selects and Union that you gave are sound.
Also, I should point out that those {metric aggregations} should be something other than SUMs, otherwise you will still be double-counting.
I hope this helps.
It's simply not possible to have two values for one row in this field, because it can only contain one value.
There are 2 possibilities:
you're actually querying two separate datasets/ two different views - that's not clearly visible with the example code. Client id (=fullvisitorid) is only unique per Property (Tracking Id, the UA-xxxxx stuff). If you query two different views from different properties you have to expect to get same ids used twice.
Given they are coming from one property, these two rows could actually be one session on a midnight split, which means visitId stays the same, but visitStartTime changes. But that would also mean the decision algorithm for device type changed in the meantime ... that would be curious.
Try using visitStartTime and see what happens.
If you're using two different properties use a user id to combine or separate the sessions by adding a constant - you can't combine them.
SELECT 'property_A' AS constant FROM ...
hth

Understanding what leads to "resources exceeded" error in GBQ?

I'm working in BigQuery on Google Analytics data. At various points in developing the query I get the error: "Resources exceeded". I want to further my understanding of what's happening. I've successfully worked around the problem, but only via trial and error.
When I use the explain tool it seems to be the 'compute' part of any query or sub-query that looks to have exceeded resources.
Here's an example of a standard SQL query that succeeds/fails depending on whether certain parts are left in:
SELECT
fullVisitorId,
visitId,
h.type AS type,
h.hitNumber AS hitNumber,
h.eventInfo.eventAction AS action,
LOWER(h.eventInfo.eventCategory) AS category,
h.page.pagePath AS page,
h.page.pageTitle AS landingTitle,
h.page.searchKeyword AS searchTerm,
LEAD(h.page.pagePath) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) AS landingPage,
SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] AS clickTitle,
CASE WHEN LEAD(h.page.pageTitle) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) = SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] THEN true ELSE false END AS searchClick
FROM `project.dataset.ga_sessions_*` AS main, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170430'
AND (
(
h.eventInfo.eventAction = 'click' AND LOWER(h.eventInfo.eventCategory) LIKE '/search%'
)
OR type = 'PAGE'
)
ORDER BY
fullVisitorId ASC, visitId ASC, h.hitNumber ASC
When removing any one of these sets of elements the query runs:
ORDER BY
fullVisitorId ASC, visitId ASC, h.hitNumber ASC
Or:
LEAD(h.page.pagePath) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) AS landingPage,
SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] AS clickTitle,
CASE WHEN LEAD(h.page.pageTitle) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) = SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] THEN true ELSE false END AS searchClick
Or:
When running on a single date partition the entire query runs.
I would describe my current level of understanding as superficial, I know little of the inner workings of GBQ and how it allocates/permits compute resources. I do know that it performs calculations on separate machines where possible. I've heard these described as shards before.
What do I need to know about GBQ compute resources in order to understand why the above will work/not work?
N.B: I only have Tier 1 access, but that doesn't mean I can't gain increased access if I can justify a need. Obviously I don't want to do this with current level of understanding.
I think the only thing that should be causing a problem in your query is the ORDER BY operation. As you can see in this answer from Jordan, this operation is not parallelizable. You can also check the docs for some ideas of what causes the Resources Exceeded error.
The rest of the query seems to be fine though. I tested your query against our data and it processed almost 300Gb in 20s:
If you still get the error then maybe you are querying quite a high amount of data. This being the case, you could try breaking the query into smaller dates range, querying for less columns, adding some WHERE conditions to filter out some rows, changing tier and so on.

Selecting different customDimensions on BigQuery integration of GA

Sorry for the specific question but I feel like I've hit a dead end because my knowledge of SQL doesn't go that far.
The data that comes out from the BigQuery implementation of GoogleAnalytics raw data looks like this:
|-visitId
|- date
|- (....)
+- hits
|- time
+- customDimensions
|- index
|- value
+- customMetrics
|- index
|- value
I know there are hits that always send some data to GA. Specifically I want customDimensions.index= 43, customDimensions.index= 24 and customMetrics.index=14. To specify, Dimension 43 is the object being seen or sold, dimension 24 tells me if they are being seen and metric 14, the value has 1 when it's just been sold.
My final result should look like this:
customDimension.value( when index=43) count(when customDimension.index=24 and customDimension.value=='ficha') count(when customMetrics.index=14 and customMetrics.value ==1))
Grouped by customDimension.value (when index=43)
I know that everytime that a hit is sent with customMetrics.index=14, the same hit has customDimensions.index=43, the same way, customDimensions.index=24 always has a customDimensions.index=43.
I actually managed to create an SQL that does what I want but, at what cost? It's big, it's slow, it's ugly. What I'm currently doing is:
Create three tables, all having visitId, hit.time and the value when index=14,24,43
Left join 43 with 24 ON 43.visitId==24.visitId AND 43.hits.time==24.hits.time as result
Left join result with 14 ON 14.visitId==result.visitId AND 14.hits.time==result.hits.time
I'm not interested in visitId or hits.time, it's just a way to relate the same hits (and know which product they bought when the customMetrics.index=14 and value=1.
This is my code:
SELECT Tviviendasvisitas.viviendaId as ViviendaID ,sum(Tviviendasvisitas.NumeroVisitas) as NumeroVisitas,sum(Ttransacciones.Transactions) as Transactions FROM (
SELECT Tviviendas.visitId as visitId, Tviviendas.hits.time as visitTime, Tviviendas.ViviendaID as viviendaId,Tvisitas.visitas as NumeroVisitas FROM (
SELECT visitId,hits.time,hits.customDimensions.value as ViviendaID FROM ((TABLE_DATE_RANGE([-------.ga_sessions_], TIMESTAMP('2014-09-01'), TIMESTAMP('2014-09-30'))))
WHERE hits.customDimensions.index = 43
GROUP EACH BY visitId,hits.time, ViviendaID)as Tviviendas
LEFT JOIN EACH(
SELECT visitId,hits.time,count(*) as visitas FROM ((TABLE_DATE_RANGE([-------.ga_sessions_], TIMESTAMP('2014-09-01'), TIMESTAMP('2014-09-30'))))
WHERE hits.customDimensions.index = 24 AND hits.customDimensions.value=='ficha'
GROUP EACH BY visitId,hits.time) as Tvisitas
ON Tvisitas.visitId==Tviviendas.visitId AND Tvisitas.time==Tviviendas.time) as Tviviendasvisitas
LEFT JOIN EACH (
SELECT visitId ,hits.time as transactionTime, sum(hits.customMetrics.value) as Transactions FROM(TABLE_DATE_RANGE([-------.ga_sessions_], TIMESTAMP('2014-09-01'), TIMESTAMP('2014-09-30')))
WHERE hits.customMetrics.index = 14 AND hits.customMetrics.value=1
GROUP BY visitId, transactionTime) as Ttransacciones
ON Tviviendasvisitas.visitId==Ttransacciones.visitId AND Tviviendasvisitas.visitTime==Ttransacciones.transactionTime
GROUP BY ViviendaID
Running this query takes way too much time for me to create a propper dashboard with the results.
So help me God if that's my final result. I feel like there should be a WAY more elegant solution to this problem but I can't seem to find it on my own.
Help?
You should be able to structure this query without the joins by using BigQuery's scoped aggregation (the WITHIN clause). Here is a small example, which may not be exactly the logic you want, but should illustrate some of the possibilities:
SELECT visitId, hits.time,
SOME(hits.customDimensions.index = 43) WITHIN RECORD AS has43,
SUM(IF(hits.customDimensions.index = 24 AND hits.customDimensions.value = 'ficha', 1, 0)) WITHIN RECORD AS numFichas,
SUM(IF(hits.customMetrics.index = 14, hits.customMetrics.value, 0)) WITHIN RECORD AS totalValues
FROM ((TABLE_DATE_RANGE([-------.ga_sessions_], TIMESTAMP('2014-09-01'), TIMESTAMP('2014-09-30'))))
HAVING has43
The example shows three WITHIN RECORD aggregations, meaning they will be computed over the repeated fields of a single record. SOME() takes a boolean expression and returns true if any field within the record satisfies that expression. So has43 will be true for visits that have one or more hits with customDimensions.index = 43. The HAVING clause filters out records where that is false.
The SUM(IF(...)) expressions compute the total number of customDimensions with index = 24 and value = 'ficha' and the total values associated with the customMetrics with index = 14.
If you just want to get the value from a hitLevel customDimension and add it to its own column here is a neat trick:
SELECT fullVisitorId, visitId, hits.hitNumber,
MAX(IF(hits.customDimensions.index=43,
hits.customDimensions.value,
NULL)) WITHIN hits AS product,
FROM [tableID.ga_sessions_20150305]
LIMIT 100