Quick one on Big Query SQL-Ecommerce Data - sql

I am trying to replicate the Google Analyitcs data in Big Query but couldnt do that.
Basically I am using Custom Dimension 40 (user subscription status)
but I am getting wrong numbers in BQ.
Can someone help me on this?
I am using this query but couldn't find it out the exact one.
SELECT
(SELECT value FROM hits.customDimensions where index=40) AS UserStatus,
COUNT(hits.transaction.transactionId) AS Unique_Purchases
FROM
`xxxxxxxxxxxxx.ga_sessions_2020*` AS GA, --new rollup
UNNEST(GA.hits) AS hits
WHERE
(SELECT value FROM hits.customDimensions where index=40) IN ("xx001","xxx002")
GROUP BY 1
I am getting this from big query which is wrong.
I have check out the dates also but dont know why its wrong.

Your question is rather unclear. But because you want something to be unique and numbers are mysteriously not what you want, I would suggest using COUNT(DISTINCT):
COUNT(DISTINCT hits.transaction.transactionId) AS Unique_Purchases

As far as I understand, you imported Google Analytics data into Bigquery and you are trying to group the custom dimension with index 40 and values ("xx001","xxx002") in order to know how many hit transactions were performed in function of these dimension values.
Replicating your scenario and trying to execute the query you posted, I got the following error.
However, I created a query that could help with your use-case. At first, it selects the transactionId and dimension values with the transactionId different from null and with index value equal to 40, then the grouping is done by the dimension value, filtered with values equals to "xx001"&"xxx002".
WITH tx AS (
SELECT
HIT.transaction.transactionId,
CD.value
FROM
`xxxxxxxxxxxxx.ga_sessions_2020*` AS GA,
UNNEST(GA.hits) AS HIT,
UNNEST(HIT.customDimensions) AS CD
WHERE
HIT.transaction.transactionId IS NOT NULL
AND
CD.index = 40
)
SELECT tx.value AS UserStatus, count(tx.transactionId) AS Unique_Purchases
FROM tx
WHERE tx.value IN ("xx001","xx002")
GROUP BY tx.value
For further details about the format and schema of the data that is imported into BigQuery, I found this document.

Related

Best practices for dealing with duplicate rows caused by unnested records in BigQuery?

Working with data coming from Facebook more often than not involves working with records where, in my case, all the “spicy” data is at. However, there is a downside, namely the huge amount of duplicate rows, which when not handled properly can cause over-reporting and/or data discrepancy.
Below is a use case which when joined with my primary data (coming from tables which do not involve any unnesting) causes a slight discrepancy in the final numbers.
Technologies used - Facebook Data -> Stitch -> BigQuery -> dbt -> Google Data Studio
I would usually create separate models where I’d unnest a record, transform the data and then join it into the rest of my models. An example of this is getting all website purchase conversion from the ads_insights’s actions record. 
Here is the difference though:

Query:
SELECT count(*) AS row_count
FROM ads_insights
Result:
 row_count - 316

Query:
SELECT count(*) AS row_count
FROM ads_insights,
UNNEST(actions) AS actions
Result:
 row_count - 5612

After unnesting, I’d use the row data to create columns for each conversion like so:
CASE WHEN value.action_type = 'offsite_conversion.fb_pixel_purchase' THEN COALESCE(value._28d_click, 0) + COALESCE(value._1d_view, 0) ELSE 0 END AS website_purchase

And finally I would join this model to the rest of my models. The only problem is that those 5600 rows cause a slight discrepancy when joined with the rest, and since I’ve already used the row data to create the columns, I don’t care about the unnested record data anymore, and I can go back to my original 316 rows. The only question is how? What techniques are out there that will help me clean up my model?
Solution:
Even though at some point I'd aggregate and group all the fields in my query like dylanbaker suggested in his answer, the discrepancy would still persist, and after doing a deep dive at my data I found that the unnested query will return 279 rows, whereas the nested one will return 314. This focused my attention at the unnesting query, where it will remove 35 rows, and those 35 rows happened to be null. After doing some google search I found this StackOverflow article which suggest using LEFT JOIN UNNEST to preserve all rows that have null record values, instead of CROSS JOIN UNNEST which will remove them.
You would typically want to do a 'pivot' here. You're most of the way there, you just need to sum and group by the relevant columns in order to get this back to the grain that you originally had and want.
I believe you'll want something like this:
select
ads_insights.some_column,
ads_insights.some_other_column,
sum(case
when value.action_type = 'offsite_conversion.fb_pixel_purchase'
then coalesce(value._28d_click, 0) + coalesce(value._1d_view, 0)
else 0
end) AS website_purchase
from ads_insights,
unnest(actions) as actions
group by 1,2
The initial columns would be whatever you want from the original table. The 'sum case whens' would be to pivot and aggregate the unnested data.
You can actually do some magic with unnests inside the select statement
Does this work for you?
SELECT
some_column,
(SELECT coalesce(_28d_click, 0) + coalesce(_1d_view, 0) from unnest(actions) WHERE action_type = "offsite_conversion.fb_pixel_purchase") AS website_purchase
FROM ads_insights

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

Getting a unique value from an aggregated result set

I've got an aggregated query that checks if I have more than one record matching certain conditions.
SELECT RegardingObjectId, COUNT(*) FROM [CRM_MSCRM].[dbo].[AsyncOperationBase] a
where WorkflowActivationId IN ('55D9A3CF-4BB7-E311-B56B-0050569512FE',
'1BF5B3B9-0CAE-E211-AEB5-0050569512FE',
'EB231B79-84A4-E211-97E9-0050569512FE',
'F0DDF5AE-83A3-E211-97E9-0050569512FE',
'9C34F416-F99A-464E-8309-D3B56686FE58')
and StatusCode = 10
group by RegardingObjectId
having COUNT(*) > 1
That's nice, but then there is one field in AsyncOperationBase that will be unique. Say count(*) = 3, well, AsyncOperationBaseId in AsyncOperationBase will have 3 different values since AsyncOperationBase is the table's primary key.
To be honest, I would not even know what terms and expressions to Google to find a solution.
If anyone has a solution and also, is there any words to describe what I'm looking for ? Perhaps BI people are often faced with such a requirement or something...
I could do it with an SSRS report where the report would visually do the grouping then I could expand each grouped row to get the AsyncOperationBaseId value, but simply through SQL, I can't seem to find a way out...
Thanks.
select * from [CRM_MSCRM].[dbo].[AsyncOperationBase]
where RegardingObjectId in
(
SELECT RegardingObjectId
FROM [CRM_MSCRM].[dbo].[AsyncOperationBase] a
where WorkflowActivationId IN
(
'55D9A3CF-4BB7-E311-B56B-0050569512FE',
'1BF5B3B9-0CAE-E211-AEB5-0050569512FE',
'EB231B79-84A4-E211-97E9-0050569512FE',
'F0DDF5AE-83A3-E211-97E9-0050569512FE',
'9C34F416-F99A-464E-8309-D3B56686FE58'
)
and StatusCode = 10
group by RegardingObjectId
having COUNT(*) > 1
)

Calculate First Time Buyer and Repeating Buyers using MAQL Queries in GOOD-DATA platform

I have been recently working with GOOD-DATA platform. I don't have that much experience in MAQL, but I am working on it. I did some metric and reports in GOOD-DATA platform. Recently I tried to create a metric for calculating Total Buyers,First Time Buyers and Repeating Buyers. I created these three reports and working perfect.But when i try to add a order date parent filter the first time buyers and repeating buyers value getting wrong. please have Look at Following queries.
I can find out the correct values using sql queries.
MAQL Queries:
TOTAL ORDERS-
SELECT COUNT(NexternalOrderNo) BY CustomerNo WITHOUT PF
TOTAL FIRSTTIMEBUYERS-
SELECT COUNT(CustomerNo) WHERE (TOTAL ORDER WO PF=1) WITHOUT PF
TOTAL REPEATINGBUYERS-
SELECT COUNT(CustomerNo) WHERE (TOTAL ORDER WO PF>1) WITHOUT PF
Can any one suggest a logic for finding these values using MAQL
It's not clear what you want to do. If you could provide more details about the report you need to get, it would be great.
It's not necessary to put "without pf" into the metrics. This clause bans filter application, so when you remove it, the parent filter will be used there. And you will probably get what you want. Specifically, modify this:
SELECT COUNT(CustomerNo) WHERE (TOTAL ORDER WO PF>1) WITHOUT PF
to:
SELECT COUNT(CustomerNo) WHERE (TOTAL ORDER WO PF>1)
The only thing you miss here is "ALL IN ALL OTHER DIMENSIONS" aka "ALL OTHER".
This keyword locks and overrides all attributes in all other dimensions—keeping them from having any affect on the metric. You can read about it more in MAQL Reference Guide.
FIRSTTIMEBUYERS:
SELECT COUNT(CustomerNo)
WHERE (SELECT IFNULL(COUNT(NexternalOrderNo), 0) BY Customer ID, ALL OTHER) = 1
REPEATINGBUYERS:
SELECT COUNT(CustomerNo)
WHERE (SELECT IFNULL(COUNT(NexternalOrderNo), 0) BY Customer ID, ALL OTHER) > 1

BigQuery: GROUP BY clause for QUANTILES

Based on the bigquery query reference, currently Quantiles do not allow any kind of grouping by another column. I am mainly interested in getting medians grouped by a certain column. The only work around I see right now is to generate a quantile query per distinct group member where the group member is a condition in the where clause.
For example I use the below query for every distinct row in column-y if I want to get the desired result.
SELECT QUANTILE( <column-x>, 1001)
FROM <table>
WHERE
<column-y> == <each distinct row in column-y>
Does the big query team plan on having some functionality to allow grouping on quantiles in the future?
Is there a better way to get what I am trying to get here?
Thanks
With the recently announced percentile_cont() window function you can get medians.
Look at the example in the announcement blog post:
http://googlecloudplatform.blogspot.com/2013/06/google-bigquery-bigger-faster-smarter-analytics-functions.html
SELECT MAX(median) AS median, room FROM (
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
)
GROUP BY room
While there are efficient algorithms to compute quantiles they are somewhat memory intensive - trying to do multiple quantile calculations in a single query gets expensive.
There are plans to improve QUANTILES, but I don't know what the timeline is.
Do you need median? Can you filter outliers and do an average of the remainder?
If your per-group size is fixed, you may be able to hack it using combination of order, nest and nth. For instance, if there are 9 distinct values of f2 per value of f1, for median:
select f1,nth(5,f2) within record from (
select f1,nest(f2) f2 from (
select f1, f2 from table
group by f1,f2
order by f2
) group by f1
);
Not sure if the sorted order in subquery is guaranteed to survive the second group, but it worked in a simple test I tried.