I'm trying to solve a problem for a couple of days already, but I'm completely stuck:
I do have a basic pageviews table from Snowplow Analytics. I'm creating a session table from that. This table has arrays to structure my data.
Now when I do a sum(count_page_views) the totals are right.
As soon as I add a date dimension date(session_start), the sum for this day is completely wrong.
This is what the table should look like. (Count distinct on pageview id)
This is what it looks like with my session table SQL:
I'm pretty certain I misunderstand something about the way summing arrays and array_length work, but I have no idea, what is wrong...
SQL for session table
with all_page_views as (
select
*
from
`page_views_table`
),
sessions_agg as (
select
pv.session_id,
array_agg(
pv
order by
pv.page_view_in_session_index
) as all_pageviews
from
all_page_views as pv
group by
1
),
sessions_agg_xf as (
select
session_id,
all_pageviews,
(
select
struct(
min(page_view_start) as session_start,
max(page_view_end) as session_end
)
from
unnest(all_pageviews)
) as timing
from
sessions_agg
),
sessions as (
select
timing.session_start,
timing.session_end,
array_length(all_pageviews) as count_page_views
from
sessions_agg_xf
)
select
sum(count_page_views )
from
sessions
where date(session_start) = "2020-02-01"
I believe I've found the problem somewhere else. There was a bug in Snowplow that didn't reset the session id, so my sessionization is wrong...
https://github.com/snowplow/snowplow-javascript-tracker/issues/718
Related
When creating a table let's say "orders" with partitioning in the following way my result gets truncated in comparison to if I create it without partitioning. (Commenting and uncommenting rows five and 6).
I suspect that it might have something to do with the BQ limits (found here) but I can't figure out what. The ts is a timestamp field and order_id is a UUID string.
i.e. The count distinct on the last row will yield very different results. When partitioned it will return far less order_ids than without partitioning.
DROP TABLE IF EXISTS
`project.dataset.orders`;
CREATE OR REPLACE TABLE
`project.dataset.orders`
-- PARTITION BY
-- DATE(ts)
AS
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2;
SELECT COUNT(DISTINCT order_id) FROM `project.dataset.orders`;
(This is not a valid 'answer', I just need a better place to write SQL than the comment box, I don't mind if moderator convert this answer into a comment AFTER it serves its purpose)
What is the number you'd get if you do query below, and which one does it align with (partitioned or non-partitioned)?
SELECT COUNT(DISTINCT order_id) FROM (
SELECT
ts,
order_id,
SUM(order_value) AS order_value
FROM
`project.dataset.raw_orders`
GROUP BY
1, 2
) t;
It turns out that there's a 60 day partition expiration!
https://cloud.google.com/bigquery/docs/managing-partitioned-tables#partition-expiration
So by updating the partition expiration I could get the full range.
I've seen these two questions mention the same issue. The answers are almost an year old, and wondering if any more updates were given in BQ - I could not find any concrete answers in the documentation.
I'm trying to do repeated sampling and would like consistent results. This is important for me.
The solution provided in this question, does not provide consistent results.
Here is my code
SELECT
*
FROM (
SELECT
*,
ROW_NUMBER() OVER() as incremental_number
FROM (
SELECT
*
FROM
Table1 as cmd
WHERE
Member NOT IN (
SELECT
Member
FROM
table2
WHERE
Idx = ‘6’
)
) as t
WHERE
MOD(ABS(FARM_FINGERPRINT(TO_JSON_STRING(t))), 10) < 5
)
ORDER BY Member
Maybe just the order of rows is different? Try to sort them.
Let be a table named data with columns time, sensor, value :
I want to pivot this table on Athena (Presto) to get a new table like this one :
To do so, one can run the following query :
SELECT time,
sensor_value['temperature'] as "temperature",
sensor_value['pressure'] as "pressure"
FROM (
SELECT time, mapp_agg(sensor, value) sensor_value
FROM data
GROUP BY time
)
This works well. But, I must specify the keys of sensor_value. I thus need to know the unique values of sensor to then manually write the query accordingly. The problem is that I don't have such information. Do you know a generic (and efficient) solution to solve this issue ? I would really appreciate any help. Thanks.
This will give you the answer, you just have to create a row_number to pivot your table.
SELECT
TIME,
MAX(CASE WHEN SENSOR='TEMPERATURE' THEN VALUE END) AS TEMPERATURE,
MAX(CASE WHEN SENSOR='PRESSURE' THEN VALUE END) AS PRESSURE
FROM (SELECT *, ROW_NUMBER() OVER(PARTITION BY SENSOR ORDER BY TIME) AS ROW_GROUP FROM TEMP_PRESSURE)
GROUP BY ROW_GROUP
I have a table with the following Columns...
Node, Date_Time, Market, Price
I would like to delete all but 1 record for each Node, Date time.
SELECT Node, Date_Time, MAX(Price)
FROM Hourly_Data
Group BY Node, Date_Time
That gets the results I would like to see but cant figure out how to remove the other records.
Note - There is no ID for this table
Here are steps that are rather workaround than a simple one-command which will work in any relational database:
Create new table that looks just like the one you already have
Insert the data computed by your group-by query to newly created table
Drop the old table
Rename new table to the name the old one used to have
Just remember that locking takes place and you need to have some maintenance time to perform this action.
There are simpler ways to achieve this, but they are DBMS specific.
here is an easy sql-server method that creates a Row Number within a cte and deletes from it. I believe this method also works for most RDBMS that support window functions and Common Table Expressions.
;WITH cte AS (
SELECT
*
,RowNum = ROW_NUMBER() OVER (PARTITION BY Node, Date_Time ORDER BY Price DESC)
FROM
Hourly_Data
)
DELETE
FROM
cte
WHERE
RowNum > 1
I have built cohorts of accounts based on date of first usage of our service. I need to use these cohorts in a handful of different queries, but don't want to have to create the queries in each of these downstream queries. Reason: Getting the data the first time took more than 60 minutes, so i don't want to pay that tax for all the other queries.
I know that I could do a statement like the below:
WHERE ACCOUNT_ID IN ('1234567','7891011','1213141'...)
But, I'm wondering if there is a way to create a temporary table that I prepopulate with my data, something like
WITH MAY_COHORT AS ( SELECT ACCOUNT_ID Account_ID, '1234567' Account_ID, '7891011' Account_ID, '1213141' )
I know that the above won't work, but would appreciate any advice or counsel here.
thanks.
Unless I am missing something, you're already on the right track, just an adjustment to your CTE should work:
WITH MAY_COHORT AS ( SELECT Account_ID from TableName WHERE ACCOUNT_ID IN ('1234567','7891011','1213141'...) )
This should give you the May_Cohort table to use for subsequent queries.
You can also use a sub-select for your Ids (no WITH MY_COHORT):
WHERE ACCOUNT_ID IN (
SELECT Account_ID
from TableName "Where ... your condition to build your cohort ..." )