SQL Query to partition aggregate data from Amazon AMS - sql

I am using Google Big Query to store data from amazon ams. Each day a csv file is loaded into the database that contains the lifetime spend, clicks and impressions. The data looks something like this:
date_uploaded,campaign,spend,impressions,clicks
2017-11-01,product a,100,1000,50
2017-11-01,product b,50,500,20
2017-11-02,product a,175,1600,75
2017-11-02,product b,100,1000,50
2017-11-03,product a,250,2200,110
2017-11-03,product b,150,1500,80
I would like to transform this data to show the daily spend (difference between previous day) so the end result would look like this:
date_uploaded,campaign,spend,impressions,clicks
2017-11-02,product a,75,600,25
2017-11-02,product b,50,500,30
2017-11-03,product a,75,600,35
2017-11-03,product b,50,500,30
Is there a way to query BQ to partition data in this way?

You can do this with lag. Difference for the first row in a partition would be null and they can be excluded with a where clause filter. This assumes one row per date,campaign. If there are multiple rows, sum up the values for a given day and use lag.
select * from (
select date_uploaded,campaign,
spend-lag(spend) over(partition by campaign order by date_uploaded) as spend_diff,
impressions-lag(impressions) over(partition by campaign order by date_uploaded) as impressions_diff,
clicks-lag(clicks) over(partition by campaign order by date_uploaded) as clicks_diff
from tbl
) t
where spend_diff is not null and impressions_diff is not null and clicks_diff is not null

Related

SQL: Apply an aggregate result per day using window functions

Consider a time-series table that contains three fields time of type timestamptz, balance of type numeric, and is_spent_column of type text.
The following query generates a valid result for the last day of the given interval.
SELECT
MAX(DATE_TRUNC('DAY', (time))) as last_day,
SUM(balance) FILTER ( WHERE is_spent_column is NULL ) AS value_at_last_day
FROM tbl
2010-07-12 18681.800775017498741407984000
However, I am in need of an equivalent query based on window functions to report the total value of the column named balance for all the days up to and including the given date .
Here is what I've tried so far, but without any valid result:
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(sum(balance) FILTER ( WHERE is_spent_column is NULL ) ) OVER ( ORDER BY DATE_TRUNC('DAY', (time)) ) AS total_value_per_day
FROM tbl
group by 1
order by 1 desc
2010-07-12 16050.496339044977568391974000
2010-07-11 13103.159119670350269890284000
2010-07-10 12594.525752964512456914454000
2010-07-09 12380.159588711091681327014000
2010-07-08 12178.119542536668113577014000
2010-07-07 11995.943973804127033140014000
EDIT:
Here is a sample dataset:
LINK REMOVED
The running total can be computed by applying the first query above on the entire dataset up to and including the desired day. For example, for day 2009-01-31, the result is 97.13522530000000000000, or for day 2009-01-15 when we filter time as time < '2009-01-16 00:00:00' it returns 24.446144000000000000.
What I need is an alternative query that computes the running total for each day in a single query.
EDIT 2:
Thank you all so very much for your participation and support.
The reason for differences in result sets of the queries was on the preceding ETL pipelines. Sorry for my ignorance!
Below I've provided a sample schema to test the queries.
https://www.db-fiddle.com/f/veUiRauLs23s3WUfXQu3WE/2
Now both queries given above and the query given in the answer below return the same result.
Consider calculating running total via window function after aggregating data to day level. And since you aggregate with a single condition, FILTER condition can be converted to basic WHERE:
SELECT daily,
SUM(total_balance) OVER (ORDER BY daily) AS total_value_per_day
FROM (
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(balance) AS total_balance
FROM tbl
WHERE is_spent_column IS NULL
GROUP BY 1
) AS daily_agg
ORDER BY daily

partition big query LIMIT over date range

I'm quite new to SQL & big query so this might be simple. I'm running some queries on the public dataset GDELT in BQ and have a question regarding the LIMIT. GDELT is massive (14.4 TB) and when I query for something, in this case a person, I could get up to 100k rows of results or more which is this case is too much. But when I use LIMIT it seems like it does not really partition the results evenly over the dates, causing me to get very random timelines. How does limit work and is there a way to get the results more evenly based on days?
SELECT DATE,V2Tone,DocumentIdentifier as URL, Themes, Persons, Locations
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE DATE>=20210610000000 and _PARTITIONTIME >= TIMESTAMP(#start_date)
AND DATE<=20210818999999 and _PARTITIONTIME <= TIMESTAMP(#end_date)
AND LOWER(DocumentIdentifier) like #url_topic
LIMIT #limit
When running this query and doing some preproc, I get the following time series:
It's based on 15k results, but they are distributed very unevenly/randomly across the days (since there are over 500k results in total if I don't use limit). I would like to make a query that limits my output to 15k but partitions the data somewhat equally over the days.
you need to order by , when you are not sorting your result , the order of returned result is not guaranteed:
but if you are looking to get the same number of rows per day , you can use window functions:
select * from (
SELECT
DATE,
V2Tone,
DocumentIdentifier as URL,
Themes,
Persons,
Locations,
row_number() over (partition by DATE) rn
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE
DATE >= 20210610000000 AND DATE <= 20210818999999
and _PARTITIONDATE >= #start_date and _PARTITIONDATE <= #end_date
AND LOWER(DocumentIdentifier) like #url_topic
) t where rn = #numberofrowsperday
if you are passing date only you can use _PARTITIONDATE to filter the partitions.

Adding grouping in framing clause window while creating partitions

Using the dataset hosted on Google (MBL Data) as an example, here is what I am accomplishing to do - obtain last 3 weeks score run for a given Venue.
My aggregated dataset looks like this without the strikes_3wk column -
Logic for strikes_3wk column is to partition the aggregated dataset by venueName, order by YearWeek column and then obtain the last 3 weeks aggregated strikes data.
Here is the query I have written so far. I see that the windowing function is where I need to modify the logic. So, is there a way to add grouping within the windowing function? Is there any alternative way of doing this?
In the image I added a new column 'expected', showing values for two weeks.
select inr.*
,sum(inr.strikes) over (Venue_Week rows between current row and 2 following) as strikes_3wk
from
(
select seasonType
,gameStatus
,homeTeamName
,awayTeamName
,venueName
,CAST(
CONCAT(
CAST(EXTRACT(YEAR FROM createdAt) as string)
,CAST(EXTRACT(WEEK(Monday) FROM createdAt) as string)
) as INT64)
as YearWeek
,sum(homeFinalRuns) as homeFinalRuns
,sum(strikes) as strikes
from `bigquery-public-data.baseball.games_wide`
where createdAt is not null
group by seasonType
,gameStatus
,homeTeamName
,awayTeamName
,venueName
,YearWeek
)inr
window Venue_Week as (
partition by inr.venueName
order by inr.YearWeek desc
)
So you are looking for strikes per venue regardless of who did them, right?
May be something like:
SELECT INR.*, STATS.strikes_3wk
FROM `bigquery-public-data.baseball.games_wide` INR
LEFT JOIN (
SELECT venueName, SUM(strikes) as strikes_3wk
FROM `bigquery-public-data.baseball.games_wide` INR2
WHERE YearWeek IN (
SELECT TOP 3 YearWeek
FROM `bigquery-public-data.baseball.games_wide`
WHERE venueName = INR2.venueName
ORDER BY YearWeek DESC
)
GROUP BY venueName
) STATS
ON INR.venueName = STATS.venueName

Difference between two rows in bigquery

I have a bigquery table in the following format.
How do I take the difference between the sum of monthly spend across months when the storeID, membership_type are the same. Example output is provided below.
You can use LAG to get the value of the previous row in BigQuery:
SELECT
STOREID,
MEMBERSHIP_TYPE,
yyyy_mm,
SUM_MONTHLY_SPEND,
IFNULL(LAG(SUM_MONTHLY_SPEND) OVER (PARTITION BY STOREID ORDER BY yyyy_mm ASC) - SUM_MONTHLY_SPEND, 0) AS MONTHLY_SPEND_DIFF
FROM dataset.table

How Can I Retrieve The Earliest Date and Status Per Each Distinct ID

I have been trying to write a query to perfect this instance but cant seem to do the trick because I am still receiving duplicated. Hoping I can get help how to fix this issue.
SELECT DISTINCT
1.Client
1.ID
1.Thing
1.Status
MIN(1.StatusDate) as 'statdate'
FROM
SAMPLE 1
WHERE
[]
GROUP BY
1.Client
1.ID
1.Thing
1.status
My output is as follows
Client Id Thing Status Statdate
CompanyA 123 Thing1 Approved 12/9/2019
CompanyA 123 Thing1 Denied 12/6/2019
So although the query is doing what I asked and showing the mininmum status date per status, I want only the first status date. I have about 30k rows to filter through so whatever does not run overload the query and have it not run. Any help would be appreciated
Use window functions:
SELECT s.*
FROM (SELECT s.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY statdate) as seqnum
FROM SAMPLE s
WHERE []
) s
WHERE seqnum = 1;
This returns the first row for each id.
Use whichever of these you feel more comfortable with/understand:
SELECT
*
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY statusdate) as rn
FROM sample
WHERE ...
) x
WHERE rn = 1
The way that one works is to number all rows sequentially in order of StatusDate, restarting the numbering from 1 every time ID changes. If you thus collect all the number 1's togetyher you have your set of "first records"
Or can coordinate a MIN:
SELECT
*
FROM
sample s
INNER JOIN
(SELECT ID, MIN(statusDate) as minDate FROM sample WHERE ... GROUP BY ID) mins
ON s.ID = mins.ID and s.StatusDate = mins.MinDate
WHERE
...
This one prepares a list of all the ID and the min date, then joins it back to the main table. You thus get all the data back that was lost during the grouping operation; you cannot simultaneously "keep data" and "throw away data" during a group; if you group by more than just ID, you get more groups (as you have found). If you only group by ID you lose the other columns. There isn't any way to say "GROUP BY id, AND take the MIN date, AND also take all the other data from the same row as the min date" without doing a "group by id, take min date, then join this data set back to the main dataset to get the other data for that min date". If you try and do it all in a single grouping you'll fail because you either have to group by more columns, or use aggregating functions for the other data in the SELECT, which mixes your data up; when groups are done, the concept of "other data from the same row" is gone
Be aware that this can return duplicate rows if two records have identical min dates. The ROW_NUMBER form doesn't return duplicated records but if two records have the same minimum StatusDate then which one you'll get is random. To force a specific one, ORDER BY more stuff so you can be sure which will end up with 1