How I Can get the last version for each row SQL? - sql

I am trying to use this query to get daily costs for each campaigns, but I got the cost for only one campaign. Every campaign has the same "_sdc_sequence" but for each day there are many "_sdc_sequence". How I can get every cost for last version daily per campaign and select somes variable like day, cost, impressions and campaign? because now I get every variable of my database
"_sdc_sequence" is a unix epoch attached to the record during replication and determine the order of all the versions of a row.
I attached a picture with table. I need select only the last sequence (max _sdc_sequence)
#standardSQL
SELECT row.* FROM (
SELECT ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)] row
FROM `adxxxxx_xxxxxxxx` t
GROUP BY day
thanks

#standardSQL
SELECT row.* FROM (
SELECT ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)] row
FROM `adxxxxx_xxxxxxxx` t
GROUP BY day, campaign

Related

Find the nth greatest value per group in SQL

I'm trying to find the nth greatest value in each group in a table; is there an efficient way to do this in SQL? (specifically Google BigQuery, if that's relevant)
For example, suppose we had a table sales with two fields, customer_id and amount, where each record corresponds to the sale of an item to a customer for a given amount. If I wanted the top sale to each customer, I could do
SELECT customer_id, MAX(amount) top_amount
FROM sales
GROUP BY customer_id;
If I instead wanted the 5th greatest value for each customer, is there an efficient/idiomatic way to do that in SQL?
Consider below approach
SELECT customer_id, array_agg(amount order by amount desc limit 5)[safe_offset(4)] top_5th_amount
FROM sales
GROUP BY customer_id;
Yet another option with use of nth_value() function
SELECT distinct customer_id,
nth_value(amount, 5) over win top_5th_amount
FROM sales
window win as (partition by customer_id order by amount desc rows between unbounded preceding and unbounded following )
You can use qualify:
select s.*
from sales s
where 1=1
qualify row_number() over (partition by customer_id order by amount desc) = 5;
Note: You question is unclear on how to handle tied amounts. This treats them as separate amounts (so the 5th could be the same as the 1st). If you want the 5th largest distinct value, use dense_rank() instead.

Delete repeated data in Bigquery

I am optimizing a query in Bigquery that shows non-repeated data, currently it is like this and it works.
select * from (select
ROW_NUMBER() OVER (PARTITION BY id) as num,
id,
created_at,
operator_id,
description
from NAME_TABLE
where created_at >='2018-01-01') where num=1
I wanted to ask if it is possible to make a GROUP BY with all the columns (in a simple way it cannot be done, since crated_at is not possible to group it) and keep the first data of created_at that appears for each id
PD:a DISTINCT does not work, since there are more than 80 million records (they increase 2 million per day) and it returns repeated data
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE` t
WHERE created_at >='2018-01-01'
GROUP BY id
Instead of processing / returning all columns - you can specify exact list you need as in below example
#standardSQL
SELECT AS VALUE ARRAY_AGG(STRUCT(id,created_at,operator_id,description) ORDER BY created_at LIMIT 1)[OFFSET(0)]
FROM `project.dataset.NAME_TABLE`
WHERE created_at >='2018-01-01'
GROUP BY id
You query should be fine. But you can do this without a subquery:
select array_agg(nt order by created_at desc limit 1)[ordinal(1)].*
from name_table nt
where created_at >='2018-01-01'
group by id

Get the latest version of every row in SQL

I need to grab the latest version of every row to not get duplicate data. "_sdc_sequence" is a unix epoch attached to the record during replication and determine the order of all the versions of a row.
I would like to get cost and impressions fro each campaign everyday
I have tried to use INNER JOIN but I could not get the data. when I tried to use "account" and "clientname" for attribute (every row has the same clientname and account) I got cero in cost and impressions. Maybe the attributes are wrongs
SELECT DISTINCT day, cost, impressions, campaign
FROM `adxxxxx_xxxxxxxx` account
INNER JOIN (
SELECT
MAX(_sdc_sequence) AS seq,
campaignid
FROM `adxxxxx_xxxxxxxx`
GROUP BY campaignid) clientname
ON account.campaignid = clientname.campaignid
AND account._sdc_sequence = clientname.seq
ORDER by day
There is another way to do this? or How I can fix it?
thank you
#standardSQL
SELECT row.* FROM (
SELECT ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)] row
FROM `adxxxxx_xxxxxxxx` t
GROUP BY campaignid
)

BigQuery - removing duplicate records sometimes taking long

We implemented following ETL process in Cloud: run a query in our local database hourly => save the result as csv and load it into the cloud storage => load the file from cloud storage into BigQuery table => remove duplicate records using the following query.
SELECT
* EXCEPT (row_number)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) row_number
FROM rawData.stock_movement
)
WHERE row_number = 1
Since 8 am (local time in Berlin) this morning the process of removing duplicate records takes much longer than it usual does, even the amount of data is not much different than it usual is: it takes usually 10s to remove duplicate records whereas this morning sometimes half an hour.
Is it the performance to remove duplicate record not stable?
It could be that you have many duplicate values for a particular id, so computing row numbers takes a long time. If you want to check for whether this is the case, you can try:
#standardSQL
SELECT id, COUNT(*) AS id_count
FROM rawData.stock_movement
GROUP BY id
ORDER BY id_count DESC LIMIT 5;
With that said, it may be faster to remove duplicates with this query instead:
#standardSQL
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
FROM rawData.stock_movement AS t
GROUP BY t.id
);
Here is an example:
#standardSQL
WITH T AS (
SELECT 1 AS id, 'foo' AS x, TIMESTAMP '2017-04-01' AS timestamp UNION ALL
SELECT 2, 'bar', TIMESTAMP '2017-04-02' UNION ALL
SELECT 1, 'baz', TIMESTAMP '2017-04-03')
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
FROM rawData.stock_movement AS t
GROUP BY t.id
);
The reason that this may be faster is that BigQuery will only keep the row with the largest timestamp in memory at any given point in time.

PostgreSQL select daily max and corresponding hour of ocurrence

I have the following table structure, with daily-hourly data:
time_of_ocurrence(timestamp); particles(numeric)
"2012-11-01 00:30:00";191.3
"2012-11-01 01:30:00";46
...
"2013-01-01 02:30:00";319.6
How do i select the DAILY max and THE HOUR in which this max occur?
I've tried
SELECT date_trunc('hour', time_of_ocurrence) as hora,
MAX(particles)
from my_table WHERE time_of_ocurrence > '2013-09-01'
GROUP BY hora ORDER BY hora
But it doesn't work:
"2013-09-01 00:00:00";34.35
"2013-09-01 01:00:00";33.13
"2013-09-01 02:00:00";33.09
"2013-09-01 03:00:00";28.08
My result would be in this format instead (one max per day, showing the hour)
"2013-09-01 05:00:00";100.35
"2013-09-02 03:30:00";80.13
How can i do that? Thanks!
This type of question has come up on StackOverflow frequently, and these questions are categorized with the greatest-n-per-group tag, if you want to see other solutions.
edit: I changed the following code to group by day instead of by hour.
Here's one solution:
SELECT t.*
FROM (
SELECT date_trunc('day', time_of_ocurrence) as hora, MAX(particles) AS particles
FROM my_table
GROUP BY hora
) AS _max
INNER JOIN my_table AS t
ON _max.hora = date_trunc('day', t.time_of_ocurrence)
AND _max.particles = t.particles
WHERE time_of_ocurrence > '2013-09-01'
ORDER BY time_of_ocurrence;
This might also show more than one result per day, if more than one row has the max value.
Another solution using window functions that does not show such duplicates:
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY date_trunc('day', time_of_ocurrence)
ORDER BY particles DESC) AS _rn
FROM my_table
) AS _max
WHERE _rn = 1
ORDER BY time_of_ocurrence;
If multiple rows have the same max, one row with nevertheless be numbered row 1. If you need specific control over which row is numbered 1, you need to use ORDER BY in the partitioning clause using a unique column to break such ties.
Use window functions:
select distinct
date_trunc('day',time_of_ocurrence) as day,
max(particles) over (partition by date_trunc('day',time_of_ocurrence)) as particles_max_of_day,
first_value(date_trunc('hour',time_of_ocurrence)) over (partition by date_trunc('day',time_of_ocurrence) order by particles desc)
from my_table
order by 1
One edge case here is if the same MAX number of particles show up in the same day, but in different hours. This version would randomly pick one of them. If you prefer one over the other (always the earlier one for example) you can add that to the order by clause:
first_value(date_trunc('hour',time_of_ocurrence)) over (partition by date_trunc('day',time_of_ocurrence) order by particles desc, time_of_ocurrence)