Bigquery select non duplicate records - google-bigquery

Consider the following table (simplified version):
id int,
amount decimal,
transaction_no,
location_id int,
created_at datetime
The above schema is used to store POS receipts for restaurants. Now, this table sometimes contains a receipt from same date, same transaction_no at same location_id.
In which case what I want to do is get the last receipt of that location_id & transaction_no order by created_at desc.
In MySQL, I use the following query which gets me last (max(created_at) receipt for a location_id & transaction_no:
SELECT id, amount, transaction_no, location_id, created_at
FROM receipts r JOIN
(SELECT transaction_no, max(created_at) AS maxca
FROM receipts r
GROUP BY transaction_no
) t
ON r.transaction_no = t.transaction_no AND r.created_at = t.maxca
group by location_id;
But when I run the same in BigQuery, I get the following error:
Query Failed Error: Shuffle reached broadcast limit for table __I0
(broadcasted at least 150393576 bytes). Consider using partitioned
joins instead of broadcast joins . Job ID:
circular-gist-812:job_A_CfsSKJICuRs07j7LHVbkqcpSg
Any idea how to make the above query work in BigQuery?

SELECT id, amount, transaction_no, location_id, created_at
FROM (
SELECT
id, amount, transaction_no, location_id, created_at,
ROW_NUMBER() OVER(PARTITION BY transaction_no, location_id
ORDER BY created_at DESC) as last
FROM your_dataset.your_table
)
WHERE last = 1

Related

SQL Query to get the last unique value from a column

I have a SQL table with the following columns:
id | created_at
I want to get a dictionary of all the unique id's and the latest created_at time.
My current query is this:
SELECT id, created_at from uscis_case_history
ORDER BY created_at DESC
LIMIT 1
This gives me the latest id, but i want one for all unique id
Sounds like you just need to aggregate:
SELECT id, MAX(created_at) LatestCreated_at
FROM uscis_case_history
GROUP BY id
ORDER BY MAX(created_at) DESC;
You can use row_number to index the dates based on their id and then filter by that:
SELECT *
FROM (SELECT id,
created_at,
row_number() over (partition by id order by created_at desc) rn
FROM tbl) a
WHERE rn=1
Here is an example on dbfiddle

How to grab the last value in a column per user for the last date

I have a table that contains three columns: ACCOUNT_ID, STATUS, CREATE_DATE.
I want to grab only the LAST status for each account_id based on the latest create_date.
In the example above, I should only see three records and the last STATUS per that account_2.
Do you know a way to do this?
create table TBL 1 (
account_id int,
status string,
create_date date)
select account_id, max(create_date) from table group by account_id;
will give you the account_id and create_date at the closest past date to today (assuming create_date can never be in the future, which makes sense).
Now you can join with that data to get what you want, something along the lines for example:
select account_id, status, create_date from table where (account_id, create_date) in (<the select expression from above>);
If you use that frequently (account with the latest create date), then consider defining a view for that.
If you have many columns and want keep the row that is the last line, you can use QUALIFY to run the ranking logic, and keep the best, like so:
SELECT *
FROM tbl
QUALIFY row_number() over (partition by account_id order by create_date desc) = 1;
The long form is the same pattern the Ely shows in the second answer. But with the MAX(CREATE_DATE) solution, if you have two rows on the same last day, the IN solution with give you both. you can also get via QUALIFY if you use RANK
So the SQL is the same as:
SELECT account_id, status, create_date
FROM (
SELECT *,
row_number() over (partition by account_id order by create_date desc) as rn
FROM tbl
)
WHERE rn = 1;
So the RANK for, which will show all equal rows is:
SELECT *
FROM tbl
QUALIFY rank() over (partition by account_id order by create_date desc) = 1;

Select only 3 rows per user - SQL Query

These are the columns in my table
id (autogenerated)
created_user
created_date
post_text
This table has lot of values. I wanted to take latest 3 posts of every created_user
I am new to SQL and need help. I ran the below query in my Postgres database and it is not helpful
SELECT * FROM posts WHERE created_date IN
(SELECT MAX(created_date) FROM posts GROUP BY created_date)
You could use the row_number() window function to create an ordered row count per user. After that you can easily filter by this value
demo:db<>fiddle
SELECT
*
FROM (
SELECT
*,
row_number() OVER (PARTITION BY created_user ORDER BY created_date DESC)
FROM
posts
) s
WHERE row_number <= 3
PARTITION BY groups the users
ORDER BY date DESC orders the posts of each user into a descending order to get the most recent as row_count == 1, ...

Getting Unique ID foe a maximum amount grouped by days in BigQuery

I have this query in BigQuery:
SELECT
ID,
max(amount) as money,
STRFTIME_UTC_USEC(TIMESTAMP(time), '%j') as day
FROM table
GROUP BY day
The console shows an error as it wants the ID to the group by clause but if i add ID in the group by it will get many ID for a specific day.
I want to print a unique ID with the maximum amount in a specific day.
For ex:
ID: 1 Money:123 Day:365
not clear from the question, but looks like you already have only one entry per given id for particular day. Assuming this, below query does what you need
SELECT id, amount, day
FROM (
SELECT
id, amount, day,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY amount DESC) AS win
FROM dataset.table
)
WHERE win = 1
 
in case if above assumption is wrong (so you have multiple entries for the same id for same day), use below
SELECT id, amount, day
FROM (
SELECT id, amount, day,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY amount DESC) AS win
FROM (
SELECT id, SUM(amount) AS amount, day
FROM dataset.table
GROUP BY id, day
)
)
WHERE win = 1
 

How to find duplicate records in PostgreSQL

I have a PostgreSQL database table called "user_links" which currently allows the following duplicate fields:
year, user_id, sid, cid
The unique constraint is currently the first field called "id", however I am now looking to add a constraint to make sure the year, user_id, sid and cid are all unique but I cannot apply the constraint because duplicate values already exist which violate this constraint.
Is there a way to find all duplicates?
The basic idea will be using a nested query with count aggregation:
select * from yourTable ou
where (select count(*) from yourTable inr
where inr.sid = ou.sid) > 1
You can adjust the where clause in the inner query to narrow the search.
There is another good solution for that mentioned in the comments, (but not everyone reads them):
select Column1, Column2, count(*)
from yourTable
group by Column1, Column2
HAVING count(*) > 1
Or shorter:
SELECT (yourTable.*)::text, count(*)
FROM yourTable
GROUP BY yourTable.*
HAVING count(*) > 1
From "Find duplicate rows with PostgreSQL" here's smart solution:
select * from (
SELECT id,
ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row
FROM tbl
) dups
where
dups.Row > 1
In order to make it easier I assume that you wish to apply a unique constraint only for column year and the primary key is a column named id.
In order to find duplicate values you should run,
SELECT year, COUNT(id)
FROM YOUR_TABLE
GROUP BY year
HAVING COUNT(id) > 1
ORDER BY COUNT(id);
Using the sql statement above you get a table which contains all the duplicate years in your table. In order to delete all the duplicates except of the the latest duplicate entry you should use the above sql statement.
DELETE
FROM YOUR_TABLE A USING YOUR_TABLE_AGAIN B
WHERE A.year=B.year AND A.id<B.id;
You can join to the same table on the fields that would be duplicated and then anti-join on the id field. Select the id field from the first table alias (tn1) and then use the array_agg function on the id field of the second table alias. Finally, for the array_agg function to work properly, you will group the results by the tn1.id field. This will produce a result set that contains the the id of a record and an array of all the id's that fit the join conditions.
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id;
Obviously, id's that will be in the duplicate_entries array for one id, will also have their own entries in the result set. You will have to use this result set to decide which id you want to become the source of 'truth.' The one record that shouldn't get deleted. Maybe you could do something like this:
with dupe_set as (
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id
order by tn1.id asc)
select ds.id from dupe_set ds where not exists
(select de from unnest(ds.duplicate_entries) as de where de < ds.id)
Selects the lowest number ID's that have duplicates (assuming the ID is increasing int PK). These would be the ID's that you would keep around.
Inspired by Sandro Wiggers, I did something similiar to
WITH ordered AS (
SELECT id,year, user_id, sid, cid,
rank() OVER (PARTITION BY year, user_id, sid, cid ORDER BY id) AS rnk
FROM user_links
),
to_delete AS (
SELECT id
FROM ordered
WHERE rnk > 1
)
DELETE
FROM user_links
USING to_delete
WHERE user_link.id = to_delete.id;
If you want to test it, change it slightly:
WITH ordered AS (
SELECT id,year, user_id, sid, cid,
rank() OVER (PARTITION BY year, user_id, sid, cid ORDER BY id) AS rnk
FROM user_links
),
to_delete AS (
SELECT id,year,user_id,sid, cid
FROM ordered
WHERE rnk > 1
)
SELECT * FROM to_delete;
This will give an overview of what is going to be deleted (there is no problem to keep year,user_id,sid,cid in the to_delete query when running the deletion, but then they are not needed)
In your case, because of the constraint you need to delete the duplicated records.
Find the duplicated rows
Organize them by created_at date - in this case I'm keeping the oldest
Delete the records with USING to filter the right rows
WITH duplicated AS (
SELECT id,
count(*)
FROM products
GROUP BY id
HAVING count(*) > 1),
ordered AS (
SELECT p.id,
created_at,
rank() OVER (partition BY p.id ORDER BY p.created_at) AS rnk
FROM products o
JOIN duplicated d ON d.id = p.id ),
products_to_delete AS (
SELECT id,
created_at
FROM ordered
WHERE rnk = 2
)
DELETE
FROM products
USING products_to_delete
WHERE products.id = products_to_delete.id
AND products.created_at = products_to_delete.created_at;
Following SQL syntax provides better performance while checking for duplicate rows.
SELECT id, count(id)
FROM table1
GROUP BY id
HAVING count(id) > 1
begin;
create table user_links(id serial,year bigint, user_id bigint, sid bigint, cid bigint);
insert into user_links(year, user_id, sid, cid) values (null,null,null,null),
(null,null,null,null), (null,null,null,null),
(1,2,3,4), (1,2,3,4),
(1,2,3,4),(1,1,3,8),
(1,1,3,9),
(1,null,null,null),(1,null,null,null);
commit;
set operation with distinct and except.
(select id, year, user_id, sid, cid from user_links order by 1)
except
select distinct on (year, user_id, sid, cid) id, year, user_id, sid, cid
from user_links order by 1;
except all also works. Since id serial make all rows unique.
(select id, year, user_id, sid, cid from user_links order by 1)
except all
select distinct on (year, user_id, sid, cid)
id, year, user_id, sid, cid from user_links order by 1;
So far works nulls and non-nulls.
delete:
with a as(
(select id, year, user_id, sid, cid from user_links order by 1)
except all
select distinct on (year, user_id, sid, cid)
id, year, user_id, sid, cid from user_links order by 1)
delete from user_links using a where user_links.id = a.id returning *;