Deduping data in BigQuery

Deduping data in BigQuery - sql

I have a query the shows only the non duplicate values, I am looking for a solution on how to use this deduped data in other queries.
I do not have permissions to create anything, so i need to find a solution for that.
IDAN
EDIT (from "answer"):
this are the fields in my table "Purchases": user_id purchase_amount purchase_sku source device_type uuid - a unique identifier for each row
duplicate is considered when all fields except the uuid are identical. i need to return deduplicated data and prepare it for use for other queries.
this is the basic data, with duplicated values in rows 5-6 and 7-8
i want to show to non duplicate rows ,and on the duplicated row show only one row,like this:
deduped data

Consider below generic solution - you do not need to enlist all the column names at all - only uuid is used in query)
select any_value(t).*
from `project.dataset.table` t
group by to_json_string((select as struct * except(uuid) from unnest([t])))

You can use qualify with row_number():
select p.*
from purchases p
where 1=1
qualify row_number() over (partition by user_id, purchase_amount, purchase_sku, source, device_type order by uuid) = 1;
You can also use aggregation:
select purchase_amount, purchase_sku, source, device_type,
min(uuid) as uuid
from purchases
group by 1, 2, 3, 4;

Related

How to group by one column and limit to rows where another column has the same value for all rows in group?

I have a table like this
CREATE TABLE userinteractions
(
userid bigint,
dobyr int,
-- lots more fields that are not relevant to the question
);
My problem is that some of the data is polluted with multiple dobyr values for the same user.
The table is used as the basis for further processing by creating a new table. These cases need to be removed from the pipeline.
I want to be able to create a clean table that contains unique userid and dobyr limited to the cases where there is only one value of dobyr for the userid in userinteractions.
For example I start with data like this:
userid,dobyr
1,1995
1,1995
2,1999
3,1990 # dobyr values not equal
3,1999 # dobyr values not equal
4,1989
4,1989
And I want to select from this to get a table like this:
userid,dobyr
1,1995
2,1999
4,1989
Is there an elegant, efficient way to get this in a single sql query?
I am using postgres.
EDIT: I do not have permissions to modify the userinteractions table, so I need a SELECT solution, not a DELETE solution.

Clarified requirements: your aim is to generate a new, cleaned-up version of an existing table, and the clean-up means:
If there are many rows with the same userid value but also the same dobyr value, one of them is kept (doesn't matter which one), rest gets discarded.
All rows for a given userid are discarded if it occurs with different dobyr values.
create table userinteractions_clean as
select distinct on (userid,dobyr) *
from userinteractions
where userid in (
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1 )
order by userid,dobyr;
This could also be done with an not in, not exists or exists conditions. Also, select which combination to keep by adding columns at the end of order by.
Updated demo with tests and more rows.
If you don't need the other columns in the table, only something you'll later use as a filter/whitelist, plain userid's from records with (userid,dobyr) pairs matching your criteria are enough, as they already uniquely identify those records:
create table userinteractions_whitelist as
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1

Just use a HAVING clause to assert that all rows in a group must have the same dobyr.
SELECT
userid,
MAX(dobyr) AS dobyr
FROM
userinteractions
GROUP BY
userid
HAVING
COUNT(DISTINCT dobyr) = 1

Group by question in SQL Server, migration from MySQL

Failed finding a solution to my problem, would love your help.
~~ Post has been edited to have only one question ~~-
Group by one query while selecting multiple columns.
In MySQL you can simply group by whatever you want, and it will still select all of them, so if for example I wanted to select the newest 100 transactions, grouped by Email (only get the last transaction of a single email)
In MySQL I would do that:
SELECT * FROM db.transactionlog
group by Email
order by TransactionLogId desc
LIMIT 100;
In SQL Server its not possible, googling a bit suggested to specify each column that I want to have with an aggregate as a hack, that couldn't cause a mix of values (mixing columns between the grouped rows)?
For example:
SELECT TOP(100)
Email,
MAX(ResultCode) as 'ResultCode',
MAX(Amount) as 'Amount',
MAX(TransactionLogId) as 'TransactionLogId'
FROM [db].[dbo].[transactionlog]
group by Email
order by TransactionLogId desc
TransactionLogId is the primarykey which is identity , ordering by it to achieve the last inserted.
Just want to know that the ResultCode and Amount that I'll get doing such query will be of the last inserted row, and not the highest of the grouped rows or w/e.
~Edit~
Sample data -
row1:
Email : test#email.com
ResultCode : 100
Amount : 27
TransactionLogId : 1
row2:
Email: test#email.com
ResultCode:50
Amount: 10
TransactionLogId: 2
Using the sample data above, my goal is to get the row details of
TransactionLogId = 2.
but what actual happens is that I get a mixed values of the two, as I do get transactionLogId = 2, but the resultcode and amount of the first row.
How do I avoid that?
Thanks.

You should first find out which is the latest transaction log by each email, then join back against the same table to retrieve the full record:
;WITH MaxTransactionByEmail AS
(
SELECT
Email,
MAX(TransactionLogId) as LatestTransactionLogId
FROM
[db].[dbo].[transactionlog]
group by
Email
)
SELECT
T.*
FROM
[db].[dbo].[transactionlog] AS T
INNER JOIN MaxTransactionByEmail AS M ON T.TransactionLogId = M.LatestTransactionLogId
You are currently getting mixed results because your aggregate functions like MAX() is considering all rows that correspond to a particular value of Email. So the MAX() value for the Amount column between values 10 and 27 is 27, even if the transaction log id is lower.
Another solution is using a ROW_NUMBER() window function to get a row-ranking by each Email, then just picking the first row:
;WITH TransactionsRanking AS
(
SELECT
T.*,
MostRecentTransactionLogRanking = ROW_NUMBER() OVER (
PARTITION BY
T.Email -- Start a different ranking for each different value of Email
ORDER BY
T.TransactionLogId DESC) -- Order the rows by the TransactionLogID descending
FROM
[db].[dbo].[transactionlog] AS T
)
SELECT
T.*
FROM
TransactionsRanking AS T
WHERE
T.MostRecentTransactionLogRanking = 1

Need To Pull Most Recent Record By Timestamp Per Unique ID

I'm going to apologize up front, this is my first question on stackoverflow...
I am attempting to query a table of records where each row has a VehicleID, latitude, longitude, timestamp and various other fields. What I need is to only pull the most recent latitude and longitude for each VehicleID.
edit: removed the term unique ID as apparently I was using it incorrectly.

If the Unique ID is truely unique, then you will always have the most recent latitude and longitude, because the ID will change with every singe row.
If the Unique ID is a Foreign Key (or an ID referencing a unique ID from a different table) you should do something like this:
SELECT latitude, longitude, unique_id
FROM table INNER JOIN
(SELECT unique_id, MAX(timestamp) AS timestamp
FROM table
GROUP BY unique_id)t2 ON table.timestamp = t2.timestamp
AND table.unique_id = t2.unique_id;

You can use the row_number() function for this purpose:
select id, latitude, longitude, timestamp, . . .
from (select t.*,
row_number() over (partition by id order by timestamp desc) as seqnum
from t
) t
where seqnum = 1
The row_number() function assigns a sequential value to each id (partition by clause), with the most recent time stamp getting the value of 1 (the order by clause). The outer where just chooses this one value.
This is an example of a window function, which I encourage you to learn more about.
One quibble with your question: you describe the id as unique. However, if there are multiple values at different times, then it is not unique.

Check this link to implement row indexes and utilize the partition to reset per group. Then in your WHERE clause filter out the results that aren't the first.

Counting the number of child records in a one-to-many relationship with SQL only

I have a database with two tables: data and file.
file_id is a foreign key from data to file. So, the relationship from data to file is n to one.
Is there a way with using SQL only to find out how many records of data refer to each record of file?
For example, I can find how many records of data are referring to file with id 13:
select count(*) from data where file_id = 13;
I want to know this for every file_id. I tried the following command to achive this, but it gives the count for all file_id records:
mysql> select distinct file_id, count(*) from data where file_id in (select id from file);
+---------+----------+
| file_id | count(*) |
+---------+----------+
| 9 | 3510 |
+---------+----------+

Distinct returns distinct values per row, not per some group. MySql allows for use of aggregate functions without a group by, which is totally misleading. In this case you got a random file_id and a count of all records - certainly something you did not intend to do.
To get group count (or any other aggregate function), use group by clause:
select file_id, count(*)
from data
group by file_id

GROUP BY...
SELECT file_id, COUNT(*)
FROM data
GROUP BY file_id

select file_id, count(*)
from data
group by file_id

Normalizing a table, from one to the other

I'm trying to normalize a mysql database....
I currently have a table that contains 11 columns for "categories". The first column is a user_id and the other 10 are category_id_1 - category_id_10. Some rows may only contain a category_id up to category_id_1 and the rest might be NULL.
I then have a table that has 2 columns, user_id and category_id...
What is the best way to transfer all of the data into separate rows in table 2 without adding a row for columns that are NULL in table 1?
thanks!

You can create a single query to do all the work, it just takes a bit of copy and pasting, and adjusting the column name:
INSERT INTO table2
SELECT * FROM (
SELECT user_id, category_id_1 AS category_id FROM table1
UNION ALL
SELECT user_id, category_id_2 FROM table1
UNION ALL
SELECT user_id, category_id_3 FROM table1
) AS T
WHERE category_id IS NOT NULL;
Since you only have to do this 10 times, and you can throw the code away when you are finished, I would think that this is the easiest way.

One table for users:
users(id, name, username, etc)
One for categories:
categories(id, category_name)
One to link the two, including any extra information you might want on that join.
categories_users(user_id, category_id)
-- or with extra information --
categories_users(user_id, category_id, date_created, notes)
To transfer the data across to the link table would be a case of writing a series of SQL INSERT statements. There's probably some awesome way to do it in one go, but since there's only 11 categories, just copy-and-paste IMO:
INSERT INTO categories_users
SELECT user_id, 1
FROM old_categories
WHERE category_1 IS NOT NULL

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Deduping data in BigQuery - sql

Consider below generic solution - you do not need to enlist all the column names at all - only uuid is used in query) select any_value(t).* from `project.dataset.table` t group by to_json_string((select as struct * except(uuid) from unnest([t])))

Related

How to group by one column and limit to rows where another column has the same value for all rows in group?

Group by question in SQL Server, migration from MySQL

Need To Pull Most Recent Record By Timestamp Per Unique ID

Counting the number of child records in a one-to-many relationship with SQL only

Normalizing a table, from one to the other

Categories

Resources