Google BigQuery - simulate Pandas removeDuplicates() in Google BigQuery SQL - sql

Given a Google BigQuery dataset with col_1....col_m, how can you use Google BigQuery SQL to return the dataset where there are no duplicates in say... [col1, col3, col7] such that when there are rows with duplicates in [col1, col3, col7], then the first row among those duplicates is returned, and the rest of the rows which have duplicate fields in those columns are all removed?
Example: removeDuplicates([col1, col3])
col1 col2 col3
---- ---- ----
r1: 20 25 30
r2: 20 70 30
r3: 40 70 30
returns
col1 col2 col3
---- ---- ----
r1: 20 25 30
r3: 40 70 30
To do this using python pandas is easy. For a dataframe (i.e. matrix), you call the pandas function removedDuplicates([field1, field2, ...]). However, removeDuplicates is not specified within the context of Google Big Query SQL.
My best guess with how to do it in Google Big Query is to use the rank() function:
https://cloud.google.com/bigquery/query-reference#rank
I am looking for a concise solution if one exists.

You can group by all of your columns that you want to remove duplicates from, and use FIRST() of the others. That is, removeDuplicates([col1, col3]) would translate to
SELECT col1, FIRST(col2) as col2, col3
FROM table
GROUP EACH BY col1, col3
Note that in BigQuery SQL, if you have more than a million distinct values for col1 and col3, you'll need the EACH keyword.

Related

separate nested columns into rows [sql]

I was trying with cross join and unnest , but I only managed to split one column, not all three at the same time
I have this table in amazon athena
and I want to separate the columns with lists into rows, leaving a table like this
COL1
COL2
COL3
COL4
COL5
COL6
765045
5782
jd938
1
a
pickup
765045
5782
jd938
2
b
delivery
41118
78995
kd982
5
g
pickup
41118
78995
kd982
8
q
delivery
411620
65852
km0899
9
k
pickup
411620
65852
km0899
6
b
delivery
select
t.COL1, t.COL2,t.COL3, u.COL4
from t
cross join
unnest(t.COL4) u(COL4)
I was thinking of making subtables and repeating this code 3 times but I wanted to know if there is a more efficient way
unnest supports handling multiple columns in one statement. Also you can use succinct syntax omitting the CROSS JOIN:
select
t.COL1, t.COL2,t.COL3, u.COL4, u.COL5, u.COL6
from t,
unnest(t.COL4, t.COL5, t.COL6) AS u(COL4, COL5, COL6)
Note that for array of different cardinality it will substitute missing values with null's. And if all arrays are empty the row will not be added to the final result (but you can work around this by adding a dummy array with one element like was done here).

How to count all rows in raw data file using Hive?

I am reading some raw input which looks something like this:
20 abc def
21 ghi jkl
mno pqr
23 stu
Note the first two rows are "good" rows and the last two rows are "bad" rows since they are missing some data.
Here is the snippet of my hive query which is reading this raw data into a readonly external table:
DROP TABLE IF EXISTS readonly_s3;
CREATE EXTERNAL TABLE readonly_s3 (id string, name string, data string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
I need to get the count of ALL the rows, both "good" and "bad." The problem is some of the data is missing, and if I do SELECT count(id) as total_rows for example, that doesn't work since not all the rows have an id.
Any suggestions on how I can count ALL the rows in this raw data file?
Hmmm . . . You can use:
select sum(case when col1 is not null and col2 is not null and col3 is not null then 1 else 0 end) as num_good,
sum(case when col1 is null or col2 is null or col3 is null then 1 else 0 end) as num_bad
from readonly_s3;

Conditional removing duplicate records

I'm storing some realtime data in SQLite. Now, I want to remove duplicate records to reduce data and enlarge its timeframe to 20 seconds by SQL commands.
Sample data:
id t col1 col2
-----------------------------
23 9:19:18 15 16
24 9:19:20 10 11
25 9:19:20 10 11
26 9:19:35 10 11
27 9:19:45 10 11
28 9:19:53 10 11
29 9:19:58 14 13
Logic: In above sample, records 25-28 have same value in col1 and col2 field, so they are duplicate. But because keeping one (for example, record 25) and removing others will cause timeframe (= time difference between subsequent data) to be more than 20s, i don't want to remove all of records 26-28. So, in above sample, row=25 will be kept because, it's not duplicate of its previous row. Row=26 will be kept, because although its duplicate of its previous row, removing this row causes to have timeframe to more than 20s (19:45 - 19:20). Row=27 will be removed, meeting these 2 conditions and row=28 will be kept.
I can load data to C# datatable and apply this logic in code in a loop over records, but it is slow comparing to run SQL in database. I'm not sure this can be implemented in SQL. Any help would be greatly appreciated.
Edit: I've added another row before row =25 to show rows with the same time. Fiddle is here: Link
OK so here's an alternate answer that handles the duplicate record scenario you've described, uses LAG and LEAD and also ends up considerably simpler as it turns out!
delete from t1 where id in
(
with cte as (
select id,
lag(t, 1) over(partition by col1, col2 order by t) as prev_t,
lead(t, 1) over(partition by col1, col2 order by t) as next_t
from t1
)
select id
from cte
where strftime('%H:%M:%S',next_t,'-20 seconds') < strftime('%H:%M:%S',prev_t)
)
Online demo here
I believe this accomplishes what you are after:
delete from t1 where id in
(
select ta.id
from t1 as ta
join t1 as tb
on tb.t = (select max(t) from t1 where t < ta.t
and col1 = ta.col1 and col2 = ta.col2)
and tb.col1 = ta.col1 and tb.col2 = ta.col2
join t1 as tc
on tc.t = (select min(t) from t1 where t > ta.t
and col1 = ta.col1 and col2 = ta.col2)
and tc.col1 = ta.col1 and tc.col2 = ta.col2
where strftime('%H:%M:%S',tc.t,'-20 seconds') < strftime('%H:%M:%S',tb.t)
)
Online demo is here where I've gone through a couple of iterations to simplify it to the above. Basically you need to look at both the previous row and the next row to determine whether you can delete the current row, which happens only when there's a difference of less than 20 seconds between the previous and next row times, as I understand your requirement.
Note: You could probably achieve the same using LAG and LEAD but I'll leave that as an exercise to anyone else who's interested!!
EDIT: In case the time values are not unique, I've included additional conditions to the ta/tb and ta/tc joins to include col1 and col2 and updated the fiddle.
I think you can do the following:
Create a result set in SQL that adds the previous row ordered by id (for this use LAG function (https://www.sqlitetutorial.net/sqlite-window-functions/sqlite-lag/)
Calculate a new column using the CASE construct (https://www.sqlitetutorial.net/sqlite-case/). This column could be a boolean called "keep" that basically is calculated in the following way:
if the previous row col1 and col2 values are not the same => true
if the previous row col1 and col2 values are the same but the time difference > 20 sec => true
in other cases => false
Filter on this query to only select the rows to keep (keep = true).

Sum up Adjacent columns in sql

I'm asking for a solution without functions or procedures (Permissions problem).
I have a table like this:
where k=number of columns (In reality : k=500)
col1 col2 col3 col4 col5.... col(k)
10 20 30 -50 60 100
and I need to create a comulative row like this:
col1 col2 col3 col4 col5 ... col(k)
10 30 60 10 70 X
In Excel, it's a simple shit to make a forumla and drag it but in sql if I have lot of columns, it seems a very clumsy work to add Manually (col1 as col1, col1+col2 as col2, col1+col2+col3 as col3 till colk etc).
Any way of finding a good solution for this problem?
You say that you've changed your data model to rows. So let's say that the new table has three columns:
grp (some group key to identify which rows belong together, i.e. what was one row in your old table)
pos (a position number from 1 to 500 to indicate the order of the values)
value
You get the cumulative sums with SUM OVER:
select grp, pos, value, sum(value) over (partition by grp order by pos) as running_total
from mytable
order by grp, pos;
If this "colk" is going to be needed/used in a lot of reports, I suggest you create a computed column or a view to sum all the columns using k = cola+colb+...
There's no function in sql to sum up columns (ex. between colA and colJ)

PL SQL - Multiple column equality

I'm trying to evaluate multiple columns to save myself a few keystrokes (granted, at this point, the time and effort of the search has long since negated any "benefit" I would ever receive) rather than multiple different compares.
Basically, I have:
WHERE column1 = column2
AND column2 = column3
I want:
WHERE column1 = column2 = column3
I found this other article, that was tangentially related:
Oracle SQL Syntax - Check multiple columns for IS NOT NULL
Use:
x=all(y,z)
instead of
x=y and y=z
The above saves 1 keystroke (1/11 = 9% - not much).
If column names are longer, then it gives bigger savings:
This is 35 characters long:
column1=column2 AND column2=column3
while this one only 28
column1=ALL(column2,column3)
But for this one (95 characters):
column1=column2 AND column2=column3 AND column3=column4
AND column4=column5 AND column5=column6
you will get 43/95 = almost 50% savings
column1=all(column2,column3,column4,column5,column6)
ALL operator is a part of ANSII SQL, it is supported by most databases (Mysql, Postgresql, SQLServer etc.
http://www.w3resource.com/sql/special-operators/sql_all.php
A simple test case that shows how it works:
create table t( x int, y int, z int );
insert all
into t values( 1,1,1)
into t values(1,2,2)
into t values(1,1,2)
into t values(1,2,1)
select 1 from dual;
select *
from t
where x = all(y,z);
X Y Z
---------- ---------- ----------
1 1 1
One possible trick is to utilize the least and greatest functions - if the largest and the smallest values of a list of values are equal, it must mean all the values are equal:
LEAST(col1, col2, col3) = GREATEST(col1, col2, col3)
I'm not sure it saves any keystrokes on a three column list, but if you have many columns, it could save some characters. Note that this solution implicitly assumes that none of the values are null, but so does your original solution, so it should be OK.