BigQuery row_number to remove duplicates

BigQuery row_number to remove duplicates - google-bigquery

I want to keep only the ID with the latest timestamp from the table, is there a more optimal and efficient way to solve the problem
a query that I tried
SELECT * except(row_number)
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY ID)
row_number
FROM employees
)
WHERE row_number = 1
employees table:
ID NAME DEPARTMENT UPDATED_AT
1 James IT 2019-05-21 12:13:14
1 James IT 2019-05-21 12:14:14
1 James IT 2019-05-21 12:18:14
2 Pam HR 2019-05-26 13:18:14
2 Pam HR 2019-05-26 14:18:14
3 David IT 2019-06-22 14:18:14
3 David IT 2019-06-23 12:18:14
result:
ID NAME DEPARTMENT UPDATED_AT
1 James IT 2019-05-21 12:18:14
2 Pam HR 2019-05-26 14:18:14
3 David IT 2019-06-23 12:18:14

You are just missing the ORDER BY clause in your subquery statement.
WITH
DATA AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) AS _row,
*
FROM
employees )
SELECT
* EXCEPT(_row)
FROM
DATA
WHERE
_row = 1

SELECT *
FROM employees
WHERE TRUE
QUALIFY ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) = 1

Related

Count distinct over partition by

I am trying to do a distinct count of names partitioned over their roles. So, in the example below: I have a table with the names and the person's role.
I would like a role count column that gives the total number of distinct people in that role. For example, the role manager comes up four times but there are only 3 distinct people for that role - Sam comes up again on a different date.
If I remove the date column, it works fine using:
select
a.date,
a.Name,
a.Role,
count(a.Role) over (partition by a.Role) as Role_Count
from table a
group by a.date, a.name, a.role
Including the date column then makes it count the total roles rather than by distinct name (which I know I haven't identified in the partition). Giving 4 managers and 3 analysts.
How do I fix this?
Desired output:
Date
Name
Role
Role_Count
01/01
Sam
Manager
3
02/01
Sam
Manager
3
01/01
John
Manager
3
01/01
Dan
Manager
3
01/01
Bob
Analyst
2
02/01
Bob
Analyst
2
01/01
Mike
Analyst
2
Current output:
Date
Name
Role
Role_Count
01/01
Sam
Manager
4
02/01
Sam
Manager
4
01/01
John
Manager
4
01/01
Dan
Manager
4
01/01
Bob
Analyst
3
02/01
Bob
Analyst
3
01/01
Mike
Analyst
3

Unfortunately, SQL Server (and other databases as well) don't support COUNT(DISTINCT) as a window function. Fortunately, there is a simple trick to work around this -- the sum of DENSE_RANK()s minus one:
select a.Name, a.Role,
(dense_rank() over (partition by a.Role order by a.Name asc) +
dense_rank() over (partition by a.Role order by a.Name desc) -
1
) as distinct_names_in_role
from table a
group by a.name, a.role

Unfortunately, COUNT(DISTINCT is not available as a window aggregate. But we can use a combination of DENSE_RANK and MAX to simulate it:
select
a.Name,
a.Role,
MAX(rnk) OVER (PARTITION BY date, Role) as Role_Count
from (
SELECT *,
DENSE_RANK() OVER (PARTITION BY date, Role ORDER BY Name) AS rnk
FROM table
) a
If Name may have nulls then we need to take that into account:
select
a.Name,
a.Role,
MAX(CASE WHEN Name IS NOT NULL THEN rnk END) OVER (PARTITION BY date, Role) as Role_Count
from (
SELECT *,
DENSE_RANK() OVER (PARTITION BY date, Role, CASE WHEN Name IS NULL THEN 0 ELSE 1 END ORDER BY Name) AS rnk
FROM table
) a

Not able to get exact latest records with two columns having same value - in SQL Server

I am trying to get distinct records for a specific department from the table employee.
I have tried with this code in SQL Server, and I'm getting this error:
Error: employeeId is invalid in the select list because it is not contained in either aggregate function or the GROUP BY clause.
My code:
SELECT
name, department, MAX(jointime) LatestDate, employeeId
FROM
employee
WHERE
department = 'Mechanical'
GROUP BY
name
Records in DB:
name department joinTime EmployeeId
-----------------------------------------------------------
Erik Mechanical 2019-07-06 11:59:59 456
Tom Mechanical 2019-07-06 11:59:59 789
Erik Computer 2019-07-05 11:59:59 222
Erik Computer 2019-07-04 11:59:59 111
Erik Mechanical 2019-07-01 11:59:59 123
I want to achieve the result when a query for 'Mechanical' is executed. The latest record should be fetched from DB for a particular department.
name department joinTime EmployeeId
-----------------------------------------------------------
Erik Mechanical 2019-07-06 11:59:59 456
Tom Mechanical 2019-07-06 11:59:59 789

Assuming the key is [Name] and not [EmployeeId]
One option is the WITH TIES clause, and thus no need for aggregation
Example
Select Top 1 with ties *
From employee
Where department='Mechanical'
Order By Row_Number() over (Partition By [Name] order by joinTime Desc)
Returns
name department joinTime EmployeeId
Erik Mechanical 2019-07-06 11:59:59.000 456
Tom Mechanical 2019-07-06 11:59:59.000 789

You can use EXISTS:
SELECT e.*
FROM employee e
WHERE e.department='Mechanical'
AND NOT EXISTS (
SELECT 1 FROM employee
WHERE department = e.department
AND name = e.name AND joinTime > e.joinTime
)
See the demo.
Results:
> name | department | joinTime | EmployeeId
> :--- | :--------- | :------------------ | ---------:
> Erik | Mechanical | 2019-07-06 11:59:59 | 456
> Tom | Mechanical | 2019-07-06 11:59:59 | 789

You can use ROW_NUMBER to mark the latest row for each employee, or CROSS APPLY to run a correlated subquery for each employee.
with q as
(
SELECT name, department, jointime, employeeId,
row_number() over (partition by name, order by joinTime desc) rn
FROM employee where department='Mechanical'
)
select name, department, jointime, employeeId
from q
where rn = 1
or
with emp as
(
select distinct name from employee
)
select e.*
from q
cross apply
(
select top 1 *
from employee e2
where e2.name = q.name
order by joinDate desc
) e

Just add department,employeeId to the GROUP BY
SELECT name , department, MAX(jointime) LatestDate , employeeId
FROM employee where department='Mechanical'
GROUP BY name, department, employeeId

You need to use AGGREGATE Functions for fields which are used in SELECT statement:
SELECT name,
MIN(department)
, MAX(jointime) LatestDate,
, MIN(employeeId)
FROM employee where department='Mechanical'
GROUP BY name
SQL server finds all records with names Tom or Erik, but SQL Server does not know what one value from multiple rows should be chosen for the fields such as department or employeeId. By using aggregrate functions, you are advising SQL Server to get the MIN, MAX, SUM, COUNT values of that columns.
OR use those columns to the GROUP BY clause to get all unique rows:
SELECT name
, department
, jointime
, employeeId
FROM employee where department='Mechanical'
GROUP BY name
, department
, jointime
, employeeId

Get employee with on-off-on weekend work pattern

I have an employee table which has columns like
employee_ID, punch_in_date, punch_out_date.
Now, what I need is to find those employees who have worked on-off-on weekend pattern.
It is like if an employee has worked in week1 then he/she should not have worked in week2 and must have worked in Week3.
Week1, week2, and week3 are the consecutive weekend days.
I tried using the lag function of sql.
SELECT employee_id,
punch_in_date,
Lag(punch_in_date) OVER(partition BY employee_id ORDER BY employee_id) AS week_lag,
Datediff(day,Lag(punch_in_date) OVER(partition BY employee_id ORDER BY employee_id) ,punch_in_date) AS days
FROM employee
WHERE Datediff(day,Lag(punch_in_date) OVER(partition BY employee_id ORDER BY employee_id) ,punch_in_date)>= 14
AND datediff(day, punch_in_date, 'Today's date') <= 90 /*This means the data must falls under 3 months duration*/;
But I am getting an error like
SQL Error [4108] [S0001]: Windowed functions can only appear in the
SELECT or ORDER BY clauses.
How can I get the required result?
sample data:
employee_ID |punch_in_date |punch_out_date |
------------|--------------|---------------|
2 |2015-12-05 |2015-12-05 |
2 |2015-12-12 |2015-12-12 |
2 |2015-12-19 |2015-12-19 |
2 |2016-01-02 |2016-01-02 |
2 |2016-01-23 |2016-01-24 |
2 |2016-01-24 |2016-01-25 |
2 |2016-01-30 |2016-01-30 |
2 |2016-02-06 |2016-02-06 |
2 |2016-02-06 |2016-02-06 |
2 |2016-02-06 |2016-02-07 |
2 |2016-02-13 |2016-02-14 |
2 |2016-02-27 |2016-02-28 |
2 |2016-03-12 |2016-03-13 |

I suspect you want:
select employee_id, punch_in_date, week_lag
datediff(day, week_lag, punch_in_date) AS days
from (select e.*,
lag(punch_in_date) over (partition by employee_id order by employee_id) as week_lag
from employee e
) e
where week_lag >= 14 and
datediff(day, punch_in_date, getdate()) <= 90 ;
When using window functions, be very careful about where filtering. The filters are applied before the window function, so you might miss some rows that you want.

As the error message states; Windowed function are only allowed in select and order by.
What you can do is to use your query in a subquery
Select Employee_id,punch_in_date, week_lag,[days] FROM(
SELECT employee_id,
punch_in_date,
Lag(punch_in_date) OVER(partition BY employee_id ORDER BY employee_id)
AS week_lag,
Datediff(day,Lag(punch_in_date) OVER(partition BY employee_id ORDER BY
employee_id) ,punch_in_date) AS [days]
FROM employee
where punch_in_date >= dateadd(day,-90,getdate())
) q
WHERE [days]>= 14

Display the latest modified record for each employee

emp table as like this
id Name Date Modified
1 Ram 2017-01-05
2 Kishore 2017-02-04
3 John 2017-04-22
1 Ram K 2017-04-25
1 Ram Kumar 2017-05-01
2 Kishore Babu 2017-05-05
3 John B 2017-06-01

Assuming you're using a reasonable rdbms that supports window functions, row_number should do the trick:
SELECT id, name, date_modified
FROM (SELECT id, name, date_modified,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date_modified DESC) rn
FROM emp) t
WHERE rn = 1

How to refine last but one?

I have the following table . I need to get the last but one event associate for each event
event_id event_date event_associate
1 2/14/2014 ben
1 2/15/2014 ben
1 2/16/2014 steve
1 2/17/2014 steve // this associate is the last but one for event 1
1 2/18/2014 paul
2 2/19/2014 paul
2 2/20/2014 paul // this associate is the last but one for event 2
2 2/21/2014 ben
3 2/22/2014 paul
3 2/23/2014 paul
3 2/24/2014 ben
3 2/25/2014 steve // this associate is the last but one for event 3
3 2/26/2014 ben
I need to find out who was the last but one event_associate for each event . The result should be
event_id event_associate
1 steve
2 paul
3 steve
I know in order to do this I need to maximize event_date and exclude the last event_associate
So I tried
SELECT event_id , event_associate
WHERE NOT EXISTS (
SELECT *
FROM mytable
WHERE event_date = MAX(event_date)
)
QUALIFY ROW_NUMBER() OVER ( PARTITION BY event_id ORDER BY event_date DESC) = 1
But I do not know how to use EXISTS in this case .

You are quite close, you just need the 2nd row based on ROW_NUMBER:
select t.*,
row_number()
over (partition by event_id
order by event_date desc)
from tab as t
qualify
row_number()
over (partition by event_id
order by event_date desc) = 2
-- or simply
-- qualify rn = 2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery row_number to remove duplicates - google-bigquery

You are just missing the ORDER BY clause in your subquery statement. WITH DATA AS ( SELECT ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) AS _row, * FROM employees ) SELECT * EXCEPT(_row) FROM DATA WHERE _row = 1

SELECT * FROM employees WHERE TRUE QUALIFY ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) = 1

Related

Count distinct over partition by

Not able to get exact latest records with two columns having same value - in SQL Server

Get employee with on-off-on weekend work pattern

Display the latest modified record for each employee

How to refine last but one?

Categories

Resources