applying window function to big data set (how to optimize?) - hive

I have to do some data analysis on a table with 400+ million rows. I got this to work on a small sample but I'm sure it will run out of memory in production.
The table structure is like this (for millions of serial numbers):
+------------+---------------+------------+----------+
| date | serial_number | status_1 | status_2 |
+------------+---------------+------------+----------+
| 10/1/2018 | 123 | warehouse | v |
| 10/10/2018 | 123 | warehouse | w |
| 10/20/2018 | 123 | warehouse | x |
| 11/2/2018 | 123 | in transit | y |
+------------+---------------+------------+----------+
I need to get the dates where status_1 = 'in transit' currently and status_2 = 'x' on a previous date. That should look like this:
+-----------+---------------+------------+----------+------------+
| date_1 | serial_number | status_1 | status_2 | date_2 |
+-----------+---------------+------------+----------+------------+
| 11/2/2018 | 123 | in transit | x | 10/20/2018 |
+-----------+---------------+------------+----------+------------+
I got it using two rank functions, but this will probably choke on a big table.
with transit as (
select
*
from (
select *,
rank() over(partition by serial_number order by date desc) rnk
from sample_t
order by serial_number, date asc
)
where rnk=1 and status_1 = 'in transit'
),
x_type as (
select
*
from (
select *,
rank() over(partition by serial_number order by date desc) rnk
from sample_t
order by serial_number, date asc
)
where rnk>1 and status_2 = 'x'
)
select tr.date date_1,
tr.serial_number,
tr.status_1,
x.status_2,
x.date date_2
from transit tr left join x_type x on tr.serial_number = x.serial_number
I can't see how to do this with one rank function. Is there a better, more efficient way?

You can use lag to do this.
select *
from (select t.*
,lag(status_2) over(partition by serial_no order by date) as prev_status_2
,lag(date) over(partition by serial_no order by date) as prev_date
from tbl t
) t
where status_1 = 'in_transit' and prev_status_2 = 'x'

Related

SQL - get rid of the nested aggregate select

There is a table Payment, which for example tracks the amount of money user puts into account, simplified as
===================================
Id | UserId | Amount | PayDate |
===================================
1 | 42 | 11 | 01.02.99 |
2 | 42 | 31 | 05.06.99 |
3 | 42 | 21 | 04.11.99 |
4 | 24 | 12 | 05.11.99 |
What is need is to receive a table with balance before payment moment, eg:
===============================================
Id | UserId | Amount | PayDate | Balance |
===============================================
1 | 42 | 11 | 01.02.99 | 0 |
2 | 42 | 31 | 05.06.99 | 11 |
3 | 42 | 21 | 04.11.99 | 42 |
4 | 24 | 12 | 05.11.99 | 0 |
Currently the select statement looks something like
SELECT
Id,
UserId,
Amount,
PaidDate,
(SELECT sum(amount) FROM Payments nestedp
WHERE nestedp.UserId = outerp.UserId AND
nestedp.PayDate < outerp.PayDate) as Balance
FROM
Payments outerp
How can I rewrite this select to get rid of the nested aggregate selection? The database in question is SQL Server 2019.
You need to use cte with some custom logic to handle this type of problem.
WITH PaymentCte
AS (
SELECT ROW_NUMBER() OVER (
PARTITION BY UserId ORDER BY Id
) AS RowId
,Id
,UserId
,PayDate
,Amount
,SUM(Amount) OVER (
PARTITION BY UserId ORDER BY Id
) AS Balance
FROM Payment
)
SELECT X.Id
,X.UserId
,X.Amount
,X.PayDate
,Y.Balance
FROM PaymentCte x
INNER JOIN PaymentCte y ON x.userId = y.UserId
AND X.RowId = Y.RowId + 1
UNION
SELECT X.Id
,X.UserId
,X.Amount
,X.PayDate
,0 AS Balance
FROM PaymentCte x
WHERE X.RowId = 1
This provides the desired output
You can try the following using lag with a cumulative sum
with b as (
select * , isnull(lag(amount) over (partition by userid order by id),0) Amt
from t
)
select Id, UserId, Amount, PayDate,
Sum(Amt) over (partition by userid order by id) Balance
from b
order by Id
Thanks to other participants' leads I came up with a query that (seems) to work:
SELECT
Id,
UserId,
Amount,
PayDate,
COALESCE(sum(Amount) over (partition by UserId
order by PayDate
rows between unbounded preceding and 1 preceding), 0) as Balance
FROM
Payments
ORDER BY
UserId, PayDate
Lots of related examples can be found here

PostgreSQL: Filter select query by comparing against other rows

Suppose I have a table of Events that lists a userId and the time the Event occurred:
+----+--------+----------------------------+
| id | userId | time |
+----+--------+----------------------------+
| 1 | 46 | 2020-07-22 11:22:55.307+00 |
| 2 | 190 | 2020-07-13 20:57:07.138+00 |
| 3 | 17 | 2020-07-11 11:33:21.919+00 |
| 4 | 46 | 2020-07-22 10:17:11.104+00 |
| 5 | 97 | 2020-07-13 20:57:07.138+00 |
| 6 | 17 | 2020-07-04 11:33:21.919+00 |
| 6 | 17 | 2020-07-11 09:23:21.919+00 |
+----+--------+----------------------------+
I want to get the list of events that had a previous event on the same day, by the same user. The result for the above table would be:
+----+--------+----------------------------+
| id | userId | time |
+----+--------+----------------------------+
| 1 | 46 | 2020-07-22 11:22:55.307+00 |
| 3 | 17 | 2020-07-11 11:33:21.919+00 |
+----+--------+----------------------------+
How can I perform a select query that filters results by evaluating them against other rows in the table?
This can be done using an EXISTS condition:
select t1.*
from the_table t1
where exists (select *
from the_table t2
where t2.userid = t1.userid -- for the same user
and t2.time::date = t1.time::date -- on the same
and t2.time < t1.time); -- but previously on that day
You can use lag():
select t.*
from (select t.*,
lag(time) over (partition by userid, time::date order by time) as prev_time
from t
) t
where prev_time is not null;
Here is a db<>fiddle.
Or row_number():
select t.*
from (select t.*,
row_number() over (partition by userid, time::date order by time) as seqnum
from t
) t
where seqnum >= 2;
You can use LAG() to find the previous row for a user. Then a simple comparison will tell if it occured in the same day or not.
For example:
select *
from (
select
*,
lag(time) over(partition by userId order by time) as prev_time
from t
) x
where date::date = prev_time::date
You can use ROW_NUMBER() analytic function :
SELECT id , userId , time
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY UserId, date_trunc('day',time) ORDER BY time DESC) AS rn,
t.*
FROM Events
) q
WHERE rn > 1
in order to bring the latest event for UserId who takes place in more than one event.

SELECT based on multiple fields in MS-SQL

I have a table with 4 columns:
AcctNumb | PeriodEndingDate | WaterConsumption | ReadingType
There are multiple records for each AcctNumb, with the date that each record was recorded.
What I want to do is grab the most recent date, consumption reading, and reading type for each account.
I have tried using MAX(PeriodEndingDate) and GROUP BY AcctNumb, but I would need to aggregate all the other values, and none of the aggregate functions help me for the WaterConsumption, etc.
Can anyone point me in the right direction?
Thanks
EDIT
Here is a sample table
+----------+------------------+------------------+-------------+
| AcctNumb | PeriodEndingDate | WaterConsumption | ReadingType |
+----------+------------------+------------------+-------------+
| 1000 | 2018-03-31 | 122230 | A |
| 1001 | 2018-03-31 | 24850 | A |
| 1002 | 2018-03-31 | 88540 | A |
| 1000 | 2017-12-31 | 123800 | A |
| 1001 | 2017-12-31 | 3000 | E |
+----------+------------------+------------------+-------------+
The ReadingType is whether it's an actual (A) reading, or an estimate (E).
Try this
SELECT
AcctNumb,
PeriodEndingDate,
WaterConsumption,
ReadingType
FROM (SELECT
AcctNumb,
PeriodEndingDate,
WaterConsumption,
ReadingType,
ROW_NUMBER() OVER (PARTITION BY AcctNumb ORDER BY PeriodEndingDate DESC) AS MostrecentRecord
FROM <TableName>) dt
WHERE MostrecentRecord= 1
This can be done using ROW_NUMBER. It has been asked an answered thousands of times but the query is easier to write than find a duplicate.
select *
from
(
select *
, RowNum = ROW_NUMBER() over(partition by AcctNumb order by PeriodEndingDate)
from YourTable
) x
where x.RowNum = 1
SELECT DQ.* FROM
(SELECT *,
Row_Number() OVER (PARTITION BY AcctNumb ORDER BY PeriodEndingDate DESC) AS RN
FROM YourTable
) AS DQ
WHERE DQ.RN = 1

select first and last record of each group horizontally

I have a table like
i want to select first and last record of every group by facility_id and created_at horizontally
need to output like . i can do it vertically but need horizontally
with CTE as (
select
*
,ROW_NUMBER() over (partition by facility_id,name order by created_at asc ) ascrnk
,ROW_NUMBER() over (partition by facility_id,name order by created_at desc ) desrnk
from TestTable
)
select T1.facility_id,T1.name,
T1.value as "First_value",
T1.created_at as "First created_at",
T2.value as "Last_value",
T2.created_at as "Last created_at"
from (
select * from cte
where ascrnk = 1
) T1
left join (
select * from cte
where desrnk = 1
) T2 on T1.facility_id = T2.facility_id and T1.name = T2.name
Result:
| facility_id | name | First_value | First created_at | Last_value | Last created_at |
|-------------|------|-------------|----------------------|------------|----------------------|
| 2011 | A | 200 | 2015-05-30T11:50:17Z | 300 | 2017-05-30T11:50:17Z |
| 2012 | B | 124 | 2015-05-30T11:50:17Z | 195 | 2017-05-30T11:50:17Z |
| 2013 | C | 231 | 2015-05-30T11:50:17Z | 275 | 2017-06-30T11:50:17Z |
| 2014 | D | 279 | 2017-06-30T11:50:17Z | 263 | 2018-06-30T11:50:17Z |
SQL Fiddle Demo Link
I think this is much simpler using window functions and select distinct:
select distinct facility_id, name,
first_value(value) over (partition by facility_id, name order by created_at asc) as first_value,
min(created_at) as first_created_at,
first_value(value) over (partition by facility_id, name order by created_at desc) as last_value,
max(created_at) as last_created_at
from t;
No subqueries. No joins.
You can also use arrays to accomplish the same functionality, using group by. It is a shame that SQL Server doesn't directly support first_value() as a window function.

Partition By over Two Columns in Row_Number function

I am trying to RANK the records using the following query:
SELECT
ROW_NUMBER() over (partition by
TW.EMPL_ID,TW.HR_DEPT_ID,TW.Transfer_Startdate
order by TW.EMPL_ID,TW.Effective_Bdate) RN,
TW.EMPL_ID,TW.HR_DEPT_ID,TW.Transfer_Startdate,Effective_BDate from
TT_EMPLOYEE_WORKDAY TW
where TW.HR_DOMAIN_CODE = 'SGP'
However the resultant Row_Number computed column only displays partition for the first column. Ideally I expected to have the same value for Row_Number where the partition by column data is identical.
Any clue where I might be going wrong?
USING RANK or DENSE RANK isn't an option as I want to identify all such rows for multiple employee where EMPL_ID, HR_DEPT_ID and Transfer_StartDate are same (RN=1)
Sample data:
RN AON_EMPL_ID HR_DEPT_ID Transfer_Startdate Effective_BDate
1 0100690 69895 01/01/2017 2017-01-01
2 0100690 69895 01/01/2017 2017-01-03
3 0100690 69895 01/01/2017 2017-01-04
expanding sample data to:
create table t (
aon_empl_id varchar(16)
, hr_dept_id varchar(16)
, Transfer_Startdate date
, Effective_bdate date
);
insert into t values
('0100690','69895','01/01/2017','2017-01-01')
,('0100690','69895','01/01/2017','2017-01-03')
,('0100690','69895','01/01/2017','2017-01-04')
,('0200700','69895','01/01/2016','2016-01-01')
,('0200700','69895','01/01/2016','2016-01-03')
,('0200700','69896','01/01/2017','2017-01-04')
,('0200700','69896','01/01/2017','2017-01-04');
using top with ties
select top 1 with ties
aon_empl_id
, hr_dept_id
, Transfer_Startdate = convert(char(10),Transfer_Startdate,120)
, Effective_bdate = convert(char(10),Effective_bdate,120)
from t
order by row_number() over (
partition by aon_empl_id, hr_dept_id, Transfer_Startdate
order by Effective_bdate
)
rextester demo: http://rextester.com/KOIZ42069
returns:
+-------------+------------+--------------------+-----------------+
| aon_empl_id | hr_dept_id | Transfer_Startdate | Effective_bdate |
+-------------+------------+--------------------+-----------------+
| 0100690 | 69895 | 2017-01-01 | 2017-01-01 |
| 0200700 | 69895 | 2016-01-01 | 2016-01-01 |
| 0200700 | 69896 | 2017-01-01 | 2017-01-04 |
+-------------+------------+--------------------+-----------------+
Alternative using a common table expression with row_number():
;with cte as (
select
rn = row_number() over (
partition by aon_empl_id, hr_dept_id, Transfer_Startdate
order by Effective_bdate
)
, aon_empl_id
, hr_dept_id
, Transfer_Startdate = convert(char(10),Transfer_Startdate,120)
, Effective_bdate = convert(char(10),Effective_bdate,120)
from t tw
)
select *
from cte
where rn = 1
returns:
+----+-------------+------------+--------------------+-----------------+
| rn | aon_empl_id | hr_dept_id | Transfer_Startdate | Effective_bdate |
+----+-------------+------------+--------------------+-----------------+
| 1 | 0100690 | 69895 | 2017-01-01 | 2017-01-01 |
| 1 | 0200700 | 69895 | 2016-01-01 | 2016-01-01 |
| 1 | 0200700 | 69896 | 2017-01-01 | 2017-01-04 |
+----+-------------+------------+--------------------+-----------------+
SELECT
RANK() over (partition by --or DENSE_RANK()
TW.EMPL_ID,TW.HR_DEPT_ID,TW.Transfer_Startdate
order by TW.EMPL_ID,TW.Effective_Bdate) RN,
TW.EMPL_ID,TW.HR_DEPT_ID,TW.Transfer_Startdate,Effective_BDate from
TT_EMPLOYEE_WORKDAY TW
where TW.HR_DOMAIN_CODE = 'SGP'
UPDATE
SELECT
RANK() over (partition by --or DENSE_RANK()
TW.EMPL_ID,TW.HR_DEPT_ID,TW.Transfer_Startdate
order by TW.EMPL_ID) RN,
TW.EMPL_ID,TW.HR_DEPT_ID,TW.Transfer_Startdate,Effective_BDate from
TT_EMPLOYEE_WORKDAY TW
where TW.HR_DOMAIN_CODE = 'SGP'
Order by RN,TW.Effective_Bdate
This bit of code appears to be working:
SELECT
dense_rank() over (partition by AON_EMPL_ID
order by AON_EMPL_ID,HR_DEPT_ID,Transfer_StartDate) RN,
TW.AON_EMPL_ID,TW.HR_DEPT_ID,TW.Transfer_Startdate,Effective_BDate from
TT_AON_EMPLOYEE_WORKDAY TW
where TW.HR_DOMAIN_CODE = 'SGP'
Apparently, I just need to partition by AON_EMPL_ID and everything else should go to Order By clause.