how do i transform data in sql or in pyspark

how do i transform data in sql or in pyspark - sql

input dataset
Act status from to
123 1 2011-03-29 00:00:00 2011-03-29 23:59:59
123 1 2011-03-30 00:00:00 2011-03-30 23:59:59
123 1 2011-03-31 00:00:00 2011-03-31 23:59:59
123 0 2011-04-01 00:00:00 2011-04-03 23:59:59
123 0 2011-04-04 00:00:00 2011-04-04 23:59:59
123 0 2011-04-05 00:00:00 2011-04-05 23:59:59
123 1 2011-04-06 00:00:00 2011-04-06 23:59:59
123 1 2011-04-07 00:00:00 2011-04-07 23:59:59
123 1 2011-04-08 00:00:00 2011-04-10 23:59:59
I want output to be
act status from to
123 1 2011-03-29 00:00:00 2011-03-31 23:59:59
123 0 2011-04-01 00:00:00 2011-04-05 23:59:59
123 1 2011-04-06 00:00:00 2011-04-10 23:59:59

You can use the lag function to track the status change. After applying the lag function, you would use the results to build your rankings and use the rankings as your groupBy parameter. For example:
status
lag
changed
rankings
1
null
1
1
1
1
0
1
0
1
1
2
1
0
1
3
0
1
1
4
1
0
1
5
1
1
0
5
where:
status : current status
lag : status from previous row
changed : 0 if status == lag, otherwise 0
rankings : cumulative sum of changed
Anyway, here's the answer to your question in SQL.
spark.sql('''
SELECT Act, status, from, to
FROM (
SELECT Act, status, MIN(from) AS from, MAX(to) AS to, rankings
FROM (
SELECT *, SUM(changed) OVER(ORDER BY from) AS rankings
FROM (
SELECT *, IF(LAG(status) OVER(PARTITION BY Act ORDER BY from) = status, 0, 1) AS changed
FROM sample
)
)
GROUP BY 1,2,5
)
''').show()
output:
+---+------+-------------------+-------------------+
|Act|status| from| to|
+---+------+-------------------+-------------------+
|123| 1|2011-03-29 00:00:00|2011-03-31 23:59:59|
|123| 0|2011-04-01 00:00:00|2011-04-05 23:59:59|
|123| 1|2011-04-06 00:00:00|2011-04-10 23:59:59|
+---+------+-------------------+-------------------+
And, here's the pyspark version:
from pyspark.sql.functions import lag, sum, max, min, when
from pyspark.sql.window import Window
df2 = df.withColumn("lag_tracker",lag("status",1).over(Window.partitionBy("Act").orderBy("from")))
df2 = df2.withColumn("changed", when(df2.lag_tracker == df2.status, 0).otherwise(1))
df2 = df2.withColumn("rankings", sum("changed").over(Window.orderBy("from")))
df2 = df2.groupBy("Act", "status", "rankings").agg(min("from").alias("from"), max("to").alias("to"))
df2.select("Act", "status", "from", "to").show()
output:
+---+------+-------------------+-------------------+
|Act|status| from| to|
+---+------+-------------------+-------------------+
|123| 1|2011-03-29 00:00:00|2011-03-31 23:59:59|
|123| 0|2011-04-01 00:00:00|2011-04-05 23:59:59|
|123| 1|2011-04-06 00:00:00|2011-04-10 23:59:59|
+---+------+-------------------+-------------------+

If you have no gaps in the dates, I would suggest uses the difference of row numbers:
select act, status, min(from), max(to)
from (select t.*,
row_number() over (partition by act order by from) as seqnum,
row_number() over (partition by act, status order by from) as seqnum_2
from t
) t
group by act, status, (seqnum - seqnum_2);
Why this works is a little tricky to explain. However, if you look at the results from the subquery, you will see that the difference between seqnum and seqnum_2 is constant on adjacent rows with the same status.
Note: I would advise you to fix your data model so you don't miss the last second on each day. The to datetime of one row should be the same as the from datetime of the next row. When querying, you can use >= and < to get the values that the row matches.

Related

Why does my cumulative column not work as expected?

Here is my query , I have a column called cum_balance which is supposed to calculate the cumulative balance but after row number 10 there is an anamoly and it doesn't work as expected , all I notice is that from row number 10 onwards the hour column has same value. What's the right syntax for this?
[select
hour,
symbol,
amount_usd,
category,
sum(amount_usd) over (
order by
hour asc RANGE BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) as cum_balance
from
combined_transfers_usd_netflow
order by
hour][1]
I have tried removing RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW , adding a partition by hour and group by hour. None of them gave the expected result or errors
Row Number
Hour
SYMBOL
AMOUNT_USD
CATEGORY
CUM_BALANCE
1
2021-12-02 23:00:00
WETH
227.2795
in
227.2795
2
2021-12-03 00:00:00
WETH
-226.4801153
out
0.7993847087
3
2022-01-05 21:00:00
WETH
5123.716203
in
5124.515587
4
2022-01-18 14:00:00
WETH
-4466.2366
out
658.2789873
5
2022-01-19 00:00:00
WETH
2442.618599
in
3100.897586
6
2022-01-21 14:00:00
USDC
99928.68644
in
103029.584
7
2022-03-01 16:00:00
UNI
8545.36098
in
111574.945
8
2022-03-04 22:00:00
USDC
-2999.343
out
108575.602
9
2022-03-09 22:00:00
USDC
-5042.947675
out
103532.6543
10
2022-03-16 21:00:00
USDC
-4110.6579
out
98594.35101
11
2022-03-16 21:00:00
UNI
-3.209306045
out
98594.35101
12
2022-03-16 21:00:00
UNI
-16.04653022
out
98594.35101
13
2022-03-16 21:00:00
UNI
-16.04653022
out
98594.35101
14
2022-03-16 21:00:00
UNI
-16.04653022
out
98594.35101
15
2022-03-16 21:00:00
UNI
-6.418612089
out
98594.35101

The "problem" with your data in all the ORDER BY values after row 10 are the same.
So if we shrink the data down a little, and use for groups to repeat the experiment:
with data(grp, date, val) as (
select * from values
(1,'2021-01-01'::date, 10),
(1,'2021-01-02'::date, 11),
(1,'2021-01-03'::date, 12),
(2,'2021-01-01'::date, 20),
(2,'2021-01-02'::date, 21),
(2,'2021-01-02'::date, 22),
(2,'2021-01-04'::date, 23)
)
select d.*
,sum(val) over ( partition by grp order by date RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) as cum_val_1
,sum(val) over ( partition by grp order by date ) as cum_val_2
from data as d
order by 1,2;
we get:
GRP
DATE
VAL
CUM_VAL_1
CUM_VAL_2
1
2021-01-01
10
10
10
1
2021-01-02
11
21
21
1
2021-01-03
12
33
33
2
2021-01-01
20
20
20
2
2021-01-02
21
63
63
2
2021-01-02
22
63
63
2
2021-01-04
23
86
86
we see with group 1 that values accumulate as we expect. So for group 2 we put duplicate values as see those rows get the same value, but rows after "work as expected again".
This tells us how this function work across unstable data (values that sort the same) is that they are all stepped in one leap.
Thus if you want each row to be different they will need better ORDER distinctness. This could be forced by add random values of literal random nature, or feeling non random ROW_NUMBER, but really they would be random, albeit not explicit, AND if you use random, you might get duplicates, thus really should use ROW_NUMBER or SEQx to have unique values.
Also the second formula shows they are equal, and it's the ORDER BY problem not the framing of "which rows" are used.
with data(grp, date, val) as (
select * from values
(1,'2021-01-01'::date, 10),
(1,'2021-01-02'::date, 11),
(1,'2021-01-03'::date, 12),
(2,'2021-01-01'::date, 20),
(2,'2021-01-02'::date, 21),
(2,'2021-01-02'::date, 22),
(2,'2021-01-04'::date, 23)
)
select d.*
,seq8() as s
,sum(val) over ( partition by grp order by date ) as cum_val_1
,sum(val) over ( partition by grp order by date, s ) as cum_val_2
,sum(val) over ( partition by grp order by date, seq8() ) as cum_val_3
from data as d
order by 1,2;
gives:
GRP
DATE
VAL S
CUM_VAL_1
CUM_VAL_2
CUM_VAL_2_2
1
2021-01-01
10
0
10
10
1
2021-01-02
11
1
21
21
1
2021-01-03
12
2
33
33
2
2021-01-01
20
3
20
20
2
2021-01-02
21
4
63
41
2
2021-01-02
22
5
63
63
2
2021-01-04
23
6
86
86

Get dates with values greater than average borrow rate

I have a table called BOOK (memberId, ISBN, dateBorrowed)
For example:
isbn | memberId | borrowed
-------+---------------+-------------+----
9998-01-101-9 | |
9998-01-101-9 | |
9998-01-101-9 | |
9998-01-101-9 | 1000 | 2018-10-02
9998-01-101-9 | 1010 | 2018-09-04
9998-01-101-9 | 1021 | 2018-09-14
9998-01-101-9 | |
9998-01-101-9 | 1001 | 2018-10-02
I have to SELECT all dates, where total count of borrowed books per day is larger, than per all days in average. How to do it?
I have selected date and how many times was it picked by:
SELECT borrowed, COUNT(*) AS dates
FROM BOOK
WHERE borrowed IS NOT NULL
GROUP BY borrowed;
Another query which was written by me is to count average:
SELECT SUM(dates)/COUNT(borrowed) AS average
FROM (
SELECT borrowed, COUNT(*) AS dates
FROM BOOKS
WHERE borrowed IS NOT NULL GROUP BY borrowed
) AS average;
Now, how to concatenate these two sequels into one clear sequel?

Using window functions can help you much: https://www.postgresql.org/docs/current/static/tutorial-window.html
demo: db<>fiddle
My test data:
isbn borrowed
9998-01-101-1 2018-08-01
9998-01-101-2 2018-08-01
9998-01-101-3 2018-08-01
9998-01-101-4 2018-08-01
9998-01-101-5 2018-08-01
9998-01-101-1 2018-08-02
9998-01-101-2 2018-08-02
9998-01-101-3 2018-08-02
9998-01-101-4 2018-08-03
9998-01-101-5 2018-08-03
9998-01-101-1 2018-08-04
9998-01-101-2 2018-08-04
9998-01-101-3 2018-08-04
9998-01-101-4 2018-08-04
9998-01-101-5 2018-08-05
9998-01-101-1 2018-08-05
The query:
SELECT
*
FROM (
SELECT
*,
borrowed_all_time::decimal / COUNT(*) OVER () as avg_borrows_per_day -- D
FROM (
SELECT DISTINCT -- C
borrowed,
COUNT(*) OVER (PARTITION BY borrowed) as borrowed_on_day, -- A
COUNT(*) OVER () as borrowed_all_time -- B
FROM book
)s
)s
WHERE borrowed_on_day > avg_borrows_per_day -- E
A: This window function counts the rows per borrowed date
B: This window function counts all rows which equals to count borrows of all time.
The result so far looks like this:
borrowed borrowed_on_day borrowed_all_time
2018-08-01 5 16
2018-08-01 5 16
2018-08-01 5 16
2018-08-01 5 16
2018-08-01 5 16
2018-08-02 3 16
2018-08-02 3 16
2018-08-02 3 16
2018-08-03 2 16
2018-08-03 2 16
2018-08-04 4 16
2018-08-04 4 16
2018-08-04 4 16
2018-08-04 4 16
2018-08-05 2 16
2018-08-05 2 16
C: Because we need no duplicates we eliminate them with a DISTINCT
D: Counting all rows after eliminating all tied rows gives the count of the distinct days. This dividing borrows of all time gives the average borrows per day. The decimal cast is neccessary. It converts the integer division (16 / 5 == 3) into a float division (16 / 5 == 3.2)
E: Now we can filter borrows per current day > average borrows per day.
The result:
borrowed
2018-08-01
2018-08-04

This looks a bit like HW, so windowed functions might be out of bounds.
SELECT *
FROM (
SELECT BOOK.*,
CAST(
COUNT(1) OVER
( PARTITION BY borrowed
) AS FLOAT) cntThatDay,
CAST(
SUM(1) OVER() AS FLOAT)/ CAST(
(SELECT COUNT(DISTINCT borrowed)
FROM BOOKS
) AS FLOAT) AS totalAverage
FROM BOOK
WHERE borrowed IS NOT NULL
) TMP
WHERE cntThatDay >= totalAverage;

How can I get a table of counts grouped by month on axis x and by hour on axis y in PostgreSQL?

There is a log table with a lot of events, I would like to know what is statistical data, i.e. at what hour each month how many events happened.
Data sample:
date_create | event
---------------------+---------------------------
2018-03-01 18:00:00 | Something happened
2018-03-05 18:15:00 | Something else happened
2018-03-06 19:00:00 | Something happened again
2018-04-01 18:00:00 | and again
The result should look like this:
hour | 03 | 04
------+----+----
18 | 2 | 1
19 | 1 | 0
I can make it with CTE, but then it is significant manual work each time. My guess would be that it can be made with funciton, but probably it is already there.

You can use aggregation. I'm thinking:
select extract(hour from date_create) as hh,
sum(case when extract(month from date_create) = 3 then 1 else 0 end) as month_03,
sum(case when extract(month from date_create) = 4 then 1 else 0 end) as month_04
from t
group by hh
order by hh;

Getting date difference between consecutive rows in the same group

I have a database with the following data:
Group ID Time
1 1 16:00:00
1 2 16:02:00
1 3 16:03:00
2 4 16:09:00
2 5 16:10:00
2 6 16:14:00
I am trying to find the difference in times between the consecutive rows within each group. Using LAG() and DATEDIFF() (ie. https://stackoverflow.com/a/43055820), right now I have the following result set:
Group ID Difference
1 1 NULL
1 2 00:02:00
1 3 00:01:00
2 4 00:06:00
2 5 00:01:00
2 6 00:04:00
However I need the difference to reset when a new group is reached, as in below. Can anyone advise?
Group ID Difference
1 1 NULL
1 2 00:02:00
1 3 00:01:00
2 4 NULL
2 5 00:01:00
2 6 00:04:00

The code would look something like:
select t.*,
datediff(second, lag(time) over (partition by group order by id), time)
from t;
This returns the difference as a number of seconds, but you seem to know how to convert that to a time representation. You also seem to know that group is not acceptable as a column name, because it is a SQL keyword.
Based on the question, you have put group in the order by clause of the lag(), not the partition by.

SQL - Creating a timeline for each ID (Vertica)

I am dealing with the following problem in SQL (using Vertica):
In short -- Create a timeline for each ID (in a table where I have multiple lines, orders in my example, per ID)
What I would like to achieve -- At my disposal I have a table on historical order date and I would like to compute new customer (first order ever in the past month), active customer- (>1 order in last 1-3 months), passive customer- (no order for last 3-6 months) and inactive customer (no order for >6 months) rates.
Which steps I have taken so far -- I was able to construct a table similar to the example presented below:
CustomerID Current order date Time between current/previous order First order date (all-time)
001 2015-04-30 12:06:58 (null) 2015-04-30 12:06:58
001 2015-09-24 17:30:59 147 05:24:01 2015-04-30 12:06:58
001 2016-02-11 13:21:10 139 19:50:11 2015-04-30 12:06:58
002 2015-10-21 10:38:29 (null) 2015-10-21 10:38:29
003 2015-05-22 12:13:01 (null) 2015-05-22 12:13:01
003 2015-07-09 01:04:51 47 12:51:50 2015-05-22 12:13:01
003 2015-10-23 00:23:48 105 23:18:57 2015-05-22 12:13:01
A little bit of intuition: customer 001 placed three orders from which the second one was 147 days after its first order. Customer 002 has only placed one order in total.
What I think that the next steps should be -- I would like to know for each date (also dates on which a certain user did not place an order), for each CustomerID, how long it has been since his/her last order. This would imply that I would create some sort of timeline for each CustomerID. In the example presented above I would get 287 (days between 1st of May 2015 and 11th of February 2016, the timespan of this table) lines for each CustomerID. I have difficulties solving this previous step. When I have performed this step I want to create a field which shows at each date the last order date, the period between the last order date and the current date, and what state someone is in at the current date. For the example presented earlier, this would look something like this:
CustomerID Last order date Current date Time between current date /last order State
001 2015-04-30 12:06:58 2015-05-01 00:00:00 0 00:00:00 New
...
001 2015-04-30 12:06:58 2015-06-30 00:00:00 60 11:53:02 Active
...
001 2015-09-24 17:30:59 2016-02-01 00:00:00 129 11:53:02 Passive
...
...
002 2015-10-21 17:30:59 2015-10-22 00:00:00 0 06:29:01 New
...
002 2015-10-21 17:30:59 2015-11-30 00:00:00 39 06:29:01 Active
...
...
003 2015-05-22 12:13:01 2015-06-23 00:00:00 31 11:46:59 Active
...
003 2015-07-09 01:04:51 2015-10-22 00:00:00 105 11:46:59 Inactive
...
At the dots there should be all the inbetween dates but for sake of space I have left these out of the table.
When I know for each date what the state is of each customer (active/passive/inactive) my plan is to sum the states and group by date which should give me the sum of new, active, passive and inactive customers. From here on I can easily compute the rates at each date.
Anybody that knows how I can possibly achieve this task?
Note -- If anyone has other ideas how to achieve the goal presented above (using some other approach compared to the approach I had in mind) please let me know!

EDIT
Suppose you start from a table like this:
SQL> select * from ord order by custid, ord_date ;
custid | ord_date
--------+---------------------
1 | 2015-04-30 12:06:58
1 | 2015-09-24 17:30:59
1 | 2016-02-11 13:21:10
2 | 2015-10-21 10:38:29
3 | 2015-05-22 12:13:01
3 | 2015-07-09 01:04:51
3 | 2015-10-23 00:23:48
(7 rows)
You can use Vertica's Timeseries Analytic Functions TS_FIRST_VALUE(), TS_LAST_VALUE() to fill gaps and interpolate last_order date to the current date:
Then you just have to join this with a Vertica's TimeSeries generated from the same table with interval one day starting from the first day each customer did place his/her first order up to now (current_date):
select
custid,
status_dt,
last_order_dt,
case
when status_dt::date - last_order_dt::date < 30 then case
when nord = 1 then 'New' else 'Active' end
when status_dt::date - last_order_dt::date < 90 then 'Active'
when status_dt::date - last_order_dt::date < 180 then 'Passive'
else 'Inactive'
end as status
from (
select
custid,
last_order_dt,
status_dt,
conditional_true_event (first_order_dt is null or
last_order_dt > lag(last_order_dt))
over(partition by custid order by status_dt) as nord
from (
select
custid,
ts_first_value(ord_date) as first_order_dt ,
ts_last_value(ord_date) as last_order_dt ,
dt::date as status_dt
from
( select custid, ord_date from ord
union all
select distinct(custid) as custid, current_date + 1 as ord_date from ord
) z timeseries dt as '1 day' over (partition by custid order by ord_date)
) x
) y
where status_dt <= current_date
order by 1, 2
;
And you will get something like this:
custid | status_dt | last_order_dt | status
--------+------------+---------------------+---------
1 | 2015-04-30 | 2015-04-30 12:06:58 | New
1 | 2015-05-01 | 2015-04-30 12:06:58 | New
1 | 2015-05-02 | 2015-04-30 12:06:58 | New
...
1 | 2015-05-29 | 2015-04-30 12:06:58 | New
1 | 2015-05-30 | 2015-04-30 12:06:58 | Active
1 | 2015-05-31 | 2015-04-30 12:06:58 | Active
...
etc.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas