How can I convert a long format dataframe into a wide format in PySpark

How can I convert a long format dataframe into a wide format in PySpark - dataframe

I'm working on PySpark and I have long format dataframe like this:
KPI
GROUP
TIME
VALUE
Sales
A
Before
100
Sales
A
After
135
Sales
B
Before
90
Sales
B
After
98
Revenue
A
Before
10
Revenue
A
After
12
Revenue
B
Before
5
Revenue
B
After
8
And what I expect to have is something like this:
KPI
GROUP
BEFORE
AFTER
Sales
A
100
135
Sales
B
90
98
Revenue
A
10
12
Revenue
B
5
8

Just pivot
df1.groupBy('KPI' ,'GROUP').pivot('TIME').agg(first('VALUE')).show()

Related

Calculating moving sum (or SUM OVER) for the last X months, but with irregular number of rows

I want to do a window function (like the SUM() OVER() function), but there are two catches:
I want to consider the last 3 months on my moving sum, but the number of rows are not consistent. Some months have 3 entries, others may have 2, 4, 5, etc;
There is also a "group" column, and the moving sum should sum only the amounts of the same group.
In summary, a have a table that has the following structure:
id
date
group
amount
1
2022-01
group A
1100
2
2022-01
group D
2500
3
2022-02
group A
3000
4
2022-02
group B
1000
5
2022-02
group C
2500
6
2022-03
group A
2000
7
2022-04
group C
1000
8
2022-05
group A
1500
9
2022-05
group D
2000
10
2022-06
group B
1000
So, I want to add a moving sum column, containing the sum the amount for each group for the last 3 months. The sum should not reset every 3 months, but should consider only the previous values from the 3 months prior, and of the same group.
The end result should look like:
id
date
group
amount
moving_sum_three_months
1
2022-01
group A
1100
1100
2
2022-01
group D
2500
2500
3
2022-02
group A
3000
4100
4
2022-02
group B
1000
1000
5
2022-02
group C
2500
2500
6
2022-03
group A
2000
6100
7
2022-04
group C
1000
3500
8
2022-05
group A
1500
3500
9
2022-05
group D
2000
2000
10
2022-06
group B
1200
1200
The best example to see how the sum work in this example is line 8.
It considers only lines 8 and 6 for the sum, because they are the only one that meet the criteria;
Line 1 and 3 do not meet the criteria, because they are more than 3 months old from line 8 date;
All the other lines are not from group A, so they are also excluded from the sum.
Any ideias? Thanks in advance for the help!

Use SUM() as a window function partitioning the window by group in RANGE mode. Set the frame to go back 3 months prior the current record using INTERVAL '3 months', e.g.
SELECT *, SUM(amount) OVER w AS moving_sum_three_months
FROM t
WINDOW w AS (PARTITION BY "group" ORDER BY "date"
RANGE BETWEEN INTERVAL '3 months' PRECEDING AND CURRENT ROW)
ORDER BY id
Demo: db<>fiddle

Finding Max Price and displaying multiple columns SQL

I have a table that looks like this:
customer_id item price cost
1 Shoe 120 36
1 Bag 180 50
1 Shirt 30 9
2 Shoe 150 40
3 Shirt 30 9
4 Shoe 120 36
5 Shorts 65 14
I am trying to find the most expensive item each customer bought along with the cost of item and the item name.
I'm able to do the first part:
SELECT customer_id, max(price)
FROM sales
GROUP BY customer_id;
Which gives me:
customer_id price
1 180
2 150
3 30
4 120
5 65
How do I get this output to also show me the item and it's cost in the output? So output should look like this...
customer_id price item cost
1 180 Bag 50
2 150 Shoe 40
3 30 Shirt 9
4 120 Shoe 36
5 65 Shorts 14
I'm assuming its a Select statement within a Select? I would appreciate the help as I'm fairly new to SQL.

One method that usually has good performance is a correlated subquery:
select s.*
from sales s
where s.price = (select max(s2.price)
from sales s2
where s2.customer_id = s.customer_id
);

I want to do some aggregations with the help of Group By function in pandas

My dataset consists of a date column in 'datetime64[ns]' dtype; it also has a price and a no. of sales column.
I want to calculate the monthly VWAP (Volume Weighted Average Price ) of the stock.
( VWAP = sum(price*no.of sales)/sum(no. of sales) )
What I applied is:-
created a new dataframe column of month and year using pandas functions.
Now, I want monthly VWAP from this dataset which I modified, also, it should be distinct by year.
For eg. - March,2016 and March,2017 should have their seperate VWAP monthly values.

Start from defining a function to count vwap for the current
month (group of rows):
def vwap(grp):
return (grp.price * grp.salesNo).sum() / grp.salesNo.sum()
Then apply it to monthly groups:
df.groupby(df.dat.dt.to_period('M')).apply(vwap)
Using the following test DataFrame:
dat price salesNo
0 2018-05-14 120.5 10
1 2018-05-16 80.0 22
2 2018-05-20 30.2 12
3 2018-08-10 75.1 41
4 2018-08-20 92.3 18
5 2019-05-10 10.0 33
6 2019-05-20 20.0 41
(containing data from the same months in different years), I got:
dat
2018-05 75.622727
2018-08 80.347458
2019-05 15.540541
Freq: M, dtype: float64
As you can see, the result contains separate entries for May in both
years from the source data.

Last 3 months average next to current month value in hive

I have a table which has the monthly sales values for each of the items. I need last 3 months average sales value next to the current month sales for each item.
Need to perform this operation in hive.
The sample input table looks like below
Item_ID Sales Month
A 4295 Dec-2018
A 245 Nov-2018
A 1337 Oct-2018
A 3290 Sep-2018
A 2000 Aug-2018
B 856 Dec-2018
B 1694 Nov-2018
B 4286 Oct-2018
B 2780 Sep-2018
B 3100 Aug-2018
The result table should look like this
Item_ID Sales_Current_Month Month Sales_Last_3_months_average
A 4295 Dec-2018 1624
A 245 Nov-2018 2209
B 856 Dec-2018 2920
B 1694 Nov-2018 3388.67

Assuming there is no missing months data, you can use avg window function to do this.
select t.*
,avg(sales) over(partition by item_id order by month rows between 3 preceding and 1 preceding) as avg_sales_prev_3_months
from tbl t
If month column is in a format different from yyyyMM, use an appropriate conversion so the ordering works as expected.

Subtract nonconsecutive values in same row in t-SQL

I have a data table that has annual data points and quarterly data points. I want to subtract the quarterly data points from the corresponding prior annual entry, e.g. Annual 2014 - Q3 2014, using t-SQL. I have an id variable for each entry, plus a reconcile id variable that shows which quarterly entry corresponds to which annual entry. See below:
CurrentDate PreviousDate Value Entry Id Reconcile Id Annual/Quarterly
9/30/2012 9/30/2011 112 2 3 Annual
9/30/2013 9/30/2012 123 1 2 Annual
9/30/2014 9/30/2013 123.5 9 1 Annual
12/31/2013 9/30/2014 124 4 1 Quarterly
3/31/2014 12/31/2013 124.5 5 1 Quarterly
6/30/2014 3/31/2014 125 6 1 Quarterly
9/30/2014 6/30/2014 125.5 7 1 Quarterly
12/31/2014 9/30/2014 126 10 9 Quarterly
3/31/2015 12/31/2014 126.5 11 9 Quarterly
6/30/2015 3/31/2015 127 12 9 Quarterly
For example, Reconcile ID 9 for the quarterly entries corresponds to Entry ID 9, which is an annual entry.
I have code to just subtract the prior entry from the current entry, but I cannot figure out how to subtract quarterly entries from annual entries where the Entry ID and Reconcile ID are the same.
Here is the code I am using, which is resulting in the right calculation, but increasing the number of results by many rows. I have also tried this as an inner join. I only want the original 10 rows, plus a new difference column:
SELECT DISTINCT T1.[EntryID]
, [T1].[RECONCILEID]
, [T1].[CurrentDate]
, [T1].[Annual_Quarterly]
, [T1].[Value]
, [T1].[Value]-T2.[Value] AS Difference
FROM Table T1
LEFT JOIN Table T2 ON T2.EntryID = T1.RECONCILEID;

Your code should be fine, here's the results I'm getting:
EntryId Annual_Quarterly CurrentDate ReconcileId Value recVal diff
2 Annual 9/30/2012 3 112
1 Annual 9/30/2013 2 123 112 11
9 Annual 9/30/2014 1 123.5 123 0.5
4 Quarterly 12/31/2013 1 124 123 1
5 Quarterly 3/31/2014 1 124.5 123 1.5
6 Quarterly 6/30/2014 1 125 123 2
7 Quarterly 9/30/2014 1 125.5 123 2.5
10 Quarterly 12/31/2014 9 126 123.5 2.5
11 Quarterly 3/31/2015 9 126.5 123.5 3
12 Quarterly 6/30/2015 9 127 123.5 3.5
with your data and this SQL:
SELECT
tr.EntryId,
tr.Annual_Quarterly,
tr.CurrentDate,
tr.ReconcileId,
tr.Value,
te.Value AS recVal,
tr.[VALUE]-te.[VALUE] AS diff
FROM
t AS tr LEFT JOIN
t AS te ON
tr.ReconcileId = te.EntryId
ORDER BY
tr.Annual_Quarterly,
tr.CurrentDate;

Your question is a bit vague as far as how you're wanting to subtract these values, but this should give you some idea.
Select T1.*, T1.Value - Coalesce(T2.Value, 0) As Difference
From Table T1
Left Join Table T2 On T2.[Entry Id] = T1.[Reconcile Id]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I convert a long format dataframe into a wide format in PySpark - dataframe

Just pivot df1.groupBy('KPI' ,'GROUP').pivot('TIME').agg(first('VALUE')).show()

Related

Calculating moving sum (or SUM OVER) for the last X months, but with irregular number of rows

Finding Max Price and displaying multiple columns SQL

I want to do some aggregations with the help of Group By function in pandas

Last 3 months average next to current month value in hive

Subtract nonconsecutive values in same row in t-SQL

Categories

Resources