Computing statistics on trends in a time series in pandas - pandas

I have a time series data on the prices of items in different periods:
import pandas as pd
d = {'ItemID': {0: '1',
1: '1',
2: '1',
3: '1',
4: '1',
5: '1',
6: '1',
7: '1',
8: '1',
9: '1',
10: '1',
11: '2',
12: '2',
13: '2',
14: '2',
15: '2',
16: '2',
17: '2',
18: '2',
19: '2',
20: '2',
21: '2'},
'Period': {0: '1',
1: '1',
2: '1',
3: '1',
4: '1',
5: '1',
6: '2',
7: '2',
8: '2',
9: '2',
10: '2',
11: '1',
12: '1',
13: '1',
14: '1',
15: '1',
16: '2',
17: '2',
18: '2',
19: '2',
20: '2',
21: '2'},
'Price': {0: 1,
1: 2,
2: 1,
3: 2,
4: 2,
5: 3,
6: 6,
7: 6,
8: 7,
9: 7,
10: 8,
11: 50,
12: 49,
13: 50,
14: 49,
15: 48,
16: 61,
17: 62,
18: 63,
19: 64,
20: 64,
21: 65}}
df = pd.DataFrame(d)
I would like to compute the following statistics about the price changes per item and period:
number of streaks
max streak length
avg streak length
A streak is, essentially, a list of either non-decreasing or non-increasing values. In the following list [0,5,4,3,3] there are 2 streaks: [0,5] and [4,3,3].
For the above dataframe the correct output would be:
s = {'ItemID': {0: '1',1: '1', 2: '2', 3: '2'}, 'Period' : {0: '1',1: '2', 2: '1', 3: '2'},
'MaxStreakLength': {0: 4,1: 5, 2: 3, 3: 6},
'AvgStreakLength': {0: 3,1: 3, 2: 2.5, 3: 6},
'NumStreaks':{0: 2,1: 1, 2: 2, 3: 1}}
How to do this efficiently? The initial dataframe is quite large (millions of records)

I assume there no direct methods to achieve these sequence splitting, here i have added conditional sequence splitting,
def sequential_split(p):
a = p >= 0
b = a.cumsum()
arr = b-b.mask(a).ffill().fillna(0).astype(int)
streak_ends = (np.where(a==0)[0]-1)
return arr, streak_ends
def get_data(p):
arr,s_e = sequential_split(p.diff())
arr1,s_e1 = sequential_split(p.diff(-1))
if len(s_e)>len(s_e1):
s_e , arr = s_e1, arr1
streak_peaks = arr.iloc[s_e].add(1).tolist()
else:
streak_peaks = arr.loc[s_e[1:]].add(1).tolist() + [arr.iloc[-1]+1]
return [arr.max()+1, sum(streak_peaks)/len(streak_peaks), arr[arr==0].shape[0]]
columns=['MaxStreakLength','AvgStreakLength','NumStreaks']
a = df.groupby(['ItemID','Period'])['Price'].apply(get_data)
a.apply(lambda x: pd.Series(x, index=columns)).reset_index()
Out:
ItemID Period MaxStreakLength AvgStreakLength NumStreaks
0 1 1 4.0 3.0 2.0
1 1 2 5.0 5.0 1.0
2 2 1 3.0 2.5 2.0
3 2 2 6.0 6.0 1.0

Related

text showing up in hoverinfo not just displayed

So I'm trying to add data labels so you can see the values of each of my stacks when looking at a graph. I added the text option and put the column I want displayed, but it just returns in the hover information and not just displayed on the graph. How do I change this?
df2 = pd.DataFrame.from_dict({'Country': {0: 'Europe',
1: 'America',
2: 'Asia',
3: 'Europe',
4: 'America',
5: 'Asia',
6: 'Europe',
7: 'America',
8: 'Asia',
9: 'Europe',
10: 'America',
11: 'Asia'},
'Year': {0: 2014,
1: 2014,
2: 2014,
3: 2015,
4: 2015,
5: 2015,
6: 2016,
7: 2016,
8: 2016,
9: 2017,
10: 2017,
11: 2017},
'Amount': {0: 1600,
1: 410,
2: 150,
3: 1300,
4: 300,
5: 170,
6: 1000,
7: 500,
8: 200,
9: 900,
10: 500,
11: 210}})
fig = go.Figure()
x=[]
for i in df2['Year'].unique():
x.append(str(i))
for c in df2['Country'].unique():
df3 = df2[df2['Country'] == c]
fig.add_trace(go.Bar(x=x, y=df3['Amount'], name=c, text=df3['Amount']))
fig.update_layout(title="Personnel at Work",
barmode='stack',
title_x=.5,
yaxis={
'showgrid':False,
'visible':False
},
xaxis=dict(
tick0=0,
dtick=1,
),
plot_bgcolor='rgba(0,0,0,0)')
fig.show()
I had a similar problem and this block of code helped me!. Im not sure if it can help your case but give it a try.
fig.update_traces(texttemplate='%{your_labels =:.1f}', textposition='outside')
Go through all the use cases here,
https://plotly.com/python/text-and-annotations/

Plot the distribution of values in a column based on category in another column

I have two columns of data as indicated below:
data = {'labels': {0: '00',
1: '08',
2: '00',
3: '08',
4: '5',
5: '04',
6: '08',
7: '04',
8: '08',
9: '5',
10: '5',
11: '04',
12: '5',
13: '00',
14: '08',
15: '5',
16: '00',
17: '00',
18: '04',
19: '04'},
'scores': {0: 0.0023585677121699122,
1: 0.056371229170055104,
2: 0.005376756883710199,
3: 0.05694460526172507,
4: 0.1049131006122696,
5: 0.008102266910447686,
6: 0.09154342979296892,
7: -0.03761723194472211,
8: 0.010718527281161072,
9: 0.11988838522095685,
10: 0.09070139731152083,
11: 0.02994813107318378,
12: 0.09277903598030868,
13: 0.062223925985664286,
14: 0.1377963110579728,
15: 0.11898618005936024,
16: -0.021227109409528988,
17: 0.008938944493717238,
18: 0.03413068403999525,
19: 0.058688416622356965}}
df = pd.DataFrame(data)
I want am trying to plot the values in the scores and color it according to the labels. I have tried
sns.scatterplot(data=df,x='labels',y='scores');
This works but it doesn't show the clusters(each x value is separated) as shown here
I want the points to be in the same space and colored differently according to the df['labels'].
sns.scatterplot(data=df,x='labels',y='scores', hue='labels')

Matching Buy Sell entries from two dataframes and creating a new one. Python 3.8 / W10

Python / Pandas.
Matching Buy and Sell entries row by row.
BuyDF and SellDF are obtained from one excel file and sorted as per ascending Time (column L).
The image shows how the matching has to be done.
Match Buy and Sell entries by Name following first in first out method.
Take very first entry (Name AAA) from BuyDF and match with very first / Top most entry (Name AAA) from SellDF and move the matching row from SellDF in front of corrosponding row of BuyDF and delete the row Sell DF.
Go back to BuyDF second entry and match SellDF entry and move the matching row from SellDF and move the matching row from SellDF in front of corrosponding row of BuyDF and the row is deleted from Sell DF ...... and so on.
For names which do not match leave the matching rows Blank.
The ascending order (Time / Column L) should not be changed to maintain first in first out.
Tried using merge but didn't work for me.
How to proceed ?
BuyDF
{'Date': {0: '2019-04-01', 1: '2019-04-01', 2: '2019-04-01', 3: '2019-04-01', 4: '2019-04-02', 5: '2019-04-02', 6: '2019-04-02', 7: '2019-04-02', 8: '2019-04-05'}, 'Name': {0: 'AAA', 1: 'AAA', 2: 'AAA', 3: 'AAA', 4: 'BBB', 5: 'CCC', 6: 'CCC', 7: 'BBB', 8: 'AAA'}, 'Ref': {0: 1, 1: 1, 2: 1, 3: 1, 4: 5, 5: 7, 6: 7, 7: 6, 8: 1}, 'Seg': {0: 'S', 1: 'S', 2: 'S', 3: 'S', 4: 'L', 5: 'XL', 6: 'XL', 7: 'L', 8: 'S'}, 'Trans': {0: 'buy', 1: 'buy', 2: 'buy', 3: 'buy', 4: 'buy', 5: 'buy', 6: 'buy', 7: 'buy', 8: 'buy'}, 'Qty': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}, 'Price': {0: 225, 1: 225, 2: 225, 3: 225, 4: 210, 5: 210, 6: 210, 7: 210, 8: 225}, 'Order ID': {0: 8249, 1: 111, 2: 654, 3: 111, 4: 888, 5: 444, 6: 444, 7: 888, 8: 111}, 'Trade ID': {0: 1010, 1: 1010, 2: 1010, 3: 1010, 4: 4645, 5: 132, 6: 132, 7: 4700, 8: 1010}, 'Time': {0: '2019-04-01 11:05:18', 1: '2019-04-01 13:05:18', 2: '2019-04-01 13:05:18', 3: '2019-04-01 13:05:59', 4: '2019-04-02 13:20:05', 5: '2019-04-02 13:35:02', 6: '2019-04-02 13:35:02', 7: '2019-04-02 14:20:12', 8: '2019-04-05 13:05:18'}}
SellDF
{'Date': {5: '2019-04-01', 6: '2019-04-02', 7: '2019-04-02', 8: '2019-04-02', 13: '2019-04-03', 14: '2019-04-05', 15: '2019-04-05'}, 'Name': {5: 'AAA', 6: 'BBB', 7: 'BBB', 8: 'BBB', 13: 'DDD', 14: 'AAA', 15: 'AAA'}, 'Ref': {5: 3, 6: 2, 7: 2, 8: 2, 13: 8, 14: 4, 15: 4}, 'Seg': {5: 'L', 6: 'X', 7: 'X', 8: 'X', 13: 'XS', 14: 'L', 15: 'L'}, 'Trans': {5: 'sell', 6: 'sell', 7: 'sell', 8: 'sell', 13: 'sell', 14: 'sell', 15: 'sell'}, 'Qty': {5: 1, 6: 1, 7: 1, 8: 1, 13: 1, 14: 1, 15: 1}, 'Price': {5: 210, 6: 210, 7: 210, 8: 210, 13: 210, 14: 210, 15: 210}, 'Order ID': {5: 555, 6: 222, 7: 222, 8: 222, 13: 999, 14: 555, 15: 555}, 'Trade ID': {5: 1640, 6: 1532, 7: 1532, 8: 1532, 13: 14623, 14: 1645, 15: 1645}, 'Time': {5: '2019-04-01 14:13:40', 6: '2019-04-02 13:10:32', 7: '2019-04-02 13:10:32', 8: '2019-04-02 13:10:32', 13: '2019-04-03 15:25:50', 14: '2019-04-05 14:41:45', 15: '2019-04-05 14:41:45'}}
Image posted for ease of understanding.

ggplot/plotnine - adding a legend from geom_text() with specific color

I have this dataframe:
df = pd.DataFrame({'Segment': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'A', 5: 'B', 6: 'C', 7: 'D'},
'Average': {0: 55341, 1: 55159, 2: 55394, 3: 56960, 4: 55341, 5: 55159, 6: 55394, 7: 56960},
'Order': {0: 0, 1: 1, 2: 2, 3: 3, 4: 0, 5: 1, 6: 2, 7: 3},
'Variable': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'One', 5: 'One', 6: 'One', 7: 'One'},
'$': {0: 40.6, 1: 18.2, 2: 78.5, 3: 123.3, 4: 42.4, 5: 24.2, 6: 89.7, 7: 144.1},
'ypos': {0: 96.0, 1: 55.4, 2: 181.2, 3: 280.4, 4: 96.0, 5: 55.4, 6: 181.2, 7: 280.4},
'yticks': {0: 20.3,1: 9.1,2: 39.25,3: 61.65,4: 21.2,5: 12.1,6: 44.85,7: 72.05}})
With I plot this:
(ggplot(df, aes(x="Segment", y="$", ymin=0, ymax=300, fill="Variable"))
+ geom_col(position = position_stack(reverse = True), alpha=0.7)
+ geom_text(aes(x = "Segment", y = "ypos", label = "Average"), size=8, format_string="Average: \n ${:,.0f} CLP")
+ geom_text(aes(label = "$"), show_legend=True, position=position_stack(vjust = 0.5), size=8, format_string="%s"%(u"\N{dollar sign}{:,.0f} MM"))
)
I have been looking for a way to add the legend of Average and (then) I will delete the 'Average' words on the bars and leaving just the number. However, for this to be understandable, the additional legend should be the same color as the Average number values (could be yellow, orange, or any other, but no red or sky blue as those colors are already being used)
You can just add color as a variable to geom_text :
import plotnine
from plotnine import ggplot, geom_col, aes, position_stack, geom_text, scale_color_brewer, guides, guide_legend
(ggplot(df, aes(x="Segment", y="$", ymin=0, ymax=300, fill="Variable"))
+ geom_col(position = position_stack(reverse = True), alpha=0.7)
+ geom_text(aes(y = "ypos",color="Segment",label = "Average"), size=8,
show_legend=True,format_string="${:,.0f} CLP")
+ geom_text(aes(label = "$"), show_legend=True, position=position_stack(vjust = 0.5),
size=8, format_string="%s"%(u"\N{dollar sign}{:,.0f} MM"))
+ scale_color_brewer(type='qual', palette=2)
+ guides(color=guide_legend(title="Averages"))
)

Apply np.average in pandas pivot aggfunc

I am trying to calculate weighted average prices using pandas pivot table.
I have tried passing in a dictionary using aggfunc.
This does not work when passed into aggfunc, although it should calculate the correct weighted average.
'Price': lambda x: np.average(x, weights=df['Balance'])
I have also tried using a manual groupby:
df.groupby('Product').agg({
'Balance': sum,
'Price': lambda x : np.average(x, weights='Balance'),
'Value': sum
})
This also yields the error:
TypeError: Axis must be specified when shapes of a and weights differ.
Here is sample data
import pandas as pd
import numpy as np
price_dict = {'Product': {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'B',
6: 'B',
7: 'B',
8: 'B',
9: 'B',
10: 'C',
11: 'C',
12: 'C',
13: 'C',
14: 'C'},
'Balance': {0: 10,
1: 20,
2: 30,
3: 40,
4: 50,
5: 60,
6: 70,
7: 80,
8: 90,
9: 100,
10: 110,
11: 120,
12: 130,
13: 140,
14: 150},
'Price': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5,
5: 6,
6: 7,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15},
'Value': {0: 10,
1: 40,
2: 90,
3: 160,
4: 250,
5: 360,
6: 490,
7: 640,
8: 810,
9: 1000,
10: 1210,
11: 1440,
12: 1690,
13: 1960,
14: 2250}}
Try to calculate weighted average by passing dict into aggfunc:
df = pd.DataFrame(price_dict)
df.pivot_table(
index='Product',
aggfunc = {
'Balance': sum,
'Price': np.mean,
'Value': sum
}
)
Output:
Balance Price Value
Product
A 150 3 550
B 400 8 3300
C 650 13 8550
The expected outcome should be :
Balance Price Value
Product
A 150 3.66 550
B 400 8.25 3300
C 650 13.15 8550
Here is one way using apply
df.groupby('Product').apply(lambda x : pd.Series(
{'Balance': x['Balance'].sum(),
'Price': np.average(x['Price'], weights=x['Balance']),
'Value': x['Value'].sum()}))
Out[57]:
Balance Price Value
Product
A 150.0 3.666667 550.0
B 400.0 8.250000 3300.0
C 650.0 13.153846 8550.0