Pandas Dataframe get trend in column - pandas

I have a dataframe:
np.random.seed(1)
df1 = pd.DataFrame({'day':[3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6],
'item': [1, 1, 2, 2, 1, 2, 3, 3, 4, 3, 4],
'price':np.random.randint(1,30,11)})
day item price
0 3 1 6
1 4 1 12
2 4 2 13
3 4 2 9
4 5 1 10
5 5 2 12
6 5 3 6
7 5 3 16
8 5 4 1
9 6 3 17
10 6 4 2
After the groupby code gb = df1.groupby(['day','item'])['price'].mean(), I get:
gb
day item
3 1 6
4 1 12
2 11
5 1 10
2 12
3 11
4 1
6 3 17
4 2
Name: price, dtype: int64
I want to get the trend from the groupby series replacing back into the dataframe column price. The price is the variation of the item-price with repect to the previous day price
day item price
0 3 1 nan
1 4 1 6
2 4 2 nan
3 4 2 nan
4 5 1 -2
5 5 2 1
6 5 3 nan
7 5 3 nan
8 5 4 nan
9 6 3 6
10 6 4 1
Please help me to code the last step. A single/double line code will be most helpful. As the actual dataframe is huge, I would like to avoid iterations.

Hope this helps!
#get the average values
mean_df=df1.groupby(['day','item'])['price'].mean().reset_index()
#rename columns
mean_df.columns=['day','item','average_price']
#sort by day an item in ascending
mean_df=mean_df.sort_values(by=['day','item'])
#shift the price for each item and each day
mean_df['shifted_average_price'] = mean_df.groupby(['item'])['average_price'].shift(1)
#combine with original df
df1=pd.merge(df1,mean_df,on=['day','item'])
#replace the price by difference of previous day's
df1['price']=df1['price']-df1['shifted_average_price']
#drop unwanted columns
df1.drop(['average_price', 'shifted_average_price'], axis=1, inplace=True)

Related

Viewing frequency of multiple values in grouped Pandas data frame

I have a data frame with three column variables A,B,C, taking numeric values in {1,2}, {6,7}, and {11,12}. I would like to see the following. For what fraction of possible observed pairs (A,B) do we have both [observations for which C=11 and observations for which C=12].
I start by entering the dataframe:
df = pd.DataFrame({"A": [1, 2, 1, 1, 2, 1, 1, 2], "B": [6,7,7,6,7,6,6,6], "C": [11,12,11,11,12,12,11,12]})
--------
A B C
0 1 6 11
1 2 7 12
2 1 7 11
3 1 6 11
4 2 7 12
5 1 6 12
6 1 6 11
7 2 6 12
Then I think I need to use groupby. I run
g = df.groupby(["A", "B"])
"g.C.value_counts()"
-----------
A B C
1 6 11 3
12 1
7 11 1
2 6 12 1
7 12 2
Name: C, dtype: int64
This shows that we have one pair of (A,B) for which we have both a C=11 and a C=12, and 3 pairs of (A,B) for which we only have either C=11 or C=12. So I would like to make pandas tells me that we have 25% of (A,B) paris for which C takes both values and 75% for which it only takes one value.
How can I accomplish this? I would like to do so for a big data frame where I can't just eyeball it from the value_counts--this small dataframe is just to illustrate.
Thanks!
Pass normalize=True
out = df.groupby(["A", "B"]).C.value_counts(normalize=True)
Out[791]:
A B C
1 6 11 0.75
12 0.25
7 11 1.00
2 6 12 1.00
7 12 1.00
Name: C, dtype: float64

Concatenate all combinations of sub-level columns in a pandas DataFrame

Given the following DataFrame:
cols = pd.MultiIndex.from_product([['A', 'B'], ['a', 'b']])
example = pd.DataFrame([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]], columns=cols)
example
A B
a b a b
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I would like to end up with the following one:
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11
I used this code:
concatenated = pd.DataFrame([])
for A_sub_col in ('a', 'b'):
for B_sub_col in ('a', 'b'):
new_frame = example[[['A', A_sub_col], ['B', B_sub_col]]]
new_frame.columns = ['A', 'B']
concatenated = pd.concat([concatenated, new_frame])
However, I strongly suspect that there is a more straight-forward, idiomatic way to do that with Pandas. How would one go about it?
Here's an option using list comprehension:
pd.concat([
example[[('A', i), ('B', j)]].droplevel(level=1, axis=1)
for i in example['A'].columns
for j in example['B'].columns
]).reset_index(drop=True)
Output:
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11
Here is one way. Not sure how more pythonic this is. It is definitely less readable :-) but on the other hand does not use explicit loops:
(example
.apply(lambda c: [list(c)])
.stack(level=1)
.apply(lambda c:[list(c)])
.explode('A')
.explode('B')
.apply(pd.Series.explode)
.reset_index(drop = True)
)
to understand what's going on it would be helpful to do this one step at a time, but the end result is
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11

Meaning of mode() in pandas

df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
"B": np.random.randint(-10, 15, size=50)})
df5.mode()
A B
0 1.0 -9
1 NaN 10
2 NaN 13
Why does the NaN come from here?
Reason is if check DataFrame.mode:
Get the mode(s) of each element along the selected axis.
The mode of a set of values is the value that appears most often. It can be multiple values.
So missing values means for A is ony one mode value, for B column are 3 mode values, so for same rows are added missing values.
If check my sample data - there is mode A 2 times and B only once, because 2and 3 are both 11 times in data:
np.random.seed(20)
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
"B": np.random.randint(-10, 15, size=50)})
print (df5.mode())
A B
0 2 8.0
1 3 NaN
print (df5.A.value_counts())
3 11 <- both top1
2 11 <- both top1
6 9
5 8
0 5
1 4
4 2
Name: A, dtype: int64
print (df5.B.value_counts())
8 6 <- only one top1
0 4
4 4
-4 3
10 3
-2 3
1 3
12 3
6 3
7 2
3 2
5 2
-9 2
-6 2
14 2
9 2
-1 1
11 1
-3 1
-7 1
Name: B, dtype: int64

Conditional filter of entire group for DataFrameGroupBy

If I have the following data
>>> data = pd.DataFrame({'day': [1, 1, 1, 1, 2, 2, 2, 2, 3, 4],
'hour':[4, 5, 6, 7, 4, 5, 6, 7, 4, 7]})
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
8 3 4
9 4 7
And I would like to keep only days where hour has 4 unique values then I would think to do something like this
>>> data.groupby('day').apply(lambda x: x[x['hour'].nunique() == 4])
But this returns KeyError: True
I am hoping to get this
>>> data
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7
Where we see that where day == 3 and day == 4 have been filtered because when grouped by day they don't have 4 unique values of hour. I'm doing this at scale so simply filtering where (day == 3) & (day == 4) is not an option. I think grouping would be a good way to do this but can't get it to work. Anyone have experience with applying functions to DataFrameGroupBy?
I think you actually need to filter the data:
>>> data.groupby('day').filter(lambda x: x['hour'].nunique() == 4)
day hour
0 1 4
1 1 5
2 1 6
3 1 7
4 2 4
5 2 5
6 2 6
7 2 7

Merging dataframes in pandas

I am new to pandas and I am facing the following problem:
I have 2 data frames:
df1 :
x y
1 3 4
2 nan
3 6
4 nan
5 9 2
6 1 4 9
df2:
x y
1 2 3 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 3 7
The size of the two is same.
I want to merge the two dataframes such that all the resulting dataframe i get is the following:
result :
x y
1 3 4 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 6 7
So in the result, priority is given to df2. If there is a value in df2, it is put first and the remaining values are put from df1 (they have the same position as in df1). There should be no repeated values in the result (i.e if a value is in position 1 in df1 and position 3 in df2, then that value should come only in position 1 in the result and not repeat)
Any kind of help will be appreciated.
Thanks!
IIUC
Setup
df1 = pd.DataFrame(dict(x=range(1, 7),
y=[[3, 4], None, [6], None, [9, 2], [1, 4, 9]]))
df2 = pd.DataFrame(dict(x=range(1, 7), y=[[2, 3, 6, 1, 5], [4, 1, 8, 7, 5],
[6, 3, 1, 4, 5], [2, 1, 3, 5, 4],
[9, 2, 3, 8, 7], [1, 4, 5, 3, 7]]))
print df1
print
print df2
x y
0 1 [3, 4]
1 2 None
2 3 [6]
3 4 None
4 5 [9, 2]
5 6 [1, 4, 9]
x y
0 1 [2, 3, 6, 1, 5]
1 2 [4, 1, 8, 7, 5]
2 3 [6, 3, 1, 4, 5]
3 4 [2, 1, 3, 5, 4]
4 5 [9, 2, 3, 8, 7]
5 6 [1, 4, 5, 3, 7]
convert to something more usable:
df1_ = df1.set_index('x').y.apply(pd.Series)
df2_ = df2.set_index('x').y.apply(pd.Series)
print df1_
print
print df2_
0 1 2
x
1 3.0 4.0 NaN
2 NaN NaN NaN
3 6.0 NaN NaN
4 NaN NaN NaN
5 9.0 2.0 NaN
6 1.0 4.0 9.0
0 1 2 3 4
x
1 2 3 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 3 7
Combine with priority given to df1 (I think you meant df1 as that what was consistent with my interpretation of your question and the expected output you provided) then reducing to eliminate duplicates:
print df1_.combine_first(df2_).apply(lambda x: x.unique(), axis=1)
0 1 2 3 4
x
1 3 4 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 9 3 7