Timeseries: Groupby and calculate variance - pandas

I have the following dataframe with timeseries data:
df = pd.DataFrame(columns = ['id', 'value'])
df['value'] =[9, 16, 10, 12, 11, 14]
df['id'] = [1, 1, 1, 2, 2, 2]
For each timeseries (defined by column 'id' I want to calculate the variance to find timeseries that do not change at all or only very little.
The final dataframe should look like this:
df_end = pd.DataFrame(columns = ['id','value', 'var'])
df_end['value'] =[9, 16, 10, 12, 11, 14]
df_end['id'] = [1, 1, 1, 2, 2, 2]
df_end['var'] = [21, 21, 21, 2.3, 2.3, 2.3]
I tried:
df.groupby(df['id']).var()
which gives me the values, but I couldn't put it into the df in the right form. I am sure, there is a handy function for this that I don't know about yet!
Thanks for helping out!

Use GroupBy.transform with specify column value:
df['var'] = df.groupby('id')['value'].transform('var')
print (df)
id value var
0 1 9 14.333333
1 1 16 14.333333
2 1 10 14.333333
3 2 12 2.333333
4 2 11 2.333333
5 2 14 2.333333

Related

Pandas : How to Apply a Condition on Every Values of a Dataframe, Based on a Second Symmetrical Dataframe

I have a dictionary with 2 DF : "quantity variation in %" and "prices". They are both symmetrical DF.
Let's say I want to set the price = 0 if the quantity variation in percentage is greater than 100 %
import numpy as np; import pandas as pd
d = {'qty_pct': pd.DataFrame({ '2020': [200, 0.5, 0.4],
'2021': [0.9, 0.5, 500],
'2022': [0.9, 300, 0.4]}),
'price': pd.DataFrame({ '2020': [-6, -2, -9],
'2021': [ 2, 3, 4],
'2022': [ 4, 6, 8]})}
# I had something like that in mind ...
df = d['price'].applymap(lambda x: 0 if x[d['qty_pct']] >=1 else x)
P.S. If by any chance there is a way to do this on asymmetrical DF, I would be curious to see how it's done.
Thanks,
I want to obtain this DF :
price = pd.DataFrame({'2020': [ 0, -2, -9],
'2021': [ 2, 3, 0],
'2022': [ 4, 0, 8]})
Assume price and qty_pct always have the same dimension, then you can just do:
d['price'][d['qty_pct'] >= 1] = 0
d['price']
2020 2021 2022
0 0 2 4
1 -2 3 0
2 -9 0 8

how to change value based on criteria pandas

I have a following problem. I have this df:
d = {'id': [1, 1, 2, 2, 3], 'value': [0, 1, 0, 0, 1]}
df = pd.DataFrame(data=d)
I would like to have a new column where value will be 1 if in any other cases it is also 1. See desired output:
d = {'id': [1, 1, 2, 2, 3], 'value': [0, 1, 0, 0, 1], 'newvalue': [1, 1, 0, 0, 1]}
df = pd.DataFrame(data=d)
How can I do it please?
If need set 0,1 by condition - here at least one 1 use GroupBy.transform with GroupBy.any for mask and casting to integers for True, False to 1,0 map:
df['newvalue'] = df['value'].eq(1).groupby(df['id']).transform('any').astype(int)
Alternative:
df['newvalue'] = df['id'].isin(df.loc[df['value'].eq(1), 'id']).astype(int)
Or if only 0,1 values is possible simplify solution for new column by maximal values per groups:
df['newvalue'] = df.groupby('id')['value'].transform('max')
print (df)
id value newvalue
0 1 0 1
1 1 1 1
2 2 0 0
3 2 0 0
4 3 1 1

pandas row wise comparison and apply condition

This is my dataframe:
df = pd.DataFrame(
{
"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],
"score": [3, 5, 6, 2, 4, 1],
}
)
I want to compare the score of bob_x with 'bob_y, and retain the row with the lowest, and do the same for jay_xandjay_y. No change is required for madandjoe`.
You can first split the names by _ and keep the first part, then groupby and keep the lowest value:
import pandas as pd
df = pd.DataFrame({"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],"score": [3, 5, 6, 2, 4, 1]})
df['name'] = df['name'].str.split('_').str[0]
df.groupby('name')['score'].min().reset_index()
Result:
name
score
0
bob
2
1
jay
4
2
joe
1
3
mad
5

Add new pandas DataFrame column whose values are an array of random numbers with length gleaned from another column

I have the following DataFrame of dummy data
data = { 'user_id': np.random.randint(1000000, 10000000, size=(10)), 'week': np.random.randint(1, 10, size=(10)) }
df = pd.DataFrame(data = data)
I would like to add a new column whose values are arrays of length week (with those arrays containing random values). None of these work
df.loc[:,'inputs'] = np.random.randint(0, 28, size=(10))
(gives one integer per DataFrame cell, not an array of them)
df.loc[:,'inputs'] = np.random.randint(0, 28, size=(df['week']))
ValueError: Length of values does not match length of index
df.loc[:,'inputs'] = np.random.randint(0, 28, size=(10, df['week']))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
These are obviously all wrong, but I cannot see how to make a new column whose entries are each arrays where the length of those arrays depends on another column's value in the same row.
Use list comprehension for arrays by week numbers:
df['inputs'] = [np.random.randint(0, 28, size=x) for x in df['week']]
print (df)
user_id week inputs
0 9168288 4 [15, 5, 10, 9]
1 2765768 7 [21, 26, 6, 6, 22, 21, 4]
2 2948278 6 [6, 14, 4, 2, 3, 20]
3 9302275 1 [23]
4 5737115 5 [1, 20, 9, 19, 18]
5 5214343 9 [16, 25, 1, 10, 2, 23, 1, 16, 18]
6 9332184 7 [8, 27, 14, 8, 14, 11, 5]
7 1569483 5 [6, 19, 3, 10, 16]
8 2931319 2 [0, 15]
9 2126334 2 [20, 22]

Generating boolean dataframe based on contents in series and dataframe

I have:
df = pd.DataFrame(
[
[22, 33, 44],
[55, 11, 22],
[33, 55, 11],
],
index=["abc", "def", "ghi"],
columns=list("abc")
) # size(3,3)
and:
unique = pd.Series([11, 22, 33, 44, 55]) # size(1,5)
then I create a new df based on unique and df, so that:
df_new = pd.DataFrame(index=unique, columns=df.columns) # size(5,3)
From this newly created df, I'd like to create a new boolean df based on unique and df, so that the end result is:
df_new = pd.DataFrame(
[
[0, 1, 1],
[1, 0, 1],
[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
],
index=unique,
columns=df.columns
)
This new df is either true or false depending on whether the value is present in the original dataframe or not. For example, the first column has three values: [22, 55, 33]. In a df with dimensions (5,3), this first column would be: [0, 1, 1, 0, 1] i.e. [0, 22, 33, 0 , 55]
I tried filter2 = unique.isin(df) but this doesn't work, also notnull. I tried applying a filter but the dimensions returned were incorrect. How can I do this?
Use DataFrame.stack with DataFrame.reset_index, DataFrame.pivot, then check if not missing values by DataFrame.notna, cast to integers for True->1 and False->0 mapping and last remove index and columns names by DataFrame.rename_axis:
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
Helper Series is not necessary, but if there is more values or is necessary change order by helper Series use add DataFrame.reindex:
#added 66
unique = pd.Series([11, 22, 33, 44, 55,66])
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.reindex(unique)
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
66 0 0 0