Complete an incomplete dataframe in pandas - pandas

Good morning.
I have a dataframe that can be both like this:
df1 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
and like this:
df2 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
The difference between the two is only that the case may arise in which one, or several but not all, zones do have data for the highest of the time periods (column date). My desired result is to be able to complete the dataframe until a certain period of time (3 in the example), in the following way in each of the cases:
df1_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
7 B 3 6809 20
8 C 3 288 5
df2_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 1280 3
7 B 3 6809 20
8 C 3 288 5
I've tried different combinations of pivot and fillna with different methods, but I can't achieve the previous result.
I hope my explanation was understood.
Many thanks in advance.

You can use reindex to create entries for all dates in the range, and then forward fill the last value into it.
import pandas as pd
df1 = pd.DataFrame([['A', 1,154, 2],
['B', 1,2647, 7],
['C', 1,0, 0],
['A', 2,1280, 3],
['B', 2,6809, 20],
['C', 2,288, 5],
['A', 3,2000, 4]],
columns=['zone', 'date', 'p1', 'p2'])
result = df1.groupby("zone").apply(lambda x: x.set_index("date").reindex(range(1, 4), method='ffill'))
print(result)
To get
zone p1 p2
zone date
A 1 A 154 2
2 A 1280 3
3 A 2000 4
B 1 B 2647 7
2 B 6809 20
3 B 6809 20
C 1 C 0 0
2 C 288 5
3 C 288 5

IIUC, you can reconstruct a pd.MultiIndex from your original df and use fillna to get the max from each subgroup of zone you have.
first, build your index
ind = df1.set_index(['zone', 'date']).index
levels = ind.levels
n = len(levels[0])
labels = [np.tile(np.arange(n), n), np.repeat(np.arange(0, n), n)]
Then, use pd.MultiIndex constructor to reindex
df1.set_index(['zone', 'date'])\
.reindex(pd.MultiIndex(levels= levels, labels= labels))\
.fillna(df1.groupby(['zone']).max())
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
To fill df2, just change from df1 in this last line of code to df2 and you get
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
I suggest not to copy/paste directly the code and try to run, but rather try to understand the process and make slight changes if needed depending on how different your original data frame is from what you posted.

Related

Using groupby() and cut() in pandas

I have a dataframe and for each group value I want to label values. If value is less that group mean then label is 1 and if group value is more than group mean then label is 2.
input data frame is
groups num1
0 a 2
1 a 5
2 a Nan
3 b 10
4 b 4
5 b 0
6 b 7
7 c 2
8 c 4
9 c 1
Here mean values for group a, b ,c are 3.5, 5.25 and 2.33 respectively and output data frame is .
groups out
0 a 1
1 a 2
2 a Nan
3 b 2
4 b 1
5 b 1
6 b 2
7 c 1
8 c 2
9 c 1
I want to use panads.cut and may be pandas.groupby and pandas.apply also.
and also how can I skip Null values here?
Thanks in advance
cut is not really pertinent here. Use groupby.transform('mean') and numpy.where:
df['out'] = np.where(df['num1'].lt(df.groupby('groups')['num1']
.transform('mean')),
1, 2)
Output (as new column "out" for clarity):
groups num1 out
0 a 2 1
1 a 5 2
2 a 7 2
3 b 10 2
4 b 4 1
5 b 0 1
6 b 7 2
7 c 2 1
8 c 4 2
9 c 1 1
I really want cut
OK, but it's not really nice and performant:
(df.groupby('groups')['num1']
.transform(lambda g: pd.cut(g, [-np.inf, g.mean(), np.inf], labels=[1, 2]))
)

Viewing frequency of multiple values in grouped Pandas data frame

I have a data frame with three column variables A,B,C, taking numeric values in {1,2}, {6,7}, and {11,12}. I would like to see the following. For what fraction of possible observed pairs (A,B) do we have both [observations for which C=11 and observations for which C=12].
I start by entering the dataframe:
df = pd.DataFrame({"A": [1, 2, 1, 1, 2, 1, 1, 2], "B": [6,7,7,6,7,6,6,6], "C": [11,12,11,11,12,12,11,12]})
--------
A B C
0 1 6 11
1 2 7 12
2 1 7 11
3 1 6 11
4 2 7 12
5 1 6 12
6 1 6 11
7 2 6 12
Then I think I need to use groupby. I run
g = df.groupby(["A", "B"])
"g.C.value_counts()"
-----------
A B C
1 6 11 3
12 1
7 11 1
2 6 12 1
7 12 2
Name: C, dtype: int64
This shows that we have one pair of (A,B) for which we have both a C=11 and a C=12, and 3 pairs of (A,B) for which we only have either C=11 or C=12. So I would like to make pandas tells me that we have 25% of (A,B) paris for which C takes both values and 75% for which it only takes one value.
How can I accomplish this? I would like to do so for a big data frame where I can't just eyeball it from the value_counts--this small dataframe is just to illustrate.
Thanks!
Pass normalize=True
out = df.groupby(["A", "B"]).C.value_counts(normalize=True)
Out[791]:
A B C
1 6 11 0.75
12 0.25
7 11 1.00
2 6 12 1.00
7 12 1.00
Name: C, dtype: float64

Calculating temporal and sptial gradients while using groupby in multi-index pandas dataframe

Say I have the following sample pandas dataframe of water content (i.e. "wc") values at specified depths along a column of soil:
import pandas as pd
df = pd.DataFrame([[1, 2,5,3,1], [1, 3, 5,3, 2], [4, 6, 6,3,1], [1, 2,5,3,1], [1, 3, 5,3, 2], [4, 6, 6,3,1]], columns=pd.MultiIndex.from_product([['wc'], [10, 20, 30, 45, 80]]))
df['model'] = [5,5, 5, 6,6,6]
df['time'] = [0, 1, 2,0, 1, 2]
df.set_index(['time', 'model'], inplace=True)
>> df
[Out]:
wc
10 20 30 45 80
time model
0 5 1 2 5 3 1
1 5 1 3 5 3 2
2 5 4 6 6 3 1
0 6 1 2 5 3 1
1 6 1 3 5 3 2
2 6 4 6 6 3 1
I would like to calulate the spatial (between columns) and temporal (between rows) gradients for each model "group" in the following structure:
wc temp_grad spat_grad
10 20 30 45 80 10 20 30 45 80 10 20 30 45
time model
0 5 1 2 5 3 1
1 5 1 3 5 3 2
2 5 4 6 6 3 1
0 6 1 2 5 3 1
1 6 1 3 5 3 2
2 6 4 6 6 3 1
My attempt involved writing a function first for the temporal gradients and combining this with groupby:
def temp_grad(df):
temp_grad = np.gradient(df[('wc', 10.0)], df.index.get_level_values(0))
return pd.Series(temp_grad, index=x.index)
df[('temp_grad', 10.0)] = (df.groupby(level = ['model'], group_keys=False)
.apply(temp_grad))
but I am not sure how to automate this to apply for all wc columns as well as navigate the multi-indexing issues.
Assuming the function you write is actually what you want, then for temp_grad, you can do at once all the columns in the apply. use np.gradient the same way you did in your function but specify along the axis=0 (rows). Built a dataframe with index and columns as the original data. For the spat_grad, I think the model does not really matter, so no need of the groupby, do np.gradient directly on df['wc'], and along the axis=1 (columns) this time. Built a dataframe the same way. To get the expected output, concat all three of them like:
df = pd.concat([
df['wc'], # original data
# add the temp_grad
df['wc'].groupby(level = ['model'], group_keys=False)
.apply(lambda x: #do all the columns at once, specifying the axis in gradient
pd.DataFrame(np.gradient(x, x.index.get_level_values(0), axis=0),
columns=x.columns, index=x.index)), # build a dataframe
# for spat, no need of groupby as it is row-wise operation
# change the axis, and the values for the x
pd.DataFrame(np.gradient(df['wc'], df['wc'].columns, axis=1),
columns=df['wc'].columns, index=df['wc'].index)
],
keys=['wc','temp_grad','spat_grad'], # redefine the multiindex columns
axis=1 # concat along the columns
)
and you get
print(df)
wc temp_grad spat_grad \
10 20 30 45 80 10 20 30 45 80 10 20
time model
0 5 1 2 5 3 1 0.0 1.0 0.0 0.0 1.0 0.1 0.2
1 5 1 3 5 3 2 1.5 2.0 0.5 0.0 0.0 0.2 0.2
2 5 4 6 6 3 1 3.0 3.0 1.0 0.0 -1.0 0.2 0.1
0 6 1 2 5 3 1 0.0 1.0 0.0 0.0 1.0 0.1 0.2
1 6 1 3 5 3 2 1.5 2.0 0.5 0.0 0.0 0.2 0.2
2 6 4 6 6 3 1 3.0 3.0 1.0 0.0 -1.0 0.2 0.1
30 45 80
time model
0 5 0.126667 -0.110476 -0.057143
1 5 0.066667 -0.101905 -0.028571
2 5 -0.080000 -0.157143 -0.057143
0 6 0.126667 -0.110476 -0.057143
1 6 0.066667 -0.101905 -0.028571
2 6 -0.080000 -0.157143 -0.057143

Assign column values from another dataframe with repeating key values

Please help me in Pandas, i cant find good solution
Tried map, assign, merge, join, set_index.
Maybe just i am too tired :)
df:
m_num A B
0 1 0 9
1 1 1 8
2 2 2 7
3 2 3 6
4 3 4 5
5 3 5 4
df1:
m_num C
0 2 99
1 2 88
df_final:
m_num A B C
0 1 0 9 NaN
1 1 1 8 NaN
2 2 2 7 99
3 2 3 6 88
4 3 4 5 NaN
5 3 5 4 NaN
Try:
df2 = df[df['m_num'].isin(df1['m_num'])].reset_index(drop=True)
df2 = pd.merge(df2,df1,on=[df1.index,'m_num']).drop('key_0',axis=1)
df2 = pd.merge(df,df2,on=['m_num','A','B'],how='left')
print(df2)
Prints:
m_num A B C
0 1 0 9 NaN
1 1 1 8 NaN
2 2 2 7 99.0
3 2 3 6 88.0
4 3 4 5 NaN
5 3 5 4 NaN
Explanation:
There may be better solutions out there but this was my thought process. The problem is slightly tricky in the sense that because 'm_num' is the only common key and it and it has repeating values.
So first I created a dataframe matching df and df1 here so that I can use the index as another key for the subsequent merge.
df2 = df[df['m_num'].isin(df1['m_num'])].reset_index(drop=True)
This prints:
m_num A B
0 2 2 7
1 2 3 6
As you can see above, now we have the index 0 and 1 in addition to the m_num as key which we can use to match with df1.
df2 = pd.merge(df2,df1,on=[df1.index,'m_num']).drop('key_0',axis=1)
This prints:
m_num A B C
0 2 2 7 99
1 2 3 6 88
Then tie the above resultant dataframe to the original df and do a left join to get the output.
df2 = pd.merge(df,df2,on=['m_num','A','B'],how='left')

Compute lagged means per name and round in pandas

I need to compute lagged means per groups in my dataframe. This is how my df looks like:
name value round
0 a 5 3
1 b 4 3
2 c 3 2
3 d 1 2
4 a 2 1
5 c 1 1
0 c 1 3
1 d 4 3
2 b 3 2
3 a 1 2
4 b 5 1
5 d 2 1
I would like to compute lagged means for column value per name and round. That is, for name a in round 3 I need to have value_mean = 1.5 (because (1+2)/2). And of course, there will be nan values when round = 1.
I tried this:
df['value_mean'] = df.groupby('name').expanding().mean().groupby('name').shift(1)['value'].values
but it gives a nonsense:
name value round value_mean
0 a 5 3 NaN
1 b 4 3 5.0
2 c 3 2 3.5
3 d 1 2 NaN
4 a 2 1 4.0
5 c 1 1 3.5
0 c 1 3 NaN
1 d 4 3 3.0
2 b 3 2 2.0
3 a 1 2 NaN
4 b 5 1 1.0
5 d 2 1 2.5
Any idea, how can I do this, please? I found this, but it seems not relevant for my problem: Calculate the mean value using two columns in pandas
You can do that as follows
# sort the values as they need to be counted
df.sort_values(['name', 'round'], inplace=True)
df.reset_index(drop=True, inplace=True)
# create a grouper to calculate the running count
# and running sum as the basis of the average
grouper= df.groupby('name')
ser_sum= grouper['value'].cumsum()
ser_count= grouper['value'].cumcount()+1
ser_mean= ser_sum.div(ser_count)
ser_same_name= df['name'] == df['name'].shift(1)
# finally you just have to set the first entry
# in each name-group to NaN (this usually would
# set the entries for each name and round=1 to NaN)
df['value_mean']= ser_mean.shift(1).where(ser_same_name, np.NaN)
# if you want to see the intermediate products,
# you can uncomment the following lines
#df['sum']= ser_sum
#df['count']= ser_count
df
Output:
name value round value_mean
0 a 2 1 NaN
1 a 1 2 2.0
2 a 5 3 1.5
3 b 5 1 NaN
4 b 3 2 5.0
5 b 4 3 4.0
6 c 1 1 NaN
7 c 3 2 1.0
8 c 1 3 2.0
9 d 2 1 NaN
10 d 1 2 2.0
11 d 4 3 1.5