Calculating temporal and sptial gradients while using groupby in multi-index pandas dataframe - pandas

Say I have the following sample pandas dataframe of water content (i.e. "wc") values at specified depths along a column of soil:
import pandas as pd
df = pd.DataFrame([[1, 2,5,3,1], [1, 3, 5,3, 2], [4, 6, 6,3,1], [1, 2,5,3,1], [1, 3, 5,3, 2], [4, 6, 6,3,1]], columns=pd.MultiIndex.from_product([['wc'], [10, 20, 30, 45, 80]]))
df['model'] = [5,5, 5, 6,6,6]
df['time'] = [0, 1, 2,0, 1, 2]
df.set_index(['time', 'model'], inplace=True)
>> df
[Out]:
wc
10 20 30 45 80
time model
0 5 1 2 5 3 1
1 5 1 3 5 3 2
2 5 4 6 6 3 1
0 6 1 2 5 3 1
1 6 1 3 5 3 2
2 6 4 6 6 3 1
I would like to calulate the spatial (between columns) and temporal (between rows) gradients for each model "group" in the following structure:
wc temp_grad spat_grad
10 20 30 45 80 10 20 30 45 80 10 20 30 45
time model
0 5 1 2 5 3 1
1 5 1 3 5 3 2
2 5 4 6 6 3 1
0 6 1 2 5 3 1
1 6 1 3 5 3 2
2 6 4 6 6 3 1
My attempt involved writing a function first for the temporal gradients and combining this with groupby:
def temp_grad(df):
temp_grad = np.gradient(df[('wc', 10.0)], df.index.get_level_values(0))
return pd.Series(temp_grad, index=x.index)
df[('temp_grad', 10.0)] = (df.groupby(level = ['model'], group_keys=False)
.apply(temp_grad))
but I am not sure how to automate this to apply for all wc columns as well as navigate the multi-indexing issues.

Assuming the function you write is actually what you want, then for temp_grad, you can do at once all the columns in the apply. use np.gradient the same way you did in your function but specify along the axis=0 (rows). Built a dataframe with index and columns as the original data. For the spat_grad, I think the model does not really matter, so no need of the groupby, do np.gradient directly on df['wc'], and along the axis=1 (columns) this time. Built a dataframe the same way. To get the expected output, concat all three of them like:
df = pd.concat([
df['wc'], # original data
# add the temp_grad
df['wc'].groupby(level = ['model'], group_keys=False)
.apply(lambda x: #do all the columns at once, specifying the axis in gradient
pd.DataFrame(np.gradient(x, x.index.get_level_values(0), axis=0),
columns=x.columns, index=x.index)), # build a dataframe
# for spat, no need of groupby as it is row-wise operation
# change the axis, and the values for the x
pd.DataFrame(np.gradient(df['wc'], df['wc'].columns, axis=1),
columns=df['wc'].columns, index=df['wc'].index)
],
keys=['wc','temp_grad','spat_grad'], # redefine the multiindex columns
axis=1 # concat along the columns
)
and you get
print(df)
wc temp_grad spat_grad \
10 20 30 45 80 10 20 30 45 80 10 20
time model
0 5 1 2 5 3 1 0.0 1.0 0.0 0.0 1.0 0.1 0.2
1 5 1 3 5 3 2 1.5 2.0 0.5 0.0 0.0 0.2 0.2
2 5 4 6 6 3 1 3.0 3.0 1.0 0.0 -1.0 0.2 0.1
0 6 1 2 5 3 1 0.0 1.0 0.0 0.0 1.0 0.1 0.2
1 6 1 3 5 3 2 1.5 2.0 0.5 0.0 0.0 0.2 0.2
2 6 4 6 6 3 1 3.0 3.0 1.0 0.0 -1.0 0.2 0.1
30 45 80
time model
0 5 0.126667 -0.110476 -0.057143
1 5 0.066667 -0.101905 -0.028571
2 5 -0.080000 -0.157143 -0.057143
0 6 0.126667 -0.110476 -0.057143
1 6 0.066667 -0.101905 -0.028571
2 6 -0.080000 -0.157143 -0.057143

Related

Meaning of mode() in pandas

df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
"B": np.random.randint(-10, 15, size=50)})
df5.mode()
A B
0 1.0 -9
1 NaN 10
2 NaN 13
Why does the NaN come from here?
Reason is if check DataFrame.mode:
Get the mode(s) of each element along the selected axis.
The mode of a set of values is the value that appears most often. It can be multiple values.
So missing values means for A is ony one mode value, for B column are 3 mode values, so for same rows are added missing values.
If check my sample data - there is mode A 2 times and B only once, because 2and 3 are both 11 times in data:
np.random.seed(20)
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
"B": np.random.randint(-10, 15, size=50)})
print (df5.mode())
A B
0 2 8.0
1 3 NaN
print (df5.A.value_counts())
3 11 <- both top1
2 11 <- both top1
6 9
5 8
0 5
1 4
4 2
Name: A, dtype: int64
print (df5.B.value_counts())
8 6 <- only one top1
0 4
4 4
-4 3
10 3
-2 3
1 3
12 3
6 3
7 2
3 2
5 2
-9 2
-6 2
14 2
9 2
-1 1
11 1
-3 1
-7 1
Name: B, dtype: int64

Operations with multiple dataframes partialy sharing indexes in pandas

I have two dataframes: (i) One has two indexes and two headers, and (ii) the other one has one index and one header. The second level of each axis in the first dataframe relates to each axis of the second dataframe. I need to multiply both dataframes based on that relation between the axis.
Dataframe 1:
Dataframe 2:
Expected result (multiplication by index/header):
Try using pd.DataFrame.mul with the level parameter:
import pandas as pd
df = pd.DataFrame([[9,10,2,1,6,5],
[4, 0,3,4,6,6],
[9, 3,9,1,2,3],
[3, 5,9,3,9,0],
[4,4,8,5,10,5],
[5, 3,1,8,5,6]])
df.columns = pd.MultiIndex.from_arrays([[2020]*3+[2021]*3,[1,2,3,1,2,3]])
df.index = pd.MultiIndex.from_arrays([[1]*3+[2]*3,[1,2,3,1,2,3]])
print(df)
print('\n')
df2 = pd.DataFrame([[.1,.3,.6],[.4,.4,.3],[.5,.4,.1]], index=[1,2,3], columns=[1,2,3])
print(df2)
print('\n')
df_out = df.mul(df2, level=1)
print(df_out)
Output:
2020 2021
1 2 3 1 2 3
1 1 9 10 2 1 6 5
2 4 0 3 4 6 6
3 9 3 9 1 2 3
2 1 3 5 9 3 9 0
2 4 4 8 5 10 5
3 5 3 1 8 5 6
1 2 3
1 0.1 0.3 0.6
2 0.4 0.4 0.3
3 0.5 0.4 0.1
2020 2021
1 2 3 1 2 3
1 1 0.9 3.0 1.2 0.1 1.8 3.0
2 1.6 0.0 0.9 1.6 2.4 1.8
3 4.5 1.2 0.9 0.5 0.8 0.3
2 1 0.3 1.5 5.4 0.3 2.7 0.0
2 1.6 1.6 2.4 2.0 4.0 1.5
3 2.5 1.2 0.1 4.0 2.0 0.6

Complete an incomplete dataframe in pandas

Good morning.
I have a dataframe that can be both like this:
df1 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
and like this:
df2 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
The difference between the two is only that the case may arise in which one, or several but not all, zones do have data for the highest of the time periods (column date). My desired result is to be able to complete the dataframe until a certain period of time (3 in the example), in the following way in each of the cases:
df1_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
7 B 3 6809 20
8 C 3 288 5
df2_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 1280 3
7 B 3 6809 20
8 C 3 288 5
I've tried different combinations of pivot and fillna with different methods, but I can't achieve the previous result.
I hope my explanation was understood.
Many thanks in advance.
You can use reindex to create entries for all dates in the range, and then forward fill the last value into it.
import pandas as pd
df1 = pd.DataFrame([['A', 1,154, 2],
['B', 1,2647, 7],
['C', 1,0, 0],
['A', 2,1280, 3],
['B', 2,6809, 20],
['C', 2,288, 5],
['A', 3,2000, 4]],
columns=['zone', 'date', 'p1', 'p2'])
result = df1.groupby("zone").apply(lambda x: x.set_index("date").reindex(range(1, 4), method='ffill'))
print(result)
To get
zone p1 p2
zone date
A 1 A 154 2
2 A 1280 3
3 A 2000 4
B 1 B 2647 7
2 B 6809 20
3 B 6809 20
C 1 C 0 0
2 C 288 5
3 C 288 5
IIUC, you can reconstruct a pd.MultiIndex from your original df and use fillna to get the max from each subgroup of zone you have.
first, build your index
ind = df1.set_index(['zone', 'date']).index
levels = ind.levels
n = len(levels[0])
labels = [np.tile(np.arange(n), n), np.repeat(np.arange(0, n), n)]
Then, use pd.MultiIndex constructor to reindex
df1.set_index(['zone', 'date'])\
.reindex(pd.MultiIndex(levels= levels, labels= labels))\
.fillna(df1.groupby(['zone']).max())
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
To fill df2, just change from df1 in this last line of code to df2 and you get
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
I suggest not to copy/paste directly the code and try to run, but rather try to understand the process and make slight changes if needed depending on how different your original data frame is from what you posted.

Concat and append in pandas datafarme

I have three data frame with the same dimension, and I need to concatenate them as a single data frame.
df1 = pd.DataFrame({'AD': ['CTA15', 'CTA15', 'AC007', 'AC007', 'AC007'],
'FC': [0.5, 0.7, 0.7, 2.6, 2.9],
'EX':['12', '13', '14', '15', '16'],
't' : [2, 2, 3, 3, 3],
'P' :[3,7,8,9,1]})
df2 = df1.copy()
df3 = df1.copy()
df = df1.append([df2, df3])
I tried append and concate, both returns me with a data frame without the first column.
This is what I tried,
pd.concat([df1,df2,df3]) and df1.append([df2,df3])
Concat works if I set the first column of all data frames as index using df1.set_index('col1') and so for df2 and df3. Then with pd.concat it works, not otherwise. Would be great if there is a direct solution
Thank you
Is this what you are looking for?
pd.concat([df1,df2,df3], ignore_index=True)
AD EX FC P t
0 CTA15 12 0.5 3 2
1 CTA15 13 0.7 7 2
2 AC007 14 0.7 8 3
3 AC007 15 2.6 9 3
4 AC007 16 2.9 1 3
5 CTA15 12 0.5 3 2
6 CTA15 13 0.7 7 2
7 AC007 14 0.7 8 3
8 AC007 15 2.6 9 3
9 AC007 16 2.9 1 3
10 CTA15 12 0.5 3 2
11 CTA15 13 0.7 7 2
12 AC007 14 0.7 8 3
13 AC007 15 2.6 9 3
14 AC007 16 2.9 1 3

create new column using a shift within a groupby values

I want to create a new column which is a result of a shift function applied to a grouped values.
df = pd.DataFrame({'X': [0,1,0,1,0,1,0,1], 'Y':[2,4,3,1,2,3,4,5]})
df
X Y
0 0 2
1 1 4
2 0 3
3 1 1
4 0 2
5 1 3
6 0 4
7 1 5
def func(x):
x['Z'] = test['Y']-test['Y'].shift(1)
return x
df_new = df.groupby('X').apply(func)
X Y Z
0 0 2 NaN
1 1 4 2.0
2 0 3 -1.0
3 1 1 -2.0
4 0 2 1.0
5 1 3 1.0
6 0 4 1.0
7 1 5 1.0
As you can see from the output the values are shifted sequentally without accounting for a group by.
I have seen a similar question, but I could not figure out why it does not work as expected.
Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation
The values are shifted without accounting for the groups because your func uses test (presumably some other object, likely another name for what you call df) directly instead of simply the group x.
def func(x):
x['Z'] = x['Y']-x['Y'].shift(1)
return x
gives me
In [8]: df_new
Out[8]:
X Y Z
0 0 2 NaN
1 1 4 NaN
2 0 3 1.0
3 1 1 -3.0
4 0 2 -1.0
5 1 3 2.0
6 0 4 2.0
7 1 5 2.0
but note that in this particular case you don't need to write a custom function, you can just call diff on the groupby object directly. (Of course other functions you might want to work with may be more complicated).
In [13]: df_new["Z2"] = df.groupby("X")["Y"].diff()
In [14]: df_new
Out[14]:
X Y Z Z2
0 0 2 NaN NaN
1 1 4 NaN NaN
2 0 3 1.0 1.0
3 1 1 -3.0 -3.0
4 0 2 -1.0 -1.0
5 1 3 2.0 2.0
6 0 4 2.0 2.0
7 1 5 2.0 2.0