How to append to a data frame from multiple loops - pandas

I have a code, which takes in files from csv and takes a price difference, but to make it simplar I made a reproducible example as seen below. I want to append each result to the end of a specific column name. For example the first loop will go through size 1 and minute 1 so it should append to column names 1;1, for file2, file 3, file4. So the output should be :
1;1 1;2 1;3 2;1 2;2 2;3
0 0 0 same below as for 1
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
5 5 5
0 0 0
0 0 0
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
6 6 6
6 6 6
0 0 0
0 0 0
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
6 6 6
7 7 7
I am using a loop to set prefixed data frame columns, because in my original code the number of minutes, sizes, and files is inputted by the user.
import numpy as np
import pandas as pd
file =[1,2,3,4,5,6,6,2]
file2=[1,2,3,4,5,6,7,8]
file3=[1,2,3,4,5,6,7,9]
file4=[1,2,1,2,1,2,1,2]
size=[1,2]
minutes=[1,2,3]
list1=[file,file2,file3]
data=pd.DataFrame(file)
data2=pd.DataFrame(file2)
data3=pd.DataFrame(file3)
list1=(data,data2,data3)
datas=pd.DataFrame(file4)
col_names = [str(sizer)+';'+str(number) for sizer in size for number in minutes]
datanew=pd.DataFrame(columns=col_names)
for sizes in size:
for minute in minutes:
for files in list1:
pricediff=files-data
datanew[str(sizes)+';'+str(minute)]=datanew[str(sizes)+';'+str(minute)].append(pricediff,ignore_index=True)
print(datanew)
Edit: When trying this line: datanew=datanew.append({str(sizes)+';'+str(minute): df['pricediff']},ignore_index=True) It appends the data but the result isn't "clean"
The result from my original data, gives me this:
111;5.0,1111;5.0
"0 4.5
1 0.5
2 8
3 8
4 8
...
704 3.5
705 0.5
706 11.5
707 0.5
708 9.0
Name: pricediff, Length: 709, dtype: object",
"price 0.0
0 0.0
Name: pricediff, dtype: float64",
"0 6.5
1 6.5
2 3.5
3 13.0
Name: pricediff, Length: 4, dtype: float64",

What you are looking for is:
datanew=datanew.append({str(sizes)+';'+str(minute): pricediff}, ignore_index=True)
This happens because you cannot change length of a single column of a dataframe without modifying length of the whole data frame.
Now consider the below as an example:
import pandas as pd
df=pd.DataFrame({"a": list("xyzpqr"), "b": [1,3,5,4,2,7], "c": list("pqrtuv")})
print(df)
#this will fail:
#df["c"]=df["c"].append("abc", ignore_index=True)
#print(df)
#what you can do instead:
df=df.append({"c": "abc"}, ignore_index=True)
print(df)
#you can even create new column that way:
df=df.append({"x": "abc"}, ignore_index=True)
Edit
In order to append pd.Series do literally the same:
abc=pd.Series([-1,-2,-3], name="c")
df=df.append({"c": abc}, ignore_index=True)
print(df)
abc=pd.Series([-1,-2,-3], name="x")
df=df.append({"x": abc}, ignore_index=True)

Related

pandas function to find index of the first future instance where column is less than each row's value

I'm new to Stack Overflow, and I just have a question about solving a problem in pandas. I am looking to create a function that returns the index of the first future instance where a column is less than each row's value for that column.
For example, consider the dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Val': [1, 2, 3, 4, 0, 1, -1, -2, -3]}, index = np.arange(0,9))
df
Index
Val
0
1
1
2
2
3
3
4
4
0
5
1
6
-1
7
-2
8
-3
I am looking for the output:
Index
F(Val)
0
4
1
4
2
4
3
4
4
6
5
6
6
7
7
8
8
NaN
Or the series/array equivalent of F(Val).
I've been able to solve this quite easily using for loops, but obviously this is extremely slow on the large dataset I am working with an not a very elegant or optimal solution. My hope is that the solution is an efficient pandas function that employs vectorization.
Also, as a bonus question (if anyone can assist), how might the maximum value between each row's index and the F(Val) index be computed using vectorization? The output should look like:
Index
G(Val)
0
4
1
4
2
4
3
4
4
1
5
1
6
-1
7
-2
8
NaN
Thanks!
You can use:
grp = df['Val'].lt(df['Val'].shift()).shift(fill_value=0).cumsum()
df['F(Val)'] = df.groupby(grp).transform(lambda x: x.index[-1]).shift(-1)
print(df)
# Output
Val F(Val)
0 1 4.0
1 2 4.0
2 3 4.0
3 4 4.0
4 0 6.0
5 1 6.0
6 -1 7.0
7 -2 8.0
8 -3 NaN
Using numpy broadcasting and the lower triangle:
a = df['Val'].to_numpy()
m = np.tril(a[:,None]<=a, k=-1)
df['F(Val)'] = np.where(m.any(0), m.argmax(0), np.nan)
Same logic with expanding:
df['F(Val)'] = (df.loc[::-1, 'Val'].expanding()
.apply(lambda x: s.idxmax() if len(s:=(x.iloc[-2::-1]<=x.iloc[-1]))
else np.nan)
)
Output (with a difference to the provided one):
Val F(Val)
0 1 5.0 # here the next is 5
1 2 4.0
2 3 4.0
3 4 4.0
4 2 5.0
5 -2 7.0
6 -1 7.0
7 -2 8.0
8 -3 NaN

Calculating temporal and sptial gradients while using groupby in multi-index pandas dataframe

Say I have the following sample pandas dataframe of water content (i.e. "wc") values at specified depths along a column of soil:
import pandas as pd
df = pd.DataFrame([[1, 2,5,3,1], [1, 3, 5,3, 2], [4, 6, 6,3,1], [1, 2,5,3,1], [1, 3, 5,3, 2], [4, 6, 6,3,1]], columns=pd.MultiIndex.from_product([['wc'], [10, 20, 30, 45, 80]]))
df['model'] = [5,5, 5, 6,6,6]
df['time'] = [0, 1, 2,0, 1, 2]
df.set_index(['time', 'model'], inplace=True)
>> df
[Out]:
wc
10 20 30 45 80
time model
0 5 1 2 5 3 1
1 5 1 3 5 3 2
2 5 4 6 6 3 1
0 6 1 2 5 3 1
1 6 1 3 5 3 2
2 6 4 6 6 3 1
I would like to calulate the spatial (between columns) and temporal (between rows) gradients for each model "group" in the following structure:
wc temp_grad spat_grad
10 20 30 45 80 10 20 30 45 80 10 20 30 45
time model
0 5 1 2 5 3 1
1 5 1 3 5 3 2
2 5 4 6 6 3 1
0 6 1 2 5 3 1
1 6 1 3 5 3 2
2 6 4 6 6 3 1
My attempt involved writing a function first for the temporal gradients and combining this with groupby:
def temp_grad(df):
temp_grad = np.gradient(df[('wc', 10.0)], df.index.get_level_values(0))
return pd.Series(temp_grad, index=x.index)
df[('temp_grad', 10.0)] = (df.groupby(level = ['model'], group_keys=False)
.apply(temp_grad))
but I am not sure how to automate this to apply for all wc columns as well as navigate the multi-indexing issues.
Assuming the function you write is actually what you want, then for temp_grad, you can do at once all the columns in the apply. use np.gradient the same way you did in your function but specify along the axis=0 (rows). Built a dataframe with index and columns as the original data. For the spat_grad, I think the model does not really matter, so no need of the groupby, do np.gradient directly on df['wc'], and along the axis=1 (columns) this time. Built a dataframe the same way. To get the expected output, concat all three of them like:
df = pd.concat([
df['wc'], # original data
# add the temp_grad
df['wc'].groupby(level = ['model'], group_keys=False)
.apply(lambda x: #do all the columns at once, specifying the axis in gradient
pd.DataFrame(np.gradient(x, x.index.get_level_values(0), axis=0),
columns=x.columns, index=x.index)), # build a dataframe
# for spat, no need of groupby as it is row-wise operation
# change the axis, and the values for the x
pd.DataFrame(np.gradient(df['wc'], df['wc'].columns, axis=1),
columns=df['wc'].columns, index=df['wc'].index)
],
keys=['wc','temp_grad','spat_grad'], # redefine the multiindex columns
axis=1 # concat along the columns
)
and you get
print(df)
wc temp_grad spat_grad \
10 20 30 45 80 10 20 30 45 80 10 20
time model
0 5 1 2 5 3 1 0.0 1.0 0.0 0.0 1.0 0.1 0.2
1 5 1 3 5 3 2 1.5 2.0 0.5 0.0 0.0 0.2 0.2
2 5 4 6 6 3 1 3.0 3.0 1.0 0.0 -1.0 0.2 0.1
0 6 1 2 5 3 1 0.0 1.0 0.0 0.0 1.0 0.1 0.2
1 6 1 3 5 3 2 1.5 2.0 0.5 0.0 0.0 0.2 0.2
2 6 4 6 6 3 1 3.0 3.0 1.0 0.0 -1.0 0.2 0.1
30 45 80
time model
0 5 0.126667 -0.110476 -0.057143
1 5 0.066667 -0.101905 -0.028571
2 5 -0.080000 -0.157143 -0.057143
0 6 0.126667 -0.110476 -0.057143
1 6 0.066667 -0.101905 -0.028571
2 6 -0.080000 -0.157143 -0.057143

Meaning of mode() in pandas

df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
"B": np.random.randint(-10, 15, size=50)})
df5.mode()
A B
0 1.0 -9
1 NaN 10
2 NaN 13
Why does the NaN come from here?
Reason is if check DataFrame.mode:
Get the mode(s) of each element along the selected axis.
The mode of a set of values is the value that appears most often. It can be multiple values.
So missing values means for A is ony one mode value, for B column are 3 mode values, so for same rows are added missing values.
If check my sample data - there is mode A 2 times and B only once, because 2and 3 are both 11 times in data:
np.random.seed(20)
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
"B": np.random.randint(-10, 15, size=50)})
print (df5.mode())
A B
0 2 8.0
1 3 NaN
print (df5.A.value_counts())
3 11 <- both top1
2 11 <- both top1
6 9
5 8
0 5
1 4
4 2
Name: A, dtype: int64
print (df5.B.value_counts())
8 6 <- only one top1
0 4
4 4
-4 3
10 3
-2 3
1 3
12 3
6 3
7 2
3 2
5 2
-9 2
-6 2
14 2
9 2
-1 1
11 1
-3 1
-7 1
Name: B, dtype: int64

pandas mean per row in chunks of size 5

I have a dataframe in the shape of [100, 50000]
and I want to reduce it by applying mean per row in chunks of 5. (So I will get a dataframe at the shape of [100, 10000]).
For example,
So, if the row is
[1,8,-1,0,2 , 6,8,11,4,6]
the output will be
[2,7]
What is the most efficient way to do so?
Thanks
If shape is 100, 50000 means 100 rows and 50000 columns, solution is GroupBy.mean with helper np.arange created by lengths of columns and axis=1:
df = pd.DataFrame([[1,8,-1,0,2 , 6,8,11,4,6],
[1,8,-1,0,2 , 6,8,11,4,6]])
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 8 -1 0 2 6 8 11 4 6
1 1 8 -1 0 2 6 8 11 4 6
print (df.shape)
(2, 10)
df = df.groupby(np.arange(len(df.columns)) // 5, axis=1).mean()
print (df)
0 1
0 2 7
1 2 7
If shape is 100, 50000 means 100 columns and 50000 rows, solution is GroupBy.mean with helper np.arange created by lengths of DataFrame:
df = pd.DataFrame({'a': [1,8,-1,0,2 , 6,8,11,4,6],
'b': [1,8,-1,0,2 , 6,8,11,4,6]})
print (df)
a b
0 1 1
1 8 8
2 -1 -1
3 0 0
4 2 2
5 6 6
6 8 8
7 11 11
8 4 4
9 6 6
print (df.shape)
(10, 2)
df = df.groupby(np.arange(len(df)) // 5).mean()
print (df)
a b
0 2 2
1 7 7

How to find average of two tables in pandas?

I have one table with 1000s of rows that looks like this:
file1:
apples1 + hate 0 0 0 2 4 6 0 1
apples2 + hate 0 2 0 4 4 6 0 2
apples4 + hate 0 2 0 4 4 6 0 2
and another file with same headers in file2 - nb some headers are missing in file1:
apples1 + hate 0 0 0 1 4 6 0 2
apples2 + hate 0 1 0 6 4 6 0 2
apples3 + hate 0 2 0 4 4 6 0 2
apples4 + hate 0 1 0 3 4 3 0 1
I want to compare the two files in pandas and average across common columns. I do not want to print columns that are in one file only. So the resultant file would look like:
apples1 + hate 0 0 0 1.5 4 6 0 1.5
apples2 + hate 0 1.5 0 5 4 6 0 2
apples4 + hate 0 2 0 3.5 4 6 0 2
There are two steps in this solution.
concatenate all your dataframes by stacking them vertically (axis=0, the default) using pandas.concat(...) and specifying a join of 'inner' to only maintain columns that in all the dataframes.
call mean(...) function on resultant dataframe.
Example:
In [1]: df1 = pd.DataFrame([[1,2,3], [4,5,6]], columns=['a','b','c'])
In [2]: df2 = pd.DataFrame([[1,2],[3,4]], columns=['a','c'])
In [3]: df1
Out[3]:
a b c
0 1 2 3
1 4 5 6
In [4]: df2
Out[4]:
a c
0 1 2
1 3 4
In [5]: df3 = pd.concat([df1, df2], join='inner')
In [6]: df3
Out[6]:
a c
0 1 3
1 4 6
0 1 2
1 3 4
In [7]: df3.mean()
Out[7]:
a 2.25
c 3.75
dtype: float64
Let's try this:
df1 = pd.read_csv('file1', header=None)
df2 = pd.read_csv('file2', header=None)
Set index to first three columns ie.. "apple1 + hate"
df1 = df1.set_index([0,1,2])
df2 = df2.set_index([0,1,2])
Let's use merge to inner join datafiles on indexes, and the groupby columns with the same name and aggregate with mean:
df1.merge(df2, right_index=True, left_index=True)\
.pipe(lambda x: x.groupby(x.columns.str.extract('(\w+)\_[xy]', expand=False),
axis=1, sort=False).mean()).reset_index()
Output:
0 1 2 3 4 5 6 7 8 9 10
0 apples1 + hate 0.0 0.0 0.0 1.5 4.0 6.0 0.0 1.5
1 apples2 + hate 0.0 1.5 0.0 5.0 4.0 6.0 0.0 2.0
2 apples4 + hate 0.0 1.5 0.0 3.5 4.0 4.5 0.0 1.5