apply() function to generate new value in a new column - pandas

I am new to python 3 and pandas. I tried to add a new column into a dataframe where the value is the difference between two existing columns.
My current code is:
import pandas as pd
import io
from io import StringIO
x="""a,b,c
1,2,3
4,5,6
7,8,9"""
with StringIO(x) as df:
new=pd.read_csv(df)
print (new)
y=new.copy()
y.loc[:,"d"]=0
# My lambda function is completely wrong, but I don't know how to make it right.
y["d"]=y["d"].apply(lambda x:y["a"]-y["b"], axis=1)
Desired output is
a b c d
1 2 3 -1
4 5 6 -1
7 8 9 -1
Does anyone have any idea how I can make my code work?
Thanks for your help.

You need y only for DataFrame for DataFrame.apply with axis=1 for process by rows:
y["d"]= y.apply(lambda x:x["a"]-x["b"], axis=1)
For better debugging is possible create custom function:
def f(x):
print (x)
a = x["a"]-x["b"]
return a
y["d"]= y.apply(f, axis=1)
a 1
b 2
c 3
Name: 0, dtype: int64
a 4
b 5
c 6
Name: 1, dtype: int64
a 7
b 8
c 9
Name: 2, dtype: int64
Better solution if need only subtract columns:
y["d"] = y["a"] - y["b"]
print (y)
a b c d
0 1 2 3 -1
1 4 5 6 -1
2 7 8 9 -1

Related

Creating a dataframe using roll-forward window on multivariate time series

Based on the simplifed sample dataframe
import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
index = timestamps )
print(df)
A B
2017-01-01 0 0
2017-01-02 1 2
2017-01-03 2 4
2017-01-04 3 6
I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.
My first idea was to use pandas
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
But that seems to only work in combination with aggregate functions such as sum, which is a different use case.
Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it:
window_size = 3
new_df = pd.concat(
[
df.iloc[i : i + window_size, :]
.T.reset_index()
.assign(other_index=i)
.set_index(["other_index", "index"])
.set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
for i in range(df.shape[0] - window_size + 1)
]
)
new_df.index.names = ["", ""]
print(df)
# Output
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6

Pandas Series Chaining: Filter on boolean value

How can I filter a pandas series based on boolean values?
Currently I have:
s.apply(lambda x: myfunc(x, myparam).where(lambda x: x).dropna()
What I want is only keep entries where myfunc returns true.myfunc is complex function using 3rd party code and operates only on individual elements.
How can i make this more understandable?
You can understand it with below given sample code
import pandas as pd
data = pd.Series([1,12,15,3,5,3,6,9,10,5])
print(data)
# filter data based on a condition keep only rows which are multiple of 3
filter_cond = data.apply(lambda x:x%3==0)
print(filter_cond)
filter_data = data[filter_cond]
print(filter_data)
This code is about to filter the series data which are of the multiples of 3. To do that, we just put the filter condition and apply it on the series data. You can verify it with below generated output.
The sample series data:
0 1
1 12
2 15
3 3
4 5
5 3
6 6
7 9
8 10
9 5
dtype: int64
The conditional filter output:
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 True
8 False
9 False
dtype: bool
The final required filter data:
1 12
2 15
3 3
5 3
6 6
7 9
dtype: int64
Hope, this helps you to understand that how we can apply conditional filters on the series data.
Use boolean indexing:
mask = s.apply(lambda x: myfunc(x, myparam))
print (s[mask])
If also is changed index values in mask filter by 1d array:
#pandas 0.24+
print (s[mask.to_numpy()])
#pandas below
print (s[mask.values])
EDIT:
s = pd.Series([1,2,3])
def myfunc(x, n):
return x > n
myparam = 1
a = s[s.apply(lambda x: myfunc(x, myparam))]
print (a)
1 2
2 3
dtype: int64
Solution with callable is possible, but a bit overcomplicated in my opinion:
a = s.loc[lambda s: s.apply(lambda x: myfunc(x, myparam))]
print (a)
1 2
2 3
dtype: int64

Different outcome using pandas nunique() and unique()

I have a big DF with 10 millions rows and I need to find the unique number for each column.
I wrote the function below:
(need to return a series)
def count_unique_values(df):
return pd.Series(df.nunique())
and I get this output:
Area 210
Item 436
Element 4
Year 53
Unit 2
Value 313640
dtype: int64
expected result should be value 313641.
when I just do
df['Value'].unique()
I do get that answer. Didn't figure out why I get less with nunique() just there.
Because DataFrame.nunique omit missing values, because default parameter dropna=True, Series.unique function not.
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'D':[np.nan,3,5,5,3,5],
})
print (df)
A D
0 a NaN
1 b 3.0
2 c 5.0
3 d 5.0
4 e 3.0
5 f 5.0
def count_unique_values(df):
return df.nunique()
print (count_unique_values(df))
A 6
D 2
dtype: int64
print (df['D'].unique())
[nan 3. 5.]
print (df['D'].nunique())
2
print (df['D'].unique())
[nan 3. 5.]
Solution is add parameter dropna=False:
print (df['D'].nunique(dropna=False))
3
print (df['D'].unique())
3
So in your function:
def count_unique_values(df):
return df.nunique(dropna=False)
print (count_unique_values(df))
A 6
D 3
dtype: int64

pandas dataframe filter by sequence of values in a specific column

I have a dataframe
A B C
1 2 3
2 3 4
3 8 7
I want to take only rows where there is a sequence of 3,4 in columns C (in this scenario - first two rows)
What will be the best way to do so?
You can use rolling for general solution working with any pattern:
pat = np.asarray([3,4])
N = len(pat)
mask= (df['C'].rolling(window=N , min_periods=N)
.apply(lambda x: (x==pat).all(), raw=True)
.mask(lambda x: x == 0)
.bfill(limit=N-1)
.fillna(0)
.astype(bool))
df = df[mask]
print (df)
A B C
0 1 2 3
1 2 3 4
Explanation:
use rolling.apply and test pattern
replace 0s to NaNs by mask
use bfill with limit for filling first NANs values by last previous one
fillna NaNs to 0
last cast to bool by astype
Use shift
In [1085]: s = df.eq(3).any(1) & df.shift(-1).eq(4).any(1)
In [1086]: df[s | s.shift()]
Out[1086]:
A B C
0 1 2 3
1 2 3 4

How to set a pandas dataframe equal to a row?

I know how to set the pandas data frame equal to a column.
i.e.:
df = df['col1']
what is the equivalent for a row? let's say taking the index? and would I eliminate one or more of them?
Many thanks.
If you want to take a copy of a row then you can either use loc for label indexing or iloc for integer based indexing:
In [104]:
df = pd.DataFrame({'a':np.random.randn(10),'b':np.random.randn(10)})
df
Out[104]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
3 0.531293 -0.386598
4 -0.278565 1.224272
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
In [106]:
row = df.iloc[3]
row
Out[106]:
a 0.531293
b -0.386598
Name: 3, dtype: float64
If you want to remove that row then you can use drop:
In [107]:
df.drop(3)
Out[107]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
4 -0.278565 1.224272
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
You can also use a slice or pass a list of labels:
In [109]:
rows = df.loc[[3,5]]
row_slice = df.loc[3:5]
print(rows)
print(row_slice)
a b
3 0.531293 -0.386598
5 0.491417 -0.498816
a b
3 0.531293 -0.386598
4 -0.278565 1.224272
5 0.491417 -0.498816
Similarly you can pass a list to drop:
In [110]:
df.drop([3,5])
Out[110]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
4 -0.278565 1.224272
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
If you wanted to drop a slice then you can slice your index and pass this to drop:
In [112]:
df.drop(df.index[3:5])
Out[112]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186