Pandas Series Shift and Conditions return the truth value of a Series is ambiguous - pandas

I have a pandas Series df containing 10 values (all doubles).
My aim is to create a new Series as follow.
newSerie = 1 if df > df.shift(1) else 0
In other words newSerie outputs 1 if the current value of df is bigger than its previous value (it should output 0 otherwise).
However, I get :
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
In addition after my aim is to concatenate df and newSerie as a Dataframe, but newSerie outputs 9 value as we cannot compare the first value of df with shitf(1). Hence I need the first value of newSerie to be a empty value in order to be able to concatenate.
How can I do that?
To give an example imagine my input is only Series df. And my output should be as in the following image:

You can use shift or diff:
# example dataframe:
data = pd.DataFrame({'df':[10,9,12,13,14,15,18,16,20,1]})
df
0 10
1 9
2 12
3 13
4 14
5 15
6 18
7 16
8 20
9 1
Using Series.shift:
data['NewSerie'] = data['df'].gt(data['df'].shift()).astype(int)
Or Series.diff
data['NewSerie'] = data['df'].diff().gt(0).astype(int)
Output
df NewSerie
0 10 0
1 9 0
2 12 1
3 13 1
4 14 1
5 15 1
6 18 1
7 16 0
8 20 1
9 1 0

Related

Python Lambda Apply Function Multiple Conditions using OR

I've searched this one and cannot find a solution. I have a multiple data condition where when either condition is met, is summed. In my dataset, I have used "apply" and the lambda function for a single condition <, >. However, I have a continuous data column where a count is based on either a low value OR a high value. I have tried variations of this below but keep getting a "ValueError:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Let's say my data looks like this: dfdata
Site data month day year
A 4 1 1 2021
A 17 1 2 2021
A 8 1 3 2021
A 7 1 1 2022
A 0 1 2 2022
A 2 1 3 2022
B 3 1 1 2021
B 16 1 2 2021
B 9 1 3 2021
B 2 1 1 2022
B 18 1 2 2022
B 5 1 3 2022
I've used a for loop that should give the following result below for evaluating the "data" column and counting the instances of the value < 4 OR > 15. I think that the "|" operator might do this but I get a True/False...
sites = ['A','B']
n = len(sites)
dft = pd.DataFrame();
for n in sites:
dft.loc[:,n] = dfdata[dfdata['Site']==n].groupby(["month", "day"])["data"].apply(lambda x: (x < 4) or (x > 15).sum())
the result.
month day A B
1 1 0 2
1 2 2 2
1 3 1 0
Thanks for your help.
You don't have to use (and should avoid) loops in pandas. Aside from being slow, it also make you intention harder to read.
Here's on solution using pandas functions:
dft = (
dfdata.query("data < 4 or data > 15")
.groupby(["month", "day", "Site"])["data"]
.sum()
.unstack(fill_value=0)
)
The query filters for rows whose data is <4 or >17. The rest is just adding them up and reshaping the resulting dataframe.

New column based on values from other columns in python

I have a dataframe df which looks like this
min
max
value
3
9
7
3
4
10
4
4
4
4
10
3
I want to create a new column df['accuracy'] which tells me the accuracy if the df['value'] is in between df['min'] and df['max'] such that the new dataframe looks like
min
max
value
Accuracy
3
9
7
Accurate
3
4
10
Not Accurate
4
4
4
Accurate
4
10
3
Not Accurate
Use apply() method of pandas, refer link
def accurate(row):
if row['value'] >= row['min'] and row['value'] <= row['max']:
return 'Accurate'
return 'Not Accurate'
df['Accuracy'] = df.apply(lambda row: accurate(row), axis=1)
print(df)

Multi-indexed series into DataFrame and reformat

I have a correlation matrix of stock returns in a Pandas DataFrame and I want to extract the top/bottom 10 correlated pairs from the matrix.
Sample DataFrame:
import pandas as pd
import numpy as np
data = np.random.randint(5,30,size=500)
df = pd.DataFrame(data.reshape((50,10)))
corr = df.corr()
This is my function to get the top/bottom 10 correlated pairs by 1) first returning a multi-indexed series (high) for highest correlated pairs, and then 2) unstacking back into a DataFrame (high_df):
def get_rankings(corr_matrix):
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
ranked_corr = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
high = ranked_corr[:10]
high_df = high.unstack().fillna("")
return high_df
get_rankings(corr)
My current DF output looks something like this:
6 4 5 7 8 3 9
3 0.359 0.198
1 0.275
4 0.257
2 0.176 0.154
0 0.153 0.164
5 0.156
But I want it to look this in either 2/3 columns:
ID1 ID2 Corr
0 9 0.304471
2 8 0.271009
2 3 0.147702
7 9 0.146176
0 7 0.144549
7 8 0.111888
4 6 0.098619
1 7 0.092338
1 4 0.09091
3 6 0.079688
It needs to be in a DataFrame so I can pass it to a grid widget, which only accepts DataFrames. Can anyone help me rehash the shape of the unstacked DF?

Remove rows in pandas df with index values within a range

I would like to remove all rows in a pandas df that have an index value within 4 counts of the index value of the previous row.
In the pandas df below,
A B
0 1 1
5 5 5
8 9 9
9 10 10
Only the row with index value 0 should remain.
Thanks!
get the differences between the current and previous row as a list and pass to loc. Chose to get it as a list so i could return a dataframe as a final output.
ind = [ a for a,b in zip(df.index,df.index[1:]) if b-a > 4]
df.loc[ind]
A B
0 1 1
You can use reset_index, diff and shift:
In [1309]: df
Out[1309]:
A B
0 1 1
5 5 5
8 9 9
9 10 10
In [1310]: d = df.reset_index()
In [1313]: df = d[d['index'].diff(1).shift(-1) >=4].drop('index', 1)
In [1314]: df
Out[1313]:
A B
0 1 1

How to get the mode of a column in pandas where there are few of the same mode values pandas

I have a data frame and i'd like to get the mode of a specific column.
i'm using:
freq_mode = df.mode()['my_col'][0]
However I get the error:
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index my_col')
I'm guessing it's because I have few mode that are the same.
I need any of the mode, it doesn't matter. How can I use any() to get any of the mode existed?
For me your code working nice with sample data.
If necessary select first value of Series from mode use:
freq_mode = df['my_col'].mode().iat[0]
We can see the one column
df=pd.DataFrame({"A":[14,4,5,4,1,5],
"B":[5,2,54,3,2,7],
"C":[20,20,7,3,8,7],
"train_label":[7,7,6,6,6,7]})
X=df['train_label'].mode()
print(X)
DataFrame
A B C train_label
0 14 5 20 7
1 4 2 20 7
2 5 54 7 6
3 4 3 3 6
4 1 2 8 6
5 5 7 7 7
Output
0 6
1 7
dtype: int64