Shift happens: using pandas shift to combine rows - pandas

I am trying to add a column to my data frame in pandas where each entry represents the difference between another column's values across two adjacent rows (if certain conditions are met). Following this answer to get previous row's value and calculate new column pandas python I'm using shift to find the delta between the duration_seconds column entries in the two rows (next minus current) and then return that delta as the derived entry if both rows are from the same user_id, the next row's action is not 'login', and the delta is not negative. Here's the code:
def duration (row):
candidate_duration = row['duration_seconds'].shift(-1) - row['duration_seconds']
if row['user_id'] == row['user_id'].shift(-1) and row['action'].shift(-1) != 'login' and candidate_duration >= 0:
return candidate_duration
else:
return np.nan
Then I test the function using
analytic_events.apply(lambda row: duration(row), axis = 1)
But that throws an error:
AttributeError: ("'int' object has no attribute 'shift'", 'occurred at index 9464384')
I wondered if this was akin to the error fixed here and so I tried passing in the whole data frame thus:
duration(analytic_events)
but that throws the error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What should I be doing to achieve this combination; how should I be using shift?

Without seeing your data. You could simplify this with using conditionally creation of columns with np.where:
cond1 = analytic_events['user_id'] == analytic_events['user_id'].shift(-1)
cond2 = analytic_events['action'].shift(-1) != 'login'
cond3 = analytic_events['duration_seconds'].shift(-1) - analytic_events['duration_seconds'] >= 0
analytic_events['candidate_duration'] = np.where((cond1) & (cond2) & (cond3),
analytic_events['duration_seconds'].shift(-1) - analytic_events['duration_seconds'],
np.NaN)
explanation
np.where works as following: np.where(condition, value if true, value is false)

Related

Filling NaNs using apply lambda function to work with DASK dataframe

I am trying to figure out how to fill a column with a value if another column and filling the null values with the value of another column as so
df['NewCol'] = df.apply(lambda row: 'Three' if row['VersionThree'] == row['VersionThree']\
else ('Two' if row['VersionTwo'] == row['VersionTwo'] else \
('Test' if row['VS'] == row['VS'] else '')), axis=1)
So the function works as it should, but I am now trying to figure out how to get it to run when I read my dataset in as a Dask Data Frame
I tried to vectorize and see if I could use numpy where with it as so,
df['NewCol'] = np.where((df['VersionThree'] == df['VersionThree']), ['Three'],
np.where((df['VersionTwo']== df['VersionTwo']), ['Two'],
np.where((df['VS'] == df['VS']), ['Test'], np.nan)))
But it does not run/crashes. I would like the function to iterate through every row and does a check for those 3 columns if any of them exist then output it to NewCol, if null then check the next column in the if's, if all are null, then place a np.nan into that cell
I am trying to use a Dask DataFrame

Pandas - Setting column value, based on a function that runs on another column

I have been all over the place to try and get this to work (new to datascience). It's obviously because I don't get how the datastructure of Panda fully works.
I have this code:
def getSearchedValue(identifier):
full_str = anedf["Diskret data"].astype(str)
value=""
if full_str.str.find(identifier) <= -1:
start_index = full_str.str.find(identifier)+len(identifier)+1
end_index = full_str[start_index:].find("|")+start_index
value = full_str[start_index:end_index].astype(str)
return value
for col in anedf.columns:
if col.count("#") > 0:
anedf[col] = getSearchedValue(col)
What i'm trying to do is iterate over my columns. I have around 260 in my dataframe. If they contain the character #, it should try to fill values based on whats in my "Diskret data" column.
Data in the "Diskret data" column is completely messed up but in the form CCC#111~VALUE|DDD#222~VALUE| <- Until there is no more identifiers + values. All identifiers are not present in each row, and they come in no specific order.
The function works if I run it with hard coded strings in regular Python document. But with the dataframe I get various error like:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Input In [119], in <cell line: 12>()
12 for col in anedf.columns:
13 if col.count("#") > 0:
---> 14 anedf[col] = getSearchedValue(col)
Input In [119], in getSearchedValue(identifier)
4 full_str = anedf["Diskret data"].astype(str)
5 value=""
----> 6 if full_str.str.find(identifier) <= -1:
7 start_index = full_str.str.find(identifier)+len(identifier)+1
8 end_index = full_str[start_index:].find("|")+start_index
I guess this is because it evaluate against all rows (Series) which obviously provides some false and true errors. But how can I make the evaluation and assignment so it it's evaluating+assigning like this:
Diskret data
CCC#111
JJSDJ#1234
CCC#111~1IBBB#2323~2234
1 (copied from "Diskret data")
0
JJSDJ#1234~Heart attack
0 (or skipped since the row does not contain a value for the identifier)
Heart attack
The plan is to drop the "Diskret data" when the assignment is done, so I have the data in a more structured way.
--- Update---
By request:
I have included a picture of how I visualize the problem, And what I seemingly can't make it do.
Problem visualisation
With regex you could do something like:
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
series = pd.Series(
['CCC#111~1|BBB#2323~2234', 'JJSDJ#1234~Heart attack']
)
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series.apply(map_)
Breaking this down:
Create a new series by running a map on each row that turns your long string into a list of tuples
Create a new series by running a map on each row that turns your long string into a list of tuples.
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series
# output:
# 0 [(CCC#111, 1), (BBB#2323, 2234)]
# 1 [(JJSDJ#1234, Heart attack)]
Then we create a map_ function. This function takes each row of reg_series and maps it to two rows: the first with only the "keys" and the other with only the "values". We then create series of this with the index as the keys and the values as the values.
Edit: We added in a if/else statement that check whether the list exists. If it does not, we return an empty series of type object.
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
...
print(idx, values) # first row
# output:
# ('CCC#111', 'BBB#2323') (1, 2234)
Finally we run apply on the series to create a dataframe that takes the outputs from map_ for each row and zips them together in columnar format.
reg_series.apply(map_)
# output:
# CCC#111 BBB#2323 JJSDJ#1234
# 0 1 2234 NaN
# 1 NaN NaN Heart attack

Assigning value to an iloc slice with condition on a separate column?

I would like to slice my dataframe using iloc (rather than loc) + some condition based on one of the dataframe's columns and assign a value to all the items in this slice (which is effectively a subset of the main dataframe).
My simplified attempt:
df.iloc[:, 1:21][df['column1'] == 'some_value'] = 1
This is meant to take a slice of the dataframe:
All rows;
Columns 2 to 20;
Then slice it again:
Only the rows where column1 = some_value.
The slicing works fine, but equalling this to 1 doesn't work. Nothing changes in df and I get this warning
A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
I really need to use iloc rather than loc if possible. It feels like there should be a way of doing this?
You can search for the error on SO. In short, you should update on one single loc/iloc:
df.loc[df['column1']=='some_value', df.columns[1:21]] = 1

How to return Boolean series meeting multiple conditions in Pandas using bracket notation for column names

I am trying to extract a series meeting multiple conditions in Pandas, i.e. using a boolean operator to filter the data, based on the question/answer here, but I need to use the bracket column notation. (Python 3.7)
This works, and returns [index, Boolean]:
mySeries = data['myCol'] == 'A'
These both return errors:
mySeries = (data['rank'] == 'A' or data['rank'] == 'B')
mySeries = (data['rank'] == 'A' | data['rank'] == 'B')
The second one returns ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). The answers in this question seem to address this error for a dataframe, not a series. The second attempt returns this error: TypeError: Cannot perform 'ror_' with a dtyped [object] array and scalar of type [bool]
I am using bracket notation df['rank'] instead of dot notation df.rank because in the dot notation, Pandas confuses the column name with the rank method.
We can just do isin
mySeries = (data['rank'].isin(['A','B'])
Based on the answer by #unutbu here, this is the correct notation, the issue was that each condition needed to be in its own parentheses:
mySeries = (data['rank'] == 'A') | (data['rank'] == 'B')

How do I create multiple new columns, and populate columns depending on values in 2 other columns using pandas/python?

I want to populate 10 columns with the numbers 1-16 depending on the values in 2 other columns. I can start by providing the column header or create new columns (does not matter to me).
I tried to create a function that iterates over the numbers 1-10 and then assigns a value to the z variable depending on the values of b and y.
Then I want to apply this function to each row in my dataframe.
import pandas as pd
import numpy as np
data = pd.read_csv('Nuc.csv')
def write_Pcolumns(df):
"""populates a column in the given dataframe, df, based on the values in two other columns in the same dataframe"""
#create string of numbers for each nucleotide position
positions = ('1','2','3','4','5','6','7','8','9','10')
a = "Po "
x = "O.Po "
#for each position create a variable for the nucleotide in the sequence (Po) and opposite to the sequence(o. Po)
for each in positions:
b = a + each
y = x + each
z = 'P' + each
#assign a value to z based on the nucleotide identities in the sequence and opposite position
if df[b] == 'A' and df[y]=='A':
df[z]==1
elif df[b] == 'A' and df[y]=='C':
df[z]==2
elif df[b] == 'A' and df[y]=='G':
df[z]==3
elif df[b] == 'A' and df[y]=='T':
df[z]==4
...
elif df[b] == 'T' and df[y]=='G':
df[z]==15
else:
df[z]==16
return(df)
data.apply(write_Pcolumns(data), axis=1)
I get the following error message:
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This happens because df[index]=='value' returns a series of booleans, not a single boolean for each value.
Check out Pandas error when using if-else to create new column: The truth value of a Series is ambiguous