Pandas dataframe: grouping by unique identifier, checking conditions, and applying 1/0 to new column if condition is met/not met - pandas

I have a large dataset pertaining customer churn, where every customer has an unique identifier (encoded key). The dataset is a timeseries, where every customer has one row for every month they have been a customer, so both the date and customer-identifier column naturally contains duplicates. What I am trying to do is to add a new column (called 'churn') and set the column to 0 or 1 based on if it is that specific customer's last month as a customer or not.
I have tried numerous methods to do this, but each and every one fails, either do to tracebacks or they just don't work as intended. It should be noted that I am very new to both python and pandas, so please explain things like I'm five (lol).
I have tried using pandas groupby to group rows by the unique customer keys, and then checking conditions:
df2 = df2.groupby('customerid').assign(churn = [1 if date==max(date) else 0 for date in df2['date']])
which gives tracebacks because dataframegroupby object has no attribute assign.
I have also tried the following:
df2.sort_values(['date']).groupby('customerid').loc[df['date'] == max('date'), 'churn'] = 1
df2.sort_values(['date']).groupby('customerid').loc[df['date'] != max('date'), 'churn'] = 0
which gives a similar traceback, but due to the attribute loc
I have also tried using numpy methods, like the following:
df2['churn'] = df2.groupby(['customerid']).np.where(df2['date'] == max('date'), 1, 0)
which again gives tracebacks due to the dataframegroupby
and:
df2['churn'] = np.where((df2['date']==df2['date'].max()), 1, df2['churn'])
which does not give tracebacks, but does not work as intended, i.e. it applies 1 to the churn column for the max date for all rows, instead of the max date for the specific customerid - which in retrospect is completely understandable since customerid is not specified anywhere.
Any help/tips would be appreciated!

IIUC use GroupBy.transform with max for return maximal values per groups and compare with date column, last set 1,0 values by mask:
mask = df2['date'].eq(df2.groupby('customerid')['date'].transform('max'))
df2['churn'] = np.where(mask, 1, 0)
df2['churn'] = mask.astype(int)

Related

Finding the mean of a column; but excluding a singular value

Imagine I have a dataset that is like so:
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 10,000
and I want to get the mean of the column weight, but the obvious 10,000 is hindering the actual mean. In this situation I cannot change the value but must work around it, this is what I've got so far, but obviously it's including that last value.
avg_num_items = df_cleaned['trans_quantity'].mean()
translist = df_cleaned['trans_quantity'].tolist()
my dataframe is df_cleaned and the column I'm actually working with is 'trans_quantity' so how do I go about the mean while working around that value?
Since you added SQL in your tags, In SQL you'd want to exclude it in the WHERE clause:
SELECT AVG(trans_quantity)
FROM your_data_base
WHERE trans_quantity <> 10,000
In Pandas:
avg_num_items = df_cleaned[df_cleaned["trans_quantity"] != 10000]["trans_quantity"].mean()
You can also replace your value with a NAN and skip it in the mean:
avg_num_items = df_cleaned["trans_quantity"].replace(10000, np.nan).mean(skipna=True)
With pandas, ensure you have numeric data (10,000 is a string), filter the values above threshold and use the mean:
(pd.to_numeric(df['weight'], errors='coerce')
.loc[lambda x: x<10000]
.mean()
)
output: 0.8057258333333334

How to calculate the difference between row values based on another column value without filtering the values in between

How to calculate the difference between row values based on another column value without filtering the values in between.I want to calculate the difference between seconds for turn_marker == 1. but when I use the following method, it filters all the zeros but I need the zeros, because I need the entire data set.
Here you can see my data set with a column called turn_marker that has the values zero and 1, and another column with seconds. Now I want to calculte the time bwetween those rows where turn_marker is equal 1.
dataframe = main_dataframe.query("turn_marker=='1;'")
main_dataframe["seconds_diff"] = dataframe["seconds"].diff()
main_dataframe
I would be grateful if you could help me.
You can do this:
main_dataframe['indx'] = main_dataframe.index
main_dataframe['diff'] = main_dataframe.sort_values(by=['turn_marker', 'indx'], ascending=[False, True])['seconds'].diff()
main_dataframe.loc[main_dataframe.turn_marker == '0;', 'diff'] = np.nan

pandas - numpy using np.where to calculate and construct new columns

I am trying to create a new column based on selection criteria in another column. This is at an end of a while loop so the data frame does not have the column until this part of the first iteration. All subsequent iterations will be based on this columns previous iteration's total and the current totals:
if 'cBeds' in sPhase.columns:
sPhase['cBeds'] = np.where(sPhase['COUNTYFP'] == '1', (sPhase['cBeds'] + (sPhase[infCount] * .08)), sPhase['cBeds'])
else:
sPhase['cBeds'] = np.where(sPhase['COUNTYFP'] == '1', (sPhase[infCount] * .08), sPhase['cBeds'])
However, when I run the code I get 'KeyError: 'cBeds'
How can handle updating a column in a conditional when the column doesn't exist on the first iteration?
In the else clause, you reference sPhase['cbeds'] as the third parameter to np.where even though you've already established that the column does not exist.
If you want to avoid this problem, just add the column at the beginning of the loop and give it a default value that you can conditionally change later.

convert Int64Index to Int

I'm iterating through a dataframe (called hdf) and applying changes on a row by row basis. hdf is sorted by group_id and assigned a 1 through n rank on some criteria.
# Groupby function creates subset dataframes (a dataframe per distinct group_id).
grouped = hdf.groupby('group_id')
# Iterate through each subdataframe.
for name, group in grouped:
# This grabs the top index for each subdataframe
index1 = group[group['group_rank']==1].index
# If criteria1 == 0, flag all rows for removal
if(max(group['criteria1']) == 0):
for x in range(rank1, rank1 + max(group['group_rank'])):
hdf.loc[x,'remove_row'] = 1
I'm getting the following error:
TypeError: int() argument must be a string or a number, not 'Int64Index'
I get the same error when I try to cast rank1 explicitly I get the same error:
rank1 = int(group[group['auction_rank']==1].index)
Can someone explain what is happening and provide an alternative?
The answer to your specific question is that index1 is an Int64Index (basically a list), even if it has one element. To get that one element, you can use index1[0].
But there are better ways of accomplishing your goal. If you want to remove all of the rows in the "bad" groups, you can use filter:
hdf = hdf.groupby('group_id').filter(lambda group: group['criteria1'].max() != 0)
If you only want to remove certain rows within matching groups, you can write a function and then use apply:
def filter_group(group):
if group['criteria1'].max() != 0:
return group
else:
return group.loc[other criteria here]
hdf = hdf.groupby('group_id').apply(filter_group)
(If you really like your current way of doing things, you should know that loc will accept an index, not just an integer, so you could also do hdf.loc[group.index, 'remove_row'] = 1).
call tolist() on Int64Index object. Then the list can be iterated as int values.
simply add [0] to insure the getting the first value from the index
rank1 = int(group[group['auction_rank']==1].index[0])

Dataframe non-null values differ from value_counts() values

There is an inconsistency with dataframes that I cant explain. In the following, I'm not looking for a workaround (already found one) but an explanation of what is going on under the hood and how it explains the output.
One of my colleagues which I talked into using python and pandas, has a dataframe "data" with 12,000 rows.
"data" has a column "length" that contains numbers from 0 to 20. she wants to divided the dateframe into groups by length range: 0 to 9 in group 1, 9 to 14 in group 2, 15 and more in group 3. her solution was to add another column, "group", and fill it with the appropriate values. she wrote the following code:
data['group'] = np.nan
mask = data['length'] < 10;
data['group'][mask] = 1;
mask2 = (data['length'] > 9) & (data['phraseLength'] < 15);
data['group'][mask2] = 2;
mask3 = data['length'] > 14;
data['group'][mask3] = 3;
This code is not good, of course. the reason it is not good is because you dont know in run time whether data['group'][mask3], for example, will be a view and thus actually change the dataframe, or it will be a copy and thus the dataframe would remain unchanged. It took me quit sometime to explain it to her, since she argued correctly that she is doing an assignment, not a selection, so the operation should always return a view.
But that was not the strange part. the part the even I couldn't understand is this:
After performing this set of operation, we verified that the assignment took place in two different ways:
By typing data in the console and examining the dataframe summary. It told us we had a few thousand of null values. The number of null values was the same as the size of mask3 so we assumed the last assignment was made on a copy and not on a view.
By typing data.group.value_counts(). That returned 3 values: 1,2 and 3 (surprise) we then typed data.group.value_counts.sum() and it summed up to 12,000!
So by method 2, the group column contained no null values and all the values we wanted it to have. But by method 1 - it didnt!
Can anyone explain this?
see docs here.
You dont' want to set values this way for exactly the reason you pointed; since you don't know if its a view, you don't know that you are actually changing the data. 0.13 will raise/warn that you are attempting to do this, but easiest/best to just access like:
data.loc[mask3,'group'] = 3
which will guarantee you inplace setitem