pandas - numpy using np.where to calculate and construct new columns - pandas

I am trying to create a new column based on selection criteria in another column. This is at an end of a while loop so the data frame does not have the column until this part of the first iteration. All subsequent iterations will be based on this columns previous iteration's total and the current totals:
if 'cBeds' in sPhase.columns:
sPhase['cBeds'] = np.where(sPhase['COUNTYFP'] == '1', (sPhase['cBeds'] + (sPhase[infCount] * .08)), sPhase['cBeds'])
else:
sPhase['cBeds'] = np.where(sPhase['COUNTYFP'] == '1', (sPhase[infCount] * .08), sPhase['cBeds'])
However, when I run the code I get 'KeyError: 'cBeds'
How can handle updating a column in a conditional when the column doesn't exist on the first iteration?

In the else clause, you reference sPhase['cbeds'] as the third parameter to np.where even though you've already established that the column does not exist.
If you want to avoid this problem, just add the column at the beginning of the loop and give it a default value that you can conditionally change later.

Related

How to calculate the difference between row values based on another column value without filtering the values in between

How to calculate the difference between row values based on another column value without filtering the values in between.I want to calculate the difference between seconds for turn_marker == 1. but when I use the following method, it filters all the zeros but I need the zeros, because I need the entire data set.
Here you can see my data set with a column called turn_marker that has the values zero and 1, and another column with seconds. Now I want to calculte the time bwetween those rows where turn_marker is equal 1.
dataframe = main_dataframe.query("turn_marker=='1;'")
main_dataframe["seconds_diff"] = dataframe["seconds"].diff()
main_dataframe
I would be grateful if you could help me.
You can do this:
main_dataframe['indx'] = main_dataframe.index
main_dataframe['diff'] = main_dataframe.sort_values(by=['turn_marker', 'indx'], ascending=[False, True])['seconds'].diff()
main_dataframe.loc[main_dataframe.turn_marker == '0;', 'diff'] = np.nan

Pandas dataframe: grouping by unique identifier, checking conditions, and applying 1/0 to new column if condition is met/not met

I have a large dataset pertaining customer churn, where every customer has an unique identifier (encoded key). The dataset is a timeseries, where every customer has one row for every month they have been a customer, so both the date and customer-identifier column naturally contains duplicates. What I am trying to do is to add a new column (called 'churn') and set the column to 0 or 1 based on if it is that specific customer's last month as a customer or not.
I have tried numerous methods to do this, but each and every one fails, either do to tracebacks or they just don't work as intended. It should be noted that I am very new to both python and pandas, so please explain things like I'm five (lol).
I have tried using pandas groupby to group rows by the unique customer keys, and then checking conditions:
df2 = df2.groupby('customerid').assign(churn = [1 if date==max(date) else 0 for date in df2['date']])
which gives tracebacks because dataframegroupby object has no attribute assign.
I have also tried the following:
df2.sort_values(['date']).groupby('customerid').loc[df['date'] == max('date'), 'churn'] = 1
df2.sort_values(['date']).groupby('customerid').loc[df['date'] != max('date'), 'churn'] = 0
which gives a similar traceback, but due to the attribute loc
I have also tried using numpy methods, like the following:
df2['churn'] = df2.groupby(['customerid']).np.where(df2['date'] == max('date'), 1, 0)
which again gives tracebacks due to the dataframegroupby
and:
df2['churn'] = np.where((df2['date']==df2['date'].max()), 1, df2['churn'])
which does not give tracebacks, but does not work as intended, i.e. it applies 1 to the churn column for the max date for all rows, instead of the max date for the specific customerid - which in retrospect is completely understandable since customerid is not specified anywhere.
Any help/tips would be appreciated!
IIUC use GroupBy.transform with max for return maximal values per groups and compare with date column, last set 1,0 values by mask:
mask = df2['date'].eq(df2.groupby('customerid')['date'].transform('max'))
df2['churn'] = np.where(mask, 1, 0)
df2['churn'] = mask.astype(int)

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))

Pandas giving length must be equal error when trying to replace one column with another

I am trying to fill one column with another column if null. I tried 2 ways
df.NAME1 = np.where(df.NAME1.isnull(), df.NAME2, df.NAME1)
df['NAME1'] = df['NAME1'].fillna(df['NAME2'])
I get:
Lengths must be equal.

Performing calculations on multiple columns in dataframe and create new columns

I'm trying to perform calculations based on the entries in a pandas dataframe. The dataframe looks something like this:
and it contains 1466 rows. I'll have to run similar calculations on other dfs with more rows later.
What I'm trying to do, is calculate something like mag='(U-V)/('R-I)' (but ignoring any values that are -999), put that in a new column, and then z_pred=10**((mag-c)m) in a new column (mag, c and m are just hard-coded variables). I have other columns I need to add too, but I figure that'll just be an extension of the same method.
I started out by trying
for i in range(1):
current = qso[:]
mag = (U-V)/(R-I)
name = current['NED']
z_pred = 10**((mag - c)/m)
z_meas = current['z']
but I got either a Series for z, which I couldn't operate on, or various type errors when I tried to print the values or write them to a file.
I found this question which gave me a start, but I can't see how to apply it to multiple calculations, as in my situation.
How can I achieve this?
Conditionally adding calculated columns row wise are usually performed with numpy's np.where;
df['mag'] = np.where(~df[['U', 'V', 'R', 'I']].eq(-999).any(1), (df.U - df.V) / (df.R - df.I), -999)
Note; assuming here that when any of the columns contain '-999' it will not be calculated and a '-999' is returned.