Index Error: Selecting the max of two rows and logging it in a separate column - pandas

I'm trying to automate billing for my boss. I have to choose the highest quantity for an invoice date and client, then print that quantity in a separate column and a 0 (or blank) for the second row associated with that client. I'm trying to recreate this example:
Billing Snippet
I'm having trouble using Pandas to do this. I used a pivot table to get the max quantity for each client, then merged that data with the original to get a "max" column. That looks like this:
Dataframe snippet
My plan is to use indexes to essentially say "if the Qty is not equal to Max, then change the value to 0"
Here's my code, but I get the error "A value is trying to be set on a copy of a slice from a DataFrame" :
ad2[ad2['Qty'] != ad2['max']]['Qtrly Billing Count']=0
Any advice on how to tackle this?
Update: Tried turning off the setting that gives me the index error, but the column I want to update isn't changing. Help!

Recreating you df:
ad2 = pd.DataFrame({'Qty':[33, 47],'max':[47,47], 'Qtrly':[47,47] })
Qtrly Qty max
0 47 33 47
1 47 47 47
using loc:
ad2.loc[ad2['Qty'] != ad2['max'], 'Qtrly']=0
result:
Qtrly Qty max
0 0 33 47
1 47 47 47

Related

Collapse pandas DataFrame based on daily column value

I have a pandas DataFrame with multiple measurements per day (for example hourly measurements, but that is not necessarily the case), but I want to keep only the hour for which a certain column is the daily minimum.
My one day in my data frame looks somewhat like this
DATE Value Distance
17 1979-1-2T00:00:00.0 15.5669870447436 34.87
18 1979-1-2T01:00:00.0 81.6306803714536 31.342
19 1979-1-2T02:00:00.0 83.1854759740486 33.264
20 1979-1-2T03:00:00.0 23.8659679630303 32.34
21 1979-1-2T04:00:00.0 63.2755504429306 31.973
22 1979-1-2T05:00:00.0 91.2129044773733 34.091
23 1979-1-2T06:00:00.0 76.493130052689 36.837
24 1979-1-2T07:00:00.0 63.5443183375785 34.383
25 1979-1-2T08:00:00.0 40.9255407683688 35.275
26 1979-1-2T09:00:00.0 54.5583051827551 32.152
27 1979-1-2T10:00:00.0 26.2690011881422 35.104
28 1979-1-2T11:00:00.0 71.3059740399097 37.28
29 1979-1-2T12:00:00.0 54.0111262724049 38.963
30 1979-1-2T13:00:00.0 91.3518048568241 36.696
31 1979-1-2T14:00:00.0 81.7651763485069 34.832
32 1979-1-2T15:00:00.0 90.5695814525067 35.473
33 1979-1-2T16:00:00.0 88.4550315358515 30.998
34 1979-1-2T17:00:00.0 41.6276969038137 32.353
35 1979-1-2T18:00:00.0 79.3818377264749 30.15
36 1979-1-2T19:00:00.0 79.1672568582629 37.07
37 1979-1-2T20:00:00.0 1.48337999844262 28.525
38 1979-1-2T21:00:00.0 87.9110385474789 38.323
39 1979-1-2T22:00:00.0 38.6646421460678 23.251
40 1979-1-2T23:00:00.0 88.4920153764757 31.236
I would like to keep all rows that have the minimum "distance" per day, so for the one day shown above, one would have only one row left (the one with index value 39). I know how to collapse the data frame so that I only have the Distance column left. I can do that - if I first set the DATE as index - with
df_short = df.groupby(df.index.floor('D'))["Distance"].min()
But I also want the Value column in my final result. How do I keep all columns?
It doesn't seem to work if I do
df_short = df.groupby(df.index.floor('D')).min(["Distance"])
This does keep all the columns in the final result, but it seems like the outcome is wrong, so I'm not sure what this does.
Maybe this is already posted somewhere, but I have trouble finding it.
You can use aggregate
df_short = df.groupby(df.index.floor('D')).agg({'Distance': min, 'Value': max})
If you want the kept Value column is the same with minimum of Distance column:
df_short = df.loc[df.groupby(df.index.floor('D'))['Distance'].idxmin(), :]
Make a datetime Index:
df.DATE = pd.to_datetime(df.DATE) # If not already datetime.
df.set_index('DATE', inplace=True)
Resample and find the min Distance's location:
df.loc[df.resample('D')['Distance'].idxmin()]
Output:
Value Distance
DATE
1979-01-02 22:00:00 38.664642 23.251

groupby 2 columns and count into separate columns based on one columns cases

I'm trying to group by 2 columns of which the first value has 5 different values and the second 2.
My data looks like this:
and using
df_counted = df_analysis
.groupby(['TYPE', 'RESULT'])
.size()
.sort_values(ascending=False)
.reset_index(name='COUNT')
I was able to transform it into the cases I want:
However I don't want a column for result, just for counts.
It's suppoed to be like
COUNT_TRUE COUNT_FALSE
FORWARD 21 182
BACKWARD 34 170
RIGHT 24 298
LEFT 20 242
NEUTRAL 16 82
The best I could do there was this. How do I get there?
Pandas has a feature of making a pivot table with dataframe. Your task can also be done by making pivot table.
df_counted.pivot_table(index="TYPE", columns="RESULT", values="COUNT")
Result:
Solved it and went a kind of full SQL there. It's not elegant, but it works:
df_counted is the last df from the question with the NaN values.
# drop duplicates for the first counts
df_pos = df_counted.drop_duplicates(subset=['TYPE'], keep='first').drop(columns=['COUNT_POS'])
# drop duplicates for the first counts
df_neg = df_counted.drop_duplicates(subset=['TYPE'], keep='last').drop(columns=['COUNT_NEG'])
# join on TYPE
df = df_pos.set_index('TYPE').join(df_neg.set_index('TYPE'))
If someone has a more elegant way of doing this, I'd be super interested to see it.

Creating new column based on condition and extracting respective value from other column. Pandas Dataframe

I am relatively new to this field and am working with a data set to find meaningful insights into customer behavior. My dataset looks like:
customerId week first_trip_week rides
0 156 44 36 2
1 164 44 38 6
2 224 42 36 5
3 224 43 36 4
4 224 44 36 5
What I want to do is create new columns week 44,week 43, week 42 and get the values in the "ride" column to be filled into the rows for the respective customer id. This is in the hope that I can eventually also make the customerId my index and can get denominations for different weeks. Help would be greatly appreciated!
Thank you!!
If I'm understanding you correctly, you want to create new columns in the same dataframe for weeks 44, 43, and 42 with the correct values for each customerId and NaN for those that don't have it. If your original dataframe has all the user data, I would first filter for dataframes that have the correct week number
week42DF = dataset.loc[dataset['week']==42,['customerId','rides']].rename(columns={'rides':'week42Rides'})
getting only the rides and customerId and renaming the former here to make things a little easier for us. Then left join the old dataframe and the new one on customerId
dataset = pd.merge(dataset,week42DF,how='left',on='customerId')
The users that are missing from week42DF will have NaN in the week42rides column in the merged dataset which you can then use the .fillna(0) method to replace with zeros. Do this for each week you require.
See Pandas' documentation on merge and the more general concatenate for more info.

Python SettingWithCopyWarning, but I'm trying to set the value using .ix

I have a pandas dataframe in python and I'm trying to modify a specific value in a particular row. I found a solution to this problem Set value for particular cell in pandas DataFrame using index, but it is still generating the SettingWithCopy error.
The name of the data frame is internal_df and it has columns 'price', 'visits', and 'orders'. Specifically, I want to add the number of orders and visits to a lower price point if we don't have a sufficient number of visits (100 in this example). Note that below the variable 'price' is a float, and the data types for 'price' within the internal_df data frame is float, while price and orders are ints.
if int(internal_df[internal_df['price']==price]['visits']) < 100:
for index, row in internal_df.iterrows():
if float(row['price']) > price:
internal_df.ix[internal_df['price'] == price, 'visits'] = internal_df.ix[internal_df['price'] == price, 'visits'] + row['visits']
internal_df.ix[internal_df['price'] == price, 'orders'] = internal_df.ix[internal_df['price'] == price, 'orders'] + row['orders']
Here is a sample of the data
price visits sales
0 1399.99 2 0
1 169.99 2 0
2 99.99 1 0
3 99.99 1 0
4 139.99 1 0
5 319.99 1 0
6 198.99 1 0
7 119.99 1 0
8 39.99 1 0
9 259.98 1 0
Does anyone have any suggestions, or should I just ignore the error?
Brad
Note that .ix is deprecated because it indexes by position or by label, depending on the data type of the index. Use .loc or .iloc instead.
This SettingWithCopyWarning might originate from a "get" operation several lines of code above what you've provided. A quick fix might be to find where internal_df is first assigned, and to add .copy() to the end of the assignment statement. For example, if you have internal_df = df[df['colname'] <= value], change that to internal_df = df[df['colname'] <= value].copy() and hopefully that resolves the error.
Also, I think you can do what you're trying to do without a for loop, which would be faster and more readable!

Select minimum value from column A where column B is not in an array

I'm trying to select accesses for patients where d11.xblood is a minimum value grouped by d11.xpid - and where d11.xcaccess_type is not 288, 289, or 292. (d11.xblood is a chronological index of accesses.)
d11.xpid: Patient ID (int)
d11.xblood: Unique chronological index of patients' accesses (int)
d11.xcaccess_type: Unique identifier for accesses (int)
I want to report one row for each d11.xpid where d11.xblood is the minimum (initial access) for its respective d11.xpid . Moreover, I want to exclude the row if the initial access for a d11.xpid has a d11.xcaccess_type value of 288, 289 or 292.
I have tried several variations of this in the Select Expert:
{d11.xblood} = Minimum({d11.xblood},{d11.xpid}) and
not ({d11.xcaccess_type} in [288, 289, 292])
This correctly selects rows with the initial access but eliminates rows where the current access is not in the array. I only want to eliminate rows where the initial access is not in the array. How can I accomplish this?
Sample table:
xpid xblood xcaccess_type
---- ------ -------------
1 98 400
1 49 300
1 152 288
2 33 288
2 155 300
2 70 400
3 40 300
3 45 400
Sample desired output:
xpid xblood xcaccess_type
---- ------ -------------
1 49 300
3 40 300
See that xpid = 2 is not in the output because its minimum value of xblood had an xcaccess_type = 288 which is excluded. Also see that even though xpid = 1 has an xcaccess_type = 288, because there is a lower value of xblood for xpid = 1 where xcaccess_type not in (288,289,292) it is still included.
If you don't want to write a stored procedure or custom SQL to handle this, you could add another Group. Assuming your deepest group (the one closest to the Details section) is sorting based on xpid, you could add a group inside that one which sorts the xcaccess_type from lowest to highest.
Suppress the header and footer for the new group then add this clause to the details section:
({d11.xpid} = PREVIOUS({d11.xpid})
OR
({d11.xcaccess_type} in [288, 289, 292])
This should modify your report to only ever display the records with the lowest access value per person. And if the lowest access value is one of the three forbidden values, no records will show for that xpid.