Pandas Sequebtial Count of members within a group and the sum - pandas

If I want to have a sequential count within a group I can do something like
df['GID'] = df.groupby(['G_COL1','G_COL2]).cumcount()
I cannot however figure out how to generate a column that contains the total number of values within the group. So if the group had three members df['GID'] would contain 0,1 & 2 and df['COUNT'] would contain the value 3 for each of the three members

df["count_zeros"] = pd.DataFrame((df["GID"]==0)).cumsum()
df["COUNT"] = df.groupby("count_zeros").transform(lambda x: len(x))["GID"]
I think the above gives what you want. The GID column starts from zero whenever a new group starts taking place and then we count how many zeros, i.e. new group "starts" we have with len.
As Scott Boston, commented,
df["COUNT"] = df.groupby("count_zeros")['GID'].transform('count')
works and looks great :)

Related

Finding the mean of a column; but excluding a singular value

Imagine I have a dataset that is like so:
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 10,000
and I want to get the mean of the column weight, but the obvious 10,000 is hindering the actual mean. In this situation I cannot change the value but must work around it, this is what I've got so far, but obviously it's including that last value.
avg_num_items = df_cleaned['trans_quantity'].mean()
translist = df_cleaned['trans_quantity'].tolist()
my dataframe is df_cleaned and the column I'm actually working with is 'trans_quantity' so how do I go about the mean while working around that value?
Since you added SQL in your tags, In SQL you'd want to exclude it in the WHERE clause:
SELECT AVG(trans_quantity)
FROM your_data_base
WHERE trans_quantity <> 10,000
In Pandas:
avg_num_items = df_cleaned[df_cleaned["trans_quantity"] != 10000]["trans_quantity"].mean()
You can also replace your value with a NAN and skip it in the mean:
avg_num_items = df_cleaned["trans_quantity"].replace(10000, np.nan).mean(skipna=True)
With pandas, ensure you have numeric data (10,000 is a string), filter the values above threshold and use the mean:
(pd.to_numeric(df['weight'], errors='coerce')
.loc[lambda x: x<10000]
.mean()
)
output: 0.8057258333333334

how to sum rows in my dataframe Pandas with specific condition?

Could anyone help me ?
I want to sum the values with the format:
print (...+....+)
for example:
a b
France 2
Italie 15
Croatie 7
I want to make the sum of France and Croatie.
Thank you for your help !
One of possible solutions:
set column a as the index,
using loc select rows for the "wanted" values,
take column b,
sum the values found.
So the code can be:
result = df.set_index('a').loc[['France', 'Croatie']].b.sum()
Note double square brackets. The outer pair is the "container" of index values
passed to loc.
The inner part, and what is inside, is a list of values.
To subtract two sums (one for some set of countries and the second for another set),
you can run e.g.:
wrk = df.set_index('a').b
result = wrk.loc[['Italie', 'USA']].sum() - wrk.loc[['France', 'Croatie']].sum()

Pandas dataframe: grouping by unique identifier, checking conditions, and applying 1/0 to new column if condition is met/not met

I have a large dataset pertaining customer churn, where every customer has an unique identifier (encoded key). The dataset is a timeseries, where every customer has one row for every month they have been a customer, so both the date and customer-identifier column naturally contains duplicates. What I am trying to do is to add a new column (called 'churn') and set the column to 0 or 1 based on if it is that specific customer's last month as a customer or not.
I have tried numerous methods to do this, but each and every one fails, either do to tracebacks or they just don't work as intended. It should be noted that I am very new to both python and pandas, so please explain things like I'm five (lol).
I have tried using pandas groupby to group rows by the unique customer keys, and then checking conditions:
df2 = df2.groupby('customerid').assign(churn = [1 if date==max(date) else 0 for date in df2['date']])
which gives tracebacks because dataframegroupby object has no attribute assign.
I have also tried the following:
df2.sort_values(['date']).groupby('customerid').loc[df['date'] == max('date'), 'churn'] = 1
df2.sort_values(['date']).groupby('customerid').loc[df['date'] != max('date'), 'churn'] = 0
which gives a similar traceback, but due to the attribute loc
I have also tried using numpy methods, like the following:
df2['churn'] = df2.groupby(['customerid']).np.where(df2['date'] == max('date'), 1, 0)
which again gives tracebacks due to the dataframegroupby
and:
df2['churn'] = np.where((df2['date']==df2['date'].max()), 1, df2['churn'])
which does not give tracebacks, but does not work as intended, i.e. it applies 1 to the churn column for the max date for all rows, instead of the max date for the specific customerid - which in retrospect is completely understandable since customerid is not specified anywhere.
Any help/tips would be appreciated!
IIUC use GroupBy.transform with max for return maximal values per groups and compare with date column, last set 1,0 values by mask:
mask = df2['date'].eq(df2.groupby('customerid')['date'].transform('max'))
df2['churn'] = np.where(mask, 1, 0)
df2['churn'] = mask.astype(int)

Groupby Get Group For Loop

I have a dataframe that I need to subset by the column measure name.
Measure_Group=measures.groupby('Measure'). I can get.group() like this CDC=Measure_Group.get_group('CDC') , but I have over 20 measures to subset. Is there a for loop or lambda function that I can use with the group by to subset all 20 column names with just one iteration instead of using the get.group multiple times

How to check the highest score among specific columns and compute the average in pandas?

Help with homework problem: "Let us define the "data science experience" of a given person as the person's largest score among Regression, Classification, and Clustering. Compute the average data science experience among all MSIS students."
Beginner to coding. I am trying to figure out how to check amongst columns and compare those columns to each other for the largest value. And then take the average of those found values.
I greatly appreciate your help in advance!
Picture of the sample data set: 1: https://i.stack.imgur.com/9OSjz.png
Provided Code:
import pandas as pd
df = pd.read_csv("cleaned_survey.csv", index_col=0)
df.drop(['ProgSkills','Languages','Expert'],axis=1,inplace=True)
Sample Data:
What I have tried so far:
df[data_science_experience]=df[["Regression","Classification","Clustering"]].values.max()
df['z']=df[['Regression','Classification','Clustering']].apply(np.max,axis=1)
df[data_science_experience]=df[["Regression","Classification","Clustering"]].apply(np.max,axis=1)
If you want to get the highest score of column 'hw1' you can get it with:
pd['hw1'].max(). this gives you a series of all the values in that column and max returns the maximum. for average use mean:
pd['hw1'].mean()
if you want to find the maximum of multiple columns, you can use:
maximum_list = list()
for col in pd.columns:
maximum_list.append(pd[col].max)
max = maximum_list.max()
avg = maximum_list.mean()
hope this helps.
First, you want to get only the rows with MSIS in the Program column. That can be done in the following way:
df[df['Program'] == 'MSIS']
Next, you want to get only the Regression, Classification and Clustering columns. The previous query filtered only rows; we can add to that, like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']]
Now, for each row remaining, we want to take the maximum. That can be done by appending .max(axis=1) to the previous line (axis=1 because we want the maximum of each row, not each column).
At this point, we should have a DataFrame where each row represents the highest score of the three categories for each student. Now, all that's left to do is take the mean, which can be done with .mean(). The full code should therefore look like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']].max(axis=1).mean()