groupby does not apply effectively - pandas

Following is the ranking dataframe i am working on:
Q6 Q17
1 Consultant NaN
2 Other NaN
3 Data Scientist Java
4 Not employed Python
5 Data Analyst SQL
I want to:
count how many times each programming language occurs for 'Data Scientists' and record the frequency in a column 'counts'
sort the count in descending order
reset index and rename Q17 as Language
The following code does not group each Language.
ranking_data = ranking_data[ranking_data.Q6 == 'Data Scientist']
ranking_data_summary = ranking_data.copy().rename(columns = {'Q17':'Language'})
ranking_data_summary['counts'] = ranking_data_summary.groupby('Language')
['Language'].transform('count')
ranking_data_summary.sort_values('counts',ascending = False, inplace = True)
ranking_data_summary.reset_index(inplace = True)
What am i doing wrong?

Related

Compare Values of 2 dataframes conditionally

I have the following problem. I have a dataframe which look like this.
Dataframe1
start end
0 0 2
1 3 7
2 8 9
and another dataframe which looks like this.
Dataframe2
data
1 ...
4 ...
8 ...
11 ...
What I am trying to achieve is following:
For each row in Dataframe1 I want to check if there is any index value in Dataframe2 which is in range(start, end) of Dataframe1.
If the condition is True, I want to create a new column["condition"] where the outcome is stored.
Since there is the possiblity to deal with large amounts of data I tried using numpy.select.
Like this:
range_start = df1.start
range_end = df1.end
condition = [
df2.index.to_series().between(range_start, range_end)
]
choice = ["True"]
df1["condition"] = np.select(condition, choice, default=0)
This gives me an error:
ValueError: Can only compare identically-labeled Series objects
I also tried a list comprehension. That didn't work either. All the things I tried are failing because I am dealing with a series (--> range_start, range_end). There has to be a way to make this work I think..
I already searched stackoverflow for this paricular problem. But I wasn't able to find a solution to this problem. It could be, that I'm just to inexperienced for this type of problem, to search for the right solution.
So maybe you can help me out here.
Thank you!
expected output:
start end condition
0 0 2 True
1 3 7 True
2 8 9 True
Use DataFrame.drop_duplicates for remove duplicates by both columns and index, create all combinations by DataFrame.merge with cross join and last test at least one match by GroupBy.any:
df3 = (df1.drop_duplicates(['start','end'])
.merge(df2.index.drop_duplicates().to_frame(), how='cross'))
df3['condition'] = df3[0].between(df3.start, df3.end)
df3 = df1.join(df3.groupby(['start','end'])['condition'].any(), on=['start','end'])
print (df3)
start end condition
0 0 2 True
1 3 7 True
2 8 9 True
If all pairs in df1 are unique is possible use:
df3 = (df1.merge(df2.index.to_frame(), how='cross'))
df3['condition'] = df3[0].between(df3.start, df3.end)
df3 = df3.groupby(['start','end'], as_index=False)['condition'].any()
print (df3)
start end condition
0 0 2 True
1 3 7 True
2 8 9 True

Creating a new column with a iterating sentence count whenever two simultaneous row values are null (indicating a new sentence is found)

I have a dataframe with words and entities and would like to create a third column which keeps a sentence count for every new sentence found as shown in the link example of desired output.
The condition based on which I would recognize the start of a new sentence is when both the word and entity columns have null values like at index 4.
0 word entity
1 It O
2 was O
3 fun O
4 NaN NaN
5 from O
6 vodka B-product
So far I have managed to fill the null values with a new_sent string and have figured out how to make a new column where I can enter a value whenever a new sentence is found using.
df.fillna("new_sentence", inplace=True)
df['Sentence #'] = np.where(df['word']=='new_sentence', 'S', False)
In the above code instead of S I would like to fill Sentence: {count} as in the example. What would be easiest/quickest way to do this? Also, is there a better way to keep a count of sentences in a separate column like in the example instead of the method I am trying?
So far I am able to get an output like this
0 word entity Sentence #
1 It O False
2 was O False
3 fun O False
4 new_sentence new_sentence S
5 from O False
6 vodka B-product False

Groupby two columns in pandas, and perform operations over totals for each group

The code below:
df = pd.read_csv('./filename.csv', header='infer').dropna()
df.groupby(['category_code','event_type']).event_type.count().head(20)
Returns the following table:
How can I obtain, for all the sub groups under event_type that have both "purchase" and "view", the ratio between the total of "purchase" and the total of "view"?
In this specific case, for instance, I need a function that returns:
1/57
1/232
3/249
Eventually, I will need to plot such result.
I have been trying for a day, without success. I am still new to pandas, and I searched across every possible forum without finding anything useful.
Next time please consider adding a sample of your data as text instead of as an image. It helps us testing..
Anyway, in your case you can combine different dataframe methods, such as groupby, as you have already done, and pivot_table. I used this data just as an example:
category_code event_type
0 A purchase
1 A view
2 B view
3 B view
4 C view
5 D purchase
6 D view
7 D view
You can create a new column from your groupby
df['event_count'] = df.groupby(['category_code', 'event_type'])\
.event_type.transform('count')
Then create a pivot_table
my_table = df.pivot_table(values='event_count',
index='category_code',
columns='event_type',
fill_value=0)
Then, finally, you can calculate the purchase_ratio directly:
my_table['purchase_ratio'] = my_table['purchase'] / my_table['view']
Which results in the following DataFrame:
event_type purchase view purchase_ratio
category_code
A 1 1 1.0
B 0 2 0.0
C 0 1 0.0
D 1 2 0.5

Fill nan's in dataframe after filtering column by names

Can anyone please tell me what the right approach here to filter (and fill nan) based on another column name. Thanks a lot.
Related link: How to fill dataframe's empty/nan cell with conditional column mean
df
ID Name Industry Expenses
1 Treslam Financial Services 734545
2 Rednimdox Construction nan
3 Lamtone IT Services 567678
4 Stripfind Financial Services nan
5 Openjocon Construction 8678957
6 Villadox Construction 5675676
7 Sumzoomit Construction 231244
8 Abcd Construction nan
9 Stripfind Financial Services nan
df_mean_expenses = (df.groupby(['Industry'], as_index = False)['Expenses']).mean()
df_mean_expenses
Industry Expenses
0 Construction 554433.11
1 Financial Services 2362818.48
2 IT Services 149153.46
In order to replace the Contruction-Revenue nan's by the contruction row's mean (in df_mean_expenses) , i tried two approaches:
1.
df.loc[df['Expenses'].isna(),['Expenses']][df['Industry'] == 'Construction'] = df_mean_expenses.loc[df_mean_expenses['Industry'] == 'Construction',['Expenses']].values
.. returns Error: Item wrong length 500 instead of 3!
2.
df['Expenses'][np.isnan(df['Expenses'])][df['Industry'] == 'Construction'] = df_mean_expenses.loc[df_mean_expenses['Industry'] == 'Construction',['Expenses']].values
.. this runs but does not add values to the df.
Expected output:
df
ID Name Industry Expenses
1 Treslam Financial Services 734545
2 Rednimdox Construction 554433.11
3 Lamtone IT Services 567678
4 Stripfind Financial Services nan
5 Openjocon Construction 8678957
6 Villadox Construction 5675676
7 Sumzoomit Construction 231244
8 Abcd Construction 554433.11
9 Stripfind Financial Services nan
Try with transform
df_mean_expenses = df.groupby('Industry')['Expenses'].transform('mean')
df['Revenue'] = df['Revenue'].fillna(df_mean_expenses[df['Industry']=='Construction'])

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation