groupby all results without resorting - pandas

Sort in groupby does not work the way I thought it would.
In the following example, I do not want to group "USA" together because there is one row of "Russia".
from io import StringIO
myst="""india, 905034 , 19:44
USA, 905094 , 19:33
Russia, 905154 , 21:56
USA, 345345, 45:55
USA, 34535, 65:45
"""
u_cols=['country', 'index', 'current_tm']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
When I use groupby I get the following:
df.groupby('country', sort=False).size()
country
india 1
USA 3
Russia 1
dtype: int64
Is there anyway I can get results something like this...
country
india 1
USA 1
Russia 1
USA 2

You could try this bit of code instead of a direct groupby:
country = [] #initialising lists
count = []
for i, g in df.groupby([(df.country != df.country.shift()).cumsum()]): #Creating a list that increases by 1 for every time a unique value appears in the dataframe country column.
country.append(g.country.tolist()[0]) #Adding the name of country to list.
count.append(len(g.country.tolist())) #Adding the number of times that country appears to list.
pd.DataFrame(data = {'country': country, 'count':count}) #Binding the lists all into a dataframe.
This df.groupby([(df.country != df.country.shift()).cumsum()]) creates a dataframe that gives a unique number (cumulatively) to every change of country in the country column.
In the for loop, i represents the unique cumulative number assigned to each country appearance and g represents the corresponding full row(s) from your original dataframe.
g.country.tolist() outputs a list of the country names for each unique appearance (aka i) i.e.
['india']
['USA']
['Russia']
['USA', 'USA']
for your given data.
Therefore, the first item is the name of the country and the length represents the number of appearances. This info can then be (recorded in a list and then) put together into a dataframe and give the output you require.
You could also use list comprehensions rather than the for loop:
cumulative_df = df.groupby([(df.country != df.country.shift()).cumsum()]) #The cumulative count dataframe
country = [g.country.tolist()[0] for i,g in cumulative_df] #List comprehension for getting country names.
count = [len(g.country.tolist()) for i,g in cumulative_df] #List comprehension for getting count for each country.
Reference: Pandas DataFrame: How to groupby consecutive values

Using the trick given in #user2285236 's comment
df['Group'] = (df.country != df.country.shift()).cumsum()
df.groupby(['country', 'Group'], sort=False).size()

Related

Creating Pandas Series List but don't know how to access Index values?

I am trying to create to dataframe using series with 0,1,2,3... with corresponding values "Qatar","USA",etc.
Here is the code :
import pandas as pd
import numpy as np
dict_country_fifa = pd.Series([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32],index=["Qatar","Ecuador","Senegal","Netherlands","Argentina","Saudi Arabia","Mexico","Poland","Spain","Costa Rica","Germany","Japan","Brazil","Serbia","Switzerland","Cameroon","England","Iran","USA","Wales","France","Australia","Denmark","Tunisia","Belgium","Canada","Morroco","Croatia","Portugal","Ghana","Uruguay","South Korea"])
print(dict_country_fifa[1])
So if I print(dict_country_fifa[1])
output I get is >>> 2 # which is 1,2 <--- 2 value in Dataframe from list
But my question how do I get the index value such as Qatar or Ecuador printed??
I tried following
print(dict_country_fifa.get(3))
output is >>> 4
But I am looking to get the country name returned instead? How do I get the name instead of 4 ?
You have set the names of the countries as an index of the Series.
Therefore you need to retrieve the names from the index as displayed below.
print(dict_country_fifa.index[3])
Output:
Netherlands

Python, pandas - Why discrepancies in totals of unique item counts pivot_table vs groupby

I have a data set with individuals participating multiple times. I would like to count the unique number of individuals by gender. I've done this with a pivot_table and groupby approach and get different values. I can't figure out why. Can you tell me the obvious element which I have overlooked?
Pivot table solution:
Groupby solution:
As you can see, both give the correct values for the specific "gender". Rather, it is the totals that are different. Groupby appears to provide the correct totals whereas pivot_table totals seem off. Why?
This could be your issue. If there are names that are shared between Genders then pivot_table doesn't count the duplicates. groupby IS counting the duplicates as shown in this small example where the name 'A' is both 'M' and 'F' genders
import pandas as pd
import sidetable
df = pd.DataFrame({
'Gender':['M','F','M','M','F','T','F','F'],
'Name': ['A','A','C','D','E','F','G','H'],
})
piv_df = df.pivot_table(index='Gender',values='Name',aggfunc=pd.Series.nunique,margins=True)
gb_df = df.groupby('Gender').agg({'Name':'nunique'}).stb.subtotal()
print(piv_df)
print(gb_df)
Output
Name
Gender
F 4
M 3
T 1
All 7
Name
Gender
F 4
M 3
T 1
grand_total 8
You can test this by df = df.drop_duplicates('Name') before the piv and gb and the counts should match if this is the only reason for the diff counts

pandas - map dataframe element with dictionary - how to access nth element of value list

suppose I have a dataframe called df
d = {'country_code':['SP','FR','US']}
df = pd.DataFrame(data = d)
next to that I have the following dictionary. Note that each key has a list of values instead of one value
dictionary = {'SP': ['Spain','Europe'],'IT':['Italy','Europe']}
I know I can use the map function to map the dictionary values in my dataframe:
df['zone'] = df['country_code'].map(dictionary)
However I would only like to map the second element of the value list instead of the complete list. So for 'SP' in the dataframe I should get 'Europe' and not ['Spain','Europe']. I assumed the syntaxis would be
df['zone'] = df['country_code'].map(dictionary)[1]
but that's not the case
Can somebody help?
Regards
Use the str accessor for a hacky solution:
df['zone'] = df['country_code'].map(dictionary).str[1]
print(df)
Output
country_code zone
0 SP Europe
1 FR NaN
2 US NaN
A more clean alternative is just to create a new dictionary from the existing one:
df['zone'] = df['country_code'].map({k : vs[1] for k, vs in dictionary.items()})
print(df)
Output
country_code zone
0 SP Europe
1 FR NaN
2 US NaN

map vectorised terms to the original dataframe

I have a dataframe column contains domain names i.e. newyorktimes.com. I split by '.' and apply CountVectorizer to "newyorktimes".
The dataframe
domain split country
newyorktimes.com newyorktimes usa
newyorkreport.com newyorkreport usa
"newyorktimes" is also added as a new dataframe column called 'split'
I'm able to get the term frequencies
vectoriser = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
X = vectoriser.fit_transform(df['split'])
features = vectoriser.get_feature_names()
count = x.toarray().sum(axis=0)
dic = dict(zip(features, count))
dic = sorted(dic.items(), key=lambda x: x[1], reverse=True)
But I also need the 'country' information from the original dataframe and I don't know how to map the terms back to the original dataframe.
Expected output
term country domain count
new york usa 2
york times usa 1
york report usa 1
I cannot reproduce the example you provided, not very sure if you provided the correct input for the countvectorizer. If it is a matter of adding the count matrix back to the data frame, you can do it like this:
df = pd.DataFrame({'corpus':['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
})
vectoriser = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
X = vectoriser.fit_transform(df['corpus'])
features = vectoriser.get_feature_names()
pd.concat([df,pd.DataFrame(X.toarray(),columns=features,index=df.index)],axis=1)
corpus document second second document
0 This is the first document. 0 0
1 This document is the second document. 1 1
2 And this is the third one. 0 0
3 Is this the first document? 0 0

Pandas dataframe replace contents based on ID from another dataframe

This is what my main dataframe looks like:
Group IDs New ID
1 [N23,N1,N12] N102
2 [N134,N100] N501
I have another dataframe that has all the required ID info in an unordered manner:
ID Name Age
N1 Milo 5
N23 Mark 21
N11 Jacob 22
I would like to modify the original dataframe such that all IDs are replaced with their respective names obtained from the other file. So that the dataframe has only names and no IDs and looks like this:
Group IDs New ID
1 [Mark,Silo,Bond] Niki
2 [Troy,Fangio] Kvyat
Thanks in advance
IIUC you can .explode your lists, replace values with .map and regroup them with .groupby
df['ID'] = (df.ID.explode()
.map(df1.set_index('ID')['Name'])
.groupby(level=0).agg(list)
)
If New ID column is not a list, you can use only .map()
df['New ID'] = df['New ID'].map(df1.set_index('ID')['Name'])
you can try making a dict from your second DF and then replacing on the first using regex patterns (no need to fully understand it, check the comments bellow):
ps: since you didn't provide the full df with the codes, I created with some of them, that's why the print() won't replace all the results.
import pandas as pd
# creating dummy dfs
df1 = pd.DataFrame({"Group":[1,2], "IDs":["[N23,N1,N12]", "[N134,N100]"], "New ID":["N102", "N501"] })
df2 = pd.DataFrame({"ID":['N1', "N23", "N11", "N100"], "Name":["Milo", "Mark", "Jacob", "Silo"], "Age":[5,21,22, 44]})
# Create the unique dict we're using regex patterns to make exact match
dict_replace = df2.set_index("ID")['Name'].to_dict()
# 'f' before string means fstrings and 'r' means to interpret it as regex
# the \b is a regex pattern that it sinalizes the begining and end of the match
## so that if you're searching for N1, it won't match if it is N11
dict_replace = {fr"\b{k}\b":v for k, v in dict_replace.items()}
# Replacing on original where you want it
df1['IDs'].replace(dict_replace, regex=True, inplace=True)
print(df1['IDs'].tolist())
# >>> ['[Mark,Milo,N12]', '[N134,Silo]']
Please note the change in my dataframes. In your sample data, the IDs in df that do not exists in df1 IDs. I altered my df to ensure only IDs in df1 were represented. I use the following df
print(df)
Group IDs New
0 1 [N23,N1,N11] N102
1 2 [N11,N23] N501
print(df1)
ID Name Age
0 N1 Milo 5
1 N23 Mark 21
2 N11 Jacob 22
Solution
dict df1.Id and df.Name and map to an exploded df.IDs. Add the result to list.
df['IDs'] = df['IDs'].str.strip('[]')#Strip corner brackets
df['IDs'] = df['IDs'].str.split(',')#Reconstruct list, this was done because for some reason I couldnt explode list
#df.explode list and map df1 to df and add to list
df.explode('IDs').groupby('Group')['IDs'].apply(lambda x:(x.map(dict(zip(df1.ID,df1.Name)))).tolist()).reset_index()
Group IDs
0 1 [Mark, Milo, Jacob]
1 2 [Jacob, Mark]