Pandas seems to be merging same dataframe twice - pandas

I have two dataframes in pandas, one of which, 'datapanel', has country data for multiple years, and the other, 'data', has country data for only one year, but also includes a "Regional indicator" column for each country. I simply want to create a new column in the datapanel frame that gives the 'Regional indicator' for each country. For some reason, the rows of the dataframe are just about doubling after this merge, whereas they should remain the same. What am I doing wrong?

The key (country name) you are merging on is duplicated in 'datapanel' (see 'Afghanistan' mentioned at least 5 times) and perhaps also in 'data', which causes troubles.
Try using a different technique (v-lookup), something like this ("Country name" must be unique in 'data'):
for country in data["Country name"].values:
indicator = data.loc[data["Country name"] == country, "Regional indicator"].item()
datapanel.loc[datapanel["Country name"] == country, "Regional indicator"] = indicator

Related

Select rows from a dataframe where a column does not take one of the values in a list

I have a list of values that is a subset of all values that a specific column can take:
list1=['DHL','FDX','UPS','USPS','Others']
Based on the value of the shipping company used for a shipment, I can choose rows (for example):
df2=df1[df1['ShippingCompany']=='DHL']
Now, I need to select 'Others', where rows correspond to all shipping companies other than the one listed before 'Others'. How do I do this without having to write a long chain? Mind you, list1 contents can change between invocations, where customers can add other values before 'Others'.
I am thinking of the following metacode:
df2=df1[df1['ShippingCompany'] is not in list1[:-1]]
Is this possible in Python?
You have isin or ==
df2=df1[df1['ShippingCompany'] == list1[-1]]
More than two value
df2=df1[~df1['ShippingCompany'].isin(list1[:-1])]

Use value_counts with a groupby function based on a condition in pandas dataframe and insert into a new column

I have a large data set with multiple rows, shown in the picture attached. I am trying to get a value_counts to get the sum of Males and Females (in the gender column), per Country. I also need to have a condition where the Hague field is a Yes or No. So there is a Male/No and a Male/Yes, with the same for Female. I am trying to use a groupby function on the countries and Hague, with the formula below:
data_df.groupby('Country')['Gender'].apply(lambda x: x[x== 'M'].count())
Using this, I can get the count of a specific gender, per country, but I cannot figure out the condition to make the Hague a 'YES' and 'NO'.
I have also figured out how to only bring the rows where you can select a particular gender and Hague status. But I cannot figure out how to count the amount of male/female in the gender column
data_df.loc[(data_df['Hague'] =='NO') & (data_df['Gender'] =='M')]
Image of the main dataset called data_df
Thank you to #wwnde I have figured out the code!
data_df.loc[(data_df['Hague'] =='NO') & (data_df['Gender'] =='M')].groupby('Country')['Gender'].value_counts()
Not sure I understood you but please try. Let us know if it is contrary and we will help
Data
df=pd.DataFrame({'Hague':['NO', 'YES', 'YES','YES','YES','YES','YES','NO','YES'],'Country':['AFGHANISTAN', 'ALBANIA', 'ALBANIA','ALBANIA','ALBANIA','ALBANIA','ALBANIA','ANTIGUA AND BARBUD','ARMENIA'],'Age':[12,2,4,3,3,9,3,12,1],'Gender':['M', 'M', 'M','F','M','F','M','F','M']})
Filter Hague/Gender
df[((df['Hague']=='YES')& (df['Gender']=='M'))].groupby('Country')['Gender'].value_counts()
Outcome

How to use aggregate with condition in pandas?

I have a dataframe.
Following code works
stat = working_data.groupby(by=['url', 'bucket_id'],
as_index=False).agg({'delta': 'max','id': 'count'})
Now i need to count ids with different statuses. I have "DOWNLOADED", "NOT_DOWNLOADED" and "DOWNLOADING" for the status.
I would like to have df with columns bucket_id, max, downloaded (how many have "DOWNLOADED" status) , not_downloaded (how many have "NOT_DOWNLOADED" status) , downloading (how many have "DOWNLOADING" status). How to make it?
Input I have:
.
Output i have:
As you can see count isn't devided by status. But i want to know that there are x downloaded, y not_downloaded, z downloading for each bucket_id bucket_id (so they should be in separate columns, but info for one bucket_id should be in one row)
One way to use assign to create columns then aggregate this new column.
working_data.assign(downloaded=df['status'] == 'DOWNLOADED',
not_downloaded=df['status'] == 'NOT_DOWNLOADED',
downloading=df['status'] == 'DOWNLOADING')\
.groupby(by=['url', 'bucket_id'],
as_index=False).agg({'delta': 'max',
'id': 'count',
'downloaded': 'sum',
'not_donwloaded':'sum',
'downloading':'sum'})

Keyerror on pd.merge()

I am trying to merge 2 data-frames ('credit' and 'info') on the column 'id'.
My code for this is:
c.execute('SELECT * FROM "credit"')
credit=c.fetchall()
credit=pd.DataFrame(credit)
c.execute('SELECT * FROM "info"')
info=c.fetchall()
movies_df=pd.DataFrame(info)
movies_df_merge=pd.merge(credit, movies_df, on='id')
Both of the id column types from the tables ('credit' and 'info') integers, but I am unsure of why I keep getting a key error on 'id'.
I have also tried:
movies_df_merge=movies_df.merge(credit, on='id')
The way how you read both DataFrames is not relevant here.
Just print both DataFrames (if the number of records is big, it will
be enough to print(head(df))).
Then look at them. Especially check whether both DataFrames contains
id column. Maybe one of them is ID, whereas another is id?
The upper / lower case of names does matter here.
Check also that id column in both DataFrames is a "normal" column
(not a part of the index).

How to group by two columns and get a sum down a third column

I have a dataframe where each row is a prescription event and contains a drug name, a postcode, and the quantity prescribed. I need to find the total quantity of every drug prescribed in every postcode.
I need to group by postcode, then by drug name, and find the sum of the cells in the "items" column for every group.
This is my function that I want to apply to every postcode group:
def count(group):
sums = []
for bnf_name in group['bnf_name']:
sum_ = group[group['bnf_name'] == bnf_name]['items'].sum()
sums.append(sum_)
group['sum'] = sums
merged.groupby('post_code').apply(count).head()
merged.head()
Calling merged.head() returns the original merged dataframe without a new column for sums like I would expect. I think there is something I don't get about the apply() function on a groupby object...
You do not need to define your own function
merged.groupby(['post_code','bnf_name'])['items'].sum()