Unable to compare the datasets [duplicate]

Unable to compare the datasets [duplicate] - pandas

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I am unable to compare the column values of 2 different dataframes.
First dataset has 500 rows and second dataset has 128 rows. I am mentioning few of the rows of datasets.
First dataset:
Country_name Weather President
USA 16 Trump
China 19 Xi
2nd dataset
Country_name Weather Currency
North Korea 26 NKT
China 19 Yaun
I want to compare the country_name column because I don't have Currency column in dataset 1 , so if the country_name matches, then I can append its value. My final dataframe should be like this
Country_name Weather President Currency
USA 16 Trump Dollar
China 19 Xi Yaun
In the above final dataframes, we have to include only those countries whose country_name is present in both the datasets and corresponding value of Currency should be appended as shown above.

If you just want to keep records that only match in Country_name, and execlude everything else, you can then use the merge function, which basically finds the intersection between two dataframes based on some given column:
d1 = pd.DataFrame(data=[['USA', 16, 'Trump'], ['China', 19, 'Xi']],
columns=['Country_name', 'Weather', 'President'])
d2 = pd.DataFrame(data=[['North Korea', 26, 'NKT'], ['China', 19, 'Yun']],
columns=['Country_name', 'Weather', 'Currency'])
result = pd.merge(d1, d2, on=['Country_name'], how='inner')\
.rename(columns={'Weather_x': 'Weather'}).drop(['Weather_y'], axis=1)
print(result)
Output
Country_name Weather President Currency
0 China 19 Xi Yun

Related

How to change index name in multiindex groupby object with condition?

I need to change 0 level index ('Product Group') of pandas groupby object, based on conditions (sum of related values in column 'Sales').
Since code is very long and some files are needed, I`ll copy output.
the last string of code is:
tdk_regions = tdk[['Region', 'Sales', 'Product Group']].groupby(['Product Group', 'Region']).sum()
###The output will be like this
Product Group Region Sales
ALUMINUM & FILM CAPACITORS BG America 7.425599e+07
China 2.249969e+08
Europe 2.404613e+08
India 6.034134e+07
Japan 7.667371e+06
... ... ...
TEMPERATURE&PRESSURE SENSORS BG Europe 1.308471e+08
India 3.077273e+06
Japan 2.851744e+07
Korea 1.309189e+06
OSEAN 1.258075e+07

Try MultiIndex.rename:
df.index.rename("New Name", level=0, inplace=True)
print(df)
Prints:
Sales
New Name Region
ALUMINUM & FILM CAPACITORS BG America 74255990.0
China 224996900.0
Europe 240461300.0
India 60341340.0
Japan 7667371.0

Applying a function to list of columns of a dataframe?

I scraped this table from this URL:
"https://www.patriotsoftware.com/blog/accounting/average-cost-living-by-state/"
Which looks like this:
State Annual Mean Wage (All Occupations) Median Monthly Rent Value of a Dollar
0 Alabama $44,930 $998 $1.15
1 Alaska $59,290 $1,748 $0.95
2 Arizona $50,930 $1,356 $1.04
3 Arkansas $42,690 $953 $1.15
4 California $61,290 $2,518 $0.87
And then I wrote this function to help me turn the strings into ints:
def money_string_to_int(s):
return int(s.replace(",", "").replace("$",""))
money_string_to_int("$1,23")
My function works when I apply it to only one column. I found this answer here about using on multiple columns: How to apply a function to multiple columns in Pandas
But my code below does not work and produces no errors:
ls = ['Annual Mean Wage (All Occupations)', 'Median Monthly Rent',
'Value of a Dollar']
ppe_table[ls] = ppe_table[ls].apply(money_string_to_int)

Lets try
df.set_index('State').apply(lambda x: (x.str.replace('[$,]','').astype(float))).reset_index()

Divide rows in two columns with Pandas

I am using Pandas.
For each row, regardless of the County, I would like to divide "AcresBurned" by "CrewsInvolved".
For each County, I would like to sum the total AcresBurned for that County and divide by the sum of the total CrewsInvolved for that County.
I just started coding and am not able to solve this. Please help. Thank you so much.
Counties AcresBurned CrewsInvolved
1 400 2
2 500 3
3 600 5
1 800 9
2 850 8

This is very simple with Pandas. You could create a new col with these operations.
df['Acer_per_Crew'] = df['AcersBurned'] / df['CrewsaInvolved']
You could use a groupby clause for viewing the sum of AcersBurned for a county.
df_gb = df.groupby(['counties']) ['AcersBurned', 'CrewsInvolved'].sum().reset_index()
df_gb.columns = ['counties', 'AcersBurnedPerCounty', 'CrewsInvolvedPerCounty']
df = df.merge(df_gb, on = 'counties')
Once you've done this, you could create a new column with a similar arithmetic operation to divide AcersBurnedPerCounty by CrewsInvolvedPerCounty.

Averaging dataframes with many string columns and display back all the columns

I have struggled with this even after looking at the various past answers to no avail.
My data consists of columns numeric and non numeric. I'd like to average the numeric columns and display my data on the GUI together with the information on the non-numeric columns.The non numeric columns have info such as names,rollno,stream while the numeric columns contain students marks for various subjects. It works well when dealing with one dataframe but fails when I combine two or more dataframes in which it returms only the average of the numeric columns and displays it leaving the non numeric columns undisplayed. Below is one of the codes I've tried so far.
df=pd.concat((df3,df5))
dfs =df.groupby(df.index,level=0).mean()
headers = list(dfs)
self.marks_table.setRowCount(dfs.shape[0])
self.marks_table.setColumnCount(dfs.shape[1])
self.marks_table.setHorizontalHeaderLabels(headers)
df_array = dfs.values
for row in range(dfs.shape[0]):
for col in range(dfs.shape[1]):
self.marks_table.setItem(row, col,QTableWidgetItem(str(df_array[row,col])))
A working code should return averages in something like this
STREAM ADM NAME KCPE ENG KIS
0 EAGLE 663 FLOYCE ATI 250 43 5
1 EAGLE 664 VERONICA 252 32 33
2 EAGLE 665 MACREEN A 341 23 23
3 EAGLE 666 BRIDGIT 286 23 2
Rather than
ADM KCPE ENG KIS
0 663.0 250.0 27.5 18.5
1 664.0 252.0 26.5 33.0
2 665.0 341.0 17.5 22.5
3 666.0 286.0 38.5 23.5
Sample data
Df1 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[70,28,79],
'KIS':[37,82,79],
'MAT':[67,38,29]})
Df2 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[40,12,56],
'KIS':[33,43,43],
'MAT':[22,58,23]})

Your question not clear. However guessing the origin of question based on content. I have modified your datframes which were not well done by adding a stream called 'CENTRAL', see
Df1 = pd.DataFrame({'STREAM':['NORTH','SOUTH', 'CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[70,28,79],'KIS':[37,82,79],'MAT':[67,38,29]})
Df2 = pd.DataFrame({ 'STREAM':['NORTH','SOUTH','CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[40,12,56],'KIS':[33,43,43],'MAT':[22,58,23]})
I have assumed you want to merge the two dataframes and find avarage
df3=Df2.append(Df1)
df3.groupby(['STREAM','ADM','NAME'],as_index=False).sum()
Outcome

Multplying groupby elements for each group in survey

I am working on Stack Overflow 2019 Survey data. here is Survey 2019 data.
There are lots of columns in that data.
I want to carry out this calculation ---> "Sum of Age1stCode" / "Number of people who are related years old".
Age1stCode is a column in survey illustrates a first year of coding. Age is a column of "age years old".
I have created a group according to "Age".
I just want to multiply each opposing number and then to sum them. For instance, for age 11 = (6x3)+(7x3)+ (9x2)+.......(8x1). I want to to do this for each age group. So at the end, I want to achieve an output like the file I attached "Age 11.0 ----> 326 (it is just random for example), Age 12.0 ---> 468)
My goal is to calculate this ---> Sum of Age1stCode for each age group.
here is the output that I want to work with. Attached File.

df_grouped = df.groupby('Age').agg({'Age1stCode': 'sum'}).reset_index()
new_col = df_grouped['Age1stCode'] / df_grouped['Age']

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Unable to compare the datasets [duplicate] - pandas

Related

How to change index name in multiindex groupby object with condition?

Applying a function to list of columns of a dataframe?

Divide rows in two columns with Pandas

Averaging dataframes with many string columns and display back all the columns

Multplying groupby elements for each group in survey

Categories

Resources