Python Pandas: add missing row of two dataframe and keep the extra columns [duplicate] - pandas

This question already has answers here:
Pandas Merging 101
(8 answers)
Error: pandas hashtable keyerror
(3 answers)
Closed 2 years ago.
I would like to add the missing row dataframe df1 and keep the extra columns information
In [183]: df1
Out[183]:
City Country Region
0 Chicago US N.America
1 San Franciso US N.America
2 Boston US N.America
3 London UK Europe
4 Beijing China Asia
5 Omaha US N.America
In [183]: df2
Out[183]:
City
0 Chicago
1 San Franciso
2 Sao Paulo
3 Boston
4 London
5 Beijing
6 Tokyo
7 Omaha
The desired result after the merge is
City Country Region
0 Chicago US N.America
1 San Franciso US N.America
2 Sao Paulo nan nan
3 Boston US N.America
4 London UK Europe
5 Beijing China Asia
6 Tokyo nan nan
7 Omaha US N.America
I am trying with pd.merge(df2, df1, on='City', how='outer') but return keyerror.

Try the code below, using pd.merge, left_join, your desired output:
merged = pd.merge(df2,df1,how='left',on='City')
print(merged)
City Country Region
0 Chicago US N.America
1 San Fransicsco NaN NaN
2 Sao Paolo NaN NaN
3 Boston US N.America
4 London UK Europe
5 Beijing China Asia
6 Tokyo NaN NaN
7 Omaha US N.America
If you want to use an outer join, you can get this result using the below code:
merged_outer = pd.merge(df2, df1, on='City', how='outer')
print(merged_outer)
City Country Region
0 Chicago US N.America
1 San Fransicsco NaN NaN
2 Sao Paolo NaN NaN
3 Boston US N.America
4 London UK Europe
5 Beijing China Asia
6 Tokyo NaN NaN
7 Omaha US N.America
8 San Franciso US N.America
DF1 & DF2 respectively:
df1
City Country Region
0 Chicago US N.America
1 San Franciso US N.America
2 Boston US N.America
3 London UK Europe
4 Beijing China Asia
5 Omaha US N.America
df2
City
0 Chicago
1 San Fransicsco
2 Sao Paolo
3 Boston
4 London
5 Beijing
6 Tokyo
7 Omaha

Related

Pandas removing rows with incomplete time series in panel data

I have a dataframe along the lines of the below:
Country1 Country2 Year
1 Italy Greece 2000
2 Italy Greece 2001
3 Italy Greece 2002
4 Germany Italy 2000
5 Germany Italy 2002
6 Mexico Canada 2000
7 Mexico Canada 2001
8 Mexico Canada 2002
9 US France 2000
10 US France 2001
11 Greece Italy 2000
12 Greece Italy 2001
I want to keep only the rows in which there are observations for the entire time series (2000-2002). So, the end result would be:
Country1 Country2 Year
1 Italy Greece 2000
2 Italy Greece 2001
3 Italy Greece 2002
4 Mexico Canada 2000
5 Mexico Canada 2001
6 Mexico Canada 2002
One idea is reshape by crosstab and test if rows has not 0 values by DataFrame.ne with DataFrame.all, convert index to DataFrame by MultiIndex.to_frame and last get filtered rows in DataFrame.merge:
df1 = pd.crosstab([df['Country1'], df['Country2']], df['Year'])
df = df.merge(df1.index[df1.ne(0).all(axis=1)].to_frame(index=False))
print (df)
Country1 Country2 Year
0 Italy Greece 2000
1 Italy Greece 2001
2 Italy Greece 2002
3 Mexico Canada 2000
4 Mexico Canada 2001
5 Mexico Canada 2002
Or if need test some specific range is possible compare sets in GroupBy.transform:
r = set(range(2000, 2003))
df = df[df.groupby(['Country1', 'Country2'])['Year'].transform(lambda x: set(x) == r)]
print (df)
Country1 Country2 Year
1 Italy Greece 2000
2 Italy Greece 2001
3 Italy Greece 2002
6 Mexico Canada 2000
7 Mexico Canada 2001
8 Mexico Canada 2002
One option is to pivot the data, drop null rows and reshape back; this only works if the combination of Country* and Year is unique (in the sample data it is ):
(df.assign(dummy = 1)
.pivot(('Country1', 'Country2'), 'Year')
.dropna()
.stack()
.drop(columns='dummy')
.reset_index()
)
Country1 Country2 Year
0 Italy Greece 2000
1 Italy Greece 2001
2 Italy Greece 2002
3 Mexico Canada 2000
4 Mexico Canada 2001
5 Mexico Canada 2002

Map column values in one dataframe to an index of another dataframe and extract values [duplicate]

This question already has answers here:
Check if a row in one data frame exist in another data frame
(1 answer)
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I am in my first few weeks of learning pandas and need help with a problem I am stuck with.
I have 2 dataframes as listed below:
df1 = pd.DataFrame({
'City': ['Chicago','Atlanta', 'Dallas', 'Atlanta', 'Chicago', 'Boston', 'Dallas','El Paso','Atlanta'],
'State': ['IL','GA','TX','GA','IL','MA','TX','TX','GA'],
'Population': [8865000,523738,6301000,523738,8865000,4309000,6301000,951000,523738]
}, columns=['City', 'State', 'Population'])
df1
City State Population
0 Chicago IL 8865000
1 Atlanta GA 523738
2 Dallas TX 6301000
3 Atlanta GA 523738
4 Chicago IL 8865000
5 Boston MA 4309000
6 Dallas TX 6301000
7 El Paso TX 951000
8 Atlanta GA 523738
.
df2 = pd.DataFrame({
'Airport': ['Hartsfield','Logan','O Hare','DFW'],
'M_Code': [78,26,52,39]
},index=[
'Atlanta',
'Boston',
'Chicago',
'Dallas'])
df2
Airport M_Code
Atlanta Hartsfield 78
Boston Logan 26
Chicago O Hare 52
Dallas DFW 39
Expected output is:
df1
City State Population M_Code City_indexed_in_df2
0 Chicago IL 8865000 52 True
1 Atlanta GA 523738 78 True
2 Dallas TX 6301000 39 True
3 Atlanta GA 523738 78 True
4 Chicago IL 8865000 52 True
5 Boston MA 4309000 26 True
6 Dallas TX 6301000 39 True
7 El Paso TX 951000 NaN False
8 Atlanta GA 523738 78 True
I started with:
df1.loc[df1.City.isin(df2.index),:]
City State Population
0 Chicago IL 8865000
1 Atlanta GA 523738
2 Dallas TX 6301000
3 Atlanta GA 523738
4 Chicago IL 8865000
5 Boston MA 4309000
6 Dallas TX 6301000
8 Atlanta GA 523738
As expected this filters out the row with El Paso.
But am not able to come up with code to do this -->
For every df1.City I need to lookup on df2.index and if found:
Extract df2.M_Code and insert the value to a new column df1.M_Code
Insert boolean result to a new column df1.City_indexed_in_df2
Can someone help me with how I can acheive this.
In addition, my thought is that creating a unique array from df1.City and then doing a lookup on df2.index may get improved performance(being a novice, I haven't figured out how to do this beyond extracting the unique array below.)
arr = df1.City.unique()
array(['Chicago', 'Atlanta', 'Dallas', 'Boston', 'El Paso'], dtype=object)
Suggestions on changing the solution approach will be great too.
You can do this, merge with how='left' and then create a new column using notna():
df = df1.merge(df2, left_on=['City'], right_index=True, how='left')
df['City_indexed_in_df2'] = df['M_Code'].notna()
print(df)
City State Population Airport M_Code City_indexed_in_df2
0 Chicago IL 8865000 O Hare 52.0 True
1 Atlanta GA 523738 Hartsfield 78.0 True
2 Dallas TX 6301000 DFW 39.0 True
3 Atlanta GA 523738 Hartsfield 78.0 True
4 Chicago IL 8865000 O Hare 52.0 True
5 Boston MA 4309000 Logan 26.0 True
6 Dallas TX 6301000 DFW 39.0 True
7 El Paso TX 951000 NaN NaN False
8 Atlanta GA 523738 Hartsfield 78.0 True

Subtract rows in indexed dataframe

I am currently working with this data frame. It is indexed by year and Country. What I would like to do is to subtract the values for "military_exp" for year 2011 and the values for "military_exp" for year 2010. Is there a way of doing this?
gdp_share military_exp
year Country
2010 USA 5.0 768465792.0
China 2.0 138028416.0
Korea 3.0 31117330.0
Russia 4.0 43120560.0
2011 USA 5.0 758988352.0
China 2.0 149022400.0
Korea 3.0 31543720.0
Russia 3.0 46022120.0
IIUC
df.groupby(level=1)['military_exp'].diff()
Out[195]:
year Country
2010 USA NaN
China NaN
Korea NaN
Russia NaN
2011 USA -9477440.0
China 10993984.0
Korea 426390.0
Russia 2901560.0
Name: military_exp, dtype: float64
Update
df.loc[2011,'military_exp']-df.loc[2010,'military_exp']
Out[197]:
Country
USA -9477440.0
China 10993984.0
Korea 426390.0
Russia 2901560.0
Name: military_exp, dtype: float64

Calculated column - SQL - Football Teams

I'd like to add a new second column to a 'teams' table which is representative of premier league (UK) football rankings. At the moment the table just contains the names of each football team.
The column will be called 'Played' and it will list the number of games each team has played. I'd like to calculate this number (integer data type) from a separate table called 'games', which records a historic log of games fixtures. This would probably include using SQL's native 'COUNT' function.
I have tried to use a function to help me do this, but currently it is inserting all values as '0'
CREATE FUNCTION [dbo].[GetPlayed](#Team VARCHAR)
RETURNS INT
BEGIN
RETURN(SELECT COUNT(*)
FROM games
WHERE games.Home = #Team OR games.Away = #Team);
END;
ALTER TABLE teams
ADD Played AS GetPlayed(teams.Team)
The tables:
teams:
```Team
Arsenal
Bournemouth
Burnley
Chelsea
Crystal Palace
Everton
Hull City
Leicester City
Liverpool
Manchester City
Manchester United
Middlesbrough
Southampton
Stoke City
Sunderland
Swansea City
Tottenham Hotspur
Wat"For"d
West Bromwich Albion
West Ham United
```
games:
gameID Home HomeScore Away AwayScore GameDate
4 Arsenal 2 Chelsea 0 2018-05-26
5 Arsenal 5 Bournemouth 0 2018-04-22
6 Arsenal 1 Leicester City 1 2018-03-15
7 Bournemouth 5 Liverpool 0 2018-04-22
8 Burnley 5 Bournemouth 0 2018-04-22
9 Burnley 1 Swansea City 2 2017-11-22
10 Stoke City 0 Burnley 0 2018-01-08
11 Chelsea 1 Middlesborough 2 2017-11-22
12 Southampton 0 Chelsea 0 2018-01-01
13 Crystal Palace 1 Everton 2 2018-03-26
14 Manchester United 4 Crystal Palace 0 2018-06-01
15 Crystal Palace 0 Southampton 1 2018-04-16
16 Everton 1 Hull City 2 2017-11-20
17 Manchester City 4 Everton 0 2017-11-20
18 Hull City 0 Burnley 0 2018-06-01
19 Sunderland 2 Hull City 0 2018-06-15
20 Leicester City 3 Tottenham Hotspur 1 2017-09-20
21 Swansea City 2 Leicester City 5 2018-02-15
22 Sunderland 0 Leicester City 1 2018-01-29
23 Liverpool 3 Tottenham Hotspur 0 2018-02-28
24 Stoke City 1 Liverpool 2 2017-09-19
25 Manchester City 2 Manchester United 4 2018-05-02
26 Middlesborough 1 Southampton 1 2018-02-08
27 Stoke City 2 Middlesborough 2 2017-08-19
28 Swansea City 0 Manchester United 5 2018-06-27
29 Sunderland 1 Tottenham Hotspur 2 2017-09-01
Any help would be much appreciated!
Thanks, Rob
VARCHAR without size defaults to 1 char, you need to change your function declaration
CREATE FUNCTION [dbo].[GetPlayed](#Team VARCHAR(32))
.....
Without size your parameter #Team will receive just the first letter of your passed team value and, of course, the WHERE statement is unable to find any result in your games table

Dictionary of sorted values gets unsorted when transformed into pandas dataframe

I am reading a csv file with the GDP for the 32 states in Mexico from 1940 to 2004. The columns are state names, and GDP values for each year.
Unfortunately, I can't add images just now... but, basically, the dataframe has as columns the following: state_name, 1940, 1950, etc... the values for state_name are the names of each state (as strings), and the values for the rest of the columns are the GDPs per state per year.
So, I am trying to produce a new dataframe in which there is no longer a state_names column, but only columns 1940, 1950, etc... where values are no longer the corresponding GDPs, but the names of the states according to the GDP in a given year. So, the column 1940 in the new dataframe would list the states not in alphabetical order, as I the current output does, but by the sorting of GDPs (as the one I have produced in my loop to create a dictionary below).
I am using the following loop to (in states) sort the entire data frame by each year (1940 to 2004), and then slice the names of this sorted data frame (in names).
ranks = {}
for year in pibe.columns.values[1:]:
states = pibe.sort(columns=year, ascending=False)
names = states["entidad"]
ranks[year] = names
The output of this dictionary looks like below:
{'1940': 1 Baja California
22 Quintana Roo
8 Distrito Federal
9 Durango
21 Queretaro
0 Aguascalientes
2 Baja California Sur
...
Name: entidad, dtype: object,
'1950': 22 Quintana Roo
1 Baja California
8 Distrito Federal
2 Baja California Sur
5 Chihuahua...}
So long so good. But, when I try to transform the dictionary into a data frame it somehow overrides my previous sorting and retrieves an alphabetically ordered list of state names. So, the new data frame has as columns each year populated by the same list of names.
To transform the dictionary into a data frame I am using:
pd.DataFrame(ranks)
Create a new dataframe based on the ordering that you need:
In [6]: ordered_df = original_df.sort(['Year','GDP'],axis=0,ascending=False)
Create a new dictionary to pass into the final dataframe (this can be done more efficiently):
In [7]: unique_years = {item[1]['Year']:[] for item in ordered_df.iterrows()}
Loop through new dataframe populating the dictionary:
In [8]: for row in ordered_df.iterrows():
unique_years[row[1]['Year']].append(row[1]['State'])
Create final dataframe:
In [9]: final_df = pd.DataFrame(unique_years)
Input:
In [11]: original_df
Out[11]:
Year State GDP
0 1945 New York 84
1 1945 Texas 38
2 1945 California 84
3 1946 New York 56
4 1946 Texas 6
5 1946 California 84
6 1947 New York 75
7 1947 Texas 95
8 1947 California 92
9 1948 New York 50
10 1948 Texas 25
11 1948 California 30
12 1949 New York 16
13 1949 Texas 33
14 1949 California 31
15 1950 New York 37
16 1950 Texas 75
17 1950 California 49
18 1951 New York 28
19 1951 Texas 74
20 1951 California 78
21 1952 New York 57
22 1952 Texas 5
23 1952 California 28
Output:
In [12]: final_df
Out[12]:
1945 1946 1947 1948 1949 1950 \
0 New York California Texas New York Texas Texas
1 California New York California California California California
2 Texas Texas New York Texas New York New York
1951 1952
0 California New York
1 Texas California
2 New York Texas
Check final dataframe against the ordered dataframe to ensure proper GDP ordering:
In [13]: ordered_df
Out[13]:
Year State GDP
21 1952 New York 57
23 1952 California 28
22 1952 Texas 5
20 1951 California 78
19 1951 Texas 74
18 1951 New York 28
16 1950 Texas 75
17 1950 California 49
15 1950 New York 37
13 1949 Texas 33
14 1949 California 31
12 1949 New York 16
9 1948 New York 50
11 1948 California 30
10 1948 Texas 25
7 1947 Texas 95
8 1947 California 92
6 1947 New York 75
5 1946 California 84
3 1946 New York 56
4 1946 Texas 6
0 1945 New York 84
2 1945 California 84
1 1945 Texas 38