pandas merging two multi-level series - pandas

I have two multi-level Series and would like to merge them according to both index. The first Series looks like this:
# of restaurants
BORO CUISINE
BRONX American 425
Chinese 330
Pizza 206
BROOKLYN American 1254
Chinese 750
Cafe/Coffee/Tea 350
The second one has more rows and is like this:
# of votes
BORO CUISINE
BRONX American 2425
Caribbean 320
Chinese 3130
Pizza 3336
BROOKLYN American 21254
Caribbean 2320
Chinese 7250
Cafe/Coffee/Tea 3350
Pizza 13336

Setup:
s1 = pd.Series({('BRONX', 'American'): 425, ('BROOKLYN', 'Chinese'): 750, ('BROOKLYN', 'Cafe/Coffee/Tea'): 350, ('BRONX', 'Pizza'): 206, ('BROOKLYN', 'American'): 1254, ('BRONX', 'Chinese'): 330})
s2 = pd.Series({('BRONX', 'Caribbean'): 320, ('BRONX', 'American'): 2425, ('BROOKLYN', 'Chinese'): 7250, ('BROOKLYN', 'Cafe/Coffee/Tea'): 3350, ('BRONX', 'Pizza'): 3336, ('BROOKLYN', 'American'): 21254, ('BROOKLYN', 'Pizza'): 13336, ('BRONX', 'Chinese'): 3130, ('BROOKLYN', 'Caribbean'): 2320})
s1 = s1.rename_axis(['BORO','CUISINE']).rename('restaurants')
s2 = s2.rename_axis(['BORO','CUISINE']).rename('votes')
print (s1)
BORO CUISINE
BRONX American 425
Chinese 330
Pizza 206
BROOKLYN American 1254
Chinese 750
Cafe/Coffee/Tea 350
Name: restaurants, dtype: int64
print (s2)
BORO CUISINE
BRONX American 2425
Caribbean 320
Chinese 3130
Pizza 3336
BROOKLYN American 21254
Caribbean 2320
Chinese 7250
Cafe/Coffee/Tea 3350
Pizza 13336
Name: votes, dtype: int64
Use concat with parameter join if need inner join:
print (pd.concat([s1,s2], axis=1, join='inner'))
restaurants votes
BORO CUISINE
BRONX American 425 2425
Chinese 330 3130
Pizza 206 3336
BROOKLYN American 1254 21254
Cafe/Coffee/Tea 350 3350
Chinese 750 7250
#join='outer' is by default, so can be omited
print (pd.concat([s1,s2], axis=1))
restaurants votes
BORO CUISINE
BRONX American 425.0 2425
Caribbean NaN 320
Chinese 330.0 3130
Pizza 206.0 3336
BROOKLYN American 1254.0 21254
Cafe/Coffee/Tea 350.0 3350
Caribbean NaN 2320
Chinese 750.0 7250
Pizza NaN 13336
Another solution is use merge with reset_index:
#by default how='inner', so can be omited
print (pd.merge(s1.reset_index(), s2.reset_index(), on=['BORO','CUISINE']))
BORO CUISINE restaurants votes
0 BRONX American 425 2425
1 BRONX Chinese 330 3130
2 BRONX Pizza 206 3336
3 BROOKLYN American 1254 21254
4 BROOKLYN Chinese 750 7250
5 BROOKLYN Cafe/Coffee/Tea 350 3350
#outer join
print (pd.merge(s1.reset_index(), s2.reset_index(), on=['BORO','CUISINE'], how='outer'))
BORO CUISINE restaurants votes
0 BRONX American 425.0 2425
1 BRONX Chinese 330.0 3130
2 BRONX Pizza 206.0 3336
3 BROOKLYN American 1254.0 21254
4 BROOKLYN Chinese 750.0 7250
5 BROOKLYN Cafe/Coffee/Tea 350.0 3350
6 BRONX Caribbean NaN 320
7 BROOKLYN Caribbean NaN 2320
8 BROOKLYN Pizza NaN 13336

Related

Pandas - add row with inverted values based on condition

In a dataframe like this:
...
match team opponent venue
233 3b0345fb Brazil Argentina Home
234 3b2357fb Argentina Brazil Away
427 3b0947fb England Poland Home
...
how can I select one dataframe slice, based on a column value (df[df['team']=='England']), like this:
...
match team opponent venue
559 4a3eae2f England Poland Home
...
And add inverted rows of that slice to the original dataframe, changing 'Home' with 'Away', ending up with:
...
match team opponent venue
233 3b0345fb Brazil Argentina Home
234 3b2357fb Argentina Brazil Away
559 3b0947fb England Poland Home
560 3b0947fb Poland England Away
...
Note: This slice should contain n rows and produce n inverted rows.
You can use:
df2 = df[df['team'].eq('England')].copy()
df2[['team', 'opponent']] = df2[['opponent', 'team']]
df2['venue'] = df2['venue'].map({'Home': 'Away', 'Away': 'Home})
out = pd.concat([df, df2])
print(out)
Output:
match team opponent venue
233 3b0345fb Brazil Argentina Home
234 3b2357fb Argentina Brazil Away
427 3b0947fb England Poland Home
427 3b0947fb Poland England Away
If you want to invert all:
df2 = df.copy()
df2[['team', 'opponent']] = df2[['opponent', 'team']]
df2['venue'] = df2['venue'].map({'Home': 'Away', 'Away': 'Home})
out = pd.concat([df, df2])
output:
match team opponent venue
233 3b0345fb Brazil Argentina Home
234 3b2357fb Argentina Brazil Away
427 3b0947fb England Poland Home
233 3b0345fb Argentina Brazil Away
234 3b2357fb Brazil Argentina Home
427 3b0947fb Poland England Away

Python Pandas: add missing row of two dataframe and keep the extra columns [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Error: pandas hashtable keyerror
(3 answers)
Closed 2 years ago.
I would like to add the missing row dataframe df1 and keep the extra columns information
In [183]: df1
Out[183]:
City Country Region
0 Chicago US N.America
1 San Franciso US N.America
2 Boston US N.America
3 London UK Europe
4 Beijing China Asia
5 Omaha US N.America
In [183]: df2
Out[183]:
City
0 Chicago
1 San Franciso
2 Sao Paulo
3 Boston
4 London
5 Beijing
6 Tokyo
7 Omaha
The desired result after the merge is
City Country Region
0 Chicago US N.America
1 San Franciso US N.America
2 Sao Paulo nan nan
3 Boston US N.America
4 London UK Europe
5 Beijing China Asia
6 Tokyo nan nan
7 Omaha US N.America
I am trying with pd.merge(df2, df1, on='City', how='outer') but return keyerror.
Try the code below, using pd.merge, left_join, your desired output:
merged = pd.merge(df2,df1,how='left',on='City')
print(merged)
City Country Region
0 Chicago US N.America
1 San Fransicsco NaN NaN
2 Sao Paolo NaN NaN
3 Boston US N.America
4 London UK Europe
5 Beijing China Asia
6 Tokyo NaN NaN
7 Omaha US N.America
If you want to use an outer join, you can get this result using the below code:
merged_outer = pd.merge(df2, df1, on='City', how='outer')
print(merged_outer)
City Country Region
0 Chicago US N.America
1 San Fransicsco NaN NaN
2 Sao Paolo NaN NaN
3 Boston US N.America
4 London UK Europe
5 Beijing China Asia
6 Tokyo NaN NaN
7 Omaha US N.America
8 San Franciso US N.America
DF1 & DF2 respectively:
df1
City Country Region
0 Chicago US N.America
1 San Franciso US N.America
2 Boston US N.America
3 London UK Europe
4 Beijing China Asia
5 Omaha US N.America
df2
City
0 Chicago
1 San Fransicsco
2 Sao Paolo
3 Boston
4 London
5 Beijing
6 Tokyo
7 Omaha

extracting data using beautifulsoup from wiki

I pretty new to this,
What I am trying to accomplished is having a table with distrcits and their various neighborhoods but my final code just list all neighborhoods in a list format without assigning them to a specific district.
url = "https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
print(soup.prettify())
Toronto_table = soup.find('table',{'class':'wikitable sortable'})
links = Toronto_table.find_all('a')
neighborhoods = []
for link in links:
neighborhoods.append(link.get('title'))
print(neighborhoods)
df_neighborhoods = pd.DataFrame(neighborhoods)
df_neighborhoods
You can simply read_html and print the table.
import pandas as pd
f_states=pd.read_html('https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto')
print(f_states[6])
Output :
District Number Neighbourhoods Included
0 C01 Downtown, Harbourfront, Little Italy, Little P...
1 C02 The Annex, Yorkville, South Hill, Summerhill, ...
2 C03 Forest Hill South, Oakwood–Vaughan, Humewood–C...
3 C04 Bedford Park, Lawrence Manor, North Toronto, F...
4 C06 North York, Clanton Park, Bathurst Manor
5 C07 Willowdale, Newtonbrook West, Westminster–Bran...
6 C08 Cabbagetown, St. Lawrence Market, Toronto wate...
7 C09 Moore Park, Rosedale
8 C10 Davisville Village, Midtown Toronto, Lawrence ...
9 C11 Leaside, Thorncliffe Park, Flemingdon Park
10 C13 Don Mills, Parkwoods–Donalda, Victoria Village
11 C14 Newtonbrook East, Willowdale East
12 C15 Hillcrest Village, Bayview Woods – Steeles, Ba...
13 E01 Riverdale, Danforth (Greektown), Leslieville
14 E02 The Beaches, Woodbine Corridor
15 E03 Danforth (Greektown), East York, Playter Estat...
16 E04 The Golden Mile, Dorset Park, Wexford, Maryval...
17 E05 Steeles, L'Amoreaux, Tam O'Shanter – Sullivan
18 E06 Birch Cliff, Oakridge, Hunt Club, Cliffside
19 E08 Scarborough Village, Cliffcrest, Guildwood, Eg...
20 E09 Scarborough City Centre, Woburn, Morningside, ...
21 E10 Rouge (South), Port Union (Centennial Scarboro...
22 E11 Rouge (West), Malvern
23 W01 High Park, South Parkdale, Swansea, Roncesvall...
24 W02 Bloor West Village, Baby Point, The Junction (...
25 W03 Keelesdale, Eglinton West, Rockcliffe–Smythe, ...
26 W04 York, Glen Park, Amesbury (Brookhaven), Pelmo ...
27 W05 Downsview, Humber Summit, Humbermede (Emery), ...
28 W06 New Toronto, Long Branch, Mimico, Alderwood
29 W07 Sunnylea (The Queensway – Humber Bay)
30 W08 The Kingsway, Central Etobicoke, Eringate – Ce...
31 W09 Kingsview Village-The Westway, Richview (Willo...
32 W10 Rexdale, Clairville, Thistletown - Beaumond He...

How does Awk's pattern matching for strings works?

I am trying to understand how the range pattern matching work in Awk
Here is the full data that I am practicing with
Raw Data
-----------------------------------------
USSR 8649 275 Asia
Canada 3852 25 North America
China 3705 1032 Asia
USA 3615 237 North America
Brazil 3286 134 South America
India 1267 746 Asia
Mexico 762 78 North America
France 211 55 Europe
Japan 144 120 Asia
Germany 96 61 Europe
England 94 56 Europe
If I write this code
$ awk '/Asia/, /Europe/' countries.awk
I get
USSR 8649 275 Asia
Canada 3852 25 North America
China 3705 1032 Asia
USA 3615 237 North America
Brazil 3286 134 South America
India 1267 746 Asia
Mexico 762 78 North America
France 211 55 Europe
Japan 144 120 Asia
Germany 96 61 Europe
It doesn't output England.
And If I write this
$ awk '/Europe/, /Asia/' countries.awk
I get
France 211 55 Europe
Japan 144 120 Asia
Germany 96 61 Europe
England 94 56 Europe
What is the behavior here? Why do I not get England on the first one?
Awk process input lines one at a time, the syntax you used is likely to print lines from the start to the end pattern, represented by country names. When you used
awk '/Asia/, /Europe/'
The start of pattern Asia happens more than once. As you can see from the line numbers below, numbers 3,5,8 and 11 represent the start of the pattern and the pattern ends at lines 10 and 12. Observe carefully the sub-ranges of lines between 8-10 and 11-12. The last end pattern Europe for the last Asia ends at line 12, that is the reason you are not seeing England in the first case.
But when you used
awk '/Europe/, /Asia/'
The line containing the first start pattern Europe starts at line 10 and ends at 11 another two pattern start at 12 and 13 without an end pattern Asia, so it would obviously print all the lines until Asia appears. So you are seeing England in the second case.
$ cat -n file
1 Raw Data
2 -----------------------------------------
3 USSR 8649 275 Asia
4 Canada 3852 25 North America
5 China 3705 1032 Asia
6 USA 3615 237 North America
7 Brazil 3286 134 South America
8 India 1267 746 Asia
9 Mexico 762 78 North America
10 France 211 55 Europe
11 Japan 144 120 Asia
12 Germany 96 61 Europe
13 England 94 56 Europe
Never use range expressions as they make trivial tasks very slightly briefer but then need a complete rewrite or duplicate conditions when your requirements change. Always use a flag instead:
awk '/Asia/{f=1} f{print} /Europe/{f=0}' countries.awk
I bet if you started with that you wouldn't even have had to ask this question as the logic is clear and explicit.

Series with count larger than certain number

Given this code in iPython
df1=df["BillingContactCountry"].value_counts()
df1
I get
United States 4138
Germany 1963
United Kingdom 732
Switzerland 528
Australia 459
Canada 369
Japan 344
France 303
Netherlands 285
I want to get a series with count larger than 303, what should I do?
You needboolean indexing:
print (df1[df1 > 303])
United States 4138
Germany 1963
United Kingdom 732
Switzerland 528
Australia 459
Canada 369
Japan 344
Name: BillingContactCountry, dtype: int64