In a dataframe like this:
...
match team opponent venue
233 3b0345fb Brazil Argentina Home
234 3b2357fb Argentina Brazil Away
427 3b0947fb England Poland Home
...
how can I select one dataframe slice, based on a column value (df[df['team']=='England']), like this:
...
match team opponent venue
559 4a3eae2f England Poland Home
...
And add inverted rows of that slice to the original dataframe, changing 'Home' with 'Away', ending up with:
...
match team opponent venue
233 3b0345fb Brazil Argentina Home
234 3b2357fb Argentina Brazil Away
559 3b0947fb England Poland Home
560 3b0947fb Poland England Away
...
Note: This slice should contain n rows and produce n inverted rows.
You can use:
df2 = df[df['team'].eq('England')].copy()
df2[['team', 'opponent']] = df2[['opponent', 'team']]
df2['venue'] = df2['venue'].map({'Home': 'Away', 'Away': 'Home})
out = pd.concat([df, df2])
print(out)
Output:
match team opponent venue
233 3b0345fb Brazil Argentina Home
234 3b2357fb Argentina Brazil Away
427 3b0947fb England Poland Home
427 3b0947fb Poland England Away
If you want to invert all:
df2 = df.copy()
df2[['team', 'opponent']] = df2[['opponent', 'team']]
df2['venue'] = df2['venue'].map({'Home': 'Away', 'Away': 'Home})
out = pd.concat([df, df2])
output:
match team opponent venue
233 3b0345fb Brazil Argentina Home
234 3b2357fb Argentina Brazil Away
427 3b0947fb England Poland Home
233 3b0345fb Argentina Brazil Away
234 3b2357fb Brazil Argentina Home
427 3b0947fb Poland England Away
Data on Table:-
wkt Partners Team Opponent Runs Balls
1 S Hope & E Lewis WEST INDIES SOUTH AFRICA 43 66
2 S Hope & S Hetmyer WEST INDIES SOUTH AFRICA 70 79
3 D Bravo & S Hetmyer WEST INDIES SOUTH AFRICA 84 97
1 J Malan & Q Kock SOUTH AFRICA WEST INDIES 3 4
2 J Malan & F Plessis SOUTH AFRICA WEST INDIES 32 44
3 J Malan & R Dussen SOUTH AFRICA WEST INDIES 100 90
1 S Dhawan & R Sharma INDIA IRELAND 3 8
2 V Kohli & R Sharma INDIA IRELAND 102 70
I want to return the pair of partners, team they belong to, opponent they play against only once for each wkt where runs are highest for that particular wkt
For above table I'd like result as follow
wkt Partners Team Opponent Runs Balls
1 S Hope & E Lewis WEST INDIES SOUTH AFRICA 43 66
2 V Kohli & R Sharma INDIA IRELAND 102 70
3 J Malan & R Dussen SOUTH AFRICA WEST INDIES 100 90
Following is the code that I've used
SELECT wkt, Partners, Team, Opponent, max(Runs), Balls
FROM Partnerships
GROUP BY wkt
But I've been stuck with following error
Column 'Partnerships.Partners' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
How about row_number()?
select p.*
from (select p.*, row_number() over (partition by wkt order by runs desc) as seqnum
from Partnerships p
) p
where seqnum = 1;
I pretty new to this,
What I am trying to accomplished is having a table with distrcits and their various neighborhoods but my final code just list all neighborhoods in a list format without assigning them to a specific district.
url = "https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
print(soup.prettify())
Toronto_table = soup.find('table',{'class':'wikitable sortable'})
links = Toronto_table.find_all('a')
neighborhoods = []
for link in links:
neighborhoods.append(link.get('title'))
print(neighborhoods)
df_neighborhoods = pd.DataFrame(neighborhoods)
df_neighborhoods
You can simply read_html and print the table.
import pandas as pd
f_states=pd.read_html('https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto')
print(f_states[6])
Output :
District Number Neighbourhoods Included
0 C01 Downtown, Harbourfront, Little Italy, Little P...
1 C02 The Annex, Yorkville, South Hill, Summerhill, ...
2 C03 Forest Hill South, Oakwood–Vaughan, Humewood–C...
3 C04 Bedford Park, Lawrence Manor, North Toronto, F...
4 C06 North York, Clanton Park, Bathurst Manor
5 C07 Willowdale, Newtonbrook West, Westminster–Bran...
6 C08 Cabbagetown, St. Lawrence Market, Toronto wate...
7 C09 Moore Park, Rosedale
8 C10 Davisville Village, Midtown Toronto, Lawrence ...
9 C11 Leaside, Thorncliffe Park, Flemingdon Park
10 C13 Don Mills, Parkwoods–Donalda, Victoria Village
11 C14 Newtonbrook East, Willowdale East
12 C15 Hillcrest Village, Bayview Woods – Steeles, Ba...
13 E01 Riverdale, Danforth (Greektown), Leslieville
14 E02 The Beaches, Woodbine Corridor
15 E03 Danforth (Greektown), East York, Playter Estat...
16 E04 The Golden Mile, Dorset Park, Wexford, Maryval...
17 E05 Steeles, L'Amoreaux, Tam O'Shanter – Sullivan
18 E06 Birch Cliff, Oakridge, Hunt Club, Cliffside
19 E08 Scarborough Village, Cliffcrest, Guildwood, Eg...
20 E09 Scarborough City Centre, Woburn, Morningside, ...
21 E10 Rouge (South), Port Union (Centennial Scarboro...
22 E11 Rouge (West), Malvern
23 W01 High Park, South Parkdale, Swansea, Roncesvall...
24 W02 Bloor West Village, Baby Point, The Junction (...
25 W03 Keelesdale, Eglinton West, Rockcliffe–Smythe, ...
26 W04 York, Glen Park, Amesbury (Brookhaven), Pelmo ...
27 W05 Downsview, Humber Summit, Humbermede (Emery), ...
28 W06 New Toronto, Long Branch, Mimico, Alderwood
29 W07 Sunnylea (The Queensway – Humber Bay)
30 W08 The Kingsway, Central Etobicoke, Eringate – Ce...
31 W09 Kingsview Village-The Westway, Richview (Willo...
32 W10 Rexdale, Clairville, Thistletown - Beaumond He...
Look at the following data file(cou.data) which has four fields separated by tab.
Four fields are:
country name
land area
population
continent
As for country name or continent name which has two words, two words are separated by space.
(Data are not accurately confirmed, just for test purpose)
USSR 8649 275 Asia
Cananda 3852 25 North America
China 3705 1032 Asia
USA 3615 237 North America
Brazil 3286 134 South America
India 1267 746 Asia
Mexico 762 78 North America
France 211 55 Europe
Japan 144 120 Asia
Germany 96 61 Europe
England 94 56 Europe
Taiwan 55 144 Asia
North Korea 44 2134 Asia
awk 'BEGIN { FS = "\t" } { print $1, "---", $4 }' cou.data
I got the output which exactly meets my anticipation:
USSR --- Asia
Cananda --- North America
China --- Asia
USA --- North America
Brazil --- South America
India --- Asia
Mexico --- North America
France --- Europe
Japan --- Asia
Germany --- Europe
England --- Europe
Taiwan --- Asia
North Korea --- Asia
Then I replace \t by one space (" ")
That is :
awk 'BEGIN { FS = " " } { print $1, "---", $4 }' cou.data
The output I got is not understandable to me
USSR --- Asia
Cananda --- North
China --- Asia
USA --- North
Brazil --- South
India --- Asia
Mexico --- North
France --- Europe
Japan --- Asia
Germany --- Europe
England --- Europe
Taiwan --- Asia
North --- 2134
Line 2,4,5,7,13 each have one space and the other lines have no space(s) at all.
As for lines that have no space, why $1, $4 still can be printed ?
As for line 2,4,5,7,13, I thought $1 should be printed like this:
Cananda 3852 25 North
USA 3615 237 North
Brazil 3286 134 South
Mexico 762 78 North
North
And $4 does not exist.
Where did I get wrong ?
So problem here is string/country names on 1st field which are having spaces in their names for example North Korea. So when you are setting FS as \t this string will be considered as a single field on the other hand when you will set FS as space this will be considered as 2 different fields. That is why you are seeing difference between field numbers after changing the FS values in your codes.
I would suggest your first attempt is good enough to get your expected values.
Given this code in iPython
df1=df["BillingContactCountry"].value_counts()
df1
I get
United States 4138
Germany 1963
United Kingdom 732
Switzerland 528
Australia 459
Canada 369
Japan 344
France 303
Netherlands 285
I want to get a series with count larger than 303, what should I do?
You needboolean indexing:
print (df1[df1 > 303])
United States 4138
Germany 1963
United Kingdom 732
Switzerland 528
Australia 459
Canada 369
Japan 344
Name: BillingContactCountry, dtype: int64