extracting data using beautifulsoup from wiki

extracting data using beautifulsoup from wiki - beautifulsoup

I pretty new to this,
What I am trying to accomplished is having a table with distrcits and their various neighborhoods but my final code just list all neighborhoods in a list format without assigning them to a specific district.
url = "https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
print(soup.prettify())
Toronto_table = soup.find('table',{'class':'wikitable sortable'})
links = Toronto_table.find_all('a')
neighborhoods = []
for link in links:
neighborhoods.append(link.get('title'))
print(neighborhoods)
df_neighborhoods = pd.DataFrame(neighborhoods)
df_neighborhoods

You can simply read_html and print the table.
import pandas as pd
f_states=pd.read_html('https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto')
print(f_states[6])
Output :
District Number Neighbourhoods Included
0 C01 Downtown, Harbourfront, Little Italy, Little P...
1 C02 The Annex, Yorkville, South Hill, Summerhill, ...
2 C03 Forest Hill South, Oakwood–Vaughan, Humewood–C...
3 C04 Bedford Park, Lawrence Manor, North Toronto, F...
4 C06 North York, Clanton Park, Bathurst Manor
5 C07 Willowdale, Newtonbrook West, Westminster–Bran...
6 C08 Cabbagetown, St. Lawrence Market, Toronto wate...
7 C09 Moore Park, Rosedale
8 C10 Davisville Village, Midtown Toronto, Lawrence ...
9 C11 Leaside, Thorncliffe Park, Flemingdon Park
10 C13 Don Mills, Parkwoods–Donalda, Victoria Village
11 C14 Newtonbrook East, Willowdale East
12 C15 Hillcrest Village, Bayview Woods – Steeles, Ba...
13 E01 Riverdale, Danforth (Greektown), Leslieville
14 E02 The Beaches, Woodbine Corridor
15 E03 Danforth (Greektown), East York, Playter Estat...
16 E04 The Golden Mile, Dorset Park, Wexford, Maryval...
17 E05 Steeles, L'Amoreaux, Tam O'Shanter – Sullivan
18 E06 Birch Cliff, Oakridge, Hunt Club, Cliffside
19 E08 Scarborough Village, Cliffcrest, Guildwood, Eg...
20 E09 Scarborough City Centre, Woburn, Morningside, ...
21 E10 Rouge (South), Port Union (Centennial Scarboro...
22 E11 Rouge (West), Malvern
23 W01 High Park, South Parkdale, Swansea, Roncesvall...
24 W02 Bloor West Village, Baby Point, The Junction (...
25 W03 Keelesdale, Eglinton West, Rockcliffe–Smythe, ...
26 W04 York, Glen Park, Amesbury (Brookhaven), Pelmo ...
27 W05 Downsview, Humber Summit, Humbermede (Emery), ...
28 W06 New Toronto, Long Branch, Mimico, Alderwood
29 W07 Sunnylea (The Queensway – Humber Bay)
30 W08 The Kingsway, Central Etobicoke, Eringate – Ce...
31 W09 Kingsview Village-The Westway, Richview (Willo...
32 W10 Rexdale, Clairville, Thistletown - Beaumond He...

Related

Can not use group by function

Data on Table:-
wkt Partners Team Opponent Runs Balls
1 S Hope & E Lewis WEST INDIES SOUTH AFRICA 43 66
2 S Hope & S Hetmyer WEST INDIES SOUTH AFRICA 70 79
3 D Bravo & S Hetmyer WEST INDIES SOUTH AFRICA 84 97
1 J Malan & Q Kock SOUTH AFRICA WEST INDIES 3 4
2 J Malan & F Plessis SOUTH AFRICA WEST INDIES 32 44
3 J Malan & R Dussen SOUTH AFRICA WEST INDIES 100 90
1 S Dhawan & R Sharma INDIA IRELAND 3 8
2 V Kohli & R Sharma INDIA IRELAND 102 70
I want to return the pair of partners, team they belong to, opponent they play against only once for each wkt where runs are highest for that particular wkt
For above table I'd like result as follow
wkt Partners Team Opponent Runs Balls
1 S Hope & E Lewis WEST INDIES SOUTH AFRICA 43 66
2 V Kohli & R Sharma INDIA IRELAND 102 70
3 J Malan & R Dussen SOUTH AFRICA WEST INDIES 100 90
Following is the code that I've used
SELECT wkt, Partners, Team, Opponent, max(Runs), Balls
FROM Partnerships
GROUP BY wkt
But I've been stuck with following error
Column 'Partnerships.Partners' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

How about row_number()?
select p.*
from (select p.*, row_number() over (partition by wkt order by runs desc) as seqnum
from Partnerships p
) p
where seqnum = 1;

How do you create a master data frame from appending many data frames together?

I am attempting to loop over many pages of data, with the same format, to acquire one large master data frame of all the data appended together. When I run this code however, the master_df comes up empty, as if the new_df isn't being appended to the master. Any thoughts? Thanks in advance!
alphabet_list = ['a','b','c','d','e','f']
master_df = pd.DataFrame([])
for letter in alphabet_list:
player_df = pd.read_html('https://www.nfl.com/players/active/{}'.format(letter))
new_df = player_df[0]
master_df = new_df.append(new_df)

Try this change:
import pandas as pd
alphabet_list = ['a','b','c','d','e','f']
master_df = pd.DataFrame()
for letter in alphabet_list:
player_df = pd.read_html('https://www.nfl.com/players/active/{}'.format(letter))
master_df = master_df.append(player_df)
Output:
Player Current Team Position Status
0 Chidobe Awuzie Dallas Cowboys CB ACT
1 Josh Avery Seattle Seahawks DT ACT
2 Genard Avery Philadelphia Eagles OLB ACT
3 Anthony Averett Baltimore Ravens CB ACT
4 Lee Autry Chicago Bears DT ACT
5 Denico Autry Indianapolis Colts DT ACT
6 Tavon Austin Dallas Cowboys WR UFA
7 Blessuan Austin New York Jets DB ACT
8 Antony Auclair Tampa Bay Buccaneers TE ACT
9 Jeremiah Attaochu Denver Broncos LB ACT
10 Hunter Atkinson Atlanta Falcons OT ACT
11 John Atkins Detroit Lions DE ACT
12 Geno Atkins Cincinnati Bengals DT ACT
13 Marcell Ateman Las Vegas Raiders WR ACT
14 George Aston New York Giants RB ACT
15 Dravon Askew-Henry New York Giants DB ACT
16 Devin Asiasi New England Patriots TE ACT
17 George Asafo-Adjei New York Giants OT ACT
18 Ade Aruna Las Vegas Raiders DE ACT
19 Grayland Arnold Philadelphia Eagles SAF ACT
20 Dan Arnold Arizona Cardinals TE ACT
21 Damon Arnette Las Vegas Raiders CB UDF
22 Ray-Ray Armstrong Dallas Cowboys LB UFA
23 Ka'John Armstrong Denver Broncos OT ACT
24 Dorance Armstrong Dallas Cowboys DE ACT
25 Cornell Armstrong Houston Texans DB ACT
26 Terron Armstead New Orleans Saints OT ACT
27 Ryquell Armstead Jacksonville Jaguars RB ACT
28 Arik Armstead San Francisco 49ers DE ACT
29 Alex Armah Carolina Panthers FB ACT
.. ... ... ... ...
70 Mark Fields Minnesota Vikings CB ACT
71 Sam Ficken New York Jets K ACT
72 Jason Ferris Carolina Panthers LB ACT
73 Clelin Ferrell Las Vegas Raiders DE ACT
74 Reid Ferguson Buffalo Bills LS ACT
75 Josh Ferguson Washington Redskins RB ACT
76 Jaylon Ferguson Baltimore Ravens LB ACT
77 Blake Ferguson Miami Dolphins LS ACT
78 James Ferentz New England Patriots C UFA
79 Rashad Fenton Kansas City Chiefs DB ACT
80 Darren Fells Houston Texans TE UFA
81 Jon Feliciano Buffalo Bills OG ACT
82 Clayton Fejedelem Miami Dolphins DB ACT
83 Matt Feiler Pittsburgh Steelers OT RFA
84 Jordan Fehr Minnesota Vikings LB ACT
85 Breiden Fehoko Los Angeles Chargers DT ACT
86 Dan Feeney Los Angeles Chargers OG ACT
87 Tavien Feaster Jacksonville Jaguars RB ACT
88 Foley Fatukasi New York Jets DT ACT
89 Rojesterman Farris Atlanta Falcons DB ACT
90 Wesley Farnsworth Denver Broncos LS ACT
91 Matthias Farley New York Jets DB ACT
92 Noah Fant Denver Broncos TE ACT
93 George Fant New York Jets OT ACT
94 David Fales New York Jets QB ACT
95 Nico Falah Denver Broncos C ACT
96 Ka'imi Fairbairn Houston Texans K ACT
97 Jovahn Fair Kansas City Chiefs OG ACT
98 Brandon Facyson Los Angeles Chargers DB ACT
99 Kyler Fackrell New York Giants LB ACT
[571 rows x 4 columns]

How does Awk's pattern matching for strings works?

I am trying to understand how the range pattern matching work in Awk
Here is the full data that I am practicing with
Raw Data
-----------------------------------------
USSR 8649 275 Asia
Canada 3852 25 North America
China 3705 1032 Asia
USA 3615 237 North America
Brazil 3286 134 South America
India 1267 746 Asia
Mexico 762 78 North America
France 211 55 Europe
Japan 144 120 Asia
Germany 96 61 Europe
England 94 56 Europe
If I write this code
$ awk '/Asia/, /Europe/' countries.awk
I get
USSR 8649 275 Asia
Canada 3852 25 North America
China 3705 1032 Asia
USA 3615 237 North America
Brazil 3286 134 South America
India 1267 746 Asia
Mexico 762 78 North America
France 211 55 Europe
Japan 144 120 Asia
Germany 96 61 Europe
It doesn't output England.
And If I write this
$ awk '/Europe/, /Asia/' countries.awk
I get
France 211 55 Europe
Japan 144 120 Asia
Germany 96 61 Europe
England 94 56 Europe
What is the behavior here? Why do I not get England on the first one?

Awk process input lines one at a time, the syntax you used is likely to print lines from the start to the end pattern, represented by country names. When you used
awk '/Asia/, /Europe/'
The start of pattern Asia happens more than once. As you can see from the line numbers below, numbers 3,5,8 and 11 represent the start of the pattern and the pattern ends at lines 10 and 12. Observe carefully the sub-ranges of lines between 8-10 and 11-12. The last end pattern Europe for the last Asia ends at line 12, that is the reason you are not seeing England in the first case.
But when you used
awk '/Europe/, /Asia/'
The line containing the first start pattern Europe starts at line 10 and ends at 11 another two pattern start at 12 and 13 without an end pattern Asia, so it would obviously print all the lines until Asia appears. So you are seeing England in the second case.
$ cat -n file
1 Raw Data
2 -----------------------------------------
3 USSR 8649 275 Asia
4 Canada 3852 25 North America
5 China 3705 1032 Asia
6 USA 3615 237 North America
7 Brazil 3286 134 South America
8 India 1267 746 Asia
9 Mexico 762 78 North America
10 France 211 55 Europe
11 Japan 144 120 Asia
12 Germany 96 61 Europe
13 England 94 56 Europe

Never use range expressions as they make trivial tasks very slightly briefer but then need a complete rewrite or duplicate conditions when your requirements change. Always use a flag instead:
awk '/Asia/{f=1} f{print} /Europe/{f=0}' countries.awk
I bet if you started with that you wouldn't even have had to ask this question as the logic is clear and explicit.

pandas merging two multi-level series

I have two multi-level Series and would like to merge them according to both index. The first Series looks like this:
# of restaurants
BORO CUISINE
BRONX American 425
Chinese 330
Pizza 206
BROOKLYN American 1254
Chinese 750
Cafe/Coffee/Tea 350
The second one has more rows and is like this:
# of votes
BORO CUISINE
BRONX American 2425
Caribbean 320
Chinese 3130
Pizza 3336
BROOKLYN American 21254
Caribbean 2320
Chinese 7250
Cafe/Coffee/Tea 3350
Pizza 13336

Setup:
s1 = pd.Series({('BRONX', 'American'): 425, ('BROOKLYN', 'Chinese'): 750, ('BROOKLYN', 'Cafe/Coffee/Tea'): 350, ('BRONX', 'Pizza'): 206, ('BROOKLYN', 'American'): 1254, ('BRONX', 'Chinese'): 330})
s2 = pd.Series({('BRONX', 'Caribbean'): 320, ('BRONX', 'American'): 2425, ('BROOKLYN', 'Chinese'): 7250, ('BROOKLYN', 'Cafe/Coffee/Tea'): 3350, ('BRONX', 'Pizza'): 3336, ('BROOKLYN', 'American'): 21254, ('BROOKLYN', 'Pizza'): 13336, ('BRONX', 'Chinese'): 3130, ('BROOKLYN', 'Caribbean'): 2320})
s1 = s1.rename_axis(['BORO','CUISINE']).rename('restaurants')
s2 = s2.rename_axis(['BORO','CUISINE']).rename('votes')
print (s1)
BORO CUISINE
BRONX American 425
Chinese 330
Pizza 206
BROOKLYN American 1254
Chinese 750
Cafe/Coffee/Tea 350
Name: restaurants, dtype: int64
print (s2)
BORO CUISINE
BRONX American 2425
Caribbean 320
Chinese 3130
Pizza 3336
BROOKLYN American 21254
Caribbean 2320
Chinese 7250
Cafe/Coffee/Tea 3350
Pizza 13336
Name: votes, dtype: int64
Use concat with parameter join if need inner join:
print (pd.concat([s1,s2], axis=1, join='inner'))
restaurants votes
BORO CUISINE
BRONX American 425 2425
Chinese 330 3130
Pizza 206 3336
BROOKLYN American 1254 21254
Cafe/Coffee/Tea 350 3350
Chinese 750 7250
#join='outer' is by default, so can be omited
print (pd.concat([s1,s2], axis=1))
restaurants votes
BORO CUISINE
BRONX American 425.0 2425
Caribbean NaN 320
Chinese 330.0 3130
Pizza 206.0 3336
BROOKLYN American 1254.0 21254
Cafe/Coffee/Tea 350.0 3350
Caribbean NaN 2320
Chinese 750.0 7250
Pizza NaN 13336
Another solution is use merge with reset_index:
#by default how='inner', so can be omited
print (pd.merge(s1.reset_index(), s2.reset_index(), on=['BORO','CUISINE']))
BORO CUISINE restaurants votes
0 BRONX American 425 2425
1 BRONX Chinese 330 3130
2 BRONX Pizza 206 3336
3 BROOKLYN American 1254 21254
4 BROOKLYN Chinese 750 7250
5 BROOKLYN Cafe/Coffee/Tea 350 3350
#outer join
print (pd.merge(s1.reset_index(), s2.reset_index(), on=['BORO','CUISINE'], how='outer'))
BORO CUISINE restaurants votes
0 BRONX American 425.0 2425
1 BRONX Chinese 330.0 3130
2 BRONX Pizza 206.0 3336
3 BROOKLYN American 1254.0 21254
4 BROOKLYN Chinese 750.0 7250
5 BROOKLYN Cafe/Coffee/Tea 350.0 3350
6 BRONX Caribbean NaN 320
7 BROOKLYN Caribbean NaN 2320
8 BROOKLYN Pizza NaN 13336

Dictionary of sorted values gets unsorted when transformed into pandas dataframe

I am reading a csv file with the GDP for the 32 states in Mexico from 1940 to 2004. The columns are state names, and GDP values for each year.
Unfortunately, I can't add images just now... but, basically, the dataframe has as columns the following: state_name, 1940, 1950, etc... the values for state_name are the names of each state (as strings), and the values for the rest of the columns are the GDPs per state per year.
So, I am trying to produce a new dataframe in which there is no longer a state_names column, but only columns 1940, 1950, etc... where values are no longer the corresponding GDPs, but the names of the states according to the GDP in a given year. So, the column 1940 in the new dataframe would list the states not in alphabetical order, as I the current output does, but by the sorting of GDPs (as the one I have produced in my loop to create a dictionary below).
I am using the following loop to (in states) sort the entire data frame by each year (1940 to 2004), and then slice the names of this sorted data frame (in names).
ranks = {}
for year in pibe.columns.values[1:]:
states = pibe.sort(columns=year, ascending=False)
names = states["entidad"]
ranks[year] = names
The output of this dictionary looks like below:
{'1940': 1 Baja California
22 Quintana Roo
8 Distrito Federal
9 Durango
21 Queretaro
0 Aguascalientes
2 Baja California Sur
...
Name: entidad, dtype: object,
'1950': 22 Quintana Roo
1 Baja California
8 Distrito Federal
2 Baja California Sur
5 Chihuahua...}
So long so good. But, when I try to transform the dictionary into a data frame it somehow overrides my previous sorting and retrieves an alphabetically ordered list of state names. So, the new data frame has as columns each year populated by the same list of names.
To transform the dictionary into a data frame I am using:
pd.DataFrame(ranks)

Create a new dataframe based on the ordering that you need:
In [6]: ordered_df = original_df.sort(['Year','GDP'],axis=0,ascending=False)
Create a new dictionary to pass into the final dataframe (this can be done more efficiently):
In [7]: unique_years = {item[1]['Year']:[] for item in ordered_df.iterrows()}
Loop through new dataframe populating the dictionary:
In [8]: for row in ordered_df.iterrows():
unique_years[row[1]['Year']].append(row[1]['State'])
Create final dataframe:
In [9]: final_df = pd.DataFrame(unique_years)
Input:
In [11]: original_df
Out[11]:
Year State GDP
0 1945 New York 84
1 1945 Texas 38
2 1945 California 84
3 1946 New York 56
4 1946 Texas 6
5 1946 California 84
6 1947 New York 75
7 1947 Texas 95
8 1947 California 92
9 1948 New York 50
10 1948 Texas 25
11 1948 California 30
12 1949 New York 16
13 1949 Texas 33
14 1949 California 31
15 1950 New York 37
16 1950 Texas 75
17 1950 California 49
18 1951 New York 28
19 1951 Texas 74
20 1951 California 78
21 1952 New York 57
22 1952 Texas 5
23 1952 California 28
Output:
In [12]: final_df
Out[12]:
1945 1946 1947 1948 1949 1950 \
0 New York California Texas New York Texas Texas
1 California New York California California California California
2 Texas Texas New York Texas New York New York
1951 1952
0 California New York
1 Texas California
2 New York Texas
Check final dataframe against the ordered dataframe to ensure proper GDP ordering:
In [13]: ordered_df
Out[13]:
Year State GDP
21 1952 New York 57
23 1952 California 28
22 1952 Texas 5
20 1951 California 78
19 1951 Texas 74
18 1951 New York 28
16 1950 Texas 75
17 1950 California 49
15 1950 New York 37
13 1949 Texas 33
14 1949 California 31
12 1949 New York 16
9 1948 New York 50
11 1948 California 30
10 1948 Texas 25
7 1947 Texas 95
8 1947 California 92
6 1947 New York 75
5 1946 California 84
3 1946 New York 56
4 1946 Texas 6
0 1945 New York 84
2 1945 California 84
1 1945 Texas 38

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

extracting data using beautifulsoup from wiki - beautifulsoup

Related

Can not use group by function

How do you create a master data frame from appending many data frames together?

How does Awk's pattern matching for strings works?

pandas merging two multi-level series

Dictionary of sorted values gets unsorted when transformed into pandas dataframe

Categories

Resources