Extracting table data using BeautifulSoup

Extracting table data using BeautifulSoup - beautifulsoup

Having a little trouble using BeautifulSoup to extract data (zip code and population). Any help appreciated.
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
pop_source = requests.get("https://www.zip-codes.com/city/tx-austin.asp").text
soup = BeautifulSoup(pop_source, 'html5lib')
zip_pop_table = soup.find('table',class_='statTable')
austin_pop = pd.DataFrame(columns=['Zip Code','Population'])
for row in zip_pop_table.find_all('tr'):
cols = row.find_all('td')
Now I'm stuck. Don't really know how to pull the data in the columns I want and append it to the columns I made in the empty dataframe.
Any help appreciated.

You just need to loop over your cols, and dump that into your austin_pop dataframe.
So I did that by making a list of the data from the cols using list comprehension:
row_list = [ data.text for data in cols ]
List comprehension equivalent to a for loop. You can use either.:
row_list = []
for data in cols:
rows_list.append(data.text)
Created a single row, kept the 2 columns you wanted, and then dumped that in to austin_pop:
temp_df = pd.DataFrame([row_list], columns = ['Zip Code','type','county','Population', 'area_codes'])
temp_df = temp_df[['Zip Code', 'Population']]
austin_pop = austin_pop.append(temp_df).reset_index(drop = True)
Full Code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = "https://www.zip-codes.com/city/tx-austin.asp"
pop_source = requests.get("https://www.zip-codes.com/city/tx-austin.asp").text
soup = BeautifulSoup(pop_source, 'html5lib')
zip_pop_table = soup.find('table',class_='statTable')
austin_pop = pd.DataFrame(columns=['Zip Code','Population'])
for row in zip_pop_table.find_all('tr'):
cols = row.find_all('td')
row_list = [ data.text for data in cols ]
temp_df = pd.DataFrame([row_list], columns = ['Zip Code','type','county','Population', 'area_codes'])
temp_df = temp_df[['Zip Code', 'Population']]
austin_pop = austin_pop.append(temp_df).reset_index(drop = True)
austin_pop = austin_pop.iloc[1:, :]
austin_pop['Zip Code'] = austin_pop['Zip Code'].apply(lambda x: x.split()[-1])
Output:
print (austin_pop)
Zip Code Population
1 73301 0
2 73344 0
3 78681 50,606
4 78701 6,841
5 78702 21,334
6 78703 19,690
7 78704 42,117
8 78705 31,340
9 78708 0
10 78709 0
11 78710 0
12 78711 0
13 78712 860
14 78713 0
15 78714 0
16 78715 0
17 78716 0
18 78717 22,538
19 78718 0
20 78719 1,764
21 78720 0
22 78721 11,425
23 78722 5,901
24 78723 28,330
25 78724 21,696
26 78725 6,083
27 78726 13,122
28 78727 26,689
29 78728 20,299
30 78729 27,108
.. ... ...
45 78746 26,928
46 78747 14,808
47 78748 40,651
48 78749 34,449
49 78750 26,814
50 78751 14,385
51 78752 18,064
52 78753 49,301
53 78754 15,036
54 78755 0
55 78756 7,194
56 78757 21,310
57 78758 44,072
58 78759 38,891
59 78760 0
60 78761 0
61 78762 0
62 78763 0
63 78764 0
64 78765 0
65 78766 0
66 78767 0
67 78768 0
68 78772 0
69 78773 0
70 78774 0
71 78778 0
72 78779 0
73 78783 0
74 78799 0
[74 rows x 2 columns]

Related

checking for duplicates in panda data frame

import pandas as pd
from io import StringIO
import requests
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import make_interp_spline
url = 'https://m-selig.ae.illinois.edu/ads/coord/b737a.dat'
response = requests.get(url).text
lines = []
for idx, line in enumerate(response.split('\n'), start=1):
if all([x.replace('.','').replace('-','').isdecimal() for x in line.split()]):
lines.append(line)
lines = [x.split() for x in lines]
df = pd.DataFrame(lines)
df = df.dropna(axis=0)
df = df.astype(float)
df = df[~(df > 1).any(1)]
print(df)
output...
0 1
2 0.0000 0.0177
3 0.0023 0.0309
4 0.0050 0.0372
5 0.0076 0.0415
6 0.0143 0.0499
7 0.0249 0.0582
8 0.0495 0.0730
9 0.0740 0.0814
10 0.0990 0.0866
11 0.1530 0.0907
12 0.1961 0.0905
13 0.2504 0.0887
14 0.3094 0.0858
15 0.3520 0.0833
16 0.3919 0.0804
17 0.4477 0.0756
18 0.5034 0.0696
19 0.5593 0.0626
20 0.5965 0.0575
21 0.6488 0.0498
22 0.8351 0.0224
23 0.9109 0.0132
24 1.0000 0.0003
26 0.0000 0.0177
27 0.0022 0.0038
28 0.0049 -0.0018
29 0.0072 -0.0053
30 0.0119 -0.0106
31 0.0243 -0.0204
32 0.0486 -0.0342
33 0.0716 -0.0457
34 0.0979 -0.0516
35 0.1488 -0.0607
36 0.1953 -0.0632
37 0.2501 -0.0632
38 0.2945 -0.0626
39 0.3579 -0.0610
40 0.3965 -0.0595
41 0.4543 -0.0563
42 0.5050 -0.0527
43 0.5556 -0.0482
44 0.6063 -0.0427
45 0.6485 -0.0375
46 0.8317 -0.0149
47 0.9410 -0.0053
48 1.0000 -0.0003
This is my code for a website I'm scraping data from. I'm running into a problem where the x points start from zero, go up, and come back down to zero creating a line in the middle of the plot which I don't need.
Notice how there is two df[0] = 0 on rows 2 and 26, How can I write a code where it detects duplicates?

Try one of the following?
Out of the loop
df1=df.drop_duplicates(keep='first', inplace=False, ignore_index=False)
Inside your loop
lines = []
lines1 = []
for idx, line in enumerate(response.split('\n'), start=1):
if all([x.replace('.','').replace('-','').isdecimal() for x in line.split()]):
if not (line in lines1): lines.append(line)
lines1.append(line)

Converting a string to number in jupyter

Here is my code:
def value_and_wage_conversion(value):
if isinstance(value,str):
if 'M' in out:
out = float(out.replace('M', ''))*1000000
elif 'K' in value:
out = float(out.replace('K', ''))*1000
return float(out)
fifa_18['Value'] = fifa_18['Value'].apply(lambda x: value_and_wage_conversion(x))
fifa_18['Wage'] = fifa_18['Wage'].apply(lambda x: value_and_wage_conversion(x))
Here is the error message:
--------------------------------------------------------------------------- UnboundLocalError Traceback (most recent call
last) in
7 return float(out)
8
----> 9 fifa_18['Value'] = fifa_18['Value'].apply(lambda x: value_and_wage_conversion(x))
10 fifa_18['Wage'] = fifa_18['Wage'].apply(lambda x: value_and_wage_conversion(x))
c:\users\brain\appdata\local\programs\python\python39\lib\site-packages\pandas\core\series.py
in apply(self, func, convert_dtype, args, **kwds) 4136
else: 4137 values = self.astype(object)._values
-> 4138 mapped = lib.map_infer(values, f, convert=convert_dtype) 4139 4140 if len(mapped) and
isinstance(mapped[0], Series):
pandas_libs\lib.pyx in pandas._libs.lib.map_infer()
in (x)
7 return float(out)
8
----> 9 fifa_18['Value'] = fifa_18['Value'].apply(lambda x: value_and_wage_conversion(x))
10 fifa_18['Wage'] = fifa_18['Wage'].apply(lambda x: value_and_wage_conversion(x))
in value_and_wage_conversion(value)
1 def value_and_wage_conversion(value):
2 if isinstance(value,str):
----> 3 if 'M' in out:
4 out = float(out.replace('M', ''))*1000000
5 elif 'K' in value:
UnboundLocalError: local variable 'out' referenced before assignment

You were almost there but you need to fix your function
For example
import numpy as np
import pandas as pd
# generate a random sample
values = ['10M', '10K', 10.5, '200M', '200K', 200]
size = 100
np.random.seed(1)
df = pd.DataFrame({
'Value': np.random.choice(values, size),
'Wage': np.random.choice(values, size),
})
print(df)
Value Wage
0 200 200
1 200M 200M
2 200K 200
3 10M 10M
4 10K 200M
.. ... ...
95 200K 200
96 200 200M
97 10.5 200K
98 200K 10.5
99 200M 10M
[100 rows x 2 columns]
Define function and apply
def value_and_wage_conversion(value):
if isinstance(value, str):
if 'M' in value:
value = float(value.replace('M', ''))*1000000
elif 'K' in value:
value = float(value.replace('K', ''))*1000
return float(value)
df['Value'] = df['Value'].apply(lambda x: value_and_wage_conversion(x))
df['Wage'] = df['Wage'].apply(lambda x: value_and_wage_conversion(x))
print(df)
Value Wage
0 200.0 200.0
1 200000000.0 200000000.0
2 200000.0 200.0
3 10000000.0 10000000.0
4 10000.0 200000000.0
.. ... ...
95 200000.0 200.0
96 200.0 200000000.0
97 10.5 200000.0
98 200000.0 10.5
99 200000000.0 10000000.0
[100 rows x 2 columns]
and check
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Value 100 non-null float64
1 Wage 100 non-null float64
dtypes: float64(2)
memory usage: 1.7 KB

Splitting a coordinate string into X and Y columns with a pandas data frame

So I created a pandas data frame showing the coordinates for an event and number of times those coordinates appear, and the coordinates are shown in a string like this.
Coordinates Occurrences x
0 (76.0, -8.0) 1 0
1 (-41.0, -24.0) 1 1
2 (69.0, -1.0) 1 2
3 (37.0, 30.0) 1 3
4 (-60.0, 1.0) 1 4
.. ... ... ..
63 (-45.0, -11.0) 1 63
64 (80.0, -1.0) 1 64
65 (84.0, 24.0) 1 65
66 (76.0, 7.0) 1 66
67 (-81.0, -5.0) 1 67
I want to create a new data frame that shows the x and y coordinates individually and shows their occurrences as well like this--
x Occurrences y Occurrences
76 ... -8 ...
-41 ... -24 ...
69 ... -1 ...
37 ... -30 ...
60 ... 1 ...
I have tried to split the string but don't think I am doing it correctly and don't know how to add it to the table regardless--I think I'd have to do something like a for loop later on in my code--I scraped the data from an API, here is the code to set up the data frame shown.
for key in contents['liveData']['plays']['allPlays']:
# for plays in key['result']['event']:
# print(key)
if (key['result']['event'] == "Shot"):
#print(key['result']['event'])
scoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if scoordinates not in shots:
shots[scoordinates] = 1
else:
shots[scoordinates] += 1
if (key['result']['event'] == "Goal"):
#print(key['result']['event'])
gcoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if gcoordinates not in goals:
goals[gcoordinates] = 1
else:
goals[gcoordinates] += 1
#create data frame using pandas
gdf = pd.DataFrame(list(goals.items()),columns = ['Coordinates','Occurences'])
print(gdf)
sdf = pd.DataFrame(list(shots.items()),columns = ['Coordinates','Occurences'])
print()

try this
import re
df[['x', 'y']] = df.Coordinates.apply(lambda c: pd.Series(dict(zip(['x', 'y'], re.findall('[-]?[0-9]+\.[0-9]+', c.strip())))))

using the in-built string methods to achieve this should be performant:
df[["x", "y"]] = df["Coordinates"].str.strip(r"[()]").str.split(",", expand=True).astype(np.float)
(this also converts x,y to float values, although not requested probably desired)

I am sure that the type of "items_tmp_dic2" is dict,so why report this error?

import pandas as pd
import numpy as np
path = 'F:/datasets/kaggle/predict_future_sales/'
train_raw = pd.read_csv(path + 'sales_train.csv')
items = pd.read_csv(path + 'items.csv')
item_category_id = items['item_category_id']
item_id = train_raw.item_id
train_raw.head()
date date_block_num shop_id item_id item_price item_cnt_day
0 02.01.2013 0 59 22154 999.00 1.0
1 03.01.2013 0 25 2552 899.00 1.0
2 05.01.2013 0 25 2552 899.00 -1.0
3 06.01.2013 0 25 2554 1709.05 1.0
4 15.01.2013 0 25 2555 1099.00 1.0
items.head()
item_name item_id item_category_id
0 ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D 0 40
1 !ABBYY FineReader 12 Professional Edition Full... 1 76
2 ***В ЛУЧАХ СЛАВЫ (UNV) D 2 40
3 ***ГОЛУБАЯ ВОЛНА (Univ) D 3 40
4 ***КОРОБКА (СТЕКЛО) D 4 40
Then I want to add a "item_category_id" to train_raw,you mean from the data of items,so i want to creat a dict of item_id and item_category_id
item_category_id = items['item_category_id']
item_id = train_raw.item_id
items_tmp = items.drop(['item_name'],axis=1)
items_tmp_dic = items_tmp.to_dict('split')
items_tmp_dic = items_tmp_dic.get('data')
items_tmp_dic2 = dict(items_tmp_dic)
ic_id = []
for i in np.nditer(item_id.values[:10]):
ic_id.append(items_tmp_dic2.get(i))
print(len(ic_id))
wrong
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-50-be637620ea6d> in <module>
6 ic_id = []
7 for i in np.nditer(item_id.values[:10]):
----> 8 ic_id.append(items_tmp_dic2.get(i))
9 print(len(ic_id))
TypeError: unhashable type: 'numpy.ndarray'
but when I run
for i in np.nditer(item_id.values[:10]):
print(i)
I get
22154
2552
2552
2554
2555
2564
2565
2572
2572
2573
I have ensured that the type of "items_tmp_dic2" is dict,so why ?

I have solved it by using int()
for i in np.nditer(item_id.values[:10]):
ic_id.append(items_tmp_dic2.get(int(i)))

Pandas custom file format

I have a huge Pandas DataFrame that I need to write away to a format that RankLib can understand. Example with a target, a query ID and 3 features is this:
5 qid:4 1:12 2:0.6 3:13
1 qid:4 1:8 2:0.4 3:11
I have written my own function that iterates over the rows and writes them away like this:
data_file = open(filename, 'w')
for index, row in data.iterrows():
line = str(row['score'])
line += ' qid:'+str(row['srch_id'])
counter = 0
for feature in feature_columns:
counter += 1
line += ' '+str(counter)+':'+str(row[feature])
data_file.write(line+'\n')
data_file.close()
Since I have about 200 features and 5m rows this is obviously very slow. Is there a better approach using the I/O of Pandas itself?

you can do it this way:
Data:
In [155]: df
Out[155]:
f1 f2 f3 score srch_id
0 12 0.6 13 5 4
1 8 0.4 11 1 4
2 11 0.7 14 2 10
In [156]: df.dtypes
Out[156]:
f1 int64
f2 float64
f3 int64
score object
srch_id int64
dtype: object
Solution:
feature_columns = ['f1','f2','f3']
cols2id = {col:str(i+1) for i,col in enumerate(feature_columns)}
def f(x):
if x.name in feature_columns:
return cols2id[x.name] + ':' + x.astype(str)
elif x.name == 'srch_id':
return 'quid:' + x.astype(str)
else:
return x
(df.apply(lambda x: f(x))[['score','srch_id'] + feature_columns]
.to_csv('d:/temp/out.csv', sep=' ', index=False, header=None)
)
out.csv:
5 quid:4 1:12 2:0.6 3:13
1 quid:4 1:8 2:0.4 3:11
2 quid:10 1:11 2:0.7 3:14
cols2id helper dict:
In [158]: cols2id
Out[158]: {'f1': '1', 'f2': '2', 'f3': '3'}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extracting table data using BeautifulSoup - beautifulsoup

Related

checking for duplicates in panda data frame

Converting a string to number in jupyter

Splitting a coordinate string into X and Y columns with a pandas data frame

I am sure that the type of "items_tmp_dic2" is dict,so why report this error?

Pandas custom file format

Categories

Resources