Scraping forum using BeautifulSoup and display in a tabluar form - beautifulsoup

How do I code BeautifulSoup to display the results in a tabluar format?
something like this:
Topic | Views | Replies
---------------------------------------
XPS 7590 problems | 557 | 8
SSD not working | 76 | 3
My code is:
import requests, re
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get("https://www.dell.com/community/XPS/bd-p/XPS")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "lia-component-messages-column-thread-info"})
for item in g_data:
print (item.find_all("h2", {"class": "message-subject"})[0].text)
print (item.find_all("span", {"class": "lia-message-stats-count"})[0].text) #replies
print (item.find_all("span", {"class": "lia-message-stats-count"})[1].text) #views

Just construct a dataframe by initializing an empty one and append each "row" into it:
import requests, re
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get("https://www.dell.com/community/XPS/bd-p/XPS")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "lia-component-messages-column-thread-info"})
df = pd.DataFrame()
for item in g_data:
topic = item.find_all("h2", {"class": "message-subject"})[0].text.strip()
replies = item.find_all("span", {"class": "lia-message-stats-count"})[0].text.strip() #replies
views = item.find_all("span", {"class": "lia-message-stats-count"})[1].text.strip() #views
df = df.append(pd.DataFrame([[topic, views, replies]], columns=['Topic','Views','Replies']), sort=False).reset_index(drop=True)
Output:
print (df)
Topic Views Replies
0 FAQ Modern Standby 1057 0
1 FAQ XPS Laptops 4315 0
2 Where is the Precision Laptops Forum board? 624 0
3 XPS 15-9570, color banding issue 5880 192
4 XPS 7590 problems.. 565 9
5 XPS 13 7390 2-in-1 Display and Touchscreen issues 17 2
6 Dell XPS 9570 I7-8750H video display issues 9 0
7 XPS 9360 Fn lock for PgUp PgDn 12 0
8 Dell XPS DPC Latency Fix 1724 4
9 XPS 13 7390 2-in-1, Realtek drivers lead to fr... 253 11
10 XPS 12 9q23 Touch screen firmware update fix 36 1
11 Dell XPS 15 9570 when HDMI plugged in, screen ... 17 0
12 XPS 13 7390 2 in 1 bluetooth keyboard and mous... 259 10
13 xps15 7590 wifi problem 46 1
14 Unable to update Windows from 1803 to 1909 - X... 52 5
15 Dell XPS 9300 - Thunderbolt 3 Power Delivery I... 28 0
16 Dell XPS 15 9560, right arrow key or right of ... 26 0
17 XPS 13 2020 (9300) Ubuntu sudden shut down 24 0
18 Dell XPS 15 9750 won’t login 26 0
19 XPS 13 9360 Windows Hello Face - reconfigurati... 29 2
20 Enclosure for Dell XPS 13 9360 512 GB pcie nvm... 181 7
21 XPS 13 7390 Firmware 1.3.1 Issue - Bluetooth /... 119 2
22 SSD Onboard? 77 3
23 XPS 13 9350 only turns on when charger connected 4090 11
24 Integrated webcam not working 45 1
25 Docking station for XPS 15 9570, Dell TB16 not... 53 4
26 Dell XPS 13 9370 34 1
27 XPS 13 9380 overheat while charging 602 3
28 DELL XPS 13 (9300) REALTEK AUDIO DRIVER PROBLEM 214 2
29 XPS 15 9570 freezing Windows 10 222 6
30 XPS 13 (9300) - Speaker Vibration 40 2
31 Dell XPS 15 9570 Fingerprint reader not workin... 158 2
32 XPS 9570 Intel 9260 No Bluetooth 34 0

Related

Scraping Lazada getting strange results

I am looking for some hints about http://lazada.co.th
I have search links I need to scrape, but the results are inconsistent. Even manually Safari brings up different results from Chrome for the same link (number of items)
For example
https://www.lazada.co.th/shop-womens-sunglasses/?service=FS&location=local&price=5000-&rating=4
Shows me in Safari 31 items and in Chrome - 0
As for scraping I get totally different results from different approaches and from same approaches in time. I mean that BS4 and Selenium return different quantities and BS4 itself and Selenium itself are not consistent during even couple of hours. I can get results similar to Safari one time and similar to Chrome another time.
Playing with timeouts didn't help.
Any hints will be highly appreciated ))
The data is embedded within the page in Json form. You can use re/json module to extract it. The caveat is, that the page sometimes returns 0 results (this is completely random - I presume the coding is bad on their side).
This script will retry the loading until the data is available:
import json
import requests
from time import sleep
url = 'https://www.lazada.co.th/shop-womens-sunglasses/?service=FS&location=local&price=5000-&rating=4'
while True:
txt = requests.get(url).text
data = json.loads(re.search(r'window\.pageData=(.*?);?<', txt).group(1))
if 'listItems' not in data['mods']:
sleep(1)
continue
break
#uncomment next line to print all data:
#print(json.dumps(data, indent=4))
for i, item in enumerate(data['mods']['listItems'], 1):
print('{:<3} {:>12} {:>15} {}'.format(i, item['price'], item['itemId'], item['name']))
Prints:
1 5390 390780925 Ray-Ban Square Sunglasses- RB1971 9150AC แว่นตากันแดด
2 5390.00 390796431 Ray-Ban Square Sunglasses- RB1971 9149AD แว่นตากันแดด
3 19290.00 100916363 GUCCI แว่นกันแดด รุ่น GG3660/K/S UD28ED/57
4 12000.00 275621098 Marc Jacobs แว่นตากันแดด รุ่น MJ457/F/S 35HEU ( Brown )
5 16900.00 7663524 Dior Mirrored Sunglasses- Rose Gold
6 9999.00 161253056 ic! berlin Malgorzata Rose Gold แว่นกันแดดกรอบสแตนเลส หรูหรา น้ำหนักเบาสุดๆ ของแท้ ฟรีแก้ว
7 38900.00 100133011 แว่นกันแดด DIOR SUN รุ่น silver matte peach/dark grey (RCM/BN)
8 7250.00 10928606 Ray-Ban Erika Polarized - RB4171F 710/T5 แว่นตากันแดด rayban
9 19990.00 100915722 GUCCI แว่นกันแดด รุ่น GG2245/S SJ5GXS/59
10 17890.00 100924756 GUCCI แว่นกันแดด รุ่น GG3754/F/S UKCLHA/58 (Black Black)
11 22790.00 101448214 GUCCI แว่นตากันแดด รุ่น GG3760/F/S U2ENHD/57
12 22790.00 100924670 GUCCI แว่นกันแดด รุ่น GG3727/F/S UILW6J/58 (Black Gold)
13 20290.00 100916203 GUCCI แว่นกันแดด รุ่น GG3635/N/F/S UZ9X0J/57
14 6990.00 100524937 GUESS แว่นกันแดด รุ่น GU7448 32B #52
15 5590.00 100525280 GUESS แว่นกันแดด รุ่น GU7460 M32B #60
16 22790.00 100924776 GUCCI แว่นกันแดด รุ่น GG3760/F/S U2EZHA/57 (Brown Black)
17 20290.00 101448302 GUCCI แว่นตากันแดด รุ่น GG4273/S S3YGMI/52
18 19990.00 100915356 GUCCI แว่นกันแดด รุ่น GG2245/S S006R3/61
19 5390.00 100516633 GUESS แว่นกันแดด รุ่น GU 6725 32C
20 18590.00 100924679 GUCCI แว่นกันแดด รุ่น GG3735/F/S UCHYUZ/57 (Red Brown)
21 26290.00 100883133 GUCCI แว่นตากันแดด GG2235/S SKJ1LG/59
22 5590.00 100525330 GUESS แว่นกันแดด รุ่น GU7460 M05B #60
23 20290.00 101443274 GUCCI แว่นตากันแดด GG3635/N/F/S UZ99HA/57
24 29790.00 101448179 GUCCI แว่นตากันแดด รุ่น GG3706/F/S U2ZXHA/58
25 7490.00 100524727 GUESS แว่นกันแดด รุ่น GU7434 02D #56
26 7200.00 100408876 Gucci GG 3504/S D28/BN(Grey Black
27 6990.00 100524871 GUESS แว่นกันแดด รุ่น GU7448 02B #52
28 5390.00 100881696 GUESS แว่นกันแดด รุ่น GU6850 49G #54
29 6990.00 100525084 GUESS แว่นกันแดด รุ่น GU7448 09C #52
30 5390.00 100881586 GUESS แว่นกันแดด รุ่น GU6850 02B #54
31 27900.00 7765890 Dior Reflected Pink M2Q

Plotting Webscraped data onto matplotlib

I recently managed to collect tabular data from a PDF file using camelot in python. By collect I mean print it out on the terminal, Now i would like to find a way to automate the results into a bar graph diagram on matplotlib. how would i do that? Here's my code for extracting the tabular data from the pdf:
import camelot
tables = camelot.read_pdf("data_table.pdf", pages='2')
print(tables[0].df)
Here's an image of the table
enter image description here
Which then prints out a large table in my terminal:
0 1 2 3 4
0 Country \nCase definition \nCumulative cases \...
1 Guinea Confirmed 2727 156 1683
2 Probable 374 * 374
3 Suspected 7 * ‡
4 Total 3108 156 2057
5 Liberia** Confirmed 3149 11 ‡
6 Probable 1876 * ‡
7 Suspected 3982 * ‡
8 Total 9007 11 3900
9 Sierra Leone Confirmed 8212 230 3042
10 Probable 287 * 208
11 Suspected 2604 * 158
12 Total 11103 230 3408
13 Total 23 218 397 9365
I do have a bit of experience with matplotlib and i know how to plot data manually but not automatically from the pdf. This would save me some time since I'm trying to automate the whole process.

Need assistance with below query

I'm getting this error:
Error tokenizing data. C error: Expected 2 fields in line 11, saw 3
Code: import webbrowser
website = 'https://en.wikipedia.org/wiki/Winning_percentage'
webbrowser.open(website)
league_frame = pd.read_clipboard()
And the above mentioned comes next.
I believe you need use read_html - returned all parsed tables and select Dataframe by position:
website = 'https://en.wikipedia.org/wiki/Winning_percentage'
#select first parsed table
df1 = pd.read_html(website)[0]
print (df1.head())
Win % Wins Losses Year Team Comment
0 0.798 67 17 1882 Chicago White Stockings best pre-modern season
1 0.763 116 36 1906 Chicago Cubs best 154-game NL season
2 0.721 111 43 1954 Cleveland Indians best 154-game AL season
3 0.716 116 46 2001 Seattle Mariners best 162-game AL season
4 0.667 108 54 1975 Cincinnati Reds best 162-game NL season
#select second parsed table
df2 = pd.read_html(website)[1]
print (df2)
Win % Wins Losses Season Team \
0 0.890 73 9 2015–16 Golden State Warriors
1 0.110 9 73 1972–73 Philadelphia 76ers
2 0.106 7 59 2011–12 Charlotte Bobcats
Comment
0 best 82 game season
1 worst 82-game season
2 worst season statistically

Pandas: Error when merging two tables, Error with set_index

Thanks in advance for your help, here's my question:
I've successfully loaded my df in to ipython notebook and then I ran a group by on it:
station_count = station.groupby('landmark').count()
which produced a table like this:
Now I'm trying to merge it with another table:
dock_count_by_station = station.groupby('landmark').sum()
that is also a simple group by on the same table, but the merge produces an error:
TypeError: cannot concatenate a non-NDFrame object
with this code:
dock_count_by_station.merge(station_count)
I think the problem is that I need to set the index of the two tables before merging them but I keep getting this error for the code below:
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)()
KeyError: 'landmark'
station_count.set_index('landmark')
Using join
You can use join, which merges the tables on their index. You may also wish to specify the join type (e.g. 'outer', 'inner', 'left' or 'right'). You have overlapping column names (e.g. station_id), so you need to specify a suffix.
>>> dock_count_by_station.join(station_count, rsuffix='_rhs')
dockcount lat long station_id dockcount_rhs installation lat_rhs long_rhs name station_id_rhs
landmark
Mountain View 117 261.767433 -854.623012 210 7 7 7 7 7 7
Palo Alto 75 187.191873 -610.767939 180 5 5 5 5 5 5
Redwood City 115 262.406232 -855.602755 224 7 7 7 7 7 7
San Francisco 665 1322.569239 -4284.054814 2126 35 35 35 35 35 35
San Jose 249 560.039892 -1828.370075 200 15 15 15 15 15 15
Using merge
Note that your landmark index was set by default when you did the groupby. You can always use as_index=False if you don't want this to occur, but then you would have to use merge instead of join.
dock_count_by_station = station.groupby('landmark', as_index=False).sum()
station_count = station.groupby('landmark', as_index=False).count()
>>> dock_count_by_station.merge(station_count, on='landmark', suffixes=['_lhs', '_rhs'])
landmark dockcount_lhs lat_lhs long_lhs station_id_lhs dockcount_rhs installation lat_rhs long_rhs name station_id_rhs
0 Mountain View 117 261.767433 -854.623012 210 7 7 7 7 7 7
1 Palo Alto 75 187.191873 -610.767939 180 5 5 5 5 5 5
2 Redwood City 115 262.406232 -855.602755 224 7 7 7 7 7 7
3 San Francisco 665 1322.569239 -4284.054814 2126 35 35 35 35 35 35
4 San Jose 249 560.039892 -1828.370075 200 15 15 15 15 15 15

Trouble with NaNs: set_index().reset_index() corrupts data

I read that NaNs are problematic, but the following causes an actual corruption of my data, rather than an error. Is this a bug? Have I missed something basic in the documentation?
I would like the second command to give an error or to give the same response as the first command:
ipdb> df
year PRuid QC data
18 2007 nonQC 0 8.014261
19 2008 nonQC 0 7.859152
20 2010 nonQC 0 7.468260
21 1985 10 NaN 0.861403
22 1985 11 NaN 0.878531
23 1985 12 NaN 0.842704
24 1985 13 NaN 0.785877
25 1985 24 1 0.730625
26 1985 35 NaN 0.816686
27 1985 46 NaN 0.819271
28 1985 47 NaN 0.807050
ipdb> df.set_index(['year','PRuid','QC']).reset_index()
year PRuid QC data
0 2007 nonQC 0 8.014261
1 2008 nonQC 0 7.859152
2 2010 nonQC 0 7.468260
3 1985 10 1 0.861403
4 1985 11 1 0.878531
5 1985 12 1 0.842704
6 1985 13 1 0.785877
7 1985 24 1 0.730625
8 1985 35 1 0.816686
9 1985 46 1 0.819271
10 1985 47 1 0.807050
The value of "QC" is actually changed to 1 from NaN where it should be NaN.
Btw, for symmetry I added the ".reset_index()", but the data corruption is introduced by set_index.
And in case this is interesting, the version is:
pd.version
<module 'pandas.version' from '/usr/lib/python2.6/site-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/version.pyc'>
So this was a bug. By the end of May 2013, pandas 0.11.1 should be released with the bug fix (see comments on this question).
In the mean time, I avoided using a value with NaNs in any multiindex, for instance by using some other flag value (-99) for the NaNs in the column 'QC'.