It's my first time working with Selenium and web scraping. I have been trying to get the menu item and prices for a certain restaurant in California from the following website (https://www.fastfoodmenuprices.com/baskin-robbins-prices/). I have been able to use Selenium to get to make it select California from the dropdown menu but I keep running into the problem of not being able to scrape the menu items and prices and coming up with a blank data frame. How do I scrape the menu item and prices from the following website and store them into a data frame? The code is below:
from selenium import webdriver
import time
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
from bs4 import BeautifulSoup
path = "/path/to/chromedriver"
driver = webdriver.Chrome(executable_path = path)
url = "https://www.fastfoodmenuprices.com/baskin-robbins-prices/"
driver.get(url)
Select(WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.XPATH, "//select[#class='tp-variation']")))).select_by_value("MS4yOA==")
print(driver.page_source)
driver.quit
menu = []
prices = []
content = driver.page_source
soup = BeautifulSoup (content, features = "html.parser")
for element in soup.findAll('div', attrs = {'tbody': 'row-hover'}):
menu = element.find ('td', attrs = {'class': "column-1"})
prices = element.find('td', attrs = {'class':'column-3'})
menu.append(menu.text)
prices.append(prices.text)
df = pd.DataFrame({'Menu Item':menu, 'Prices':prices})
df
Try:
import requests
import base64
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.fastfoodmenuprices.com/baskin-robbins-prices/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = []
for td in soup.select(
"tr:has(.column-1):has(.column-2):has(.column-3):has(input)"
):
data.append(
{
"Type": td.find_previous(colspan="3").get_text(strip=True),
"Food": td.select_one(".column-1").get_text(strip=True),
"Size": td.select_one(".column-2").get_text(strip=True),
"Price": float(
td.select_one(".column-3").get_text(strip=True).strip("$")
),
}
)
adjust = soup.select_one('.tp-variation option:-soup-contains("California")')
adjust = float(base64.b64decode(adjust["value"]))
df = pd.DataFrame(data)
df["Price"] = (df["Price"] * adjust).round(2)
print(df)
df.to_csv("data.csv", index=False)
Prints:
Type
Food
Size
Price
0
Soft ServeFlavors: Reese’s, Heath, Snickers, M&M’s, Oreo, Butterfinger, andChocolate Chip Cookie Dough
Soft Serve Below
Mini
2.8
1
Soft ServeFlavors: Reese’s, Heath, Snickers, M&M’s, Oreo, Butterfinger, andChocolate Chip Cookie Dough
Soft Serve Below
Small
4.84
2
Soft ServeFlavors: Reese’s, Heath, Snickers, M&M’s, Oreo, Butterfinger, andChocolate Chip Cookie Dough
Soft Serve Below
Medium
5.61
3
Soft ServeFlavors: Reese’s, Heath, Snickers, M&M’s, Oreo, Butterfinger, andChocolate Chip Cookie Dough
Soft Serve Below
Large
7.65
4
Soft ServeFlavors: Reese’s, Heath, Snickers, M&M’s, Oreo, Butterfinger, andChocolate Chip Cookie Dough
Cups & Cones
Kids
2.02
5
Soft ServeFlavors: Reese’s, Heath, Snickers, M&M’s, Oreo, Butterfinger, andChocolate Chip Cookie Dough
Cups & Cones
Regular
2.53
6
Soft ServeFlavors: Reese’s, Heath, Snickers, M&M’s, Oreo, Butterfinger, andChocolate Chip Cookie Dough
Cups & Cones
Large
3.81
7
Soft ServeFlavors: Reese’s, Heath, Snickers, M&M’s, Oreo, Butterfinger, andChocolate Chip Cookie Dough
Parfaits
Mini
2.8
8
Soft ServeFlavors: Reese’s, Heath, Snickers, M&M’s, Oreo, Butterfinger, andChocolate Chip Cookie Dough
Parfaits
Regular
6.39
9
Sundaes
Banana Royale
7.03
10
Sundaes
Brownie
7.03
11
Sundaes
Banana Split
8.56
12
Sundaes
Reese’s Peanut Butter Cup Sundae
7.67
13
Sundaes
Chocolate Chip Cookie Dough Sundae
7.67
14
Sundaes
Oreo® Layered Sundae
7.67
15
Sundaes
Made with Snickers Sundae
7.67
16
Sundaes
One Scoop Sundae
4.47
17
Sundaes
Two Scoops Sundae
5.75
18
Sundaes
Three Scoops Sundae
6.64
19
Sundaes
Candy Topping
1.01
20
Sundaes
Waffle Bowl
1.27
21
Ice Cream
Kid’s Scoop
2.8
22
Ice Cream
Single Scoop
3.57
23
Ice Cream
Double Scoop
5.11
24
Ice Cream
Regular Waffle Cone
1.27
25
Ice Cream
Chocolate Waffle Cone
1.91
26
Ice Cream
Fancy Waffle Cone
1.91
27
Beverages
Cappuccino Blast
Mini
4.72
28
Beverages
Cappuccino Blast
Small
6
29
Beverages
Cappuccino Blast
Medium
7.28
30
Beverages
Cappuccino Blast
Large
8.56
31
Beverages
Iced Cappy Blast
Mini
4.72
32
Beverages
Iced Cappy Blast
Small
6
33
Beverages
Iced Cappy Blast
Medium
7.28
34
Beverages
Iced Cappy Blast
Large
8.56
35
Beverages
Add a Boost (Cappuccino or Iced Cappy Blast)
0.64
36
Beverages
Smoothie
Mini
4.72
37
Beverages
Smoothie
Small
6
38
Beverages
Smoothie
Medium
7.28
39
Beverages
Smoothie
Large
8.56
40
Beverages
Shake
Mini
4.72
41
Beverages
Shake
Small
6
42
Beverages
Shake
Medium
7.28
43
Beverages
Shake
Large
8.56
44
Ice Cream To Go
Pre-Packed
Quart
7.67
45
Ice Cream To Go
Hand-Packed
Pint
6.39
46
Ice Cream To Go
Hand-Packed
Quart
10.23
47
Ice Cream To Go
Clown Cones
3.7
and creates data.csv (screenshot from LibreOffice):
*The website is using cloudflare protection
https://www.fastfoodmenuprices.com/baskin-robbins-prices/ is using Cloudflare CDN/Proxy!
https://www.fastfoodmenuprices.com/baskin-robbins-prices/ is using Cloudflare SSL!
** So I have to use the following options to evade detection
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
*** To select table tr, td,I use css selector which is more robust and flexible.
**** I have to use list and zip function in pandas DataFrame as it shows not the same shape.
***** I have to use try except as you will see that some menu items are missing
Script:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
url = "https://www.fastfoodmenuprices.com/baskin-robbins-prices/"
driver.get(url)
Select(WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.XPATH, "//select[#class='tp-variation']")))).select_by_value("MS4yOA==")
price=[]
menu=[]
soup = BeautifulSoup (driver.page_source,"lxml")
driver.close()
for element in soup.select('#tablepress-34 tbody tr'):
try:
menus = element.select_one('td:nth-child(2)').text
menu.append(menus)
except:
pass
try:
prices = element.select_one('td:nth-child(3) span').text
price.append(prices)
except:
pass
df = pd.DataFrame(data=list(zip(price,menu)),columns=['price','menu'])
print(df)
webdriver-manager
Output:
price menu
0 $2.80 Mini
1 $4.84 Small
2 $5.61 Medium
3 $7.65 Large
4 $2.02 Kids
5 $2.53 Regular
6 $3.81 Large
7 $2.80 Mini
8 $6.39 Regular
9 $7.03
10 $7.03
11 $8.56
12 $7.67
13 $7.67
14 $7.67
15 $7.67
16 $4.47
17 $5.75
18 $6.64
19 $1.01
20 $1.27
21 $2.80
22 $3.57
23 $5.11
24 $1.27
25 $1.91
26 $1.91
27 $4.72 Mini
28 $6.00 Small
29 $7.28 Medium
30 $8.56 Large
31 $4.72 Mini
32 $6.00 Small
33 $7.28 Medium
34 $8.56 Large
35 $0.64
36 $4.72 Mini
37 $6.00 Small
38 $7.28 Medium
39 $8.56 Large
40 $4.72 Mini
41 $6.00 Small
42 $7.28 Medium
43 $8.56 Large
44 $7.67 Quart
45 $6.39 Pint
46 $10.23 Quart
47 $3.70
Once you select California to extract the table contents within the website you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategies:
Code Block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
driver.get("https://www.fastfoodmenuprices.com/baskin-robbins-prices")
Select(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//select[#class='tp-variation']")))).select_by_value("MS4yOA==")
tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[#id='tablepress-34']"))).get_attribute("outerHTML")
tabledf = pd.read_html(tabledata)
print(tabledf)
Console Output:
[ Food ... Price
0 Soft Serve Flavors: Reese’s, Heath, Snickers, ... ... Soft Serve Flavors: Reese’s, Heath, Snickers, ...
1 Soft Serve Below ... $2.80
2 Soft Serve Below ... $4.84
3 Soft Serve Below ... $5.61
4 Soft Serve Below ... $7.65
5 Cups & Cones ... $2.02
6 Cups & Cones ... $2.53
7 Cups & Cones ... $3.81
8 Parfaits ... $2.80
9 Parfaits ... $6.39
10 Sundaes ... Sundaes
11 Banana Royale ... $7.03
12 Brownie ... $7.03
13 Banana Split ... $8.56
14 Reese’s Peanut Butter Cup Sundae ... $7.67
15 Chocolate Chip Cookie Dough Sundae ... $7.67
16 Oreo® Layered Sundae ... $7.67
17 Made with Snickers Sundae ... $7.67
18 One Scoop Sundae ... $4.47
19 Two Scoops Sundae ... $5.75
20 Three Scoops Sundae ... $6.64
21 Candy Topping ... $1.01
22 Waffle Bowl ... $1.27
23 Ice Cream ... Ice Cream
24 Kid’s Scoop ... $2.80
25 Single Scoop ... $3.57
26 Double Scoop ... $5.11
27 Regular Waffle Cone ... $1.27
28 Chocolate Waffle Cone ... $1.91
29 Fancy Waffle Cone ... $1.91
30 Beverages ... Beverages
31 Cappuccino Blast ... $4.72
32 Cappuccino Blast ... $6.00
33 Cappuccino Blast ... $7.28
34 Cappuccino Blast ... $8.56
35 Iced Cappy Blast ... $4.72
36 Iced Cappy Blast ... $6.00
37 Iced Cappy Blast ... $7.28
38 Iced Cappy Blast ... $8.56
39 Add a Boost (Cappuccino or Iced Cappy Blast) ... $0.64
40 Smoothie ... $4.72
41 Smoothie ... $6.00
42 Smoothie ... $7.28
43 Smoothie ... $8.56
44 Shake ... $4.72
45 Shake ... $6.00
46 Shake ... $7.28
47 Shake ... $8.56
48 Ice Cream To Go ... Ice Cream To Go
49 Pre-Packed ... $7.67
50 Hand-Packed ... $6.39
51 Hand-Packed ... $10.23
52 Clown Cones ... $3.70
[53 rows x 3 columns]]
I recently managed to collect tabular data from a PDF file using camelot in python. By collect I mean print it out on the terminal, Now i would like to find a way to automate the results into a bar graph diagram on matplotlib. how would i do that? Here's my code for extracting the tabular data from the pdf:
import camelot
tables = camelot.read_pdf("data_table.pdf", pages='2')
print(tables[0].df)
Here's an image of the table
enter image description here
Which then prints out a large table in my terminal:
0 1 2 3 4
0 Country \nCase definition \nCumulative cases \...
1 Guinea Confirmed 2727 156 1683
2 Probable 374 * 374
3 Suspected 7 * ‡
4 Total 3108 156 2057
5 Liberia** Confirmed 3149 11 ‡
6 Probable 1876 * ‡
7 Suspected 3982 * ‡
8 Total 9007 11 3900
9 Sierra Leone Confirmed 8212 230 3042
10 Probable 287 * 208
11 Suspected 2604 * 158
12 Total 11103 230 3408
13 Total 23 218 397 9365
I do have a bit of experience with matplotlib and i know how to plot data manually but not automatically from the pdf. This would save me some time since I'm trying to automate the whole process.
How do I code BeautifulSoup to display the results in a tabluar format?
something like this:
Topic | Views | Replies
---------------------------------------
XPS 7590 problems | 557 | 8
SSD not working | 76 | 3
My code is:
import requests, re
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get("https://www.dell.com/community/XPS/bd-p/XPS")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "lia-component-messages-column-thread-info"})
for item in g_data:
print (item.find_all("h2", {"class": "message-subject"})[0].text)
print (item.find_all("span", {"class": "lia-message-stats-count"})[0].text) #replies
print (item.find_all("span", {"class": "lia-message-stats-count"})[1].text) #views
Just construct a dataframe by initializing an empty one and append each "row" into it:
import requests, re
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get("https://www.dell.com/community/XPS/bd-p/XPS")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "lia-component-messages-column-thread-info"})
df = pd.DataFrame()
for item in g_data:
topic = item.find_all("h2", {"class": "message-subject"})[0].text.strip()
replies = item.find_all("span", {"class": "lia-message-stats-count"})[0].text.strip() #replies
views = item.find_all("span", {"class": "lia-message-stats-count"})[1].text.strip() #views
df = df.append(pd.DataFrame([[topic, views, replies]], columns=['Topic','Views','Replies']), sort=False).reset_index(drop=True)
Output:
print (df)
Topic Views Replies
0 FAQ Modern Standby 1057 0
1 FAQ XPS Laptops 4315 0
2 Where is the Precision Laptops Forum board? 624 0
3 XPS 15-9570, color banding issue 5880 192
4 XPS 7590 problems.. 565 9
5 XPS 13 7390 2-in-1 Display and Touchscreen issues 17 2
6 Dell XPS 9570 I7-8750H video display issues 9 0
7 XPS 9360 Fn lock for PgUp PgDn 12 0
8 Dell XPS DPC Latency Fix 1724 4
9 XPS 13 7390 2-in-1, Realtek drivers lead to fr... 253 11
10 XPS 12 9q23 Touch screen firmware update fix 36 1
11 Dell XPS 15 9570 when HDMI plugged in, screen ... 17 0
12 XPS 13 7390 2 in 1 bluetooth keyboard and mous... 259 10
13 xps15 7590 wifi problem 46 1
14 Unable to update Windows from 1803 to 1909 - X... 52 5
15 Dell XPS 9300 - Thunderbolt 3 Power Delivery I... 28 0
16 Dell XPS 15 9560, right arrow key or right of ... 26 0
17 XPS 13 2020 (9300) Ubuntu sudden shut down 24 0
18 Dell XPS 15 9750 won’t login 26 0
19 XPS 13 9360 Windows Hello Face - reconfigurati... 29 2
20 Enclosure for Dell XPS 13 9360 512 GB pcie nvm... 181 7
21 XPS 13 7390 Firmware 1.3.1 Issue - Bluetooth /... 119 2
22 SSD Onboard? 77 3
23 XPS 13 9350 only turns on when charger connected 4090 11
24 Integrated webcam not working 45 1
25 Docking station for XPS 15 9570, Dell TB16 not... 53 4
26 Dell XPS 13 9370 34 1
27 XPS 13 9380 overheat while charging 602 3
28 DELL XPS 13 (9300) REALTEK AUDIO DRIVER PROBLEM 214 2
29 XPS 15 9570 freezing Windows 10 222 6
30 XPS 13 (9300) - Speaker Vibration 40 2
31 Dell XPS 15 9570 Fingerprint reader not workin... 158 2
32 XPS 9570 Intel 9260 No Bluetooth 34 0
I am importing EIA data which contains weekly storage data. The first column in the reported week and second is storage.
When I import the data it shows two columns. First column has no title and second one as following title "Weekly Lower 48 States Natural Gas Working Underground Storage, Weekly (Billion Cubic Feet)".
I would like to plot the data using matplotlib but I need to separate the columns first. I used df.iloc[100:,:0] and this gives the first column which is the week but I somehow cannot separate the second column.
import eia
import pandas as pd
import os
api_key = "mykey"
api = eia.API(api_key)
series_search = api.data_by_series(series='NG.NW2_EPG0_SWO_R48_BCF.W')
df = pd.DataFrame(series_search)
df1 = df.iloc[100:,:0]
Code Output
This output is sample of all 486 rows. When I use df.shape command it shows as (486, 1) when it should show (486, 2 )
2010 0101 01 3117
2010 0108 08 2850
2010 0115 15 2607
2010 0122 22 2521
2019 0322 22 1107
2019 0329 29 1130
2019 0405 05 1155
2019 0412 12 1247
2019 0419 19 1339
You can first cut the last 3 characters of the string and then convert it to datetime:
df['Date'] = pd.to_datetime(df['Date'].str[:-3], format='%Y %m%d')
print(df)
Date Value
0 2010-01-01 3117
1 2010-01-08 2850
2 2010-01-15 2607
3 2010-01-22 2521
4 2019-03-22 1107
5 2019-03-29 1130
6 2019-04-05 1155
7 2019-04-12 1247
8 2019-04-19 1339
Im a stranger to SQL, I have somehow managed to get this table out but, its still far away from what I need. My table looks like,
Location Bus Type Colour Count Capacity
1. hartford volvo 20 Seater Red 10 5000cc
2. hartford ford 10 seater blue 12 2000cc
3. hartford Merc 20 seater green 12 2000cc
4. kansas lambo 16 Seater Red 13 1000cc
5. kansas banbo 15 Seater blue 13 1000cc
6. kansas babho 17 Seater green 13 1000cc
I want to change the layout to
http://i.stack.imgur.com/MMUgf.png
Please suggest me how can I get the desired layout.Im trying to use pivot function in sql