Issue with scraping zomato.com

Issue with scraping zomato.com - beautifulsoup

I cannot iterate through the list of restaurants.
Here is a quick video demonstrating my issue: https://streamable.com/vorg7
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")
zomato_containers = bs.find("div", {"class": "ui cards"})
restaurant_title = zomato_containers[1].find("a", {"class": "result-title hover_feedback zred bold ln24 fontsize0 "}).text
print("restaurant_title: ", restaurant_title)
I expect Python to state that there are 15 restaurants in 1 page, but I am getting 39.

I just changed the class you use to find your results, and used a find_all method to get all the snippet cards, and I've found 15 restaurants:
CODE:
import requests
from bs4 import BeautifulSoup as soup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
bs = soup(response.text,"html.parser")
zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})
print(len(zomato_containers))
for zc in zomato_containers:
restaurant_title = zc.find("a", {"class": "result-title"})
print("restaurant_title: ", restaurant_title.get_text())
RESULT:
15
restaurant_title: Delfina
restaurant_title: Boudin Bakery
restaurant_title: In-N-Out Burger
restaurant_title: Hollywood Cafe
restaurant_title: The Slanted Door
restaurant_title: Tartine Bakery
restaurant_title: The Original Ghirardelli Ice Cream and Chocolate...
restaurant_title: The Cheesecake Factory
restaurant_title: Scoma's
restaurant_title: Boulevard
restaurant_title: Foreign Cinema
restaurant_title: Zuni Café
restaurant_title: Brenda's French Soul Food
restaurant_title: Gary Danko
restaurant_title: Hog Island Oyster Company

Related

How to crawl start URLs only?

I'm trying to make the spider crawl just the provided start URLs without following any extracted links. I've tried setting rules = (Rule (follow=False),) but it still follows links. Does anyone know how to download the start URLs only?
EDIT:
Here's some code
class Spider(CrawlSpider):
name = 'spider'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
def __init__(self, mode, *args, **kwargs):
if mode == 'scan':
self.start_urls = ['https://www.example.com/']
self.rules = (Rule (callback="parse_obj", follow=False),
)
self.custom_settings = {
'COMPRESSION_ENABLED': True,
'URLLENGTH_LIMIT': 100,
'DOWNLOAD_DELAY': 1
}
elif mode == 'crawl':
# something else
super(Spider, self).__init__(*args, **kwargs)

python requests.request "GET" returns different result

I am sending a GET request to a website using requests.request in python. When I set User-Agent in the header I only get script tags in the response. When I do not set User-Agent I get all the tags but script. What is the problem? Any idea?
Code with only script tags:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
URL = 'https://www.gate-away.com/properties/lazio/rome/rome/id/567087'
response = requests.request('GET', URL, headers=headers).text
soup = BeautifulSoup(response, 'html.parser')
print(len(soup.findAll('script', {'type': "text/javascript"})))
print(len(soup.findAll('div')))
Output is:
15 (a non-zero number)
0
Code with all tags but script:
headers = {}
URL = 'https://www.gate-away.com/properties/lazio/rome/rome/id/567087'
response = requests.request('GET', URL, headers=headers).text
soup = BeautifulSoup(response, 'html.parser')
print(len(soup.findAll('script', {'type': "text/javascript"})))
print(len(soup.findAll('div')))
Output is:
0
100 (a non-zero number)

import re
import requests
from bs4 import BeautifulSoup
import json
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
url = "https://www.gate-away.com/properties/lazio/rome/rome/id/567087"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
data_script = soup.find('script', string=re.compile("var preloadedData = "))
actual_data = json.loads(data_script.contents[0].split('var preloadedData = ')[1].rsplit(';', 1)[0])
print(actual_data)
This will return a dict with pretty much all data in question.

I cant get text from atribute span in bs4

when i try to get text i have a output like:
price = item.find('span').text
AttributeError: 'NoneType' object has no attribute 'text'
code:
#___IMPORTS_____
from datetime import date
import calendar
import requests
from bs4 import BeautifulSoup
#_______________
url= 'https://www.investing.com/currencies/eur-usd'
page = requests.get(url, headers = {'User-
Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/91.0.4472.124 Safari/537.36'})
#print(f'Status code is: {page.status_code}')
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('div', class_='first
inlineblock')[0]
for item in table:
price = item.find('span').text
print(price)

Try:
#___IMPORTS_____
from datetime import date
import calendar
import requests
from bs4 import BeautifulSoup
#_______________
url= 'https://www.investing.com/currencies/eur-usd'
page = requests.get(url, headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'})
#print(f'Status code is: {page.status_code}')
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('div', class_='first inlineblock')
for item in table:
price = item.find('span', class_='float_lang_base_2')
print(price.text)
1.1753
1.1752
- 0.4
Or if you require the field:
for item in table:
field = item.find('span', class_='float_lang_base_1')
price = item.find('span', class_='float_lang_base_2')
print(field.text, ':', price.text)
Prev. Close : 1.1753
Open : 1.1752
1-Year Change : - 0.4

python selenium/soup not scrolling and printing entire job containers in linkedined

Here's the problem statement: The base_site link below takes us to a job search URL.
There are small containers that show jobs on the left pane of the webpage.
The problem is that with this code I can only see 7 containers as output.
For example, it shows the 1st seven job result locations in the output whereas I am expecting all of them to be shown in the output. For this, I am using scrolltoview but that doesn't seem to help as well.
What is it that I'm missing?
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from time import sleep
def get_driver():
options = Options()
options.add_argument("user-data-dir=C:\\Users\\abc\\AppData\\Local\\Google\\Chrome\\User Data")
path = 'C:\\Program Files (x86)\\Google\\chromedriver.exe'
options.add_experimental_option("detach", True)
driver = webdriver.Chrome(path, options=options)
text_search = 'Product Development Engineer'
location_search = 'california'
# base_site = 'https://www.linkedin.com/jobs'
base_site = 'https://www.linkedin.com/jobs/search/?currentJobId=2638809245&f_E=3%2C4&f_JT=F&f_SB2=3&f_TPR=r60' \
'4800&geoId=102095887&keywords=product%20development%20engineer&location=California%2C%20United%20States&sortBy=R'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/"
"70.0.3538.102 Safari/537.36 Edge/18.19582"}
driver.get(base_site)
parsing_job_data(driver, base_site, headers)
def parsing_job_data(driver, base_site, headers):
try:
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
soup = BeautifulSoup(driver.page_source, 'lxml')
results = soup.find_all('div', class_="job-card-container relative job-card-list job-card-container--clickable "
"job-card-list--underline-title-on-hover jobs-search-results-list__list-"
"item--active jobs-search-two-pane__job-card-container--viewport-tracking"
"-0")
sleep(1)
each_container = soup.select('[class*="occludable-update"]', limit=20)
for container in each_container:
element = driver.find_element_by_class_name("artdeco-entity-lockup__caption")
driver.execute_script("arguments[0].scrollIntoView(true);", element)
element.click()
job_title = container.find('a', class_='disabled ember-view job-card-container__link job-card-list__title').text
location = container.find('li', class_='job-card-container__metadata-item').text
job_title = job_title.strip()
location = location.strip()
print(job_title, ', ', location)
except Exception as e:
print(e)
if __name__ == "__main__":
get_driver()

import trio
import httpx
from bs4 import BeautifulSoup
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"
}
async def get_soup(content):
return BeautifulSoup(content, 'lxml')
allin = []
async def worker(channel):
async with channel:
async for num in channel:
async with httpx.AsyncClient(timeout=None) as client:
client.headers.update(headers)
params = {
"currentJobId": "2638809245",
"f_E": "3,4",
"f_JT": "F",
"f_SB2": "3",
"f_TPR": "r604800",
"geoId": "102095887",
"keywords": "product development engineer",
"location": "California, United States",
"sortBy": "R",
"position": "1",
"pageNum": "0",
"start": num
}
r = await client.get('https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search', params=params)
soup = await get_soup(r.text)
goal = [(x.h3.get_text(strip=True), x.select_one('.job-search-card__location').get_text(strip=True))
for x in soup.select('.base-search-card__info')]
allin.extend(goal)
async def main():
async with trio.open_nursery() as nurse:
sender, receiver = trio.open_memory_channel(0)
async with receiver:
for _ in range(2):
nurse.start_soon(worker, receiver.clone())
async with sender:
for num in range(0, 450, 25):
await sender.send(num)
df = pd.DataFrame(allin, columns=["Title", "Location"])
print(df)
#df.to_csv('result.csv', index=False)
if __name__ == "__main__":
trio.run(main)
Output:
Title Location
0 Packaging Process Engineer Fremont, CA
1 Project Engineer Oakland, CA
2 Process Engineer- Materials and Fibers Santa Clarita, CA
3 Senior Product Design Engineer Carson, CA
4 Design Engineer Sacramento, CA
.. ... ...
436 Software Development Engineer Irvine, CA
437 Software Development Engineer Sunnyvale, CA
438 Software Development Engineer San Luis Obispo, CA
439 Software Development Engineer - Luna Irvine, CA
440 Software Development Engineer Irvine, CA
[441 rows x 2 columns]

Scrape hyperlink inside href with BeautifulSoup + Python

I would like to scrape the individual's website and and Blog's links on https://lawyers.justia.com/lawyer/robin-d-gross-39828.
I have so far:
if soup.find('div', attrs={'class': "heading-3 block-title iconed-heading font-w-bold"}) is not None:
webs = soup.find('div', attrs={'class': "heading-3 block-title iconed-heading font-w-bold"})
print(webs.findAll("href"))

from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
r = requests.get(
"https://lawyers.justia.com/lawyer/robin-d-gross-39828", headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("a", {'data-vars-action': ['ProfileWebsite', 'ProfileBlogPost']}):
print(item.get("href"))
Output:
http://www.imaginelaw.com/
http://www.imaginelaw.com/lawyer-attorney-1181486.html
http://www.ipjustice.org/internet-governance/icann-accountability-deficits-revealed-in-panel-ruling-on-africa/
http://www.circleid.com/members/5382
http://www.circleid.com/posts/20160301_icann_accountability_proposal_power_of_governments_over_internet
http://www.circleid.com/posts/20151201_supporting_orgs_marginalized_in_icann_accountability_proposal
http://www.circleid.com/posts/20150720_icann_accountability_deficits_revealed_in_panel_ruling_on_africa
http://www.circleid.com/posts/20150401_freedom_of_expression_chilled_by_icann_addition_of_speech
http://www.circleid.com/posts/20150203_proposal_for_creation_of_community_veto_for_key_icann_decisions
http://www.circleid.com/posts/20150106_civil_society_cautions_icann_giving_governments_veto_geo_domains
http://www.circleid.com/posts/20140829_radical_shift_of_power_proposed_at_icann_govts_in_primary_role
http://www.circleid.com/posts/20140821_icanns_accountability_plan_gives_icann_board_total_control
http://www.circleid.com/posts/20140427_a_civil_society_perspective_on_netmundial_final_outcome_document
https://imaginelaw.wordpress.com

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Issue with scraping zomato.com - beautifulsoup

Related

How to crawl start URLs only?

python requests.request "GET" returns different result

I cant get text from atribute span in bs4

python selenium/soup not scrolling and printing entire job containers in linkedined

Scrape hyperlink inside href with BeautifulSoup + Python

Categories

Resources