I am creating a web scraping tool using BeautifulSoup and Selenium. I am scraping a community forum where I am able to scrap the first web page of a particular thread. Say, for example, for the following thread: https://www.dell.com/community/Optiplex-Desktops/dell-optiplex-7000MT-DDR5-Ram-campatibility/m-p/8224888#M61514
i can scrap only the first page. I want to scrap all of the pages (in this case 3) and display the content.
The following code scraps the first page:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException, ElementNotVisibleException
url = "https://www.dell.com/community/Optiplex-Desktops/dell-optiplex-7000MT-DDR5-Ram-campatibility/m-p/8224888#M61514"
result = requests.get(url)
soup = BeautifulSoup(result.text, "html.parser")
date = '01-19-2023'
comments = []
comments_section = soup.find('div', {'class':'lia-component-message-list-detail-with-inline-editors'})
comments_body = comments_section.find_all('div', {'class':'lia-linear-display-message-view'})
for comment in comments_body:
if date in comment.find('span',{'class':'local-date'}).text :
comments.append({
'Date': comment.find('span',{'class':'local-date'}).text.strip('\u200e'),
'Board': soup.find_all('li', {'class': 'lia-breadcrumb-node crumb'})[1].text.strip(),
'Sub-board':soup.find('a', {'class': 'lia-link-navigation crumb-board lia-breadcrumb-board lia-breadcrumb-forum'}).text,
'Title of Post': soup.find('div', {'class':'lia-message-subject'}).text.strip(),
'Main Message': soup.find('div', {'class':'lia-message-body'}).text.strip(),
'Post Comment': comment.find('div',{'class':'lia-message-body-content'}).text.strip(),
'Post Time' : comment.find('span',{'class':'local-time'}).text,
'Username': comment.find('a',{'class':'lia-user-name-link'}).text,
'URL' : str(url)
})
df1 = pd.DataFrame(comments)
print(df1)
I have tried the following:
next_page = driver.find_element("xpath","//li[#class='lia-link-navigation lia-js-data-pageNum-2 lia-custom-event']")
next_page.click ()
page2_url = driver.current_url
print(page2_url)
this is specific just for page 2.
However, i want this for all subsequent pages. And if there is only one page continue to execute next statement.
By using the above code i'm trying to get the URLs for the subsequent pages which i will add to list of urls that need to be scraped. Is there any alternative way to acheive this?
To scrape all the pages you can add a simple while 1 loop which is broken when the button Next Page disappear.
while 1:
print('current page:', soup.select_one('span[aria-current="page"]').text)
comments_section = ...
comments_body = ...
for comment in comments_body:
...
# next_btn is a list
next_btn = soup.select('a[aria-label="Next Page"]')
# if the list is not empty...
if next_btn:
url = next_btn[0]['href']
soup = BeautifulSoup(requests.get(url).text, "html.parser")
else:
break
Related
I need to collect all links from a webpage as seen below, which also has a load more news button. I wrote my script, but my script gives only the links from the first page, as if it does not click on the load more news button. I updated some of Selenium attributes. I really don't know why I could not get all the links, clicking on load_more button.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
import json
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
url = "..."
base_url = "..."
driver.get(url)
outlinks = []
wait = WebDriverWait(driver, 90)
load_more_button = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a.listing-load-more-btn[title="Load More News"]')))
num_links = 0
while True:
links = driver.find_elements(By.CSS_SELECTOR, 'a.text-truncate')
num_links_new = len(links)
if num_links_new > num_links:
num_links = num_links_new
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
load_more_button = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a.listing-load-more-btn[title="Load More News"]')))
if load_more_button.is_displayed():
load_more_button.click()
sleep(10)
else:
break
new_links = driver.find_elements(By.CSS_SELECTOR, 'a.text-truncate')
for link in new_links:
href = link.get_attribute('href')
full_url = base_url + href
enurl=full_url.replace("ar-ae", "en")
outlinks.append(enurl)
print(outlinks)
data = json.dumps(outlinks)
with open('outlinks.json', 'w') as f:
f.write(data)
driver.close()
Although you have tagged selenium, this is a much better way to handle it.
Whenever you click on the "load more" button, it sends a POST request to:
https://www.mofaic.gov.ae/api/features/News/NewsListPartialView
So, you can just get all the data from there directly using the requests/BeautifulSoup modules. There's no need for Selenium, and the process will be much faster!
import requests
from bs4 import BeautifulSoup
data = {
"CurrentPage": "1",
"CurrentRenderId": "{439EC71A-4231-45C8-B075-975BD41099A7}",
"CategoryID": "{f9048938-c577-4caa-b1d9-ae1b7a5f1b20}",
"PageSize": "6",
}
BASE_URL = "https://www.mofaic.gov.ae"
POST_URL = "https://www.mofaic.gov.ae/api/features/News/NewsListPartialView"
response = requests.post(
POST_URL,
data=data,
)
for page in range(
1, 10
): # <-- Increase this number to get more Articles - simulates the "load more" button.
data["CurrentPage"] = page
response = requests.post(
POST_URL,
data=data,
)
soup = BeautifulSoup(response.text, "html.parser")
for link in soup.select("a.text-truncate"):
print(BASE_URL + link["href"])
Prints (truncated):
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/2/02-01-2023-uae-leaders
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/2/02-01-2023-uae-vatican
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/2/02-01-2023-uae-fm
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/1/01-01-2022-uae-cuba
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/1/01-01-2022-uae-sudan
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/1/01-01-2023-uae-israel
I'm trying to get the url src from the following html
For some reason when i try to print out the logo url, I get [] as a response. My code is as follows:
from urllib.request import Request
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = 'https://growjo.com/industry/Cannabis'
request = Request(
url,
headers={'User-Agent': 'Mozilla/5.0'}
)
page = urlopen(request)
page_content_bytes = page.read()
page_html = page_content_bytes.decode("utf-8")
soup = BeautifulSoup(page_html, "html.parser")
company_rows = soup.find_all("table",{"class":"jss31"})[0].find_all("tbody")[0].find_all("tr")
for company in company_rows:
company_data = company.find_all("td")
logo = company_data[1].find_all("div",{"class":"lazyload-wrapper"})[0].find_all("a")
name = company_data[1].text
print(logo)
break
I tried printing out the 'a' tags...i tried the 'img'...they all respond with []. Its as if bs4 is not reading within the div class=lazyload-wrapper
Any help would be greatly appreciated.
The urls those contain logos are entirely dynamic
Bs4 can't render JS
API is restricted by authentication
Use an automation tool something like seleniun
Here I use Selenium4 with bs4
WebDriverManager is here
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--headless")
#chrome to stay open
#options.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)
url= 'https://growjo.com/industry/Cannabis'
driver.get(url)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, "html.parser")
company_rows = soup.select('table.jss31 tbody tr')
for company in company_rows:
log = company.select_one('td[class="jss38 jss40 jss46"] div + a')
logo = 'https://growjo.com' + log.get('href') if log else None
print(logo)
Output:
https://growjo.com/company/Dutchie
https://growjo.com/company/Ascend_Wellness_Holdings
https://growjo.com/company/Hiku_Brands
https://growjo.com/company/C3_Industries
https://growjo.com/company/Jane_Technologies
https://growjo.com/company/Headset
https://growjo.com/company/Jushi_Holdings
https://growjo.com/company/FLOWER_CO.
https://growjo.com/company/Columbia_Care
https://growjo.com/company/Cannabis_Control_Commission
https://growjo.com/company/FIGR
https://growjo.com/company/Leafly
https://growjo.com/company/Hound_Labs
https://growjo.com/company/Leaf_Trade
https://growjo.com/company/Wurk
https://growjo.com/company/Sundial_Cannabis
https://growjo.com/company/BEYOND_%2F_HELLO
https://growjo.com/company/PharmaCann
https://growjo.com/company/LeafLink
https://growjo.com/company/Connected_Cannabis_Co.
https://growjo.com/company/NATURE'S_MEDICINES
https://growjo.com/company/Althea_Group
https://growjo.com/company/CURE_Pharmaceutical
https://growjo.com/company/urban-gro
https://growjo.com/company/NABIS
None
https://growjo.com/company/Medisun
https://growjo.com/company/Mammoth_Distribution
https://growjo.com/company/Dosecann_Cannabis_Solutions
https://growjo.com/company/Vireo_Health
https://growjo.com/company/Dama_Financial
https://growjo.com/company/Caliber
https://growjo.com/company/springbig
https://growjo.com/company/Westleaf
https://growjo.com/company/INSA
https://growjo.com/company/Pure_Sunfarms
https://growjo.com/company/Sensi_Media_Group
https://growjo.com/company/Verano_Holdings
https://growjo.com/company/TILT_Holdings
https://growjo.com/company/Bloom_Medicinals
https://growjo.com/company/Planet_13_Holdings
https://growjo.com/company/Liberty_Health_Sciences
https://growjo.com/company/Calyx_Peak_Companies
https://growjo.com/company/Vangst
https://growjo.com/company/Fire_&_Flower
https://growjo.com/company/Revolution_Enterprises
https://growjo.com/company/4Front_Ventures
https://growjo.com/company/Calyx_Containers
https://growjo.com/company/GreenTech_Industries
https://growjo.com/company/BZAM_Cannabis
https://growjo.com/company/Cova_Software
None
https://growjo.com/company/Up_Cannabis
https://growjo.com/company/Cann_Group
https://growjo.com/company/Holistic_Industries
https://growjo.com/company/Treez
https://growjo.com/company/INDIVA
https://growjo.com/company/Kiva_Confections
https://growjo.com/company/MariMed
https://growjo.com/company/MCR_Labs
https://growjo.com/company/Vicente_Sederberg
https://growjo.com/company/Demetrix
https://growjo.com/company/365_Cannabis
https://growjo.com/company/LivWell_Enlightened_Health
https://growjo.com/company/High_Tide
https://growjo.com/company/The_Hawthorne_Gardening_Company
https://growjo.com/company/WYLD
https://growjo.com/company/VidaCann
https://growjo.com/company/Sira_Naturals
https://growjo.com/company/iAnthus
https://growjo.com/company/EastHORN_Clinical_Services
https://growjo.com/company/PharmaCielo
https://growjo.com/company/OCS_Ontario_Cannabis_Store
https://growjo.com/company/Hugh_Wood_Canada
https://growjo.com/company/Wana_Brands
https://growjo.com/company/Parallel
https://growjo.com/company/Weedmaps
None
https://growjo.com/company/Dark_Heart_Nursery
https://growjo.com/company/Stealth_Monitoring
https://growjo.com/company/dicentra
https://growjo.com/company/Sunday_Goods_&_The_Pharm
https://growjo.com/company/Phase_Zero_Design
https://growjo.com/company/Sava
https://growjo.com/company/Ceylon_Solutions
https://growjo.com/company/Green_Flower
https://growjo.com/company/Shryne_Group
https://growjo.com/company/MJ_Freeway
https://growjo.com/company/Theory_Wellness
https://growjo.com/company/HEXO_Corp
https://growjo.com/company/Lightshade
https://growjo.com/company/New_Frontier_Data
https://growjo.com/company/Mission_Dispensaries
https://growjo.com/company/FLUENT_Cannabis_Care
https://growjo.com/company/Superette
https://growjo.com/company/HdL_Companies
https://growjo.com/company/Helix_Technologies
https://growjo.com/company/Mary's_Medicinals
https://growjo.com/company/Indus_Holdings
https://growjo.com/company/Auxly
https://growjo.com/company/Good_Chemistry
https://growjo.com/company/Khiron_Life_Sciences_Corp
https://growjo.com/company/The_Apothecarium
I get to a page where there are many rows of data per page.
My code gets to each row, and I am able to scrape the title of each row.
However, all the data after that, all looks to have the same tag names (for example the author and program number etc)
Based on this, how do I scrape all the data within each row.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
baseURL = 'https://index.mirasmart.com/aan2022/'
for x in range (1,3):
driver.get(f'https://index.mirasmart.com/aan2022/SearchResults.php?pg={x}')
time.sleep(3)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
eachRow=soup.find_all('div',class_='full search-result')
for item in eachRow:
title=item.find('h2').text
print(title)
You could use :-soup-contains to target by text
for result in soup.select('.detail'):
print('title: ', result.find_previous('h2').text)
for item in ['Author:', 'Session Name:', 'Author Disclosures:', 'Topic:', 'Program Number:', 'Author Institution:']:
try:
print(item, result.select_one(f'.cell:-soup-contains("{item}")').find_next('span').text)
except:
print(f'{item} not found')
print()
I'm trying to scrape information from this website: "http://vlg.film/"
I'm not only interested in the first 15 titles, but in all of them. When clicking on the 'Show More' button a couple of times, the extra titles show up in the "inspect element" window, but the url stays the same, i.e. "https://vlg.film/". Does anyone have a or some bright ideas? I am fairly new to this..Thanks
`
import requests as re
from bs4 import BeautifulSoup as bs
url = ("https://vlg.film/")
page = re.get(url)
soup = bs(page.content, 'html.parser')
wrap = soup.find_all('div', class_="column column--20 column--main")
for det in wrap:
link = det.a['href']
print(link)
`
Looks like you can simply add the pagination to the url. The trick is to know when you reached the end. Playing around with it, it appears once you reach the end, it repeats the first page. So all you need to do is keep appending the links into a list, and when you start to repeat a link, have it stop.
import requests as re
from bs4 import BeautifulSoup as bs
next_page = True
page_num = 1
links = []
while next_page == True:
url = ("https://vlg.film/")
payload = {'PAGEN_1': '%s' %page_num}
page = re.get(url, params=payload)
soup = bs(page.content, 'html.parser')
wrap = soup.find_all('div', class_="column column--20 column--main")
for det in wrap:
link = det.a['href']
if link in links:
next_page = False
break
links.append(link)
page_num += 1
for link in links:
print(link)
Output:
/films/ainbo/
/films/boss-level/
/films/i-care-a-lot/
/films/fear-of-rain/
/films/extinct/
/films/reckoning/
/films/marksman/
/films/breaking-news-in-yuba-county/
/films/promising-young-woman/
/films/knuckledust/
/films/rifkins-festival/
/films/petit-pays/
/films/life-as-it-should-be/
/films/human-voice/
/films/come-away/
/films/jiu-jitsu/
/films/comeback-trail/
/films/cagefighter/
/films/kolskaya/
/films/golden-voices/
/films/bad-hair/
/films/dragon-rider/
/films/lucky/
/films/zalozhnik/
/films/findind-steve-mcqueen/
/films/black-water-abyss/
/films/bigfoot-family/
/films/alone/
/films/marionette/
/films/after-we-collided/
/films/copperfield/
/films/her-blue-sky/
/films/secret-garden/
/films/hour-of-lead/
/films/eve/
/films/happier-times-grump/
/films/palm-springs/
/films/unhinged/
/films/mermaid-in-paris/
/films/lassie/
/films/sunlit-night/
/films/hello-world/
/films/blood-machines/
/films/samsam/
/films/search-and-destroy/
/films/play/
/films/mortal/
/films/debt-collector-2/
/films/chosen-ones/
/films/inheritance/
/films/tailgate/
/films/silent-voice/
/films/roads-not-taken/
/films/jim-marshall/
/films/goya-murders/
/films/SUFD/
/films/pinocchio/
/films/swallow/
/films/come-as-you-are/
/films/kelly-gang/
/films/corpus-christi/
/films/gentlemen/
/films/vic-the-viking/
/films/perfect-nanny/
/films/farmageddon/
/films/close-to-the-horizon/
/films/disturbing-the-peace/
/films/trauma-center/
/films/benjamin/
/films/COURIER/
/films/aeronauts/
/films/la-belle-epoque/
/films/arctic-dogs/
/films/paradise-hills/
/films/ditya-pogody/
/films/selma-v-gorode-prizrakov/
/films/rainy-day-in-ny/
/films/ty-umeesh-khranit-sekrety/
/films/after-the-wedding/
/films/the-room/
/films/kuda-ty-propala-bernadett/
/films/uglydolls/
/films/smert-i-zhizn-dzhona-f-donovana/
/films/sinyaya-bezdna-2/
/films/just-a-gigolo/
/films/i-am-mother/
/films/city-hunter/
/films/lets-dance/
/films/five-feet-apart/
/films/after/
/films/100-things/
/films/greta/
/films/CORGI/
/films/destroyer/
/films/vice/
/films/ayka/
/films/van-gogh/
/films/serenity/
This is pretty simple web site to extract data. Create a urls list of web page , how many page do you want to extract. Then use for loop to iterate all page extract the data.
import requests as re
from bs4 import BeautifulSoup as bs
urls = ["http://vlg.film/ajax/index_films.php?PAGEN_1={}".format(x) for x in range(1,11)]
for url in urls:
page = re.get(url)
soup = bs(page.content, 'html.parser')
wrap = soup.find_all('div', class_="column column--20 column--main")
print(url)
for det in wrap:
link = det.a['href']
print(link)
I have a program to download photos on various websites. Each url is formed at the end of the address by codes, which are accessed in a dataframe. In a dataframe of 8,583 lines
The sites have javascript, so I use selenium to access the src of the photos. And I download it with urllib.request.urlretrieve
Example of a photo site: http://divulgacandcontas.tse.jus.br/divulga/#/candidato/2018/2022802018/PB/150000608817
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
from bs4 import BeautifulSoup
import time
import urllib.request, urllib.parse, urllib.error
# Root URL of the site that is accessed to fetch the photo link
url_raiz = 'http://divulgacandcontas.tse.jus.br/divulga/#/candidato/2018/2022802018/'
# Accesses the dataframe that has the "sequencial" type codes
candidatos = pd.read_excel('candidatos_2018.xlsx',sheet_name='Sheet1', converters={'sequencial': lambda x: str(x), 'cpf': lambda x: str(x),'numero_urna': lambda x: str(x)})
# Function that opens each page and takes the link from the photo
def pegalink(url):
profile = webdriver.FirefoxProfile()
browser = webdriver.Firefox(profile)
browser.get(url)
time.sleep(10)
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
browser.close()
link = soup.find("img", {"class": "img-thumbnail img-responsive dvg-cand-foto"})['src']
return link
# Function that downloads the photo and saves it with the code name "cpf"
def baixa_foto(nome, url):
urllib.request.urlretrieve(url, nome)
# Iteration in the dataframe
for num, row in candidatos.iterrows():
cpf = (row['cpf']).strip()
uf = (row['uf']).strip()
print(cpf)
print("-/-")
sequencial = (row['sequencial']).strip()
# Creates full page address
url = url_raiz + uf + '/' + sequencial
link_foto = pegalink(url)
baixa_foto(cpf, link_foto)
Please I look guidance for:
Put a try-Exception type to wait for the page to load (I'm having errors reading the src - after many hits the site takes more than ten seconds to load)
And I would like to record all possible errors - in a file or dataframe - to write down the "sequencial" code that gave error and continue the program
Would anyone know how to do it? The guidelines below were very useful, but I was unable to move forward
I put in a folder a part of the data I use and the program, if you want to look: https://drive.google.com/drive/folders/1lAnODBgC5ZUDINzGWMcvXKTzU7tVZXsj?usp=sharing
put your code within :
try:
WebDriverWait(browser, 30).until(wait_for(page_has_loaded))
# here goes your code
except: Exception
print "This is an unexpected condition!"
For waitForPageToLoad :
def page_has_loaded():
page_state = browser.execute_script(
'return document.readyState;'
)
return page_state == 'complete'
30 above is time in seconds. You can adjust it as per your need.
Approach 2 :
class wait_for_page_load(object):
def __init__(self, browser):
self.browser = browser
def __enter__(self):
self.old_page = self.browser.find_element_by_tag_name('html')
def page_has_loaded(self):
new_page = self.browser.find_element_by_tag_name('html')
return new_page.id != self.old_page.id
def __exit__(self, *_):
wait_for(self.page_has_loaded)
def pegalink(url):
profile = webdriver.FirefoxProfile()
browser = webdriver.Firefox(profile)
browser.get(url)
try:
with wait_for_page_load(browser):
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
browser.close()
link = soup.find("img", {"class": "img-thumbnail img-responsive dvg-cand-foto"})['src']
except Exception:
print ("This is an unexpected condition!")
print("Erro em: ", url)
link = "Erro"
return link