how to get href value from web page using xpath - beautifulsoup

i am trying to get href values from a web page, executed without error but not getting href value. please any one help me on this
url = "https://journals.sagepub.com/action/doSearch?field1=Keyword&text1=cancer&field2=AllField&text2=&Ppub=&Ppub=&AfterYear=&BeforeYear=&earlycite=on&access="
red = requests.get(url)
page_source = red.text
soup = BeautifulSoup(page_source, 'html.parser')
elms = soup.select('div.ol.li[1].article.h2.span a')
for i in elms:
print(i.attrs['data-item-name'])

Your Css selector is wrong.Your css selector should be like div ol li article h2 span a
url = "https://journals.sagepub.com/action/doSearch?field1=Keyword&text1=cancer&field2=AllField&text2=&Ppub=&Ppub=&AfterYear=&BeforeYear=&earlycite=on&access="
red = requests.get(url)
page_source = red.text
soup = BeautifulSoup(page_source, 'html.parser')
elms = soup.select('div ol li article h2 span a')
for i in elms:
print(i['href'])
Output:
/doi/pdf/10.2182/cjot.2012.79.1.5
/doi/full/10.1177/1073274818775360
/doi/full/10.1177/0024363918811637
/doi/full/10.2217/whe.15.57
/doi/pdf/10.1177/1077558709335536
/doi/full/10.1177/1757975914537094
/doi/full/10.1177/1073274819846603
/doi/pdf/10.1191/0748233701th098oa
/doi/pdf/10.2190/PM.40.2.d
/doi/pdf/10.1177/070674371005501203
/doi/pdf/10.1177/030089161109700213
/doi/pdf/10.1177/1721727X1100900215
/doi/pdf/10.1177/1049732310387798
/doi/full/10.1177/2051415818755626
/doi/pdf/10.1136/jms.7.4.177
/doi/pdf/10.1177/1066896909333778
/doi/full/10.4137/CIN.S5460
/doi/full/10.4137/CMO.S603
/doi/full/10.1177/0194599814551718
/doi/full/10.4137/CIN.S13788

Related

Selenium: click next till the last page

I am creating a web scraping tool using BeautifulSoup and Selenium. I am scraping a community forum where I am able to scrap the first web page of a particular thread. Say, for example, for the following thread: https://www.dell.com/community/Optiplex-Desktops/dell-optiplex-7000MT-DDR5-Ram-campatibility/m-p/8224888#M61514
i can scrap only the first page. I want to scrap all of the pages (in this case 3) and display the content.
The following code scraps the first page:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException, ElementNotVisibleException
url = "https://www.dell.com/community/Optiplex-Desktops/dell-optiplex-7000MT-DDR5-Ram-campatibility/m-p/8224888#M61514"
result = requests.get(url)
soup = BeautifulSoup(result.text, "html.parser")
date = '01-19-2023'
comments = []
comments_section = soup.find('div', {'class':'lia-component-message-list-detail-with-inline-editors'})
comments_body = comments_section.find_all('div', {'class':'lia-linear-display-message-view'})
for comment in comments_body:
if date in comment.find('span',{'class':'local-date'}).text :
comments.append({
'Date': comment.find('span',{'class':'local-date'}).text.strip('\u200e'),
'Board': soup.find_all('li', {'class': 'lia-breadcrumb-node crumb'})[1].text.strip(),
'Sub-board':soup.find('a', {'class': 'lia-link-navigation crumb-board lia-breadcrumb-board lia-breadcrumb-forum'}).text,
'Title of Post': soup.find('div', {'class':'lia-message-subject'}).text.strip(),
'Main Message': soup.find('div', {'class':'lia-message-body'}).text.strip(),
'Post Comment': comment.find('div',{'class':'lia-message-body-content'}).text.strip(),
'Post Time' : comment.find('span',{'class':'local-time'}).text,
'Username': comment.find('a',{'class':'lia-user-name-link'}).text,
'URL' : str(url)
})
df1 = pd.DataFrame(comments)
print(df1)
I have tried the following:
next_page = driver.find_element("xpath","//li[#class='lia-link-navigation lia-js-data-pageNum-2 lia-custom-event']")
next_page.click ()
page2_url = driver.current_url
print(page2_url)
this is specific just for page 2.
However, i want this for all subsequent pages. And if there is only one page continue to execute next statement.
By using the above code i'm trying to get the URLs for the subsequent pages which i will add to list of urls that need to be scraped. Is there any alternative way to acheive this?
To scrape all the pages you can add a simple while 1 loop which is broken when the button Next Page disappear.
while 1:
print('current page:', soup.select_one('span[aria-current="page"]').text)
comments_section = ...
comments_body = ...
for comment in comments_body:
...
# next_btn is a list
next_btn = soup.select('a[aria-label="Next Page"]')
# if the list is not empty...
if next_btn:
url = next_btn[0]['href']
soup = BeautifulSoup(requests.get(url).text, "html.parser")
else:
break

Collecting links from a Webpage using Selenium-Load More Problem

I need to collect all links from a webpage as seen below, which also has a load more news button. I wrote my script, but my script gives only the links from the first page, as if it does not click on the load more news button. I updated some of Selenium attributes. I really don't know why I could not get all the links, clicking on load_more button.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
import json
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
url = "..."
base_url = "..."
driver.get(url)
outlinks = []
wait = WebDriverWait(driver, 90)
load_more_button = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a.listing-load-more-btn[title="Load More News"]')))
num_links = 0
while True:
links = driver.find_elements(By.CSS_SELECTOR, 'a.text-truncate')
num_links_new = len(links)
if num_links_new > num_links:
num_links = num_links_new
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
load_more_button = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a.listing-load-more-btn[title="Load More News"]')))
if load_more_button.is_displayed():
load_more_button.click()
sleep(10)
else:
break
new_links = driver.find_elements(By.CSS_SELECTOR, 'a.text-truncate')
for link in new_links:
href = link.get_attribute('href')
full_url = base_url + href
enurl=full_url.replace("ar-ae", "en")
outlinks.append(enurl)
print(outlinks)
data = json.dumps(outlinks)
with open('outlinks.json', 'w') as f:
f.write(data)
driver.close()
Although you have tagged selenium, this is a much better way to handle it.
Whenever you click on the "load more" button, it sends a POST request to:
https://www.mofaic.gov.ae/api/features/News/NewsListPartialView
So, you can just get all the data from there directly using the requests/BeautifulSoup modules. There's no need for Selenium, and the process will be much faster!
import requests
from bs4 import BeautifulSoup
data = {
"CurrentPage": "1",
"CurrentRenderId": "{439EC71A-4231-45C8-B075-975BD41099A7}",
"CategoryID": "{f9048938-c577-4caa-b1d9-ae1b7a5f1b20}",
"PageSize": "6",
}
BASE_URL = "https://www.mofaic.gov.ae"
POST_URL = "https://www.mofaic.gov.ae/api/features/News/NewsListPartialView"
response = requests.post(
POST_URL,
data=data,
)
for page in range(
1, 10
): # <-- Increase this number to get more Articles - simulates the "load more" button.
data["CurrentPage"] = page
response = requests.post(
POST_URL,
data=data,
)
soup = BeautifulSoup(response.text, "html.parser")
for link in soup.select("a.text-truncate"):
print(BASE_URL + link["href"])
Prints (truncated):
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/2/02-01-2023-uae-leaders
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/2/02-01-2023-uae-vatican
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/2/02-01-2023-uae-fm
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/1/01-01-2022-uae-cuba
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/1/01-01-2022-uae-sudan
https://www.mofaic.gov.ae/ar-ae/mediahub/news/2023/1/1/01-01-2023-uae-israel

How to get data from IMDb using BeautifulSoup

I'm trying to get the names of top 250 IMDb movies using BeautifulSoup. The code does not execute properly and shows no errors.
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/chart/top"
response = requests.get(url)
rc = response.content
soup = BeautifulSoup(rc,"html.parser")
for i in soup.find_all("td",{"class:":"titleColumn"}):
print(i)
I'm expecting it show me all of the td tags with titleColumn classes but it is not working. Am I missing something? Thanks in advance!
Remove the : after the class:
{"class:":"titleColumn"}
to
{"class":"titleColumn"}
Example ++
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/chart/top"
response = requests.get(url)
rc = response.content
soup = BeautifulSoup(rc,"html.parser")
data = []
for i in soup.find_all("td",{"class":"titleColumn"}):
data.append({
'people':i.a['title'],
'title':i.a.get_text(),
'info':i.span.get_text()
})
data

Scraping a web page with a "more" button...with beautifulsoup

I'm trying to scrape information from this website: "http://vlg.film/"
I'm not only interested in the first 15 titles, but in all of them. When clicking on the 'Show More' button a couple of times, the extra titles show up in the "inspect element" window, but the url stays the same, i.e. "https://vlg.film/". Does anyone have a or some bright ideas? I am fairly new to this..Thanks
`
import requests as re
from bs4 import BeautifulSoup as bs
url = ("https://vlg.film/")
page = re.get(url)
soup = bs(page.content, 'html.parser')
wrap = soup.find_all('div', class_="column column--20 column--main")
for det in wrap:
link = det.a['href']
print(link)
`
Looks like you can simply add the pagination to the url. The trick is to know when you reached the end. Playing around with it, it appears once you reach the end, it repeats the first page. So all you need to do is keep appending the links into a list, and when you start to repeat a link, have it stop.
import requests as re
from bs4 import BeautifulSoup as bs
next_page = True
page_num = 1
links = []
while next_page == True:
url = ("https://vlg.film/")
payload = {'PAGEN_1': '%s' %page_num}
page = re.get(url, params=payload)
soup = bs(page.content, 'html.parser')
wrap = soup.find_all('div', class_="column column--20 column--main")
for det in wrap:
link = det.a['href']
if link in links:
next_page = False
break
links.append(link)
page_num += 1
for link in links:
print(link)
Output:
/films/ainbo/
/films/boss-level/
/films/i-care-a-lot/
/films/fear-of-rain/
/films/extinct/
/films/reckoning/
/films/marksman/
/films/breaking-news-in-yuba-county/
/films/promising-young-woman/
/films/knuckledust/
/films/rifkins-festival/
/films/petit-pays/
/films/life-as-it-should-be/
/films/human-voice/
/films/come-away/
/films/jiu-jitsu/
/films/comeback-trail/
/films/cagefighter/
/films/kolskaya/
/films/golden-voices/
/films/bad-hair/
/films/dragon-rider/
/films/lucky/
/films/zalozhnik/
/films/findind-steve-mcqueen/
/films/black-water-abyss/
/films/bigfoot-family/
/films/alone/
/films/marionette/
/films/after-we-collided/
/films/copperfield/
/films/her-blue-sky/
/films/secret-garden/
/films/hour-of-lead/
/films/eve/
/films/happier-times-grump/
/films/palm-springs/
/films/unhinged/
/films/mermaid-in-paris/
/films/lassie/
/films/sunlit-night/
/films/hello-world/
/films/blood-machines/
/films/samsam/
/films/search-and-destroy/
/films/play/
/films/mortal/
/films/debt-collector-2/
/films/chosen-ones/
/films/inheritance/
/films/tailgate/
/films/silent-voice/
/films/roads-not-taken/
/films/jim-marshall/
/films/goya-murders/
/films/SUFD/
/films/pinocchio/
/films/swallow/
/films/come-as-you-are/
/films/kelly-gang/
/films/corpus-christi/
/films/gentlemen/
/films/vic-the-viking/
/films/perfect-nanny/
/films/farmageddon/
/films/close-to-the-horizon/
/films/disturbing-the-peace/
/films/trauma-center/
/films/benjamin/
/films/COURIER/
/films/aeronauts/
/films/la-belle-epoque/
/films/arctic-dogs/
/films/paradise-hills/
/films/ditya-pogody/
/films/selma-v-gorode-prizrakov/
/films/rainy-day-in-ny/
/films/ty-umeesh-khranit-sekrety/
/films/after-the-wedding/
/films/the-room/
/films/kuda-ty-propala-bernadett/
/films/uglydolls/
/films/smert-i-zhizn-dzhona-f-donovana/
/films/sinyaya-bezdna-2/
/films/just-a-gigolo/
/films/i-am-mother/
/films/city-hunter/
/films/lets-dance/
/films/five-feet-apart/
/films/after/
/films/100-things/
/films/greta/
/films/CORGI/
/films/destroyer/
/films/vice/
/films/ayka/
/films/van-gogh/
/films/serenity/
This is pretty simple web site to extract data. Create a urls list of web page , how many page do you want to extract. Then use for loop to iterate all page extract the data.
import requests as re
from bs4 import BeautifulSoup as bs
urls = ["http://vlg.film/ajax/index_films.php?PAGEN_1={}".format(x) for x in range(1,11)]
for url in urls:
page = re.get(url)
soup = bs(page.content, 'html.parser')
wrap = soup.find_all('div', class_="column column--20 column--main")
print(url)
for det in wrap:
link = det.a['href']
print(link)

BeautifulSoup code Error

I cannot seem to find the problem in this code.
Help will be appreciated.
import requests
from bs4 import BeautifulSoup
url = 'http://nytimes.com'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html)
title = soup.find('span','articletitle').string
Code & Error Screenshot
The problem is http://nytimes.com does not have any articletitle span. To be safe, just check if soup.find('span','articletitle') is not None: before accessing it. Also, you don't need to access string property here. For example, the following would work fine.
import requests
from bs4 import BeautifulSoup
url = 'http://nytimes.com'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, 'html.parser')
if soup.find('div', 'suggestions') is not None:
title = soup.find('div','suggestions')
print(title)
Put your code inside try and catch & then print exception which is occurring. using the exception occurred you can rectify the problem.
Hi use a parser as your second argument for the get method,
Ex:-
page_content = BeautifulSoup(r.content, "html.parser")