beautifulsoup4 present in Anaconda3 package list but cannot use it - beautifulsoup

import beautifulsoup
import requests
pageurl = "https://learning.edx.org/course/course-v1:TUMx+iLabx+2T2020/block-v1:TUMx+iLabx+2T2020+type#sequential+block#d7110bd0bcf4448eb3b170be28f7dfe4/block-v1:TUMx+iLabx+2T2020+type#vertical+block#141d0b4db33649b7bbffd7c4ec8a465c"
r = requests.get(pageurl)
soup = beautifulsoup(r.content,"html5lib")
links = soup.findALL("a")
I have Anaconda3 installed on my machine. When I do pip list in the cmd I can see beautifulsoup4 present there with other packages,but when I import it in spyder IDE it shows NoModuleFoundError. I have also tried pip installing bs4 but it dosen't work.

Make sure you're importing bs4 and not beautifulsoup4

You have to use from bs4 import BeautifulSoup instead of import beautifulsoup and You have to use soup = BeautifulSoup(r.content,"html5lib") instead of soup = beautifulsoup(r.content,"html5lib").
from bs4 import BeautifulSoup
import requests
pageurl = "https://learning.edx.org/course/course-v1:TUMx+iLabx+2T2020/block-v1:TUMx+iLabx+2T2020+type#sequential+block#d7110bd0bcf4448eb3b170be28f7dfe4/block-v1:TUMx+iLabx+2T2020+type#vertical+block#141d0b4db33649b7bbffd7c4ec8a465c"
r = requests.get(pageurl)
soup = BeautifulSoup(r.content,"html5lib")

Related

How to scrape company names from inc5000?

I am trying to scrape all company names from inc5000 site ("https://www.inc.com/inc5000/2021"). The problem is that the company names are displayed using JavaScript. I have tried using selenium and requests_html both to render the site but still when I fetch source code of page i get JavaScript. This is what I tried. I am new to web scraping so it is possible that I am making some foolish mistake. please guide
Here is my code.
...
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
options = Options()
options.headless = True
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.get("https://www.inc.com/inc5000/2021")
data=driver.page_source
print(data)
...
You could give some time to render or use seleniums waits:
...
import time
driver.get('https://www.inc.com/inc5000/2021')
time.sleep(5)
data = driver.page_source
soup = BeautifulSoup(data)
for e in soup.select('.company'):
print(e.text)
...
Why do you need beautiful soup, you just could use selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.inc.com/inc5000/2021")
companies = [e.text for e in driver.find_elements(By.CLASS_NAME, "company")]
This will only give you the elements in the viewport. You need to improve on that by scrolling.

Unable to parse element Selenium

I am trying to parse the date element ("3 February 2022") on the following webpage. However, I am unable to find it, even when using selenium to load it. Any suggestions to what I am doing wrong? Currently trying with the following code:
import requests as re
from bs4 import BeautifulSoup
import time
import re
from selenium import webdriver
url = "http://www.londonstockexchange.com/news-article/SAIN/net-asset-value-s/15316710"
driver = webdriver.Chrome()
driver.get(url)
time.sleep(5)
soup = str(BeautifulSoup(driver.page_source, 'html.parser'))
date = re.findall("[0-9]{1,2}\s[A-Z][a-z]+\s[0-9]{4}", soup)
print(f'Tager {date[-1]} ud af mulige datoer: {date}')

building a web scraping using the selenium command

I'm building a web scraping using the selenium command. I was able to read the data from the table on the first and second pages, however, I cannot read the data on the following pages. Can anybody help me?
Below is the code I am using.
NoSuchElementException: Message: no such element: Unable to locate
element:
{"method":"xpath","selector":"//table[1]/tbody[1]/tr[#class='painel'
and 1]/td[2]/a[1 and #href='javascript:pesquisar(2);']"} (Session
info: headless chrome=86.0.4240.75)
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import os
import json
url = 'https://www.desaparecidos.pr.gov.br/desaparecidos/desaparecidos.do?action=iniciarProcesso&m=false'
option = Options()
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.get(url)
time.sleep (5)
lista = driver.find_element_by_xpath('//*[#id="list_tabela"]/tbody')
lista_text = lista.text
print (lista_text)
driver.implicitly_wait(5)
driver.find_element_by_xpath("//table[1]/tbody[1]/tr[#class='painel' and 1]/td[2]/a[1 and #href='javascript:pesquisar(2);']").click()
time.sleep (5)
lista = driver.find_element_by_xpath('//*[#id="list_tabela"]/tbody')
lista_text = lista.text
print (lista_text)
driver.implicitly_wait(10)
driver.find_element_by_xpath("//table[1]/tbody[1]/tr[#class='painel' and 1]/td[2]/a[3 and #href='javascript:pesquisar(3);']").click()
time.sleep (10)
lista = driver.find_element_by_xpath('//*[#id="list_tabela"]/tbody')
lista_text = lista.text
print (lista_text)

How to add Proxy to Scrapy and Selenium Script

I would like to add a proxy to my script.
How do I have to do it? Do I have to use Selenium or Scrapy for it?
I think that Scrapy is making the initial request, so it would make sense to use scrapy for it. But what exactly do I have to do?
Can you recommend any proxylist which works quite reliable?
This is my current script:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
import re
import csv
from time import sleep
class PostsSpider(Spider):
name = 'posts'
allowed_domains = ['xyz']
start_urls = ('xyz',)
def parse(self, response):
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
for url in start_urls:
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
self.driver.get(url)
try:
self.driver.find_element_by_id('currentTab').click()
sleep(3)
self.logger.info('Sleeping for 5 sec.')
self.driver.find_element_by_xpath('//*[#id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
sleep(7)
self.logger.info('Sleeping for 7 sec.')
except NoSuchElementException:
self.logger.info('Blog does not exist anymore')
while True:
try:
element = self.driver.find_element_by_id('last_item')
self.driver.execute_script("arguments[0].scrollIntoView(0, document.documentElement.scrollHeight-5);", element)
sleep(3)
self.driver.find_element_by_id('last_item').click()
sleep(7)
except NoSuchElementException:
self.logger.info('No more tipps')
sel = Selector(text=self.driver.page_source)
allposts = sel.xpath('//*[#class="block media _feedPick feed-pick"]')
for post in allposts:
username = post.xpath('.//div[#class="col-sm-7 col-lg-6 no-padding"]/a/#title').extract()
publish_date = post.xpath('.//*[#class="bet-age text-muted"]/text()').extract()
yield {'Username': username,
'Publish date': publish_date}
self.driver.close()
break
One simple way is to set the http_proxy and https_proxy environment variables.
You could set them in your environment before starting your script, or maybe add this at the beginning of your script:
import os
os.environ['http_proxy'] = 'http://my/proxy'
os.environ['https_proxy'] = 'http://my/proxy'
For a list of publicly available proxy, you will find a ton if you just search in Google.
You should read Scrapy ProxyMiddleware to explore it to best. Ways of using mentioned proxies are also mentioned in it

Not able to get hidden contents of a website

I am trying to scrape a website with the help of BeautifulSoup. I am not able to get the contents of the website but it is on the source code when I inspect the site.
import requests
import urllib
from bs4 import BeautifulSoup
url1 = 'https://recruiting.ultipro.com/usg1006/JobBoard/dfc53730-57d1-3460-336f-ddafabd108f3/?q=&o=postedDateDesc'
response1 = get(url1)
print(response1.text[:500])
html_soup1 = BeautifulSoup(response1.text, 'html.parser')
type(html_soup1)
all_info1 = html_soup1.find("div", {"data-bind": "foreach: opportunities"})
all_info1
all_automation1 = all_info1.find_all("div",{"data-automation":"opportunity"})
all_automation1
In the source code there is "job-title", "location" and "description" and other details but I am not able to see the same details in the html contents.
You should try like this or anything similar to fetch the title from that page:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://recruiting.ultipro.com/usg1006/JobBoard/dfc53730-57d1-3460-336f-ddafabd108f3/?q=&o=postedDateDesc')
time.sleep(3) #let the browser load it's content
soup = BeautifulSoup(driver.page_source,'lxml')
for item in soup.select("h3 .opportunity-link"):
print(item.text)
driver.quit()