Splash not rendering a webpage completely - scrapy

I am trying to use scrapy + splash to scrape this site https://www.teammitsubishihartford.com/new-inventory/index.htm?compositeType=new. But i am unable to extract any data from the site. When I try rendering the webpage using splash api (browser), I came to know that the site is not fully loaded (splash rendering returns a partially loaded website image). How can I render the site completly??

#Vinu Abraham, If your requirement is not specific to scrapy + splash, you can use selenium. This issue occurs when we try to scrape a dynamic site.
Below is the code snippet for reference.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import re
from csv import writer
# url of the page we want to scrape
url = 'https://www.*******/drugs-all-medicines'
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
all_divs = soup.find('div', {'class': 'style__container___1i8GI'})
Also let me know if you get any solution for the same using scrapy.

Related

Scraping data from CrowdTangle using API return expired image links

I wanted to download images from CrowdTangle Dashboard. I wrote the code to fetch data using its API. However, historical posts scraped using the API return expired media links. While downloading the image, I got "URL expired" error. How to generate new links?
After talking with people, I figured out that I needed to scroll in the CrowdTangle dashboard to generate new image links. However, scrolling manually through thousands of posts will be a tedious task. Hence I decided to code a bot that scrolls. This solved my problem and I was able to generate new links.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(options=options)
link = {insert_link}
browser.get(link)
browser.maximize_window()
fb_button = browser.find_element(by=By.LINK_TEXT, value="click here.")
fb_button.click()
time.sleep(7)
phone = browser.find_element(by=By.ID,value="email")
password = browser.find_element(by=By.ID,value="pass")
submit = browser.find_element(by=By.ID,value="loginbutton")
phone.send_keys({phone number})
password.send_keys({password})
submit.click()
time.sleep(6)
element = browser.find_element(by=By.XPATH, value="/html/body/div[1]/div/div/div[3]/div")
while True:
browser.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", element)
time.sleep(3)
Go to the CrowdTangle dashboard, enter your filters, and query. Copy the link from the browser in the code. I would recommend running the scroll bot for each month. Sometimes more posts won't load. This is an issue with CrowdTangle. Just close the browser and move on to the next month.

How can I get the link to YouTube Channel from Video Page?

I've been trying to get the link to the YouTube channel from the Video Page. However, I couldn't find the element of the link. With Inspector, it is obvious that the link is right here as the following picture.
With the code 'a.yt-simple-endpoint.style-scope.yt-formatted-string', I tried to get the link through the following code.
! pip install selenium
from selenium import webdriver
! pip install beautifulsoup4
from bs4 import BeautifulSoup
driver = webdriver.Chrome('D:\chromedrive\chromedriver.exe')
driver.get('https://www.youtube.com/watch?v=P6Cc2R2jK6s')
soup = BeautifulSoup(driver.page_source, 'lxml')
links = soup.select('a.yt-simple-endpoint.style-scope.yt-formatted-string')
for link in links:
print(link.get_attribute("href"))
However, no matter I used links = soup.select('a.yt-simple-endpoint.style-scope.yt-formatted-string') or links = soup.find('a', class_='yt-simple-endpoint style-scope ytd-video-owner-renderer'), it did not print anything. Someone please help me solve this.
Instead of this:
links = soup.select('a.yt-simple-endpoint.style-scope.yt-formatted-string')
In Selenium if I would do:
links = drvier.find_elements_by_css_selector('a.yt-simple-endpoint.style-scope.yt-formatted-string')

running beautiful soup on brower opened using selenium(geckodriver)

currently, I am trying to scrape a website which generates captcha as text in the source. I want to automate the process of filling the form on that website using selenium. But every time I scrape the website the captcha for selenium webpage and scraped webpage become different, can someone help we out on this?
website: https://webkiosk.jiit.ac.in
from bs4 import BeautifulSoup
results=requests.get("https://webkiosk.jiit.ac.in/")
soup=BeautifulSoup(src,'html5lib')
links=soup.find_all("td");
#finding the captcha from link text
for i in links:
if(i.text.find("Parents")):
f=1;
if(f and i.text!='*' and i.text!='' and len(i.text)==5):
p=(i.text)
break
#captcha is getting stored in variable p
from selenium import webdriver
driver=webdriver.Firefox()
driver.get("https://webkiosk.jiit.ac.in/")
driver.find_element_by_name("MemberCode").send_keys('*****')
driver.find_element_by_name("DATE1").send_keys('******')
driver.find_element_by_name("Password101117").send_keys('#******')
driver.find_element_by_name("txtcap").send_keys(p)
#driver.find_element_by_name("BTNSubmit").click()

BeautifulSoup: Why doesn't it find all iframes?

I'm pretty new to BeautifulSoup, and I'm trying to figure out why it doesn't work as expected.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.globes.co.il/news/article.aspx?did=1001285710")
bsObj = BeautifulSoup(html.read(), features="html.parser")
print(bsObj.find_all('iframe'))
I get a list of only 2 iframes.
However, when I open this page with a browser and type:
document.getElementsByTagName("iframe")
in dev-tools I get a list of 14 elements.
Can you please help me?
This is because that site dynamically adds more iframes once the page is loaded. Additionally, iframe content is dynamically loaded by a browser, and won't be downloaded via urlopen either. You may need to use Selenium to allow JavaScript to load the additional iframes, and then may need to search for the iframe and download the content via the src url.

Can beautiful soup also hit webpage events?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. I will use it to extract webpage data,but i didn't find out any way to click the buttons,anchor label which are used in my case the page navigation. So for this shall I have to use any other or beautiful soup has the capability i didn't aware of.
Please advice me!
To answer your tags/comment, yes, you can use them together (Selenium and BeautifulSoup), and no, you can't directly use BeautifulSoup to execute events (clicking etc.). Although I myself haven't ever used them together in the same situation, a hypothetical situation could involve using Selenium to navigate to a target page via a certain path (i.e. click() these options and then click() the button to the next page), and then using BeautifulSoup to read the driver.page_source (where driver is the Selenium driver you created to 'drive' the browser). Since driver.page_source is the HTML of the page, you can use BeautifulSoup as you are used to, parsing out whatever information you need.
Simple example:
from bs4 import BeautifulSoup
from selenium import webdriver
# Create your driver
driver = webdriver.Firefox()
# Get a page
driver.get('http://news.ycombinator.com')
# Feed the source to BeautifulSoup
soup = BeautifulSoup(driver.page_source)
print soup.title # <title>Hacker News</title>
The main idea is that anytime you need to read the source of a page, you can pass driver.page_source to BeautifulSoup in order to read whatever you want.