running beautiful soup on brower opened using selenium(geckodriver) - selenium

currently, I am trying to scrape a website which generates captcha as text in the source. I want to automate the process of filling the form on that website using selenium. But every time I scrape the website the captcha for selenium webpage and scraped webpage become different, can someone help we out on this?
website: https://webkiosk.jiit.ac.in
from bs4 import BeautifulSoup
results=requests.get("https://webkiosk.jiit.ac.in/")
soup=BeautifulSoup(src,'html5lib')
links=soup.find_all("td");
#finding the captcha from link text
for i in links:
if(i.text.find("Parents")):
f=1;
if(f and i.text!='*' and i.text!='' and len(i.text)==5):
p=(i.text)
break
#captcha is getting stored in variable p
from selenium import webdriver
driver=webdriver.Firefox()
driver.get("https://webkiosk.jiit.ac.in/")
driver.find_element_by_name("MemberCode").send_keys('*****')
driver.find_element_by_name("DATE1").send_keys('******')
driver.find_element_by_name("Password101117").send_keys('#******')
driver.find_element_by_name("txtcap").send_keys(p)
#driver.find_element_by_name("BTNSubmit").click()

Related

Splash not rendering a webpage completely

I am trying to use scrapy + splash to scrape this site https://www.teammitsubishihartford.com/new-inventory/index.htm?compositeType=new. But i am unable to extract any data from the site. When I try rendering the webpage using splash api (browser), I came to know that the site is not fully loaded (splash rendering returns a partially loaded website image). How can I render the site completly??
#Vinu Abraham, If your requirement is not specific to scrapy + splash, you can use selenium. This issue occurs when we try to scrape a dynamic site.
Below is the code snippet for reference.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import re
from csv import writer
# url of the page we want to scrape
url = 'https://www.*******/drugs-all-medicines'
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
all_divs = soup.find('div', {'class': 'style__container___1i8GI'})
Also let me know if you get any solution for the same using scrapy.

How to scrape a website where details are not on the inspect page?

I have this website that I need to scrape.
https://www.dawn.com
My goal is to scrape all news content with the keyword "Pakistan"
So far, I can only scrape the content if I have the URL. For example:
from newspaper import Article
import nltk
nltk.download('punkt')
url = 'https://www.dawn.com/news/1582311/who-chief-lauds-pakistan-for-suppressing-covid-19-while-keeping-economy-afloat'
article = Article(url)
article.download()
article.parse()
article.nlp()
article.summary
From this code, I wrote I would to copy and paste all the URLs and that is too much to do manually. Do you have any idea on how to do this?
better is goto> https://www.dawn.com/pakistan & download (.html) then scrape all the news content, later bifurcate using keywords.

Beautifulsoup not able to extract src tag

I want to automate downloading images from imgflip.
import requests
from bs4 import BeautifulSoup as bs
url="https://imgflip.com/memegenerator/Drake-Hotline-Bling"
page=requests.get(url)
parsed=bs(page.content,'html.parser')
res=parsed.find_all('img',class_="mm-img shadow")
print(res)
When I inspect the page, I see the src tag for the image but the response I get does not have the src tag. I have also tried setting src=True, but it also doesn't work. Thank you for helping.
On the other hand, with a dynamic website the server might not send
back any HTML at all. Instead, you’ll receive JavaScript code as a
response. This will look completely different from what you saw when
you inspected the page with your browser’s developer tools.
Src: real python.
In my case also, the website sent back javascript code. The solution is to use requests-html or selenium

BeautifulSoup: Why doesn't it find all iframes?

I'm pretty new to BeautifulSoup, and I'm trying to figure out why it doesn't work as expected.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.globes.co.il/news/article.aspx?did=1001285710")
bsObj = BeautifulSoup(html.read(), features="html.parser")
print(bsObj.find_all('iframe'))
I get a list of only 2 iframes.
However, when I open this page with a browser and type:
document.getElementsByTagName("iframe")
in dev-tools I get a list of 14 elements.
Can you please help me?
This is because that site dynamically adds more iframes once the page is loaded. Additionally, iframe content is dynamically loaded by a browser, and won't be downloaded via urlopen either. You may need to use Selenium to allow JavaScript to load the additional iframes, and then may need to search for the iframe and download the content via the src url.

Can beautiful soup also hit webpage events?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. I will use it to extract webpage data,but i didn't find out any way to click the buttons,anchor label which are used in my case the page navigation. So for this shall I have to use any other or beautiful soup has the capability i didn't aware of.
Please advice me!
To answer your tags/comment, yes, you can use them together (Selenium and BeautifulSoup), and no, you can't directly use BeautifulSoup to execute events (clicking etc.). Although I myself haven't ever used them together in the same situation, a hypothetical situation could involve using Selenium to navigate to a target page via a certain path (i.e. click() these options and then click() the button to the next page), and then using BeautifulSoup to read the driver.page_source (where driver is the Selenium driver you created to 'drive' the browser). Since driver.page_source is the HTML of the page, you can use BeautifulSoup as you are used to, parsing out whatever information you need.
Simple example:
from bs4 import BeautifulSoup
from selenium import webdriver
# Create your driver
driver = webdriver.Firefox()
# Get a page
driver.get('http://news.ycombinator.com')
# Feed the source to BeautifulSoup
soup = BeautifulSoup(driver.page_source)
print soup.title # <title>Hacker News</title>
The main idea is that anytime you need to read the source of a page, you can pass driver.page_source to BeautifulSoup in order to read whatever you want.