Beautifulsoup not able to extract src tag - beautifulsoup

I want to automate downloading images from imgflip.
import requests
from bs4 import BeautifulSoup as bs
url="https://imgflip.com/memegenerator/Drake-Hotline-Bling"
page=requests.get(url)
parsed=bs(page.content,'html.parser')
res=parsed.find_all('img',class_="mm-img shadow")
print(res)
When I inspect the page, I see the src tag for the image but the response I get does not have the src tag. I have also tried setting src=True, but it also doesn't work. Thank you for helping.

On the other hand, with a dynamic website the server might not send
back any HTML at all. Instead, you’ll receive JavaScript code as a
response. This will look completely different from what you saw when
you inspected the page with your browser’s developer tools.
Src: real python.
In my case also, the website sent back javascript code. The solution is to use requests-html or selenium

Related

BeautifulSoup: Why doesn't it find all iframes?

I'm pretty new to BeautifulSoup, and I'm trying to figure out why it doesn't work as expected.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.globes.co.il/news/article.aspx?did=1001285710")
bsObj = BeautifulSoup(html.read(), features="html.parser")
print(bsObj.find_all('iframe'))
I get a list of only 2 iframes.
However, when I open this page with a browser and type:
document.getElementsByTagName("iframe")
in dev-tools I get a list of 14 elements.
Can you please help me?
This is because that site dynamically adds more iframes once the page is loaded. Additionally, iframe content is dynamically loaded by a browser, and won't be downloaded via urlopen either. You may need to use Selenium to allow JavaScript to load the additional iframes, and then may need to search for the iframe and download the content via the src url.

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.
https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.
It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

Response has nothing in it

I have been following the scrapy tutorial trying to create a very simple web scraper for warframe.market. I have about a year of coding experience from school, but no python experience. I simply want to get the price of an item from the website. I used the following to scrape the page:
scrapy shell "https://warframe.market/items/hydroid_prime_set"
then I inspected the web page to find the individual elements that I am trying to scrape. I used this command to try to view the results I wanted:
response.css("div.order-row.d-flex.col-12").extract()
This did not work, so I used view(response) to see what I had scraped, and my cmd just waits endlessly at this point.
Is HTTPS stopping me from scraping? Am I selecting the wrong css in my response? Is the webpage too big? Could someone please show me where I went wrong?
Thanks
The response isn't empty, but it's rendered using javascript (you can validate it inspecting the response.body), for example try this in the shell:
import json
data = json.loads(response.css('#application-state::text').extract_first())
for order in data.get('payload',{}).get('orders', []):
print '"{}" price: {}'.format(order.get('platinum'),
order.get('user',{}).get('ingame_name'))

retrieving ad urls using scrapy and selenium

I am trying to retrieve the ad URLs for this website:
http://www.appledaily.com
The ad URLs are loaded using javascript so a standard crawlspider does not work. The ads also changes as you refresh the page.
I found this question here and what I gathered is that we need to first use selenium to load a page in the browser then use Scrapy to retrieve the url. I have some experiences with scrapy but none at all in using Selenium. Can anyone show/point me to resource on how I can write a script to do that?
Thank you very much!
EDIT:
I tried the following but neither works in opening the ad banner. Can anyone help?
from selenium import webdriver driver=webdriver.Firefox()
driver=webdriver.Firefox()
driver.get('http://appledaily.com')
adBannerElement = driver.find_element_by_id('adHeaderTop')
adBannerElement.click()
2nd try:
adBannerElement =driver.find_element_by_css_selector("div[#id='adHeaderTop']")
adBannerElement.click()
CSS Selector should not contain # symbol it should be 'div[id='adHeaderTop']' or a shorter way of representing the same as div#adHeaderTop
Actually on observing and analyzing the site and the event that you are trying to carry out, I find that the noscript tag is what should interest you. Just get the HTML source of this node, parse the href attribute and fire this URL.
It will be equivalent to clicking the banner.
<noscript>
"<a href="http://adclick.g.doubleclick.net/aclk%253Fsa%...</a>"
</noscript>
(This is not the complete node information, just inspect the banner in Chrome and you will find this tag).
EDIT: Here is a working snippet that gives you the URL without clicking on the Ad banner, as mentioned from the tag.
driver = new FirefoxDriver();
driver.navigate().to("http://www.appledaily.com");
WebElement objHidden = driver.findElement(By.cssSelector("div#adHeaderTop_ad_container noscript"));
if(objHidden != null){
String innerHTML = objHidden.getAttribute("innerHTML");
String adURL = innerHTML.split("\"")[1];
System.out.println("** " + adURL); ///URL when you click on the Ad
}
else{
System.out.println("<noscript> element not found...");
}
Though this is written in Java, the page source wont change.

Can beautiful soup also hit webpage events?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. I will use it to extract webpage data,but i didn't find out any way to click the buttons,anchor label which are used in my case the page navigation. So for this shall I have to use any other or beautiful soup has the capability i didn't aware of.
Please advice me!
To answer your tags/comment, yes, you can use them together (Selenium and BeautifulSoup), and no, you can't directly use BeautifulSoup to execute events (clicking etc.). Although I myself haven't ever used them together in the same situation, a hypothetical situation could involve using Selenium to navigate to a target page via a certain path (i.e. click() these options and then click() the button to the next page), and then using BeautifulSoup to read the driver.page_source (where driver is the Selenium driver you created to 'drive' the browser). Since driver.page_source is the HTML of the page, you can use BeautifulSoup as you are used to, parsing out whatever information you need.
Simple example:
from bs4 import BeautifulSoup
from selenium import webdriver
# Create your driver
driver = webdriver.Firefox()
# Get a page
driver.get('http://news.ycombinator.com')
# Feed the source to BeautifulSoup
soup = BeautifulSoup(driver.page_source)
print soup.title # <title>Hacker News</title>
The main idea is that anytime you need to read the source of a page, you can pass driver.page_source to BeautifulSoup in order to read whatever you want.