Scrapy shell gives an output of empty list even if the xpath is correct in chrome.Why? - scrapy

Executed on Scrapy shell
url = "https://www.daraz.com.np/smartphones/?spm=a2a0e.11779170.cate_1.1.287d2d2b2cP9ar"
fetch(url)
r = scrapy.Request(url = url)
fetch(r)
response.xpath("//div[#class='ant-col-20 ant-col-push-4 c1z9Ut']/div[#class='c1_t2i']/div[#class='c2prKC']/div/div/div/div[#class='c16H9d']/a/text()").getall()
##NOTE##
There is no tbody tag in xpath
Why it outputs an empty list in scrapy thought it has 40 text in chrome?

It's because the website is heavily javascript orientated. That means content on the website is being loaded dynamically. It's invoking HTTP requests as the page loads and it's not hard coded into the HTML. So when you use scrapy shell it's not loading the HTML.
Couple of suggestions
Try to re-engineer the HTTP Requests. That is javascript envokes HTTP requests and therefore if you can mimic the requests can you get the data you want. YOu will need to use chrome dev tools or similar to see how the requests are made. This is the most clean and concise way to get data. All other options will slow the spider down and are more brittle.
Scrapy-splash - This prerenders the DOM of the page and allows you to access the HTML you desire.
Scrapy-selenium - A downloader middleware that handles requests with selenium. Not got the full function of selenium package but can render the DOM and you could get the data you require.
Embed selenium into the scrapy spider. It's the worst choice and really should be only used as last resort.
Please see the docs on dynamic content for a bit more detail here

Related

How to access all text from a website, including the a tag?

I'm trying to extract all the article text from the following site:
https://www.phonearena.com/reviews/Samsung-Galaxy-S9-Plus-Review_id4494
I tried findAll(text=True) but it extracts lot of useless information.
So I did findAll(text=True, recursive=False) but it ignores text data in certain tags like ? What's the most effective way of extracting the text in this case?
The website seems to be javascript protected. It loads the body content when requests already retrieved the http response. You need to simulate a real page request. With the python module Selenium Webdriver it would be possible.

Scrapy does not return results

I'm studying scrapy and am trying to crawl through this website - http://bananarepublic.gap.com/browse/category.do?cid=1055063&sop=true
However my scrapy code cannot find the product links listed on this website. Could anyone tell me why? The xpath Im using is //a[#class="product-card--link"]/#href
Is this because of js? If so, I tried using scrapy splash but still cannot find the product links listed. Could someone please help!
Thank you!
The items are generated via AJAX request. When you connect to a page a javascript script is executed that makes some extra http requests to retreive some json data. However scrapy does not execute any javascript so you need to manually find and call those AJAX requests.
See related issue: Can scrapy be used to scrape dynamic content from websites that are using AJAX?, to see how inspect network traffic and solve such cases.
In this particular case you can see first xhr requests that is being made returns a huge json file with all of the item data:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063&isFacetsEnabled=true&globalShippingCountryCode=&globalShippingCurrencyCode=&locale=en_US&
As you can see url takes some arguments, most importantly it takes cid which stands for category id and other arguments are mostly for calculating shipping prices, so if you don't care about those this works just as well:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063
An alternative that avoids digging deep into the AJAX requests would be using Splash (https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/) to scrape the page after the AJAX has been processed.
Can be a bit easier to implement, your xpath expression should work fine with Splash. But the scraper will be slower as it has to render each page.

script to check entire website to figure out if there are any pages which are taking more time to load

Can we have a script which will crawl through the entire website to figure out if there are any pages which are taking more time to load (some pages under a particular category were taking more time to load) in selenium Webdriver or jmeter
For JMeter you can use HTML Link Parser configuration element for this purposes. From the documentation:
Spidering Example
Consider a simple example: let's say you wanted JMeter to "spider" through your site, hitting link after link parsed from the HTML returned from your server (this is not actually the most useful thing to do, but it serves as a good example). You would create a Simple Controller, and add the "HTML Link Parser" to it. Then, create an HTTP Request, and set the domain to ".*", and the path likewise. This will cause your test sample to match with any link found on the returned pages. If you wanted to restrict the spidering to a particular domain, then change the domain value to the one you want. Then, only links to that domain will be followed.
More information on above approach and a couple more options: How to Spider a Site with JMeter - A Tutorial
Remember that JMeter is not a browser hence it doesn't execute JavaScript so your results may not be precise enough as JMeter doesn't measure the time required to actually render the page.

Scrape a part of website and notify on change

The website of my university unfortunately does not provide feeds but they keep publishing information there that is important for me (deadlines, dates of exams etc.) as links to pdfs
in a certain section of the website.
How can I regularly scrape that section of the site and have me notified (growl, mail something alike).
Normally I would use wget to mirror it but how to extract only parts of the website?
Is there a cli tool that can extract the XHTML via XPATH or similar?
Try this:
wget --spider --server-response http://example.com
This will print the headers which might contain the "Length"-attribute. If it changes, you can notify yourself.
edit: If it changes, you can download the whole html file, grep for a pdf file or whatever you want to look for (maybe for "<div id='news'>(.*?)</div>")
Mmm... You should take a look at QueryPath. QueryPath makes easy to parse HTML. What if the HTML structure changes? What if you want specific elements of the page? QueryPath does the hard work for you. Do you like JQuery? QueryPath is like the JQuery of PHP.
See: http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html?S_TACT=105AGX01&S_CMP=HP
See: http://querypath.org/
You might be interested in looking at Pjscrape (disclaimer: this is my project). It's a web-scraping tool built on PhantomJS, giving you full jQuery access to the page in a headless Webkit browser context. It makes it very easy to pull semi-structured data from webpages via the command line, particularly if the page you're scraping has a consistent structure for new elements.
For example, you can pull all the course titles from this course catalog with the following code:
pjs.addScraper(
// the page you're scraping
'http://www.ischool.berkeley.edu/courses/catalog',
// selector for elements you want to pull text from
'.views-row .views-field-title'
);
// suppress STDOUT logging
pjs.config('log', 'none');
Running this from the command line gives you JSON to STDOUT by default:
~> phantomjs /path/to/pjscrape.js my_script.js
["W10. Introduction to Information","24. Freshman Seminar", ...]
So it would be pretty simple to run this script on a regular basis, capture the output in a file, and then alert you when the new output doesn't match the previous scrape. You can also write your own scraper functions, so there's a lot of flexibility for more complex scraping if a simple selector won't do the trick.

scraping dynamic content

I am working on a web scraping project. do any body have idea of scraping dynamic content.
Dynamic content on base of query string is similar to static content but dynamic content based on some event of a control within same page is the point where i am stuck. because in this case page url remain same.
I am using C#.
Thanks in advance
Your question is rather general.
I'm not sure what you mean by event of a control, but as long as a browser generates http request you can catch it using tools like Firebug for Firefox or tools built in Google Chrome and see what is actually being sent to the server. So called AJAX requests are nothing else than standard http requests, it's just that web page is not reloaded as a whole.
Based on that information and page source it is possible to figure out how to create range of reguests that would simulate user interaction with dynamic elements on the page.