I'm studying scrapy and am trying to crawl through this website - http://bananarepublic.gap.com/browse/category.do?cid=1055063&sop=true
However my scrapy code cannot find the product links listed on this website. Could anyone tell me why? The xpath Im using is //a[#class="product-card--link"]/#href
Is this because of js? If so, I tried using scrapy splash but still cannot find the product links listed. Could someone please help!
Thank you!
The items are generated via AJAX request. When you connect to a page a javascript script is executed that makes some extra http requests to retreive some json data. However scrapy does not execute any javascript so you need to manually find and call those AJAX requests.
See related issue: Can scrapy be used to scrape dynamic content from websites that are using AJAX?, to see how inspect network traffic and solve such cases.
In this particular case you can see first xhr requests that is being made returns a huge json file with all of the item data:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063&isFacetsEnabled=true&globalShippingCountryCode=&globalShippingCurrencyCode=&locale=en_US&
As you can see url takes some arguments, most importantly it takes cid which stands for category id and other arguments are mostly for calculating shipping prices, so if you don't care about those this works just as well:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063
An alternative that avoids digging deep into the AJAX requests would be using Splash (https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/) to scrape the page after the AJAX has been processed.
Can be a bit easier to implement, your xpath expression should work fine with Splash. But the scraper will be slower as it has to render each page.
Related
Executed on Scrapy shell
url = "https://www.daraz.com.np/smartphones/?spm=a2a0e.11779170.cate_1.1.287d2d2b2cP9ar"
fetch(url)
r = scrapy.Request(url = url)
fetch(r)
response.xpath("//div[#class='ant-col-20 ant-col-push-4 c1z9Ut']/div[#class='c1_t2i']/div[#class='c2prKC']/div/div/div/div[#class='c16H9d']/a/text()").getall()
##NOTE##
There is no tbody tag in xpath
Why it outputs an empty list in scrapy thought it has 40 text in chrome?
It's because the website is heavily javascript orientated. That means content on the website is being loaded dynamically. It's invoking HTTP requests as the page loads and it's not hard coded into the HTML. So when you use scrapy shell it's not loading the HTML.
Couple of suggestions
Try to re-engineer the HTTP Requests. That is javascript envokes HTTP requests and therefore if you can mimic the requests can you get the data you want. YOu will need to use chrome dev tools or similar to see how the requests are made. This is the most clean and concise way to get data. All other options will slow the spider down and are more brittle.
Scrapy-splash - This prerenders the DOM of the page and allows you to access the HTML you desire.
Scrapy-selenium - A downloader middleware that handles requests with selenium. Not got the full function of selenium package but can render the DOM and you could get the data you require.
Embed selenium into the scrapy spider. It's the worst choice and really should be only used as last resort.
Please see the docs on dynamic content for a bit more detail here
Can we have a script which will crawl through the entire website to figure out if there are any pages which are taking more time to load (some pages under a particular category were taking more time to load) in selenium Webdriver or jmeter
For JMeter you can use HTML Link Parser configuration element for this purposes. From the documentation:
Spidering Example
Consider a simple example: let's say you wanted JMeter to "spider" through your site, hitting link after link parsed from the HTML returned from your server (this is not actually the most useful thing to do, but it serves as a good example). You would create a Simple Controller, and add the "HTML Link Parser" to it. Then, create an HTTP Request, and set the domain to ".*", and the path likewise. This will cause your test sample to match with any link found on the returned pages. If you wanted to restrict the spidering to a particular domain, then change the domain value to the one you want. Then, only links to that domain will be followed.
More information on above approach and a couple more options: How to Spider a Site with JMeter - A Tutorial
Remember that JMeter is not a browser hence it doesn't execute JavaScript so your results may not be precise enough as JMeter doesn't measure the time required to actually render the page.
I am trying to scrape a this page with scrapy:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=7391
and the response which I get is different than what I see in the browser. Browser response has the correct page, while scrapy response is:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=1
page. I have tried with urllib2 but still have the same issue. Any help is much appreciated.
I don't really understand the issue, but usually a different response for a browser and scrapy is caused by one these:
the server analyzes your User-Agent header, and returns a specially crafted page for mobile clients or bots;
the server analyzes the cookies, and does something special when it looks like you are visiting for the first time;
you are trying to make a POST request via scrapy like the browser does, but you forgot some form fields, or put wrong values
etc.
There is no universal way to determine what's wrong, because it depends on the server logic, which you don't know. If you are lucky, you will analyze and fix all the mentioned issues and will make it work.
I am looking for a way to utilize the FireShot API with JS to given a URL (or perhaps a list) use the FireShot API to take screenshot, upload to Imgur, then return the user the URLs or perhaps something like markdown to use quickly in forums.
Method 1: Open new window
I tried opening the URL in a new window, but found that I cant control that page with JS dues to cross domain problems. The same with iFrames.
Method 2: simple $.get()
A simple $.get() wont work because of the same cross domain issues I guess?
http://jsfiddle.net/t6aeq/
$.get($url.val(), function(data) {
console.log(data);
});
Via PHP "Proxy"
So I tried creating a simple PHP script that gets the HTML of the URL and returns it to my JS (using file_get_contents($url)). But some sites like Microsoft will detect that I am using some automated methods and give an error page of sorts. I also cant seem to find a way to use jQuery to query that returned HTML for link[rel=stylesheet], script, style and body to append to the head and a div respectively. I posted abt that on another question
A new Idea: Embed scripts on browser level
So I thought away of getting around these is using iMacros or GreeseMonkey or something to insert scripts into pages on the browser level instead? But any guidance or tips on how can I do that? Also, I'd prefer a pure JS/PHP method if available so users are not limited to using Browser plugin/scripts (tho I will be the only user for now)
It suddenly came to my mind that this may not work because the FireShot API key and Imgur is limited to the domain? Any solutions?
You might be able to inject the FireShot script using Greasemonkey. But, first use GM_xmlhttpRequest() to fetch an API key, for that page's domain, from the "Create FireShot API Key" page.
Note that GM_xmlhttpRequest() does not have the same cross-domain issues that $.get() has.
However, at this point you might be better off just writing your own Firefox add-on. Maybe start with FireShot's code for ideas. Also see the Screengrab add-on.
I have a web page that has 70000 characters. As you know when doing translation through Google API you can only send up to 5000 characters at a time. Which means I have to send data to Google 14 times (70000/5000) which takes a lot of time and then my page is displayed. Is there a way to speed up the process?
Thanks
have you tried caching the translation?
If you were using some AJAX framework (you don't mention what your web page is created with eg c#) then you can make it faster by making the API call via the AJAX framework.
It would look something like this (psuedo-code since we don't know what you are using):
Serve web page (almost instant)
Web page starts AJAX call:
Break text into chunks
Foreach chunk
Translate via API
Append to the page
This way the user will see the page immediately, and will also see the translation appear piece by piece as it is processsed instead of having to wait until the end.
My best bet would be to generate a page in one language, then ask google to translate it trough HTTP and display result as your own, to make it seamless for user. I believe that is what Google Chrome does when translating web pages.
Example of URL that makes Google translate the whole web page:
http://translate.google.com/translate?hl=en&sl=ru&tl=en&u=http%3A%2F%2Flinux.org.ru%2F
Of course, another option is to use Google Translate API and cache result if page content is not changing frequently.
go to the Javascript file in Google, it will lead you also to the CSS file, make a file or perhaps two, or you may be able to add CSS to your own, now make Javascript page on your web site in own directory. make a nip of code to update the Javascript code every so many seconds or minutes, and this will make the transition much faster, just by refreshing the content they give.. have fun :) also ultimately you can also send a request at the same time as the first one to translate after char 5000 which should be relatively easy to do.