I'm trying to scrape a site that utilizes Js, but scrapy keeps dropping the next page url as duplicate and stoping the crawl. From my reading, it's my understanding that scrapy checks for duplicates by checking the hash of the resource the request points to and by defualt will drop the fragments in the URL. This behaviour can be changed by altering the Keep_Fragments command in the request_fingerprint module. (see the excerpt from scrapy release notes bellow)
"A new keep_fragments parameter of scrapy.utils.request.request_fingerprint() allows to generate different fingerprints for requests with different fragments in their URL (issue 4104)"
My question is, how does one actually modify this parameter?
Changing the request fingerprinting logic of the duplicate request filter is possible by writing your custom duplicate request filter. It will also get easier in a future version of Scrapy.
However, after you do that you will find out that Scrapy does not evaluate JavaScript code on its own. In the end, it is probably not in your best interest to change the duplicate request detection logic.
The Scrapy documentation covers how to extract that type of information with Scrapy.
Related
I am a novice to APIs and I am aware of the major kind of path in a rest API: path such as www.example.com/carsand query parameters such as www.example.com/cars?color=blue.
I just visit an e-commerce website and I am confused about the current path. I selected the category iphone-8 and got that url: https://www.example.fr/iphone-8.html
On the same page, I filter all phones with a price between 250 and 300 euros. This is the new url: https://www.example.fr/iphone-8.html#price=250&price=300
Does this url means that the filter is only applied on the html because of the # and therefore there is no api call for filtering?
Does this url means that the filter is only applied on the html because of the # and therefore there is no api call for filtering?
No that doesn't follow.
The experiment to try would be to load the original page into your browser, turn on the developer tools used to watch the network traffic, and then perform your search.
What you may discover is that when you manipulate the filter controls on the web page, what's really happening under the covers is that java script code is running, and making calls to fetch data from some back end endpoint, and then re-rendering the web page on the client. The fragment is being updated so that, if you were to bookmark the link, or copy it to another tab in your browser, the underlying javascript can reproduce "the same" results (by getting the search parameters from the fragment and repeating the search).
It should be possible to repeat those same calls directly from the browser itself (you won't necessarily get the HTML rendering, of course, but you'll probably be able to look at the filtered results in their own native representation (application/json, perhaps).
What's the avantage of using the frag rather than the query parameters for price?
Fragments are not part of the absolute-URI, and the query part is.
Which is to say, the query part is still part of the identifier of a primary resource, and is part of the request-line that is sent to the server.
But fragments are used to identify secondary resources; resources embedded within some primary resource.
Consider:
https://www.rfc-editor.org/rfc/rfc3986#section-3.5
This identifies a secondary resource (specifically, section-3.5) that is included within a primary resource (an HTML representation of RFC 3986). So we "fetch" the secondary resource by first loading the primary resource (the whole RFC) and then use the fragment identifier and HTML processing rules to discover the appropriate element in the document.
The fragment part is strictly a client side concern.
I am trying to find the best strategy to read links with names (eg. href=/mypage#sectionA )
If I don't do anything special, this kind of link can get skipped if I've already visited that page. If I check if my url has a hash (#), I can parse the result before yielding a new request, but it works only if the link point to a name on the same page.
How should I manage this kind of link? Disable duplicate check and potentially parse a page many many times?
I'm studying scrapy and am trying to crawl through this website - http://bananarepublic.gap.com/browse/category.do?cid=1055063&sop=true
However my scrapy code cannot find the product links listed on this website. Could anyone tell me why? The xpath Im using is //a[#class="product-card--link"]/#href
Is this because of js? If so, I tried using scrapy splash but still cannot find the product links listed. Could someone please help!
Thank you!
The items are generated via AJAX request. When you connect to a page a javascript script is executed that makes some extra http requests to retreive some json data. However scrapy does not execute any javascript so you need to manually find and call those AJAX requests.
See related issue: Can scrapy be used to scrape dynamic content from websites that are using AJAX?, to see how inspect network traffic and solve such cases.
In this particular case you can see first xhr requests that is being made returns a huge json file with all of the item data:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063&isFacetsEnabled=true&globalShippingCountryCode=&globalShippingCurrencyCode=&locale=en_US&
As you can see url takes some arguments, most importantly it takes cid which stands for category id and other arguments are mostly for calculating shipping prices, so if you don't care about those this works just as well:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063
An alternative that avoids digging deep into the AJAX requests would be using Splash (https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/) to scrape the page after the AJAX has been processed.
Can be a bit easier to implement, your xpath expression should work fine with Splash. But the scraper will be slower as it has to render each page.
Can we have a script which will crawl through the entire website to figure out if there are any pages which are taking more time to load (some pages under a particular category were taking more time to load) in selenium Webdriver or jmeter
For JMeter you can use HTML Link Parser configuration element for this purposes. From the documentation:
Spidering Example
Consider a simple example: let's say you wanted JMeter to "spider" through your site, hitting link after link parsed from the HTML returned from your server (this is not actually the most useful thing to do, but it serves as a good example). You would create a Simple Controller, and add the "HTML Link Parser" to it. Then, create an HTTP Request, and set the domain to ".*", and the path likewise. This will cause your test sample to match with any link found on the returned pages. If you wanted to restrict the spidering to a particular domain, then change the domain value to the one you want. Then, only links to that domain will be followed.
More information on above approach and a couple more options: How to Spider a Site with JMeter - A Tutorial
Remember that JMeter is not a browser hence it doesn't execute JavaScript so your results may not be precise enough as JMeter doesn't measure the time required to actually render the page.
I'm trying to achieve urls in the form of http://localhost:9294/users instead of http://localhost:9294/#/users
This seems possible according to the documentation but I haven't been able to get this working for "bookmarkable" urls.
To clarify, browsing directly to http://localhost:9294/users gives a 404 "Not found: /users"
You can turn on HTML5 History support in Spine like this:
Spine.Route.setup(history: true)
By passing the history: true argument to Spine.Route.setup() that will enable the fancy URLs without hash.
The documentation for this is actually buried a bit, but it's here (second to last section): http://spinejs.com/docs/routing
EDIT:
In order to have urls that can be navigated to directly, you will have to do this "server" side. For example, with Rails, you would have to build a way to take the parameter of the url (in this case "/users"), and pass it to Spine accordingly. Here is an excerpt from the Spine docs:
However, there are some things you need to be aware of when using the
History API. Firstly, every URL you send to navigate() needs to have a
real HTML representation. Although the browser won't request the new
URL at that point, it will be requested if the page is subsequently
reloaded. In other words you can't make up arbitrary URLs, like you
can with hash fragments; every URL passed to the API needs to exist.
One way of implementing this is with server side support.
When browsers request a URL (expecting a HTML response) you first make
sure on server-side that the endpoint exists and is valid. Then you
can just serve up the main application, which will read the URL,
invoking the appropriate routes. For example, let's say your user
navigates to http://example.com/users/1. On the server-side, you check
that the URL /users/1 is valid, and that the User record with an ID of
1 exists. Then you can go ahead and just serve up the JavaScript
application.
The caveat to this approach is that it doesn't give search engine
crawlers any real content. If you want your application to be
crawl-able, you'll have to detect crawler bot requests, and serve them
a 'parallel universe of content'. That is beyond the scope of this
documentation though.
It's definitely a good bit of effort to get this working properly, but it CAN be done. It's not possible to give you a specific answer without knowing the stack you're working with.
I used the following rewrites as explained in this article.
http://www.josscrowcroft.com/2012/code/htaccess-for-html5-history-pushstate-url-routing/