I'm new to Scrapy and I'm not sure how to tell it to only follow links that are subpages of the current url. For example, if you are here:
www.test.com/abc/def
then I want scrapy to follow:
www.test.com/abc/def/ghi
www.test.com/abc/def/jkl
www.test.com/abc/def/*
but not:
www.test.com/abc/*
www.test.com/*
or any other domain for that matter.
http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example
Write a spider deriving on the BaseSpider. In basespider parse call back you need to return the requests which you need to follow through. Just make sure the the request you are generating is of the form you like. i.e. the extracted url from the response using is a child of the current url( this will be response url). And make a request object and yield them.
Related
I'm trying to scrape a site that utilizes Js, but scrapy keeps dropping the next page url as duplicate and stoping the crawl. From my reading, it's my understanding that scrapy checks for duplicates by checking the hash of the resource the request points to and by defualt will drop the fragments in the URL. This behaviour can be changed by altering the Keep_Fragments command in the request_fingerprint module. (see the excerpt from scrapy release notes bellow)
"A new keep_fragments parameter of scrapy.utils.request.request_fingerprint() allows to generate different fingerprints for requests with different fragments in their URL (issue 4104)"
My question is, how does one actually modify this parameter?
Changing the request fingerprinting logic of the duplicate request filter is possible by writing your custom duplicate request filter. It will also get easier in a future version of Scrapy.
However, after you do that you will find out that Scrapy does not evaluate JavaScript code on its own. In the end, it is probably not in your best interest to change the duplicate request detection logic.
The Scrapy documentation covers how to extract that type of information with Scrapy.
Executed on Scrapy shell
url = "https://www.daraz.com.np/smartphones/?spm=a2a0e.11779170.cate_1.1.287d2d2b2cP9ar"
fetch(url)
r = scrapy.Request(url = url)
fetch(r)
response.xpath("//div[#class='ant-col-20 ant-col-push-4 c1z9Ut']/div[#class='c1_t2i']/div[#class='c2prKC']/div/div/div/div[#class='c16H9d']/a/text()").getall()
##NOTE##
There is no tbody tag in xpath
Why it outputs an empty list in scrapy thought it has 40 text in chrome?
It's because the website is heavily javascript orientated. That means content on the website is being loaded dynamically. It's invoking HTTP requests as the page loads and it's not hard coded into the HTML. So when you use scrapy shell it's not loading the HTML.
Couple of suggestions
Try to re-engineer the HTTP Requests. That is javascript envokes HTTP requests and therefore if you can mimic the requests can you get the data you want. YOu will need to use chrome dev tools or similar to see how the requests are made. This is the most clean and concise way to get data. All other options will slow the spider down and are more brittle.
Scrapy-splash - This prerenders the DOM of the page and allows you to access the HTML you desire.
Scrapy-selenium - A downloader middleware that handles requests with selenium. Not got the full function of selenium package but can render the DOM and you could get the data you require.
Embed selenium into the scrapy spider. It's the worst choice and really should be only used as last resort.
Please see the docs on dynamic content for a bit more detail here
I'm studying scrapy and am trying to crawl through this website - http://bananarepublic.gap.com/browse/category.do?cid=1055063&sop=true
However my scrapy code cannot find the product links listed on this website. Could anyone tell me why? The xpath Im using is //a[#class="product-card--link"]/#href
Is this because of js? If so, I tried using scrapy splash but still cannot find the product links listed. Could someone please help!
Thank you!
The items are generated via AJAX request. When you connect to a page a javascript script is executed that makes some extra http requests to retreive some json data. However scrapy does not execute any javascript so you need to manually find and call those AJAX requests.
See related issue: Can scrapy be used to scrape dynamic content from websites that are using AJAX?, to see how inspect network traffic and solve such cases.
In this particular case you can see first xhr requests that is being made returns a huge json file with all of the item data:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063&isFacetsEnabled=true&globalShippingCountryCode=&globalShippingCurrencyCode=&locale=en_US&
As you can see url takes some arguments, most importantly it takes cid which stands for category id and other arguments are mostly for calculating shipping prices, so if you don't care about those this works just as well:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063
An alternative that avoids digging deep into the AJAX requests would be using Splash (https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/) to scrape the page after the AJAX has been processed.
Can be a bit easier to implement, your xpath expression should work fine with Splash. But the scraper will be slower as it has to render each page.
I am writing a scraper for a site. however weird thing is happening, it's not visiting the URL i supply to him. Rather it visits the base url of the website.
I searched on the internet and came to know that, scrapy would ingnore URL after #, I need to indentify the Ajax request being sent and mimic that.
However the problem is. the response of the Ajax request comes as json response. it's not a html content. Would someome please help me how to deal with it.
Following is the url
https://www.buildersshow.com/Search/Exhibitors.aspx#showID=11&state=160&tabname=name
If you investigate the AJAX requests that the page makes, identify the request you need to make and get your response, it should be JSON contained in the response body. To parse it and get your data of interest, use the json decoder/encoder module. Something like this:
import json
mydata = json.loads(response.body)
info = mydata['somekey']
subinfo = mydata['somekey']['subkey']
And so forth. Make sure to handle the json decoder the proper way, it would be best to read the official documentation first.
I'm trying to achieve urls in the form of http://localhost:9294/users instead of http://localhost:9294/#/users
This seems possible according to the documentation but I haven't been able to get this working for "bookmarkable" urls.
To clarify, browsing directly to http://localhost:9294/users gives a 404 "Not found: /users"
You can turn on HTML5 History support in Spine like this:
Spine.Route.setup(history: true)
By passing the history: true argument to Spine.Route.setup() that will enable the fancy URLs without hash.
The documentation for this is actually buried a bit, but it's here (second to last section): http://spinejs.com/docs/routing
EDIT:
In order to have urls that can be navigated to directly, you will have to do this "server" side. For example, with Rails, you would have to build a way to take the parameter of the url (in this case "/users"), and pass it to Spine accordingly. Here is an excerpt from the Spine docs:
However, there are some things you need to be aware of when using the
History API. Firstly, every URL you send to navigate() needs to have a
real HTML representation. Although the browser won't request the new
URL at that point, it will be requested if the page is subsequently
reloaded. In other words you can't make up arbitrary URLs, like you
can with hash fragments; every URL passed to the API needs to exist.
One way of implementing this is with server side support.
When browsers request a URL (expecting a HTML response) you first make
sure on server-side that the endpoint exists and is valid. Then you
can just serve up the main application, which will read the URL,
invoking the appropriate routes. For example, let's say your user
navigates to http://example.com/users/1. On the server-side, you check
that the URL /users/1 is valid, and that the User record with an ID of
1 exists. Then you can go ahead and just serve up the JavaScript
application.
The caveat to this approach is that it doesn't give search engine
crawlers any real content. If you want your application to be
crawl-able, you'll have to detect crawler bot requests, and serve them
a 'parallel universe of content'. That is beyond the scope of this
documentation though.
It's definitely a good bit of effort to get this working properly, but it CAN be done. It's not possible to give you a specific answer without knowing the stack you're working with.
I used the following rewrites as explained in this article.
http://www.josscrowcroft.com/2012/code/htaccess-for-html5-history-pushstate-url-routing/