Scrappy response different than browser response - beautifulsoup

I am trying to scrape a this page with scrapy:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=7391
and the response which I get is different than what I see in the browser. Browser response has the correct page, while scrapy response is:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=1
page. I have tried with urllib2 but still have the same issue. Any help is much appreciated.

I don't really understand the issue, but usually a different response for a browser and scrapy is caused by one these:
the server analyzes your User-Agent header, and returns a specially crafted page for mobile clients or bots;
the server analyzes the cookies, and does something special when it looks like you are visiting for the first time;
you are trying to make a POST request via scrapy like the browser does, but you forgot some form fields, or put wrong values
etc.
There is no universal way to determine what's wrong, because it depends on the server logic, which you don't know. If you are lucky, you will analyze and fix all the mentioned issues and will make it work.

Related

API Request URL returns "Invalid Access"

I'm trying to scrape data from a website but I have no experience with scraping or APIs except for making a Discord Bot once. So I followed the steps described here to find the API:
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api
The Request URL in the Headers tab with the important information is this one:
https://api.amiami.com/api/v1.0/item?gcode=FIGURE-119023&lang=eng
When I try to open this page, like he does, it only returns:
{"RSuccess":false,"RValue":{"HttpStatusCode":400},"RMessage":"Invalid access."}
If you want to try getting the Request URL yourself, the original page I used was:
https://www.amiami.com/eng/detail/?gcode=FIGURE-119023
Removing the language argument doesn't seem to change anything either. So I guess there's something that detects that I'm not accessing it in a normal way. Any Ideas on how to fix this?

Scrapy does not return results

I'm studying scrapy and am trying to crawl through this website - http://bananarepublic.gap.com/browse/category.do?cid=1055063&sop=true
However my scrapy code cannot find the product links listed on this website. Could anyone tell me why? The xpath Im using is //a[#class="product-card--link"]/#href
Is this because of js? If so, I tried using scrapy splash but still cannot find the product links listed. Could someone please help!
Thank you!
The items are generated via AJAX request. When you connect to a page a javascript script is executed that makes some extra http requests to retreive some json data. However scrapy does not execute any javascript so you need to manually find and call those AJAX requests.
See related issue: Can scrapy be used to scrape dynamic content from websites that are using AJAX?, to see how inspect network traffic and solve such cases.
In this particular case you can see first xhr requests that is being made returns a huge json file with all of the item data:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063&isFacetsEnabled=true&globalShippingCountryCode=&globalShippingCurrencyCode=&locale=en_US&
As you can see url takes some arguments, most importantly it takes cid which stands for category id and other arguments are mostly for calculating shipping prices, so if you don't care about those this works just as well:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063
An alternative that avoids digging deep into the AJAX requests would be using Splash (https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/) to scrape the page after the AJAX has been processed.
Can be a bit easier to implement, your xpath expression should work fine with Splash. But the scraper will be slower as it has to render each page.

scrapy not visiting url after #

I am writing a scraper for a site. however weird thing is happening, it's not visiting the URL i supply to him. Rather it visits the base url of the website.
I searched on the internet and came to know that, scrapy would ingnore URL after #, I need to indentify the Ajax request being sent and mimic that.
However the problem is. the response of the Ajax request comes as json response. it's not a html content. Would someome please help me how to deal with it.
Following is the url
https://www.buildersshow.com/Search/Exhibitors.aspx#showID=11&state=160&tabname=name
If you investigate the AJAX requests that the page makes, identify the request you need to make and get your response, it should be JSON contained in the response body. To parse it and get your data of interest, use the json decoder/encoder module. Something like this:
import json
mydata = json.loads(response.body)
info = mydata['somekey']
subinfo = mydata['somekey']['subkey']
And so forth. Make sure to handle the json decoder the proper way, it would be best to read the official documentation first.

How do I get HTTP Headers [Links Only] using a Web Browser in VB.NET?

What I'm trying to achieve is something similar to an Add-on called Live Http Headers used with Firefox. I'm not trying to get the Headers or cookies, but the links that load on the page itself. Let us assume I visited Mail.Yahoo.com, this is pretty much what you would see when I use the add-on.
CLICK HERE
How can I achieve something similar ? Only the links that load on the page itself !
I'm looking forward into reading your suggestions, please enlighten me if you know!
You can download the webpage using a webclient instance
Then with the result string, you can get the urls using a regular expression
http://www.geekzilla.co.uk/view2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm

Do I need to send a 404?

We're in the middle of writing a lot of URL rewrite code that would basically take ourdomain.com/SomeTag and some something dynamic to figure out what to display.
Now if the Tag doesn't exist in our system, we're gonna display some information helping them finding what they were looking for.
And now the question came up, do we need to send a 404 header? Should we? Are there any reasons to do it or not to do it?
Thanks
Nathan
You aren't required to, but it can be useful for automated checkers to detect the response code instead of having to parse the page.
I certainly send proper response codes in my applications, especially when I have database errors or other fatal errors. Then the search engine knows to give up and retry in 5 mins instead of indexing the page. e.g. code 503 for "Service Unavailable" and I also send a Retry-After: 600 to tell it to try again...search engines won't take this badly.
404 codes are sent when the page should not be indexed or doesn't exist (e.g. non-existent tag)
So yes, do send status codes.
I say do it - if the user is actually an application acting on behalf of the user (i.e. cURL, wget, something custom, etc...) then a 404 would actually help quite a bit.
You have to keep in mind that the result code you return is not for the user; for the standard user, error codes are meaningless so don't display this info to the user.
However think about what could happen if the crawlers access your pages and consider them valid (with a 200 response); they will start indexing the content and your page will be added to the index. If you tell the search engine to index the same content for all your not found pages, it will certainly affect your ranking and if one page appears in the top search results, you will look like a fool.