scrapy not visiting url after # - scrapy

I am writing a scraper for a site. however weird thing is happening, it's not visiting the URL i supply to him. Rather it visits the base url of the website.
I searched on the internet and came to know that, scrapy would ingnore URL after #, I need to indentify the Ajax request being sent and mimic that.
However the problem is. the response of the Ajax request comes as json response. it's not a html content. Would someome please help me how to deal with it.
Following is the url
https://www.buildersshow.com/Search/Exhibitors.aspx#showID=11&state=160&tabname=name

If you investigate the AJAX requests that the page makes, identify the request you need to make and get your response, it should be JSON contained in the response body. To parse it and get your data of interest, use the json decoder/encoder module. Something like this:
import json
mydata = json.loads(response.body)
info = mydata['somekey']
subinfo = mydata['somekey']['subkey']
And so forth. Make sure to handle the json decoder the proper way, it would be best to read the official documentation first.

Related

API Request URL returns "Invalid Access"

I'm trying to scrape data from a website but I have no experience with scraping or APIs except for making a Discord Bot once. So I followed the steps described here to find the API:
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api
The Request URL in the Headers tab with the important information is this one:
https://api.amiami.com/api/v1.0/item?gcode=FIGURE-119023&lang=eng
When I try to open this page, like he does, it only returns:
{"RSuccess":false,"RValue":{"HttpStatusCode":400},"RMessage":"Invalid access."}
If you want to try getting the Request URL yourself, the original page I used was:
https://www.amiami.com/eng/detail/?gcode=FIGURE-119023
Removing the language argument doesn't seem to change anything either. So I guess there's something that detects that I'm not accessing it in a normal way. Any Ideas on how to fix this?

Add a header to a page request using GET?

I have a vb.NET App that uses System.Net.WebClient to query an API. I'm able to get the information I'm requesting just fine.
The people that supply the API are requesting that I
"set a custom User header when requesting data to determine the source application."
Am I supposed to pre-send something first, or append something to the url for the WebClient to processes? The API only accepts get requests and it doesn't have a parameter for an identification.
I'm stuck in terminology here. A search for that phrase, here, came up with server-side topics so I don't know what to look for. Can someone translate?

Follow only child links using Scrapy

I'm new to Scrapy and I'm not sure how to tell it to only follow links that are subpages of the current url. For example, if you are here:
www.test.com/abc/def
then I want scrapy to follow:
www.test.com/abc/def/ghi
www.test.com/abc/def/jkl
www.test.com/abc/def/*
but not:
www.test.com/abc/*
www.test.com/*
or any other domain for that matter.
http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example
Write a spider deriving on the BaseSpider. In basespider parse call back you need to return the requests which you need to follow through. Just make sure the the request you are generating is of the form you like. i.e. the extracted url from the response using is a child of the current url( this will be response url). And make a request object and yield them.

Get request headers to validate analytics in a website

I have to perform testing through selenium webdriver(java) on Site analytics of a website. All the attributes, values are sent to the analytics toold via URL header request. I would like to capture the request alone so that I can perform my manipulations and extract the attributes and their values.
I tried BrowserMob tool. It's getting me the entire traffic in the form of HAR file. is there a way to extract the request alone?
I tried server.setCaptureHeaders(true); but it didn't help much as I see a whole bunch of URLs in the HAR. I'm interested in only one that is sent to the analytics website. There is a URL thats sent as a request behind the scene. Few analytics plugin are able to exactly get the request URL and extract the attribute values but I can't automate through those plugins.
Or is there a way to rip off only certain requests from HAR?
BMP is a great tool. You can only create a new har before the request is sent and read it after this. You can iterate through the dict it returns and find the request you need

Scrappy response different than browser response

I am trying to scrape a this page with scrapy:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=7391
and the response which I get is different than what I see in the browser. Browser response has the correct page, while scrapy response is:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=1
page. I have tried with urllib2 but still have the same issue. Any help is much appreciated.
I don't really understand the issue, but usually a different response for a browser and scrapy is caused by one these:
the server analyzes your User-Agent header, and returns a specially crafted page for mobile clients or bots;
the server analyzes the cookies, and does something special when it looks like you are visiting for the first time;
you are trying to make a POST request via scrapy like the browser does, but you forgot some form fields, or put wrong values
etc.
There is no universal way to determine what's wrong, because it depends on the server logic, which you don't know. If you are lucky, you will analyze and fix all the mentioned issues and will make it work.