Using scrapy fetching the data from site without proxy. How to know ip address of that response in scrapy .
Related
I'm trying to fetch articles from https://journals.sagepub.com/, the website is accessible though my browser but I keep getting a 503 error when I try to crawl in scrapy shell. When I view the response in browser it shows the generic cloudflare ddos protection page. I have tried changing user agents and download delay but nothing works. I am new to scrapy and web scraping in general so some help would be much appreciated.
Is there a way or setting in scrapy to ignore pages having an basic http authentication while the crawling is in progress.
Thanks
I'm attempting to test a real-time Instagram stream using the Subscription API, but am having trouble setting up subscriptions for local testing.
I attempted using localhost:8080 for the callback_url and editing my /etc/hosts file (redirecting localhost to local.machine.com)
Eventually, I was able to set up a subscription to my home's IP address to receive callbacks from Instagram.
The IP address was in the form:
xxx.xxx.xxx.xx:8080
However, this morning, I was trying from a different IP address in the form xxx.xxx.x.xx:8080 which has continuously led to Instagram returning 400: Bad Request: Invalid URL
Does anybody have any insight as to what Instagram treats as a valid URL parameter for subscriptions?
I would recommend ngrok for this.
It allows you to set up a tunnel between your local machine and the internet.
With ngrok you can on the command line do like this:
ngrok http 8080
That will give you a url like this: http://something.ngrok.io. In your terminal window you can also inspect all traffic through this tunnel.
I'm using selenium php webdriver and php wrapper for browsermob proxy to fetch the access token from facebook. Once the user authetication was sucessfull, the facebook will redirect to
'http://www.karkala.in/index.html#state=ads_management%2Cread_insights&access_token=ABCDEFZCLkK3EBAJOxrzwq0BdXzT6DCA6QDZBbwUpc8ArgdAv5ly3nNSHME9W19cF7a06pGGGyQdkpVtqc4OnZAnAQT4eKDqeaipxLlVEgZDZD&expires_in=5569'
Now I need to read this token. I use the following php code to fetch the response
$har = self::$client->__get("har");
But I'm not able to see the location (above url) in response headers.
My response text is available here:
http://www.karkala.in/har.txt
Selenium itself has got an option to read the url.
selenium.GetLocation()
Can be achieved without using browsermob proxy.
I am using Apache 2 to serve content, and Bing Bot is using HTTP/0.9 to request pages from my server which does not serve direct IP hosts.
How should I handle the spider if I don't know which host they want, but still need them to index my site?
I currently return 400 Bad Request, but it makes me nervous that my sites will not be indexed for Bing or Yahoo.
Thanks
[SOLVED]: I have been returning 400 Bad Request and Bing/Yahoo have taken the hint.