Scrapy: How to know ip address of a response through scrapy spider

Scrapy: How to know ip address of a response through scrapy spider - scrapy

Using scrapy fetching the data from site without proxy. How to know ip address of that response in scrapy .

Related

Scrapy 503 Service unavailable for CloudFlare protection

I'm trying to fetch articles from https://journals.sagepub.com/, the website is accessible though my browser but I keep getting a 503 error when I try to crawl in scrapy shell. When I view the response in browser it shows the generic cloudflare ddos protection page. I have tried changing user agents and download delay but nothing works. I am new to scrapy and web scraping in general so some help would be much appreciated.

Scrapy ignore pages with http auth

Is there a way or setting in scrapy to ignore pages having an basic http authentication while the crawling is in progress.
Thanks

Invalid URL for Subscription API: Instagram

I'm attempting to test a real-time Instagram stream using the Subscription API, but am having trouble setting up subscriptions for local testing.
I attempted using localhost:8080 for the callback_url and editing my /etc/hosts file (redirecting localhost to local.machine.com)
Eventually, I was able to set up a subscription to my home's IP address to receive callbacks from Instagram.
The IP address was in the form:
xxx.xxx.xxx.xx:8080
However, this morning, I was trying from a different IP address in the form xxx.xxx.x.xx:8080 which has continuously led to Instagram returning 400: Bad Request: Invalid URL
Does anybody have any insight as to what Instagram treats as a valid URL parameter for subscriptions?

I would recommend ngrok for this.
It allows you to set up a tunnel between your local machine and the internet.
With ngrok you can on the command line do like this:
ngrok http 8080
That will give you a url like this: http://something.ngrok.io. In your terminal window you can also inspect all traffic through this tunnel.

Unable to capture Response Headers Location through browsermob proxy

I'm using selenium php webdriver and php wrapper for browsermob proxy to fetch the access token from facebook. Once the user authetication was sucessfull, the facebook will redirect to
'http://www.karkala.in/index.html#state=ads_management%2Cread_insights&access_token=ABCDEFZCLkK3EBAJOxrzwq0BdXzT6DCA6QDZBbwUpc8ArgdAv5ly3nNSHME9W19cF7a06pGGGyQdkpVtqc4OnZAnAQT4eKDqeaipxLlVEgZDZD&expires_in=5569'
Now I need to read this token. I use the following php code to fetch the response
$har = self::$client->__get("har");
But I'm not able to see the location (above url) in response headers.
My response text is available here:
http://www.karkala.in/har.txt

Selenium itself has got an option to read the url.
selenium.GetLocation()
Can be achieved without using browsermob proxy.

How should I handle Spiders/Web Crawlers using HTTP/0.9 if I am using Apache 2?

I am using Apache 2 to serve content, and Bing Bot is using HTTP/0.9 to request pages from my server which does not serve direct IP hosts.
How should I handle the spider if I don't know which host they want, but still need them to index my site?
I currently return 400 Bad Request, but it makes me nervous that my sites will not be indexed for Bing or Yahoo.
Thanks

[SOLVED]: I have been returning 400 Bad Request and Bing/Yahoo have taken the hint.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scrapy: How to know ip address of a response through scrapy spider - scrapy

Using scrapy fetching the data from site without proxy. How to know ip address of that response in scrapy .

Related

Scrapy 503 Service unavailable for CloudFlare protection

Scrapy ignore pages with http auth

Invalid URL for Subscription API: Instagram

Unable to capture Response Headers Location through browsermob proxy

How should I handle Spiders/Web Crawlers using HTTP/0.9 if I am using Apache 2?

Categories

Resources