How to avoid ban when uses scrapy - scrapy

I'm getting constantly ban from a website, I set download_delay = 10 in scrapy, I tried a package fake_user_agent then I tried implementing tor and polipo, according to this site the config is ok. But after running 1/2 times again I got banned! Can anyone help me here ?
Note: scrapy-proxie I also want to try this but can't activate.

You should take a look at what the documentation says.
Here are some tips to keep in mind when dealing with these kinds of
sites:
rotate your user agent from a pool of well-known ones from browsers
(google around to get a list of them)
disable cookies (see
COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
use download delays (2 or higher). See DOWNLOAD_DELAY setting.
if
possible, use Google cache to fetch pages, instead of hitting the
sites directly use a pool of rotating IPs. For example, the free Tor
project or paid services like ProxyMesh
use a highly distributed
downloader that circumvents bans internally, so you can just focus on
parsing clean pages. One example of such downloaders is Crawlera

Use delay on clicks
Not tor - all connections from one address - bad, rotate proxy after several visits
And check this post - web scraping etiquette

Related

Can Zap be used as a DAST tool via API without spidering?

I'm trying to use Zap as a DAST tool via the API and it's getting a bit annoying.
Can i use the tool as an attack tool instead of a proxy tool? what i mean is, currently i can't launch an active scan without the url being in the tree, which is only done via the spider afaik right?
What i want is to provide the url and launch an active scan based on a policy and get results, now that i think about it, this is similar to fuzzing just with attack vectors, although i see the logic of what to do with URL X if there is no history or scanning done, can't it just scan the page for actions and variables? the main difference is page\url scanning contrary to spidering which assumes there are other urls.
After writing this i'm not sure it can be done without a spider unless you're in my situation so let me explain it.
Lets say for example's sake i just want to scan the login page for SQLI and i'm using Owasp JuiceShop to make things easier, can i tell zap to attack the one page? the only way i found on that example is via the POST method since the url is not a static page and isn't being pick up by Zap unless it's an action, but then i can't launch it without spidering so this is like a loop.
Sorry for the long post hopefully you can provide some insights.
Update in comments
ZAP has to know about the site its going to attack. We deliberately separate the concepts of discovery and attacking because theres no one discovery option thats best for all. You can use the standard spider, the ajax spider, import urls, import defns like OpenAPI, proxy your browser, proxy regression tests or even make direct requests to the target site via the ZAP API.
It looks like you have quite a few questions about ZAP. The ZAP User Group is probably a better forum for them: https://groups.google.com/group/zaproxy-users

Is it a bad idea to have a web browser query another api instead of my site providing it?

Here's my issue. I have a site that provides some investing services, I pay for end of day data which is all I really need for my service but I feel its a bit odd when people check in during the day and it only displays yesterdays closing price. End of day is fine for my analytics but I want to display delayed quotes on my site.
According to the yahoo's YQL faq: If you use IP based authentication then you are limited to 1000 calls/day/IP, if my site grows I may exceed that but I was thinking of trying to push this request to the people browsing my site themselves since its extremely unlikely that the same IP will visit my site 1,000 times a day(my site itself has no use for this info). I would call a url from their browser, then parse the results so I can allow them to view it in the format of the sites template.
I'm new to web development so I'm wondering is it a common practice or a bad idea to have the users browser make the api call themselves?
It is not a bad idea at all:
You stretch up limitations this way;
Your server will respond faster (since it does not have to contact the api);
Your page will load faster because the initial response is smaller;
You can load the remaining data from the api in async manner while your UI is already responsive.
Generally speaking it is a great idea to talk with api's from the client. It's more dynamic, you spread traffic, more responsiveness etc...
The biggest downside I can think of is depending on the availability of other services. On the other hand your server(s) will be stressed less because of spreading the traffic.
Hope this helped a bit! Cheers!

Apache to limit number of different IP addresses that can connect to a server

Me and a group of friends are developing a server that we want to have a limited number of users accessing to it. We first tried the KeepAlive and MaxClients directives with a relative small timeout. It worked fine in an experimental simple webpage.
But our webpage only loads a portion of it. I think it's because we use the AJAX model that does multiple connections. One per one part of the web interface.
Anyway to overcome this we though of blocking by number of different IP's connected and not different connections. We tried to find an Apache module/directive that does this but we keep finding modules that limit bandwidth/connections per IP (which is the reverse thing of what we keep trying to find).
Does anyone know something that can help me with this?
Thanks in advance.
This sort of things are usually done in routing since that's where you want to handle DOS related issues. There is however mod_evasive, and you can find it by Googling. It is not really maintained anymore and doesn't seem to have official homepage either but if you insist using Apache for that sort of stuff I would check it out in any case...

when should i use or avoid subdomains?

Recently a user told me to avoid subdomains when i can. I remember reading google consider subdomains as a unique site (is this true?). What else happens when i use a subdomain and when should i use or should not use a subdomain?
I heard cookies are not shared between subdomains? i know 2 images can be DL simultaneously from a site. Would i be able to DL 4 if i use sub1.mysite.com and sub2.mysite.com?
What else should i know?
You can share cookies between subdomains, provided you set the right parameters in the cookie. By default, they won't be shared, though.
Yes, you can get more simultaneous downloads if you store images in different subdomains. However, the other side of the scale is that the user spends more time resolving DNSs, so it's not practical to have, say, 25 subdomains to get 50 simultaneous downloads.
Another thing that happens with subdomains is that AJAX requests won't work without some effort (you CAN make them work using document.domain tricks, but it's far from straightforward).
Can't help with the SEO part, however, although some people discourage having both yoursite.com and www.yoursite.com working and returning the same content, because it "dilutes your pagerank". Not sure how true that is.
You complicate quite a few things. Collecting stats, controlling spiders, html5 storage, XSS, inter-frame communication, virtual-host setup, third-party ad serving, interaction with remote APIs like google maps.
That's not to say these things can't be solved, just that the rise in complexity adds more work and may not provide suitable benefits to compensate.
I should add that I went down this path once myself for a classifieds site, adding domains like porshe.site.com, ferrari.site.com hoping to boost rank for those keywords. In the end I did not see noticeable improvement and even worse google was walking the entire site via each subdomain, meaning that a search for ferraris might return porsche.site.com/ferraris instead of ferrari.site.com/ferraris. In short google considered each site to be duplicates but it still crawled each site every time it visited.
Again, workarounds existed but I chose simplicity and I don't regret it.
If you use sub domains to store your web sites images, javascript, stylesheets, etc. then your pages may load quicker. Browsers limit the number of simultaneous connections to each domain name. The more sub domains you use, the more connection can be made at the same time to collect the web pages content.
Recently a user told me to avoid subdomains when i can. I remember reading google consider subdomains as a unique site (is this true?). What else happens when i use a subdomain and when should i use or should not use a subdomain?
The last thing I heard about Google optimization, is that domains count for more pagerank than subdomains. I also believe that page rank calculations are per page, not per site (according to algorithm etc.). Though the only person who can really tell you is a Google employee.
I heard cookies are not shared between subdomains?
You should be able to use a cookie for all subdomains. www.mysite.com sub1.mysite.com sub2.mysite.com can all share the same cookies, but a cookie specified for mysite.com cannot be shared with them.
i know 2 images can be DL simultaneously from a site. Would i be able to DL 4 if i use sub1.mysite.com and sub2.mysite.com?
I'm not sure what you mean by DL simultaneously. Often times, a browser with a single thread will download images one at a time, even from different domains. Browsers with multiple thread configurations can download multiple items from different domains at the same time.

Yslow alternatives - Optimisations for small websites

I am developing a small intranet based web application. I have YSlow installed and it suggests I do several things but they don't seem relevant for me.
e.g I do not need a CDN.
My application is slow so I want to reduce the bandwidth of requests.
What rules of YSlow should I adhere to?
Are there alternative tools for smaller sites?
What is the check list I should apply before rolling out my application?
I am using ASP.net.
Bandwidth on intranet sites shouldn't be an issue at all (unless you have VPN users, that is). If you don't and it's still crawling, it's probably something to do with the backend than the front-facing structure.
If you are trying to optimise for remote users, some of the same things apply to try and optimise the whole thing:
Don't use 30 stylesheets - cat them into one
Don't use 30 JS files, cat them into one
Consider compressing both JS and CSS using minifiers or the YUI compressor.
Consider using sprites (images with multiple versions in - eg button-up and button-down, one above the other)
Obviously, massive images are a no-no
Make sure you send expires headers to make sure stylesheets/js/images/etc are all cached for a sensible amount of time.
Make sure your pages aren't ridiculously large. If you're in a controlled environment and you can guarantee JS availability, you might want to page data with AJAX.
To begin,
limit the number of HTTP requests
made for images, scripts and other
resources by combining where
possible. Consider minifying them
too. I would recommend Fiddler for debugging HTTP
Be mindful of the size of Viewstate,
set EnableViewState = false where
possible e.g. For dropdown list controls
that never have their list of items changed,
disable Viewstate and populate in
Page_Init or override OnLoad. TRULY
understanding Viewstate is a
must read article on the subject
Oli has posted an answer while writing this and have to agree that bandwidth considerations should be secondary or tertiary for an intranet application.
I've discovered Page speed since asking this question. Its not really for smaller sites but is another great fire-bug plug-in.
Update: As of June 2015 Page Speed plugins for Firefox and Chrome is no longer maintained and available, instead, Google suggests the web version.
Pingdom tools provides a quick test for any publicly accessible web page.