Recently a user told me to avoid subdomains when i can. I remember reading google consider subdomains as a unique site (is this true?). What else happens when i use a subdomain and when should i use or should not use a subdomain?
I heard cookies are not shared between subdomains? i know 2 images can be DL simultaneously from a site. Would i be able to DL 4 if i use sub1.mysite.com and sub2.mysite.com?
What else should i know?
You can share cookies between subdomains, provided you set the right parameters in the cookie. By default, they won't be shared, though.
Yes, you can get more simultaneous downloads if you store images in different subdomains. However, the other side of the scale is that the user spends more time resolving DNSs, so it's not practical to have, say, 25 subdomains to get 50 simultaneous downloads.
Another thing that happens with subdomains is that AJAX requests won't work without some effort (you CAN make them work using document.domain tricks, but it's far from straightforward).
Can't help with the SEO part, however, although some people discourage having both yoursite.com and www.yoursite.com working and returning the same content, because it "dilutes your pagerank". Not sure how true that is.
You complicate quite a few things. Collecting stats, controlling spiders, html5 storage, XSS, inter-frame communication, virtual-host setup, third-party ad serving, interaction with remote APIs like google maps.
That's not to say these things can't be solved, just that the rise in complexity adds more work and may not provide suitable benefits to compensate.
I should add that I went down this path once myself for a classifieds site, adding domains like porshe.site.com, ferrari.site.com hoping to boost rank for those keywords. In the end I did not see noticeable improvement and even worse google was walking the entire site via each subdomain, meaning that a search for ferraris might return porsche.site.com/ferraris instead of ferrari.site.com/ferraris. In short google considered each site to be duplicates but it still crawled each site every time it visited.
Again, workarounds existed but I chose simplicity and I don't regret it.
If you use sub domains to store your web sites images, javascript, stylesheets, etc. then your pages may load quicker. Browsers limit the number of simultaneous connections to each domain name. The more sub domains you use, the more connection can be made at the same time to collect the web pages content.
Recently a user told me to avoid subdomains when i can. I remember reading google consider subdomains as a unique site (is this true?). What else happens when i use a subdomain and when should i use or should not use a subdomain?
The last thing I heard about Google optimization, is that domains count for more pagerank than subdomains. I also believe that page rank calculations are per page, not per site (according to algorithm etc.). Though the only person who can really tell you is a Google employee.
I heard cookies are not shared between subdomains?
You should be able to use a cookie for all subdomains. www.mysite.com sub1.mysite.com sub2.mysite.com can all share the same cookies, but a cookie specified for mysite.com cannot be shared with them.
i know 2 images can be DL simultaneously from a site. Would i be able to DL 4 if i use sub1.mysite.com and sub2.mysite.com?
I'm not sure what you mean by DL simultaneously. Often times, a browser with a single thread will download images one at a time, even from different domains. Browsers with multiple thread configurations can download multiple items from different domains at the same time.
Related
I am very confused as to how Safari 2.3 works in certain respects, and why sites can’t easily circumvent it. I don’t understand under what circumstances limits are applied, what the exact limits are, to what they are applied, and for how long.
To clarify my question I broke it down into several cases. I will be referring to Apple’s official blog post about ITP 2.3 [1] which you can quote from, but feel free to link to any other authoritative or factually correct sources in your answer.
For third-party sites loaded in iframes:
Why can’t they just use localStorage to store the values of cookies, and send this data along not as actual browser cookies🍪, but as data in the body of the request? Similarly, they can parse the response to updaye localStorage. What limits does ITP actually place on localStorage in third party iframes?
If the localStorage is frequently purged (see question 1), why can’t they simply use postMessage to tell a script on the enclosing website to store some information (perhaps encrypted) and then spit it back whenever it loads an iframe?
For sites that use link decoration
I still don’t understand what the limits on localStorage are in third party sites in iframes, which did NOT get classified as link decorator sites. But let’s say they are link decorator sites. According to [1] Apple only start limiting stuff further if there is a querystring or fragment. But can’t a website rather trivially store this information in the URL path before the querystring, ie /in/here without ?in=here … certainly large companies like Google can trivially choose to do that?
In the case a site has been labeled as a tracking site, does that mean all its non-cookie data is limited to 7 days? What about cookies set by the server, aren’t they exempted? So then simply make a request to your server to set the cookie instead of using Javascript. After all, the operator of the site is very likely to also have access to its HTTP server and app code.
For all sites
Why can’t a service like Google Analytics or Facebook’s widgets simply convince a site to additional add a CNAME to their DNS and get Google’s and Facebook’s servers under a subdomain like gmail.mysite.com or analytics.mysite.com ? And then boom, they can read and set cookies again, in some cases even on the top-level domain for website owners who don’t know better. Doesn’t this completely defeat the goals of Apple’s ITP, since Google and Facebook have now become a “second party” in some sense?
Here on StackOverflow, when we log out on iOS Safari the StackOverflow network is able to log out of multiple sites at once … how is that even accomplished if no one can track users across websites? I have heard it said that “second party cookies” still can be stored but what exactly makes a second party cookie different from a third party?
My question is broken down into 6 cases but the overall theme is, in each case: how does Apple’s latest ITP work in that case, and how does it actually block all cases of potentially malicious tracking (to the point where a well-funded company can’t just do the workarounds above) while at the same time allowing legitimate use cases?
[1] https://webkit.org/blog/9521/intelligent-tracking-prevention-2-3/
I'm getting constantly ban from a website, I set download_delay = 10 in scrapy, I tried a package fake_user_agent then I tried implementing tor and polipo, according to this site the config is ok. But after running 1/2 times again I got banned! Can anyone help me here ?
Note: scrapy-proxie I also want to try this but can't activate.
You should take a look at what the documentation says.
Here are some tips to keep in mind when dealing with these kinds of
sites:
rotate your user agent from a pool of well-known ones from browsers
(google around to get a list of them)
disable cookies (see
COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
use download delays (2 or higher). See DOWNLOAD_DELAY setting.
if
possible, use Google cache to fetch pages, instead of hitting the
sites directly use a pool of rotating IPs. For example, the free Tor
project or paid services like ProxyMesh
use a highly distributed
downloader that circumvents bans internally, so you can just focus on
parsing clean pages. One example of such downloaders is Crawlera
Use delay on clicks
Not tor - all connections from one address - bad, rotate proxy after several visits
And check this post - web scraping etiquette
I was wondering what's Google's official policy on linking my own websites together, do they forbid it, allow it, allow it as long as it's no-follow, etc.
For clarification i will give both a white-hat and black-hat examples:
white-hat:
I'm a web designer who also has several affiliate websites. I designed those websites so i would like to give myself credit by linking from the affiliate website to my professional bio website where people can hire me as a designer.
black-hat:
I buy 100 different domains and link each one to the other 99 sharing all the link juice between them. The content of each website abide by Google's policy and isn't spammy , the only thing that's wrong is the fact that i got 99 links to each of them and i'm the only one doing the linking.
First solution - nofollow:
Well, if they are nofollow, I don't see why Google would care.
So, you'd probably be safe with that, if what you want to achieve is indeed giving yourself credit.
But, as for SEO optimization, as you already know, the sites wouldn't benefit much.
However with nofollow, even if you didn't increase pagerank, number of visits to each site should increase (the traffic from your other sites). This also could be beneficial.
Second solution - portfolio site:
There is one scenario which could suit your purpose:
Create your "portfolio". A site with links to all the sites you created, as an example of your skills and stuff..
Place a link on each of your sites to this portfolio.
Now, you have a page with 100 outbound links, each perfectly legitimate. And each of your sites contains just one outbound link connecting it to your other sites.
This should be fine both for your presentation and for SEO, and you avoided having a link farm.
EDIT: You can find actual info from Google here: http://www.google.com/webmasters/docs/search-engine-optimization-starter-guide.pdf
I am trying to ban users that spam my service by logging their IP and blocking it.
Of course this isn't safe at all, because of dynamic IP addresses.
Is there a way to identify a user that's 100% safe?
I've heard about something called evercookie, but I was easily able to delete that, and I guess that anyone capable of changing their IP can also keep their PC clean..
Are there any other options? Or is it just not possible?
A cookie will prevent the same browser from visiting your site as long as the user doesn't delete it, or turn off cookies, or use a different browser, or reinstall their browser, or use another machine, etc.
There is no such thing as 100% safe. Spam is an ongoing problem that most websites just have to learn to deal with.
There are numerous highly secure options, mostly relying on multi-factor authentication and physical key generators like the ones RSA markets. But the real question is an economic one. The more draconian the authentication mechanism, the more quickly you kill your website as you scare off all your visitors.
More practical solutions involve CAPTCHA, forum moderation, spam-reporting affordances, etc. One particularly effective technique is to block offending content from every IP address except the one that originated it. That way, the original spammer thinks their content is still there, oblivious to the fact that no one else can see it.
Alright I get that it's impossible to 100% identify a unique visitor.
What are the things that I could do to:
- find whether someone (anonymous) is using lots of different proxies to see my content (problem here is that cookies would land on the machine of the proxy? and not the actual visitors PC?)
- identify unique (anonymous) visitors with a dynamic IP
In order to allow for multiple policies regarding content... security, cookies, sessions, etc, I'm considering moving some content from my sites to their own domains and was wondering what kinds of dividends it will pay off (If any).
I understand cookies are domain specific and are sent on every request (even for images) so if they grow too large they could start affecting performance, so moving static content in this way makes sense (to me at least).
Since I expect that someone out there has already done something similar, I was wondering if you could provide some feedback of the pros and the cons.
I don't know of any situation that fits your reasons that can't be controlled in the settings for the HTTP server, whether it be Apache, IIS or whatever else you might be using.
I assume you mean you want to split them up into separate hosts, ie www1.domain.com www2.domain.com. And you are correct that the cookies are host/domain specific. However, there aren't really any gains if www1 and www2 are still the same computer. If you are experiencing load issues and split it between two different servers, there could be some gains there.
If you actually mean different domains (www.domain1.com & www.domain2.com) I'm not sure what kind of benefits you would be looking for...