We are planning to put large number of Business Research Reports and Articles from our intranet on to the Internet. However, we don't want others to copy the content and host it on their own.
I read about protection by CAPTCHA and was wondering if this is possible. Readers should be able to read 50% of the article for FREE after which a CAPTCHA should be entered to read the rest of the article [In this way we are making life little harder for those copycats]
Any pointers on how to implment this ? The content is in HTML and programming experience in Perl, PHP. Can hire others if required.
Aditionally, search engine will crawl half of the article and wondering if it will penalize the site for not being able to crawl the rest of the article since it won't be able to crack the CAPTCHA ?
Thanks.
There's a really good Captcha service provided by Recaptcha - http://recaptcha.net/
There is a PHP class that you can use to do all the hard work.
It's important to bear in mind that search engines aren't able to solve a Captcha and so they will only index the first half of the report. As long as this half contains largely the correct key words, it shouldn't cause a massive problem. Don't make the mistake of "detecting" a search engine and showing them different content to a normal user as the major search engines think that this is spamming.
An alternative solution would be to use a service like Copyscape (http://www.copyscape.com/) to protect your content.
I know this is not what you're asking, but please take into account that CAPTCHAs are universally broken, and will not protect your content. You said the first half is free, does that mean you intend to charge for the other half? CAPTCHA won't help you here at all...
But even if you're just trying to prevent automated scraping, CAPTCHA still won't do the trick. Check out my answer to another captcha question... Or you can go straight to the ppt I presented at OWASP last year.
Readers should be able to read 50% of the article for FREE after which a CAPTCHA should be entered to read the rest of the article
Have your PHP programmer output 50% of the article. On the bottom, add a captcha. If the user types in the correct captcha, output 100% of the article.
Any pointers on how to implment this ? The content is in HTML and programming experience in Perl, PHP. Can hire others if required.
As a PHP programmer, I use http://www.phpcaptcha.org to implement captcha.
Aditionally, search engine will crawl half of the article and wondering if it will penalize the site for not being able to crawl the rest of the article since it won't be able to crack the CAPTCHA ?
No, it won't penalize you but that particular section will not be shown on the search results.
As already mentioned reCAPTCHA is a good way to go.
Have a look at Captcha::reCAPTCHA on CPAN which according to the CPAN rating reviews "Works out of the box"
If your want Captcha then there are plenty of modules that do this on CPAN ;-)
Hope that helps.
Related
We all know that showing inexistent stuff to Google bots is not allowed and will hurt the search positioning but what about the other way around; showing stuff to visitors that are not displayed for Google bots?
I need to do this because I have photo pages each with the short title and the photo along with textarea containing the embed HTML code. googlebot is taking the embed code and putting it at the page description on its search results which is very ugly.
Please advise.
When you start playing with tricks like that, you need to consider several things.
... showing stuff to visitors that are not displayed for Google bots.
That approach is a bit tricky.
You can certainly check User-agents to see if a visitor is Googlebot, but Google can add any number of new spiders with different User-agents, which will index your images in the end. You will have to constantly monitor that.
Testing of each code release your website will have to check "images and Googlebot" scenario. That will extend testing phase and testing cost.
That can also affect future development - all changes will have to be done with "images and Googlebot" scenario in mind which can introduce additional constraints to your system.
Personally I would choose a bit different approach:
First of all review if you can use any methods recommended by Google. Google provides a few nice pages describing that problem e.g. Blocking Google or Block or remove pages using a robots.txt file.
If that is not enough, maybe restructuring of you HTML would help. Consider using JavaScript to build some customer facing interfaces.
And whatever you do, try to keep it as simple as possible, otherwise very complex solutions can turn around and bite you.
It is very difficult to give you very good advise without knowledge of your system, constraints and strategy. But I hope my answer will help you out to choose good architecture / solution for your system.
Boy, you want more.
Google does not because of a respect therefore judge you cheat, he needs a review, as long as your purpose to the user experience, the common cheating tactics, Google does not think you cheating.
just block these pages with robots.txt and you`ll be fine, it is not cheating - that's why they came with solution like that in the first place
What is a good reason to choose reCAPTCHA over a well known and tested captcha generator on the server. Is it just philanthropy (helping with digitizing texts) or are there other good reasons.
reCAPTCHA is rather neat. Not only does it stop spammers but it helps digitize books. Each word that appears in the captcha has actually been scanned in from a book but sometimes the character recognition is off so the computer my save some gibberish of a sentence without knowing any better.
See the image off their site:
By making people type in what they think the word is, it helps create a digital copy of the book or word that was scanned with accuracy while at the same time checking what the user submit, comparing it to other's submissions, and determining if the user is human or not.
For that reason I use reCAPTCHA. I'm not just selfishly protecting my site, I'm providing a service for others.
Not only that but it's fairly simple to implement and provided by a reliable company (Google).
The question was "why should I use it"; that question must include "why shouldn't I use it", so some criticisms:
Recaptcha volunteers your users to be OCR monkeys, without bothering to ask their opinion.
It requires that you advertise recaptcha in the captcha widget, which isn't always appropriate.
It's a web service, which means there's no hard guarantee it'll still exist a week or a year or two years from now. (Google has crippled or removed public, widely-used APIs in the past, such as their translation API.)
It only supports web pages, loading everything with scripts and iframes. It doesn't have a proper API, so if you ever want to have an iOS or Android app that logs into your system, and need to show a captcha there, you'll be out of luck.
You have no control over the complexity of the generated captcha. Captchas always have a tradeoff between how hard they are to read and how difficult they are to OCR. There are no knobs to adjust, based on how important stopping robots is to your use case. If they decide to make the captchas much harder to read (which they've done at times), and this becomes a nuisance to your users, there's nothing you can do about it.
reCAPTCHA is quite good. Most other generators are broken easily while reCAPTCHA usually gets good scores.
Another good thing is that it has the accessiblity button so that it would read the text.
This is an old threat but I would just like to confirm that in my case we used reCAPTCHA on a number of Drupal 6 websites in combination with the Honeypot module. We did that to stop automated spam user registrations.
I presume these user accounts were being created automatically by desktop applications such as SEnuke XCr and XRumer with the aim of then posting spam. They create the user account but they rarely do anything further but I found it annoying. Further reading on this subject can be found here: How to prevent spam user registrations? (links to an article on Drupal.org).
I can confirm that the above reduced my spam user registrations from a little over 100 a day to none at all.
We need to register our IP address on which server would be running. Its seems some what risky. So we might be required to change registration work flow in case of use of reCAPTCHA.
I can't seem to find any information on how google determines if you are cloaking your content. How, from a technical standpoint, do you think they are determining this? Are they sending in things other than the googlebot and comparing it to the googlebot results? Do they have a team of human beings comparing? Or can they somehow tell that you have checked the user agent and executed a different code path because you saw "googlebot" in the name?
It's in relation to this question on legitimate url cloaking for seo. If textual content is exactly the same, but the rendering is different (1995-style html vs. ajax vs. flash), is there really a problem with cloaking?
Thanks for your put on this one.
As far as I know, how Google prepares search engine results is secret and constantly changing. Spoofing different user-agents is easy, so they might do that. They also might, in the case of Javascript, actually render partial or entire pages. "Do they have a team of human beings comparing?" This is doubtful. A lot has been written on Google's crawling strategies including this, but if humans are involved, they're only called in for specific cases. I even doubt this: any person-power spent is probably spent by tweaking the crawling engine.
Google looks at your site while presenting user-agent's other than googlebot.
See the Google Chrome comic book page 11 where it describes (even better than layman's terms) about how a Google tool can take a schematic of a web page. They could be using this or similar technology for Google search indexing and cloak detection - at least that would be another good use for it.
Google does hire contractors (indirectly, through an outside agency, for very low pay) to manually review documents returned as search results and judge their relevance to the search terms, quality of translations, etc. I highly doubt that this is their only tool for detecting cloaking, but it is one of them.
In reality, many of Google's algos are trivially reversed and are far from rocket science. In the case of, so called, "cloaking detection" all of the previous guesses are on the money (apart from, somewhat ironically, John K lol) If you don't believe me set up some test sites (inputs) and some 'cloaking test cases' (further inputs), submit your sites to uncle Google (processing) and test your non-assumptions via pseudo-advanced human-based cognitive correlationary quantum perceptions (<-- btw, i made that up for entertainment value (and now i'm nesting parentheses to really mess with your mind :)) AKA "checking google resuts to see if you are banned yet" (outputs). Loop until enlightenment == True (noob!) lol
A very simple test would be to compare the file size of a webpage the Googlbot saw against the file size of the page scanned by an alias user of Google that looks like a normal user.
This would detect most suspect candidates for closeer examination.
They call your page using tools like curl and they construct a hash based on the page without the user agent, then they construct another hash with the googlebot user-agent. Both hashes must be similars, they have algorithms to check the hashes and know if its cloaking or not
I'm working on a website on which I am asked to add to the homepage's footer a list of all the products that are sold on the website along with a link to the products' detail pages.
The problem is that there are about 900 items to display.
Not only that doesn't look good but that makes the page render a lot slower.
I've been told that such a technique would improve the website's visibility in Search Engine.
I've also heard that such techniques could lead to the opposite effect: google seeing it as "spam".
My question is: Is listing products of a website on its homepage really efficient when it comes to becoming more visible on search engines?
That technique is called keyword stuffing and Google says that it's not a good idea:
"Keyword stuffing" refers to the practice of loading a webpage with keywords in an attempt to manipulate a site's ranking in Google's search results. Filling pages with keywords results in a negative user experience, and can harm your site's ranking. Focus on creating useful, information-rich content that uses keywords appropriately and in context.
Now you might want to ask: Does their crawler really realize that the list at the bottom of the page is just keyword stuffing? Well, that's a question that only Google could answer (and I'm pretty sure that they don't want to). In any case: Even if you could make a keyword stuffing block that is not recognized, they will probably improve they algorithm and -- sooner or later -- discover the truth. My recommendation: Don't do it.
If you want to optimize your search engine page ranking, do it "the right way" and read the Search Engine Optimization Guide published by Google.
Google is likely to see a huge list of keywords at the bottom of each page as spam. I'd highly recommend not doing this.
When is it ever a good idea to specify 900 items to a user? good practice dictates that large lists are usually paginated to avoid giving the user a huge blob of stuff to look through at once.
That's a good rule of thumb, if you're doing it to help the user, then it's probably good ... if you're doing it purely to help a machine (ie. google/bing), then it might be a bad idea.
You can return different html to genuine users and google by inspecting the user agent of the web request.
That way you can provide the google bot with a lot more text than you'd give a human user.
Update: People have pointed out that you shouldn't do this. I'm leaving this answer up though so that people know it's possible but bad.
Of all the forms of CAPTCHA available, which one is the "least crackable" while remaining fairly human readable?
I believe that CAPTCHA is dying. If someone really wants to break it, it will be broken. I read (somewhere, don't remember where) about a site that gave you free porn in exchange for answering CAPTCHAs to they can be rendered obsolete by bots. So, why bother?
Anyone who really wants to break this padlock can use a pair of bolt cutters, so why bother with the lock?
Anyone who really wants to steal this car can drive up with a tow truck, so why bother locking my car?
Anyone who really wants to open this safe can cut it open with an oxyacetylene torch, so why bother putting things in the safe?
Because using the padlock, locking your car, putting valuables in a safe, and using a CAPTCHA weeds out a large spectrum of relatively unsophisticated or unmotivated attackers. The fact that it doesn't stop sophisticated, highly motivated attackers doesn't mean that it doesn't work at all. Using a CAPTCHA isn't going to stop all spammers, but it's going to tremendously reduce the amount that requires filtering or manual intervention.
Heck look at the lame CAPTCHA that Jeff uses on his blog. Even a wimpy barrier like that still provides a lot of protection.
I agree with Thomas. Captcha is on its way out. But if you must use it, reCAPTCHA is a pretty good provider with a simple API.
I believe that CAPTCHA is dying. If someone really wants to break it, it will be broken. I read (somewhere, don't remember where) about a site that gave you free porn in exchange for answering CAPTCHAs to they can be rendered obsolete by bots. So, why bother?
If you're a small enough site, no one would bother.
If you're still looking for a CAPTCHA, I like tEABAG_3D by the OCR Research Team. It's complicated to break and uses your 3D vision. Plus, it being developed by people who break CAPTCHAs for fun.
If you're just looking for a captcha to prevent spammers from bombing your blog, the best option is something simple but unique. For example, ask to write the word "Cat" into a box. The advantage of this is that no targeted captcha-breaker was developed for this solution, and your small blog isn't important enough for someone to actually develop one. I've used such a captcha on my blog with some success for a couple of years now.
This information is hard to really know because I believe a CAPTCHA gets broken long before anybody knows about it. There is economic incentive for those that break them to keep it quiet.
I used to work with a guy whose job revolved mostly around breaking CAPTCHA's and I can tell you the one giving them fits currently is reCAPTCHA.
Now, does that mean it will forever, call me skeptical.
I wonder if a CAPTCHA mechanism that uses collage made of pictures and asks human to type what he sees in the collage image will be much more crack-proof than the text and number image one. Imagine that the mechanism stitches pictures of cat, cup and car into a collage image and expects human visitor to tick (checkboxes) cat, cup, and car. How long do you think will hackers and crackers will come up with an algorithm to crack the mechanism (i.e. extract image elements from the collage and recognize the object depicted by each picture) ...
If you wanted you could try out the Microsoft Research project Asirra: http://research.microsoft.com/asirra/
CAPTCHAS, I believe should start being considered heavily when designing the UX. They're slow, cumbersome, and a very poor user experience. They are useful, don't get me wrong but perhaps you should look into designing a honeypot.
A honeypot is created by adding a hiddenfield at the bottom of the form. Because spam bots will fill in all the fields on the page blindly you can do a check:
If honeypotfield <> Empty Then
"No Spam TY"
Else
//Proceed with the form
End If
This works until there is a specifically designed spambot for your site, so they can choose to fill out selected input fields.
For more information: http://haacked.com/archive/2007/09/11/honeypot-captcha.aspx/
As far as I know, the Google's one is the best that there is. It hasn't been broken by computer programs yet. What I know that the crackers have been doing is to copy the image and then send it to many phishing websites where humans solve them to enter those websites.
It doesn't matter if captchas are broken or not now -- there are Indian firms that do nothing but process captchas. I'm with the rest of the group in saying that Captchas are on their way out.
Here is a cool link to create CAPTCHA..... http://www.codeproject.com/aspnet/CaptchaImage.asp
Just.. don't.. There are several reasons use of captcha is not advised.
http://www.interfacegeek.com/dont-ever-use-captchas/
I use uniqpin.com - it's easy to use and not annoying for users. So, bots can recognise a text, but can't recognize a image.
Death by Captcha can solve any Regular CAPTCHA (incude reCAPTCHA), but not Speedcoin Cryptocurrency Captcha.
Death by Captcha - http://deathbycaptcha.com
Speedcoin Captcha - http://speedcoin.co/info/captcha/Speedcoin_Captcha.html