Is there any way to find out which urls of my websites are indexed and which not?
(e.g. site:http://example.com/site1.html)
What i tried:
Used google ajax api -> Problem here is that the results are totally different from the ones i'm getting from the google search
Used google custom search api -> Same problem here: The results differ from the ones google is showing (because its actually kind of like a private search)
Used Jsoup to crawl google -> Since its against there Terms its really hard to do -> I set a timeout between every request (between 30s-90s) and used proxies. Still i can't crawl for a long time and google blocks the ip.
What to do? :)
Related
i am using the Google CSE Json API in order to receive results on generic queries from Google.
I have configured the Google Engine to return results from all web sites.
i though that this setting would make me use Google as if i was using it from the Google web site via the regular search engine.
BUT - i don't get same results that i would have expected to get from the site. i have major differences and i was wondering why.
From the past reading that i have made, i know that the API uses a certain server(s) around UK, making the results not the same due to the locale settings.
i have read the documentation on the CSE site and saw that there are 2 parameters that i thought would have improved the state:
googlehost - specifying the domain to use. this parameter is deprecated according to the documentation, hence i used the second parameter
gl - specify a country code for the search
both parameters didn't affect my results at all.
i am struggling with this for quite a long time and would appreciate a proper solution for it.
all i want is to have a CSE that acts the same as the google website. Running a search here and there should not differ in the returned results.
your help is highly appreciated.
Regards,
attiass
Implemented Google site search in our company website. We need to automate the google indexing for our website.
Suppose like our customers are updated the forum. We need to show the up to updated forum information in our forum search ?
Is there any option in google API or any other API please help me ?
You can use an XML sitemap. This will tell the search engines where your content is so they can find it and crawl it. Keep in mind there is no way to make the search engines crawl your site when you want them to. They will crawl on a schedule they determine to be right for your site. (You can set a crawl rate in Google Webmaster Tools but that rate is relative to what crawl rate Google already has set for you. Setting it to fastest will not speed up heir crawl rate)).
Unfortunately, Google will only crawl your site when it feels like it. It is based on many variables to determine how often this occurs (i.e. site ranking, standards compliance, and so on). The sitemap XML is a helpful way to help Google determine what parts of your site to index, however if you don't have one Google will find it by crawling links on other parts of your page and updating its index if the page changes.
The more visitors you get and the more often your site's links appear on other sites will make Google index more frequently.
To start, I'd suggest http://validator.w3.org/ to validate your site and make sure you get it as close to possible to no errors. This makes it easier for Google to index your site because it can find the information it expects without having to crawl over invalid markup. Also, chances are, if a site validates with a very small amount of errors, it is more credible than one containing many errors. It tells the search engine that you update your site to ensure most all browsers can use it and that it is accessible.
Also validating your site gives you some bragging rights over those who don't meet W3 standards :)
Hope this helps!
My question is regarding the google AJAX search api. I have been trying to figure this out by exploring their site with no luck. How can I use this API on my site but have the results only be the google results from within my site (i.e. only shows the site:mydomain.com results and NOT the results from a standard google.com search)? Is this even allowed per their terms of usage? Thanks.
You can make a Google custom search and tie it to that.
Are all these types of sites just illegally scraping Google or another search engine?
As far as I can tell ther is no 'legal' way to get this data for a commercial site.. The Yahoo! api ( http://developer.yahoo.com/search/siteexplorer/V1/inlinkData.html ) is only for noncommercial use, Yahoo! Boss does not allow automated queries etc.
Any ideas?
For example, if you wanted to find all the links to Google's homepage, search for
link:http://www.google.com
So if you want to find all the inbound links, you can simply traverse your website's tree, and for each item it finds, build a URL. Then query Google for:
link:URL
And you'll get a collection of all the links that Google has from other websites into your website.
As for the legality of such harvesting, I'm sure it's not-exactly-legal to make a profit from it, but that's never stopped anyone before, has it?
(So I wouldn't bother wondering whether they did it or not. Just assume they do.)
I don't know what hubspot do, but, if you wanted to find out what sites link to your site, and you don't have the hardware to crawl the web, one thing you can do is monitor the HTTP_REFERER of visitors to your site. This is, for example, how Google Analytics (as far as I know) can tell you where your visitors are arriving from. This is not 100% reliable as not all browsers set it, particularly in "Privacy Mode", but you only need one visitor per link to know that it exists!
This is ofter accomplished by embedding a script into each of your webpages (often in a common header or footer). For example, if you examine the source for the page you are currently reading you will find (right down at the bottom) a script that reports back to Google information about your visit.
Now this won't tell you if there are links out there that no one has ever used to get to your site, but let's face it, they are a lot less interesting than the ones people actually use.
Technorarati's got their Cosmos api, which works fairly well but limits you to noncommercial use and no more than 500 queries a day.
Yahoo's got a Site Explorer InLink Data API, but it defines the task very literally, returning links from sidebar widgets in blogs rather than just links from inside blog content.
Is there any other alternative for tracking who's linking to a given URL (think of the discussion links that run below stories on Techmeme.com)? Or will I have to roll my own?
Well, it's not an API, but if you google (for example): "link:nytimes.com", the search results that come back show inbound links to that site.
I haven't tried to implement what you want yet, but the Google search API almost certainly has that functionality built in.
Is this for links to Urls under your control?
If so, you could whip up something quick that logs entries in the Referrer HTTP header.
If you wanted to do to this for an entire web site without altering application code, you could implement as an ISAPI filter or equivalent for your web server of choice.
Information available publicly from web crawlers is always going to be incomplete and unreliable (not that my solution isn't...).