Crawling recursively with import.io

Crawling recursively with import.io - import.io

I want to crawl all the links, sub-links and so on, that live inside a page (recursively).
Is there a recursive option in import.io? If so, how do I use it?

Can you tell us more about the specific use case? What site / sub section are you trying to extract data from?
Based on your questions, you may want to check out the "Chain APIs" feature.
Essentially it allows you to have an API that extracts a set of links, and feed that set into a second API that extracts sub links.
http://support.import.io/knowledgebase/articles/629686-chain-apis-combine-two-apis

Related

custom google search autosuggestion

I have created a custom google search application for my website.
Below is the url
https://www.google.com/cse/
to create application.
Under the Auto complete section i have enabled autocomplete,but still it dont suggest me the options when i start typing on search box.
It suggest me only the keywords that We define under custom autocompletions.
SO my question is : Do we need to provide all the custom keywords that we want to autocomplete in search box or google just creates its own autosuggestion from the website ?

It creates autosuggestions from the website, but it may take few days to collect all data. It also takes into account user's queries on your website.

As previous answer suggests, autocomplete can take a few days to begin working. In addition you must specify one or more specific sites / pages to which to restrict the custom search engine. On occasion, you may also have to specify that the CSE only search those site(s) / page(s) instead of simply emphasizing them in the results.

How to add dynamic links to Google and his bots

I have a website which when you first go to the website it will just display the normal domain so /. When they use the form they will get forwarded to lets say /question/DYNAMIC(question id).
So google has no way to see these links.
Is there a way to tell google about all of these links without manually putting these in and without having to keep this up-to-date as some question might be removed at a later date?

Submit an XML sitemap

How to force google to show my first page from a page set with pagination?

I have a website and in my website I have, for example, a list of Audi models. I saw, using google webmaster tools, that my website appears in the google search by the word audi, but the target page was the 22nd page from my result set, not the first. I need my first page to appead, not my last (or middle), but I cannot tell google that this is a parameter, because my URLs are rewritten using mod rewrite. Any ideas?
BTW, I have read in a SEO forum, that it's a bad idea to use a cannonical tag. So is it really a bad idea in my case?

You can't force Google to do anything, however, they have made it easier to deal with pagination issues with a recent post on rel="next" and rel="prev".
But the primary problem you face is signalling to Google that your first (main) page is the starting point - this is achieved using internal link and back-link "juice" focussed on that page. You need to ensure that the first page of results is linked to properly from higher-value pages (like the home-page).

Google recently announced that you can use View All which will allow them to find and index entire articles that are normally broken up using pagination and display them all as one result.

Normal Google Custom Search

I'm writing an application that analyses search engine results.
With the Google Search API now being depreciated and limited to 1000 queries/day they are forcing developers to move to the AJAX APIs and to use the Custom Search API to do a Google search.
The thing is I don't need a Custom Search, I need a general search not one that is filtered by site; OK maybe filtered by USA/UK (Google.com/Google.co.uk).
Does anyone know how to just do a regular Google search using the AJAX APIs? Is the Custom Search the right thing to be using?
I don't want to hit the 1000/day limit using the old service but this is exactly what I need.
I did find: How do I create a CSE that searches the entire web?
http://www.google.com/support/customsearch/bin/answer.py?hl=en&answer=1210656
But by the sounds of it this will distort the search results.
Thank you.

OK. Here's how I think it is done.
Create a Custom Search Engine.
Add a site such as *.com When this is created go to the Advanced tab
and download the context xml.
Remove the Background Label associated with the site.
Upload the XML to replace the previous context.
This seems to work just fine and is returning the same values as far as I can see.

Yes, you are right *in theory, and this should let you get 100 results a day on the fly. Just this Saturday though, Google confirmed how here -
(* so far though, we can't get it working...)

Is there a way to prevent Googlebot from indexing certain parts of a page?

Is it possible to fine-tune directives to Google to such an extent that it will ignore part of a page, yet still index the rest?
There are a couple of different issues we've come across which would be helped by this, such as:
RSS feed/news ticker-type text on a page displaying content from an external source
users entering contact phone etc. details who want them visible on the site but would rather they not be google-able
I'm aware that both of the above can be addressed via other techniques (such as writing the content with JavaScript), but am wondering if anyone knows if there's a cleaner option already available from Google?
I've been doing some digging on this and came across mentions of googleon and googleoff tags, but these seem to be exclusive to Google Search Appliances.
Does anyone know if there's a similar set of tags to which Googlebot will adhere?
Edit: Just to clarify, I don't want to go down the dangerous route of cloaking/serving up different content to Google, which is why I'm looking to see if there's a "legit" way of achieving what I'd like to do here.

What you're asking for, can't really be done, Google either takes the entire page, or none of it.
You could do some sneaky tricks though like insert the part of the page you don't want indexed in an iFrame and use robots.txt to ask Google not to index that iFrame.

In short NO - unless you use cloaking with is discouraged by Google.

Please check out the official documentation from here
http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Preparing.html
Go to section "Excluding Unwanted Text from the Index"
<!--googleoff: index-->
here will be skipped
<!--googleon: index-->

Found useful resource for using certain duplicate content and not to allow index by search engine for such content.
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index>

At your server detect the search bot by IP using PHP or ASP. Then feed the IP addresses that fall into that list a version of the page you wish to be indexed. In that search engine friendly version of your page use the canonical link tag to specify to the search engine the page version that you do not want to be indexed.
This way the page with the content that do want to be index will be indexed by address only while the only the content you wish to be indexed will be indexed. This method will not get you blocked by the search engines and is completely safe.

Yes definitely you can stop Google from indexing some parts of your website by creating custom robots.txt and write which portions you don't want to index like wpadmins, or a particular post or page so you can do that easily by creating this robots.txt file .before creating check your site robots.txt for example www.yoursite.com/robots.txt.

All search engines either index or ignore the entire page. The only possible way to implement what you want is to:
(a) have two different versions of the same page
(b) detect the browser used
(c) If it's a search engine, serve the second version of your page.
This link might prove helpful.

There are meta-tags for bots, and there's also the robots.txt, with which you can restrict access to certain directories.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Crawling recursively with import.io - import.io

I want to crawl all the links, sub-links and so on, that live inside a page (recursively). Is there a recursive option in import.io? If so, how do I use it?

Related

custom google search autosuggestion

How to add dynamic links to Google and his bots

How to force google to show my first page from a page set with pagination?

Normal Google Custom Search

Is there a way to prevent Googlebot from indexing certain parts of a page?

Categories

Resources