How can I extract data from multiple webpage using one import.io connector? - import.io

I need to extract data from multiple pages on a website. Can I do it with one connector or I have to create multiple connector to get data and group them later ? As an example i am trying to collect data for stocks but it is spread on multiple pages.
Here are the diffrent pages I am trying to collect the information -
http://www.moneycontrol.com/india/stockpricequote/powergenerationdistribution/suzlonenergy/SE17
http://www.moneycontrol.com/financials/suzlonenergy/balance-sheet/SE17#SE17
http://www.moneycontrol.com/financials/suzlonenergy/ratios/SE17#SE17
How do I write one extractor to fetch data from these different pages ?

If the URLs of the pages are different then it would be best to use an extractor and paste the URLs into it, this way you can get live data at the click of a button. It would be really cool to get a few of the URLs so we can take a look at them.
Thanks!

Related

Search in mutliple sites using Google Custom Search JSON

Trying to figure out how can i search in mutliple sites using Google Custom Search JSON API.
Meaning that search will be only from a specific sites list.
i was playing with the api explorer - https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list?apix_params=%7B%22cx%22%3A%22011602274690322925368%3Atkz2zvvpmk0%22%2C%22siteSearch%22%3A%22www.walla.co.il%22%7D
and noticed the site search query key, but it can only accept a single string not a list of sites:
enter image description here
What is the way to search in only in specific sites?
Thanks
There's a couple things you can do.
If you know the specific sites you want to search, you can add them as refinements to your engine. Then query for that refinement by adding 'more:<REFINEMENT_LABEL>' to the query.
Or, add 'site:' operators to the query itself. For example cats site:cnn.com OR site:bbc.com

How to download data from a series of pages?

Example:
I want to download all data from https://www.example.com/api.php?id=X (fictitious URL), where X is 1 to 1000 and each page is a JSON containing data of a data row. (I don't want to manually add 1000 URLs and OpenRefine seems not allowing pasting list of URLs).
I want to download information in pages in
https://en.wikipedia.org/wiki/Category:Lists_of_horror_films_by_year, each of which contains one HTML table.
I want to download data in all tables from all pages in https://en.wikipedia.org/wiki/Template:Earthquakes_by_year, each of which contains multiple HTML tables.
OpenRefine is not a web scraping tool. It has the feature to fetch web pages, but you will hit a lot of limitation quickly.
Example 1 you can prepare your list of URL in a spreadsheet software like Excel or OpenOffice Calc. Import your project in OpenRefine and user the feature Add a column by fetching URL.
Example 2 and 3: OpenRefine cannot crawl or follow links. You will need to:
extract the list of links from each page, using OpenRefine to
creates a separated OpenRefine project with one link per row
fetch each page using the Add a column by fetching URL feature
parse the HTML of each page

how to dynamically extract data from dropdown lists or multiple textboxes using import.io

I am making an API wherein I want to dynamically get data from the site http://transportformumbai.com/mumbai_local_train.php
Depending on start and end station and timings I want to get the list of all available trains along with the table given by clicking on viewroute column table. i.e. for eg.
I am using import.io connector... But it works well with a single textbox but not with multiple textboxes (Refer this link)or dropdown lists...
Can anyone guide what should I do next...
Apart from import.io is there anyother alternative?
I am a newbie working with crawlers... So please justify your answer.
What is web scraping... Do I have to use web scraper??
Thank you.
Actually, if you look in the URL bar the parameters for destination and time are defined there (highlighted below), so you don't need to worry about drop down menus, or using a Connector.
Use an Extractor on this page:
http://transportformumbai.com/get_schedule_new.php?user_route=western&start_station=khar_road&end_station=malad&start_time=00&end_time=18
Train it to get every column - note that the view route column contains links.
You can create a separate Extractor for the "view route" page:
http://transportformumbai.com/view_route_new.php?trainno=BYR1097&user_route=western&train_origin=Churchgate&train_end=Bhayandar&train_speed=S
Now you should "Chain" the second Extractor to the first one and it will pull that information from every link on the first one.
If you want to choose different destinations and times, just change the URL parameters of the original link.
http://support.import.io/knowledgebase/articles/613374-how-do-i-get-data-behind-dropdown-menus
Your best bet here seems to have an API for every URL combination. You have to analyze the URL structure.

Google query for a mass of related websites

Is there a way to load a bunch of urls like a hundred of them and query in google to find other related to those.
To be more specific the command as_rq=www.example.com in google query searches sites that are related to this url, what if I want to search for a vast amount of urls is there an option or I'll have to traverse all the urls one by one.
Unfortunately it is not possible to do multiple url queries. I've tried to do this myself before with no luck after searching multiple online forumns
Yeap it is possible via Google CSE(custom search engine) API where on the required parameter q=exampleQuery you insert q=as_rq=www.example.comand by using annotations you can parametrize your search results.

How does api archive.org works?

As you surely know web.archive.org lets you inspect the history of a domain, ie:http://web.archive.org/web/*/besttatoo.com
I also has an API: http://archive.org/help/json.php
I need to get data from the API but I can't get many info on how to use it, has anyone used it and can paste some examples of use?
This link provides details about the item LovingU on archive.org:
http://archive.org/details/LovingU&output=json
To create an API query to your liking, use this page:
https://archive.org/advancedsearch.php#raw
That page allows you to choose your output format: JSON, XML, HTML, CSV or RSS and also the parameters your want to see. You can limit the number of results, too.