How to download data from a series of pages? - openrefine

Example:
I want to download all data from https://www.example.com/api.php?id=X (fictitious URL), where X is 1 to 1000 and each page is a JSON containing data of a data row. (I don't want to manually add 1000 URLs and OpenRefine seems not allowing pasting list of URLs).
I want to download information in pages in
https://en.wikipedia.org/wiki/Category:Lists_of_horror_films_by_year, each of which contains one HTML table.
I want to download data in all tables from all pages in https://en.wikipedia.org/wiki/Template:Earthquakes_by_year, each of which contains multiple HTML tables.

OpenRefine is not a web scraping tool. It has the feature to fetch web pages, but you will hit a lot of limitation quickly.
Example 1 you can prepare your list of URL in a spreadsheet software like Excel or OpenOffice Calc. Import your project in OpenRefine and user the feature Add a column by fetching URL.
Example 2 and 3: OpenRefine cannot crawl or follow links. You will need to:
extract the list of links from each page, using OpenRefine to
creates a separated OpenRefine project with one link per row
fetch each page using the Add a column by fetching URL feature
parse the HTML of each page

Related

Exporting WebCenter Content to XML

I am attempting to migrate content from an Oracle's WebCenter CMS into our organizations primary CMS. All the different parts of the page templates are separate xml snippets that get pull together and converted into html for production deployments. I am trying to find a way to export the page into xml to just get at the content. I don't need styles or js or images.
There are some built in web services and an ability to create custom ones. Is there any way to get the system to output xml or get the xml to give me a mapping of all the files so I can merge them myself?
Not sure if this will do what you want, but if you add &IsSoap=1 to the end of the URL then the request is returned in XML format. You can view the page data by using the following settings:
• IsJava
• IsSoap
• IsJson
• IsPageDebug
These may help as well. Here and here.
If this is for a Site Studio website, you should be able to turn on sitestudio section tracing, clear the server ouput, view the website page, refresh the output and it should show you details about the content items it retrieved.

how to dynamically extract data from dropdown lists or multiple textboxes using import.io

I am making an API wherein I want to dynamically get data from the site http://transportformumbai.com/mumbai_local_train.php
Depending on start and end station and timings I want to get the list of all available trains along with the table given by clicking on viewroute column table. i.e. for eg.
I am using import.io connector... But it works well with a single textbox but not with multiple textboxes (Refer this link)or dropdown lists...
Can anyone guide what should I do next...
Apart from import.io is there anyother alternative?
I am a newbie working with crawlers... So please justify your answer.
What is web scraping... Do I have to use web scraper??
Thank you.
Actually, if you look in the URL bar the parameters for destination and time are defined there (highlighted below), so you don't need to worry about drop down menus, or using a Connector.
Use an Extractor on this page:
http://transportformumbai.com/get_schedule_new.php?user_route=western&start_station=khar_road&end_station=malad&start_time=00&end_time=18
Train it to get every column - note that the view route column contains links.
You can create a separate Extractor for the "view route" page:
http://transportformumbai.com/view_route_new.php?trainno=BYR1097&user_route=western&train_origin=Churchgate&train_end=Bhayandar&train_speed=S
Now you should "Chain" the second Extractor to the first one and it will pull that information from every link on the first one.
If you want to choose different destinations and times, just change the URL parameters of the original link.
http://support.import.io/knowledgebase/articles/613374-how-do-i-get-data-behind-dropdown-menus
Your best bet here seems to have an API for every URL combination. You have to analyze the URL structure.

Which is the better way to use Scrapy to crawl 1000 sites?

I'd like to hear the diffrences between 3 different approaches for using Scrapy in order to crawl 1000 sites.
For example, I want to scrape 1000 photo sites, they all most has the same structure.Like have one kind of photo list page,and other kind of big photo page; but these list or photo desc page's HTML code will not all the same.
Another example,I want to scrape 1000 wordpress blog,Only bolg's article.
The first, is exploring the entire 1000 sites using one scrapy project.
The second, is having all these 1000 sites under the same scrapy project, all items in items.py, each site having it's own spider.
The third is similar to the second, but having one spider for all the sites instead of seperating them.
What are the diffrences, and which do you think is the right approach? Is there any other, better approach I've missed?
I had 90 sites to pull from so it wasn't great option to create one crawler per site. The idea was to be able to run in parallel. Also i had split this to pack similar page formats in one place.
So I ended up with 2 crawlers:
Crawler 1 - URL Extractor. This would extract all detail page URLs from top level listing page in a file(s).
Crawler 2 - Fetch Details.
This would read from the URL file and extract item details.
This allowed me to fetch URLs first and estimate number of threads that i might need for second crawler.
Since each crawler was working on specific page format, there were quite a few functions I could reuse.

can someone post a "multiple table" example yadcf

I'm trying to get the "multiple tables" example from https://github.com/vedmack/yadcf working and i cant seem to get it to work.
Was wondering if anyone could post a zip file of a working example that i could just tweak.
I have a specific outcome i'm trying to test for with multiple tables where the second table gets filtered by the contents of the first table.
example: http://yadcf-showcase.appspot.com/dom_multi_columns_tables_1.10.html
You can grab all the needed files from the yadcf-showcase repo' , here is the link to the zip of the showcase , and this is the relevant html , here its in action in the showcase
you can grab the war folder and place it into your \Public folder of the DropBox and access it via the "Copy public link* that way there will be no need in web server.

How can I extract data from multiple webpage using one import.io connector?

I need to extract data from multiple pages on a website. Can I do it with one connector or I have to create multiple connector to get data and group them later ? As an example i am trying to collect data for stocks but it is spread on multiple pages.
Here are the diffrent pages I am trying to collect the information -
http://www.moneycontrol.com/india/stockpricequote/powergenerationdistribution/suzlonenergy/SE17
http://www.moneycontrol.com/financials/suzlonenergy/balance-sheet/SE17#SE17
http://www.moneycontrol.com/financials/suzlonenergy/ratios/SE17#SE17
How do I write one extractor to fetch data from these different pages ?
If the URLs of the pages are different then it would be best to use an extractor and paste the URLs into it, this way you can get live data at the click of a button. It would be really cool to get a few of the URLs so we can take a look at them.
Thanks!