Extracking Multiple Documents from one Page using Nutch - api

I'm using Nutch to Crawl APIs and Index the Data.
Using APIs, I can multiple "Pages" of data in one go. For example, lets say I was indexing Movies.
I could query the top level and get a list of Categories like Action, Drama, Comedy, etc. Then, I could query each category and get a list of Movies. At this point, I can insert each movie as an outlink and have nutch crawl the details of each movie.
However, the Category call already gives me the details of say 10 movies at a time.
I want to be able to create the 10 entries in Nutch without having to crawl each of them. Can this be done?

Related

Conditional scraping of already scraped items from an index page

I'm trying to scrape a movie reviews site using python scrapy.
In the index page of the movie reviews, items are ordered by relevance, so items cannot be sorted in a way new items appear at the top of the list.
Thus, I want to be able to browse the index daily and skip those reviews I've already scraped.
I was thinking on exporting each review into a single file.
Is is possible to check whether a file already exists from the spider?
Is this the best practice to do it?
I'm new in web-scraping and I don't know if this is a good practice.

Scrape more than 5,000 reviews off amazon

When scraping amazon reviews I always hit a wall at 5,000 reviews even when the product has 40,000. Is there any way to get past this barrier and scrape more?
To obtain more pages when scraping directly from search results, you can use filters to divide the search into smaller parts. For example reviews with star ratings may be searched first for one star, then two, etc. This of course does not promise all the results, but would increase your chances. Some filters may be more numerous but harder to implement, such as tags or usernames.
Alternately, access the data directly through their API and/or become an affiliate through their Amazon Associates program.

Creating a SOLR index for activity stream or newsfeed

I am trying to index the activity feed of a social portal am building. The portal allows users to follow each other to get updates from the people they follow as an activity feed sorted by date.
For example, user A will be following users B, C, D, E & F. So user A should see all the posts from B, C, D, E & F on his/her activity feed.
Let's assume the post consist of just two fields.
1. The text of the post. (text_field)
2. The name/UID of the user who posted it. (user_field)
Currently, I am creating an index for all the posts and indexing the text_field & user_field. In scale, there can be 1,000,000+ posts. A user may follow 100s if not 1000s of users. What will be the best way to create an index for this scenario?
Should I also index a person followers, so that its quickly looked up and then pass it to a second query for getting the posts of all those users sorted by date?
What is the best way to query the index consisting of all these posts, by passing the UID of all the users that are followed? Considering this may be in 100's or more.
Update:
The motivation for using Solr for the news feed was mainly inspired by this detailed slide and my brief discussion with OpenSocial team.
When starting off with a social portal, Fan out on write seems an overkill and more expensive. However Fan out on read is better. Both the slide and the OpenSocial team suggested using a search backend for Fan out on read. The slide mentioned above also have data on how it helped them.
At present, the feed is going to be flat and only sort criteria will be the date(recency). We won't be considering relevance or posts from more closer groups.
It's kind of abstract, but I will do my best here. Based on what you mentioned, I am not sure if Solr is really the right tool for the job here. You can still have Solr for full text search, but I am not sure about generating a news feed from it in this scenario. Remember that although Solr is pretty impressive, it is a search engine. I will pretend that you will stick with Solr for the rest of the post, keep in mind that we are trying to put a square peg through a round hole here though.
Here are a few additional questions you should think about.
You will probably want to add a timestamp of the post to the data element
You need to figure out how to properly sort the results. Is it in order of recency? Or based on posts that the user is more likely to interact with?
If a user has 1000+ connections, would he want to see an update from every one of them in the main feed? Or should posts from a closer group of friends show up higher?
Here are some comments about your questions:
1) If you index person's followers, it may be hard to keep up. I am assuming followers are going to be changing often and re-indexing in this scenario would not really be practical.
2) That sounds more on par, but again, you need to figure out the sorting. You can get a list of connections for the user, then run a search for top posts from all of them.

How to search Amazon for 5-star-rated products with at least 10 ratings?

Oftentimes, when searching products on Amazon, I want to see the products with the best ratings. More often than not, when sorting by Customer Ratings, Amazon will display products that have a 5-star rating from just 1 rating, which is not enough of a sample size to tell me much about those products.
So how can I search Amazon for 5-star-rated products that have received a minimum of X ratings (e.g. 10) and only then sort by ratings?
Can this only be done via their API or can this be done on their website directly via some kind of advanced search?
You can't do this with the API - Amazon removed star ratings (and review count) from it back in 2010, and the API now only shows them as part of a block of reviews in an iframe. So the only way to extract this information is to scrape Amazon - either the site itself, or inside those frames - which of course they do not take kindly to!

Using APIs to Filter Albums by Years

So I'm working on an application that has a feature that generates a list of 100 or so artists that are similar to those in the user's music catalog using the Echo Nest API. Then, a user can supply a certain year, and, based on the similar artists, the application will return a list of albums that were released on that year.
The only problem is that I have no idea how to filter albums based on year. The Echo Nest API doesn't really do much with albums. The Discogs and Last.fm APIs work with albums, and the Discogs API has data about albums' release dates, but there is no way to filter an initial query by release date. For example, if I have the artist Fleet Foxes and I want to filter it by albums released in 2011, there is no option to search for albums by the Fleet Foxes confined to release dates of 2011.
The only option I can really see at this point is iterating over EVERY album an artist has and only adding those albums that meet my specifications. However, this is obviously very heavy on both the APIs and my server, especially considering that many of the artists in the list of 100 similar artists will have no albums that match my criteria and that many artists have well within the range of 100 albums when you take into consideration singles, remixes, etc.
Does anyone see a better way of doing this?
If an API really doesn't have any way to filter by year, then yes, of course you will have to pull down all of the releases and filter them after the fact.
If you think this is a burden on your code and/or their server, you should file a feature request to add the filtering.
However, you should make sure first that they really don't provide such a thing. Most REST APIs separate "fetch" and "search". For example, http://api.example.com/artists/12345/releases may not have any way to filter it, but http://api.example.com/search?type=releases&artist=12345&year=2011 may exist.
Without looking into all of the APIs in detail, a quick check of Discogs' "Run a search query" docs shows that you can include a year criterion in the search (although it looks like maybe you can't actually search by artist ID, just by artist name?).