Scrape more than 5,000 reviews off amazon - scrapy

When scraping amazon reviews I always hit a wall at 5,000 reviews even when the product has 40,000. Is there any way to get past this barrier and scrape more?

To obtain more pages when scraping directly from search results, you can use filters to divide the search into smaller parts. For example reviews with star ratings may be searched first for one star, then two, etc. This of course does not promise all the results, but would increase your chances. Some filters may be more numerous but harder to implement, such as tags or usernames.
Alternately, access the data directly through their API and/or become an affiliate through their Amazon Associates program.

Related

How to search Amazon for 5-star-rated products with at least 10 ratings?

Oftentimes, when searching products on Amazon, I want to see the products with the best ratings. More often than not, when sorting by Customer Ratings, Amazon will display products that have a 5-star rating from just 1 rating, which is not enough of a sample size to tell me much about those products.
So how can I search Amazon for 5-star-rated products that have received a minimum of X ratings (e.g. 10) and only then sort by ratings?
Can this only be done via their API or can this be done on their website directly via some kind of advanced search?
You can't do this with the API - Amazon removed star ratings (and review count) from it back in 2010, and the API now only shows them as part of a block of reviews in an iframe. So the only way to extract this information is to scrape Amazon - either the site itself, or inside those frames - which of course they do not take kindly to!

Using APIs to Filter Albums by Years

So I'm working on an application that has a feature that generates a list of 100 or so artists that are similar to those in the user's music catalog using the Echo Nest API. Then, a user can supply a certain year, and, based on the similar artists, the application will return a list of albums that were released on that year.
The only problem is that I have no idea how to filter albums based on year. The Echo Nest API doesn't really do much with albums. The Discogs and Last.fm APIs work with albums, and the Discogs API has data about albums' release dates, but there is no way to filter an initial query by release date. For example, if I have the artist Fleet Foxes and I want to filter it by albums released in 2011, there is no option to search for albums by the Fleet Foxes confined to release dates of 2011.
The only option I can really see at this point is iterating over EVERY album an artist has and only adding those albums that meet my specifications. However, this is obviously very heavy on both the APIs and my server, especially considering that many of the artists in the list of 100 similar artists will have no albums that match my criteria and that many artists have well within the range of 100 albums when you take into consideration singles, remixes, etc.
Does anyone see a better way of doing this?
If an API really doesn't have any way to filter by year, then yes, of course you will have to pull down all of the releases and filter them after the fact.
If you think this is a burden on your code and/or their server, you should file a feature request to add the filtering.
However, you should make sure first that they really don't provide such a thing. Most REST APIs separate "fetch" and "search". For example, http://api.example.com/artists/12345/releases may not have any way to filter it, but http://api.example.com/search?type=releases&artist=12345&year=2011 may exist.
Without looking into all of the APIs in detail, a quick check of Discogs' "Run a search query" docs shows that you can include a year criterion in the search (although it looks like maybe you can't actually search by artist ID, just by artist name?).

Yelp, Google's API for restaurants help

Ok I have looked into this, and I'm not sure if anyone else has experience with it. I'm having termendous difficulties with Yelp and Google's API.
To help explain what I am trying to do here is the concept of the website. We would have to pull restaurants based on user distance, and then randomize them based on quality of restaurant based on feedback from review websites (Yelp, Google, urbanspoon, zagat, opentable, kudzu, yahoo - doesn't have to be from all), and feedback from our users (on results page for the random restaurant users can select good recommendation/bad recommendation). There’s a lot we could calculate for our formula. Things that will dictate your results will be based on if you’re at home or work. If you’re at home you will have more time to drive out to the city to grab some dinner or lunch. If you’re at work we would have to recommend restaurants nearby as lunch is typically 30 minutes to a hour. A 30 minute lunch would require take out most likely or quick service. A hour lunch break you could dine in at a local fine dining restaurant. So in a nutshell, user comes to website. Select if they're at home or work, click submit and we will have a random restaurant selected for them to go. If they don't like it they can click retry and a new restaurant can show.
The issue I am having is using the API to gather all the restaurants in the US. I know it can be done because there are similiar websites/apps that pull restaurants that are closest to you such as Ness, Alfred, and I believe there's two more but I can't remember the names.
Anyone know if this can be accomplish? As I desperately need some help. Thanks in advance!
Yelp is the API that can provide you list of restaurant as per your search, your search can be area specific, lattitude/longitude specific etc. there are number of API through which you can see the reviews of different restaurant and put some logic based on that.
I think the logic of home/work order is something that you have to integrate in your application, yelp API can provide you with the results as per your search.
Go through their documentation for further information.
http://www.yelp.com/developers/documentation/v2/search_api

Twitter Search API - Unusable?

After many tests, I've been unable to get the Twitter Search API to return more than 80% of tweets containing a specific keyword or hashtag. This is not related to the maximum number of results, one test involved a hashtag which had been tweeted 50 times and only 15 of them were returned by the Twitter Search API. The same results were returned when using Twitter's own search tool.
Is the Twitter Search API simply a tool for getting estimates and trends, rather than accurate data?
Has anyone found a way to capture 100% of tweets containing a specific keyword or hashtag?
Twitter filters search api for better results. Here is a quote from developer site:
Both the Streaming API and the Search
API filter, and on some end-points,
discard, statuses created by a small
proportion of accounts based upon
status quality metrics. For example,
frequent and repetitious status
updates may, in some instances, and in
combination with other metrics, result
in a different status quality score
for a given account.
Search api simply returns a subset of the found tweets.

Is there a way to get details of the product from single insudtry?

I want to maintain a database of all the products or the brands with respect to industry.
For example I need to get information about all the food supplements. How can I get them?
I am not sure all the companies have an API for their products.
Please advise
Uhm,... what kind of information? If you need prices, you can probably get information from goverment sources. At least you can here in Argentina. Other than that, I don't think it's possible, unless you somehow manage to scrape websites of all the brands you want to track.
Speaking as someone who has worked for two data-aggregation companies, aggregating data involves a lot of manual work. You find the sources, you automate the acquisition of data as best you can (APIs, file downloads and imports, even screen scraping from HTML pages), and you stay on top of it constantly. You're always looking for additional sources, updating code for sources that have changed, minding legal implications of sources who don't want you to harvest their data, etc.
Sometimes you have to buy the data, or weigh that cost against not having data from that source or scraping it manually. Sometimes a source will block you in some way and you need to either try to get around that or negotiate some terms with them. It's a viable business model, but it's not cheap.
For some products, Retailigence ( http://www.retailigence.com ) may have data in API form. They basically keep track of local stores' inventory and pricing for certain categories of products.
You should definitely check out Good Guide - an API that gives you access to details on over 60,000 household products.
http://developer.goodguide.com
DailyMed is a good service to check out if you're interested in products in the medical space.
http://dailymed.nlm.nih.gov/dailymed