Conditional scraping of already scraped items from an index page - scrapy

I'm trying to scrape a movie reviews site using python scrapy.
In the index page of the movie reviews, items are ordered by relevance, so items cannot be sorted in a way new items appear at the top of the list.
Thus, I want to be able to browse the index daily and skip those reviews I've already scraped.
I was thinking on exporting each review into a single file.
Is is possible to check whether a file already exists from the spider?
Is this the best practice to do it?
I'm new in web-scraping and I don't know if this is a good practice.

Related

Google AdWords Campaign doesn't show list of Products

I've searching this issue for about 2 days, so I'm really appreciate your help.
First, I create account at Google Merchants Center and linked it with my site.
The Product Feed is also works well ( I'm using prestashop, so the input methods is "E-Commerce platform imports".
As shown in image above, the status is successful.
After that, I create a campaign (max CPC and budget are already created), but here is the result:
Is there any step that I miss that caused the products doesn't listed?
first, the items, images, and website must adhere
to all google's rules and policies; individual items
that are searchable does not necessarily indicate
all is well -- always check the main dashboard, the
entire account, and email, for any messages from
google and check the site's log-files to help verify
that google has crawled the website and images.
yes, a feed may take 24-72 hours or so to be processed
and all items and images crawled -- before being seen
within the (linked) adwords campaign.
be certain the merchant-center account or sub-account is properly linked.
also, why is an inventory-filter being used?
generally, a filter is created to exclude items --
any items that do not match the filter (exactly)
will be excluded from the campaign.
check the feed file within a browser-window and individual items
within the products-tab to be certain that the items (attributes)
will be able to match the filter, exactly.
also, verify that the remaining products that do fit through the filter
are being included in the defined product-group within the (linked)
adwords-account, by clicking on 'view the full list of products' --
generally, only one campaign is needed but check any other shopping
campaigns and all other product-groups to be certain that the items
are not being filtered or matched elsewhere in the account.
otherwise, google should likely be contacted so that a person can
look directly into the data-feed, website, images, and both accounts.
After Days, I finally found it.
I've just aware that I made a mistake in "Attributes" the value of my site category doesn't being recognized by Google, so I have to create a new one based on the template from google itself.
After I re-fetch the feed, there are some changes happens. You can compare these below images with above images.
And finally... the products show up.

How to search Amazon for 5-star-rated products with at least 10 ratings?

Oftentimes, when searching products on Amazon, I want to see the products with the best ratings. More often than not, when sorting by Customer Ratings, Amazon will display products that have a 5-star rating from just 1 rating, which is not enough of a sample size to tell me much about those products.
So how can I search Amazon for 5-star-rated products that have received a minimum of X ratings (e.g. 10) and only then sort by ratings?
Can this only be done via their API or can this be done on their website directly via some kind of advanced search?
You can't do this with the API - Amazon removed star ratings (and review count) from it back in 2010, and the API now only shows them as part of a block of reviews in an iframe. So the only way to extract this information is to scrape Amazon - either the site itself, or inside those frames - which of course they do not take kindly to!

Extracking Multiple Documents from one Page using Nutch

I'm using Nutch to Crawl APIs and Index the Data.
Using APIs, I can multiple "Pages" of data in one go. For example, lets say I was indexing Movies.
I could query the top level and get a list of Categories like Action, Drama, Comedy, etc. Then, I could query each category and get a list of Movies. At this point, I can insert each movie as an outlink and have nutch crawl the details of each movie.
However, the Category call already gives me the details of say 10 movies at a time.
I want to be able to create the 10 entries in Nutch without having to crawl each of them. Can this be done?

How to get random page of specific "Portal:" using WikiMedia API

Is there any way to get random article of specific wikimedia portal using wikimedia API?
For example, I need random page of Portal:Science.
Does anyone know how to do this?
What you're asking for doesn't make much sense, because a portal doesn't have a list of pages associated with it.
The closest thing you can do is to get a random page from e.g. Category:Science or one of its subcategories. There is no way to do that directly using the API, you would need to traverse all the subcategories and choose a random page from them by yourself.
There is a tool that already does this (with a limit on the depth of the category tree): erwin85's random article and there is also a template for it on the English Wikipedia.

Specify items per page

I'm trying to query Picasa Web Albums to pull album/photo data (obviously!) however initially I only need to pull the first 4 photos. Is there any way to limit a particular field or specify the items per page?
I've already accomplished something similar with Facebook's Graph API however I'm unable to find anything similar for Picasa. The only items I can find related to limiting the response is specifying which fields, but nothing related to the number of rows.
Use max-results argument. See doc here.
https://developers.google.com/picasa-web/docs/2.0/reference#Parameters