Incrementally crawl a website with Scrapy - scrapy

I am new to crawling and would like to know whether it's possible to use Scrapy to crawl a site, like CNBC.com, incrementally? For example, if today I crawled all pages from a site, then from tomorrow I only want to collect pages that are newly posted to this site, to avoid crawling all the old pages.
Thank you for any info. or input on this.

Yes you can and it's actually quite easy. Every news website has a few very important index pages like the homepage and the categories (eg politics, entertainment etc.) There is no article that doesn't go through these pages for at least a few minutes. Scan those pages every minute or so and save just the links. Then do a diff with what you already have in your databases and a few times a day issue a crawl to scrape all the missing links. Very standard practice.

Please try the scrapy plugin scrapy-deltafetch , which would make your life easier.

Short answer: no.
Longer answer: What you could do is write the article id or the article url to a file and during the scraping, you would match the id or url with the records in the file.
Remember to load your file only once and assign it to a variable. Don't load it during your iteration when scraping.

Related

Which is the better way to use Scrapy to crawl 1000 sites?

I'd like to hear the diffrences between 3 different approaches for using Scrapy in order to crawl 1000 sites.
For example, I want to scrape 1000 photo sites, they all most has the same structure.Like have one kind of photo list page,and other kind of big photo page; but these list or photo desc page's HTML code will not all the same.
Another example,I want to scrape 1000 wordpress blog,Only bolg's article.
The first, is exploring the entire 1000 sites using one scrapy project.
The second, is having all these 1000 sites under the same scrapy project, all items in items.py, each site having it's own spider.
The third is similar to the second, but having one spider for all the sites instead of seperating them.
What are the diffrences, and which do you think is the right approach? Is there any other, better approach I've missed?
I had 90 sites to pull from so it wasn't great option to create one crawler per site. The idea was to be able to run in parallel. Also i had split this to pack similar page formats in one place.
So I ended up with 2 crawlers:
Crawler 1 - URL Extractor. This would extract all detail page URLs from top level listing page in a file(s).
Crawler 2 - Fetch Details.
This would read from the URL file and extract item details.
This allowed me to fetch URLs first and estimate number of threads that i might need for second crawler.
Since each crawler was working on specific page format, there were quite a few functions I could reuse.

scrapy CrawlSpider: crawl policy / queue questions

I started with scrapy some days ago, learned about scraping particular sites, ie the dmoz.org example; so far it's fine and i like it. As I want to learn about search engine development I aim to build a crawler (and storage, indexer etc) for large amount of websites of any "color" and content.
So far I also tried the depth-first-order and bredth-first-order crawling.
I use at the moment just one Rule, I set some path to skip and some domains.
Rule(SgmlLinkExtractor(deny=path_deny_base, deny_domains=deny_domains),
callback='save_page', follow=True),
I have one pipeline, a mysql storage to store url, body and headers of the downloaded pages, done via a PageItem with these fields.
My questions for now are:
Is it fine to use item for simple storing of pages ?
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Is there something like a blacklist for useless domains, ie. placeholder domains, link farms etc.?
There are many other issues like storage but I guess I stop here, just one more general search engine question
Is there a way to obtain crawl result data from other professional crawlers, of course it must be done by sending harddisks otherwise the data volume would be the same if I crawl them myself, (compressing left aside).
I will try to answer only two of your questions:
Is it fine to use item for simple storing of pages ?
AFAIK, scrapy doesn't care what you put into an Item's Field. Only your pipeline will dealing be with them.
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Scrapy has duplicates middleware, but it filters duplicates only in current session. You have to manually prevent scrapy to not crawl sites you've crawled six months ago.
As for question 3 and 4 - you don't understand them.

Count the number of pages in a site

I'd like to know how many public pages there are in a site, say for example, smashingmagzine.com. Is there are way to count the number of pages?
You can query Google's index using the site operator. e.g:
site:domain-to-query.com
This will return a list of the pages from the site that are currently indexed by Google. Other search engines provide similar functionality but I don't know the syntax off hand.
Of course not all pages may be indexed, and the index may contain pages which no longer exist.
You need to basically crawl the site. Your process would be something like:
Start at root domain / homepage
Look for all links that point within the same domain
For each of those links, repeat the steps
Your loop terminates when there are no more links to crawl that are pointing in the same domain. Remember to stay in the site otherwise you'll start crawling external sites.
You can also try parsing the sitemap if they provide one.
One tool that might prove useful if using Java is JSpider or Sphider in PHP.
You'll need to recursively scan the markup of each page, starting with your top level page, looking for any kind of links to other pages, and recursively crawl through them. You'll also need to keep track of what has been scanned as to not get caught in an infinate loop.

How do I get Google to index changes made to my website's keywords?

I have made changes to my website's keywords, description, and title, but Google is not indexing the new keyword. Instead, I have found that Google is indexing the older one.
How can I get Google to index my site using the new keywords that I have added?
Periods between crawling a page vary a lot across pages. A post to SO will be crawled and indexed by Google in seconds. Your personal page that hasn't changed content in 20 years might not even be crawled as much as once a year.
Submitting a sitemap to the webmaster tools will likely re-crawl your website to validate your sitemap. You could use this to speed up the re-crawling.
However, as #Charles noted, the keywords meta-tag is mostly ignored by Google. So it sounds like you're wasting your time.

Use of sitemaps

I've recently been involved in the redevelopment of a website (a search engine for health professionals: http://www.tripdatabase.com), and one of the goals was to make it more search engine "friendly", not through any black magic, but through better xhtml compliance, more keyword-rich urls, and a comprehensive sitemap (>500k documents).
Unfortunately, shortly after launching the new version of the site in October 2009, we saw site visits (primarily via organic searches from Google) drop substantially to 30% of their former glory, which wasn't the intention :)
We've brought in a number of SEO experts to help, but none have been able to satisfactorily explain the immediate drop in traffic, and we've heard conflicting advice on various aspects, which I'm hoping someone can help us with.
My question are thus:
do pages present in sitemaps also need to be spiderable from other pages? We had thought the point of a sitemap was specifically to help spiders get to content not already "visible". But now we're getting the advice to make sure every page is also linked to from another page. Which prompts the question... why bother with sitemaps?
some months on, and only 1% of the sitemap (well-formatted, according to webmaster tools) seems to have been spidered - is this usual?
Thanks in advance,
Phil Murphy
The XML sitemap helps search engine spider to indexing of all web pages of your site.
The sitemap is very usefull if you publish frequently many pages, but does not replace the correct system of linking of the site: all documents must be linke from an other related page.
Your site is very large, you must attention at the number of URLs published in the Sitemap because there are the limit of 50.000 URLs for each XML file.
The full documentation is available at Sitemaps.org
re: do pages present in sitemaps also need to be spiderable from other pages?
Yes, in fact this should be one of the first things you do. Make your website more usable to users before the search engines and the search engines will love you for it. Heavy internal linking between pages is a must first step. Most of the time you can do this with internal sitemap pages or category pages ect..
re: why bother with sitemaps?
Yes!, Site map help you set priorities for certain content on your site (like homepage), Tell the search engines what to look at more often. NOTE: Do not set all your pages with the highest priority, it confuses Google and doesn't help you.
re: some months on, and only 1% of the sitemap seems to have been spidered - is this usual?
YES!, I have a webpage with 100k+ pages. Google has never indexed them all in a single month, it takes small chunks of about 20k at a time each month. If you use the priority settings property you can tell the spider what pages they should re index each visit.
As Rinzi mentioned more documentation is available at Sitemaps.org
Try build more backlinks and "trust" (links from quality sources)
May help speed indexing further :)