how to collect urls with connection refused error in scrapy without using logfile - scrapy

i'm writing a scrapy crawler, at the end of the logfile scrapy stats, 'retry/reason_count/twisted.internet.error.ConnectionRefusedError':2, i want to collected these two urls in a list and print them out, is there a way to do it without saving and parsing logfile after the crawling for all urls is done?

Related

increase scrapy crawlera crawling speed

CONCURRENT_REQUESTS = 50
CONCURRENT_REQUESTS_PER_DOMAIN = 50
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_DELAY= 0
after checking How to increase Scrapy crawling speed?, my scraper is still slow and takes about 25 hours to scrape 12000 pages (Google,Amazon), I use Crawlera, Is there more I can do to increase speed and when CONCURRENT_REQUESTS =50 does this mean I have 50 thread like request?
#How to run several instances of a spider
Your spider can take some arguments in the terminal as the following: scrapy crawl spider -a arg=value.
Let's imagine you want to start 10 instances because I guess you start with 10 urls (quote: input is usually 10 urls). The commands could be like this:
scrapy crawl spider -a arg=url1 &
scrapy crawl spider -a arg=url2 &
...
scrapy crawl spider -a arg=url3
Where & indicates you launch a command after the previous one without waiting the end of this previous one. As far as I know it's the same syntax in Windows or Ubuntu for this particular need.
##Spider source code
To be able to launch as I showed you, the spider can look like this
class spiderExample(scrapy.Spiper):
def __init__(arg): #all args in here are able to be entered in terminal with -a
self.arg = arg #or self.start_urls = [arg] , because it can answer your problematic
... #any instructions you want, to initialize variables you need in the proccess
#by calling them with self.correspondingVariable in any method of the spider.
def parse(self,response):#will start with start_urls
... #any instructions you want to in the current parsing method
#To avoid to be banned
As far as I read you use Crawlera. Personally I never used this. I never needed to use payable services for it.
##One IP for each spider
The goal here is clear. As I told you in comment I use Tor and Polipo. Tor needs an HTTP proxy like Polipo or Privoxy to run in a scrapy spider correctly. Tor will be tunneled with the HTTP proxy, and at the end the proxy will work with a Tor IP. Where Crawlera can be interesting is Tor's IPs are well known by some websites with a lot of traffic (so a lot of robots going through it too...). These websites can ban Tor's IP because they detected robots behaviors corresponding with the same IP.
Well, I don't know how work Crawlera, so I don't know how you can open several ports and use several IP's with Crawlera. Look at it by yourself. In my case with Polipo I can run several instances tunneled (polipo is listening the tor's corresponding socks port) on several Tor circuits launched by my own. Each Polipo instances has its own listen port. Then for each spider I can run the following
scrapy crawl spider -a arg=url1 -s HTTP_PROXY:127.0.0.1:30001 &
scrapy crawl spider -a arg=url2 -s HTTP_PROXY:127.0.0.1:30002 &
...
scrapy crawl spider -a arg=url10 -s HTTP_PROXY:127.0.0.1:30010 &
Here, each port will listen with different IP's, so for the website these are different users. Then your spider can be more polite (look at settings options) and your whole project be faster. So no need to go through the roof by setting 300 to CONCURRENT_REQUESTS or to CONCURRENT_REQUESTS_PER_DOMAIN, it will make the website turn the wheel and generates unnecessary event like DEBUG: Retrying <GET https://www.website.com/page3000> (failed 5 times): 500 Internal Server Error.
In my personal preferences I like to set different log file for each spider. It avoids to explodes the amount of lines in the terminal, and allows to let me read the events of processes in a more comfortable text file. Easy to write in the command -s LOG_FILE=thingy1.log. It will show you easily if some urls where not scraped as you wanted.
##Random user agent.
When I read Crawlera is a smart solution because it uses the right user agent to avoid to be banned... I was surprised because actually you can do it by your own-self like here. The most important aspect when you do it by yourself is to choose popular user-agent to be overlooked among the large number of users of this same agent. You have available lists on some websites. Besides, be careful and take computer user-agent and not other device like mobile because the rendered page, I mean source code, is not necessarily the same and you can lose the information you want to scrape.
The main cons of my solution is it consumes your computer resources. So your choice of number of instances will depend on your computer capacities (RAM,CPU...), and router capacities too. Personally I am still using ADSL and as I told you 6000 requests were done in 20-30 minutes... But my solution does not consume more bandwidth than setting a crazy amount on CONCURRENT_REQUESTS.
There are many things that can affect your crawler speed. But CONCURRENT_REQUEST and CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN settings. Try setting them to some absurdly hard number like 300 and go from there.
see: https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests
Other than that ensure that you have AUTOTHROTTLE_ENABLED set to False and DOWNLOAD_DELAY set to 0.
Even if scrapy is limited by it's internal behaviour you can always fire up multple instances of your crawler and scale your speed like that. Common atomic approach is to put urls/ids to a redis or rabbitmq queue and pop->crawl them in mutliple scrapy instances.

How to ignore download timeout errors in Scrapy

I'm using scrapy 0.24.4
I have set a download timeout to 15 seconds. Some urls take > 15 seconds to download and my log file records these errors. I have many of these exceptions/errors in the log file simply because of the volume of urls I'm crawling and it's becoming hard to see genuine errors. Is there a way to catch these specific exceptions and ignore them?

Apache nutch and solr : queries

I have just started using Nutch 1.9 and Solr 4.10
After browsing through certain pages I see that syntax for running this version has been changed, and I have to update certain xml's for configuring Nutch and Solr
This version of package doesnt require Tomcat for running. I started Solr:
java -jar start.jar
and checked localhost:8983/solr/admin, its working.
I planted a seed in bin/url/seed.txt and seed is "simpleweb.org"
Ran Command in Nutch: ./crawl urls -dir crawl -depth 3 -topN 5
I got few IO exceptions in the middle and so to avoid the IO exception I downloaded
patch-hadoop_7682-1.0.x-win.jar and made an entry in nutch-site.xml and placed the jar file in lib of Nutch.
After running Nutch,
Following folders were created:
apache-nutch-1.9\bin\-dir\crawldb\current\part-00000
I can see following files in that path:
data<br>
index<br>
.data.crc<br>
.index.crc<br>
I want to know what to do with these files, what are the next steps? Can we view these files? If yes, how?
I indexed the crawled data from Nutch into Solr:
for linking solr with nutch (command completed successfully)
Command ./crawl urls solr http://localhost:8983/solr/ -depth 3 -topN 5
Why do we need to index the data crawled by Nutch into Solr?
After crawling using Nutch
command used for this: ./crawl urls -dir crawl -depth 3 -topN 5; can we view the crawled data, if yes, where?
OR only after indexing the data crawled by Nutch into Solr, can we view the crawled data entires?
How to view the crawled data in Solr web?
command used for this: ./crawl urls solr localhost:8983/solr/ -depth 3 -topN 5
Although Nutch was built to be a web scale search engine, this is not the case any more. Currently, the main purpose of Nutch is to do a large scale crawling. What you do with that crawled data is then up to your requirements. By default, Nutch allows to send data into Solr. That is why you can run
crawl url crawl solraddress depth level
You can emit the solr url parameter also. In that case, nutch will not send the crawled data into Solr. Without sending the crawled data to solr, you will not be able to search data. Crawling data and searching data are two different things but very related.
Generally, you will find the crawled data in the crawl/segments not crawl/crawdb. The crawl db folder stores information about the crawled urls, their fetching status and next time for fetching plus some other useful information for crawling. Nutch stores actual crawled data in crawl/segments.
If you want to have an easy way to view crawled data, you might try nutch 2.x as it can store its crawled data into several back ends like MySQL, Hbase, Cassandra and etc through the Gora component.
To view data at solr, you simplly issue a query to Solr like this:
curl http://127.0.0.1:8983/solr/collection1/select/?q=*:*
Otherwise, you can always push your data into different stores via adding indexer plugins. Currently, Nutch supports sending data to Solr and Elasticsearch. These indexer plugins send structured data like title, text, metadata, author and other metadata.
The following summarizes what happens in Nutch:
seed list -> crawldb -> fetching raw data (download site contents)
-> parsing the raw data -> structuring the parse data into fields (title, text, anchor text, metadata and so on)->
sending the structured data to storage for usage (like ElasticSearch and Solr).
Each of these stages is extendable and allows you to add your logic to suit your requirements.
I hope that clears your confusion.
u can run nutch on windows-I am also a beginner-yes it is a bit tough to install in windows but it do works!-this input path doesn't exist problem can be solved by:-
replacing Hadoop-core-1.2.0.jar file in apache-nutch-1.9/lib with hadoop-core-0.20.2.jar(from maven)
then rename this new file as hadoop-core-1.2.0

Is there anyway to log the list of urls 'ignored' in Nutch crawl?

I am using Nutch to crawl a list of URLS specified in the seed file with depth 100 and topN 10,000 to ensure a full crawl. Also, I am trying to ignore urls with repeated strings in their path using regex-urlfilter http://rubular.com/r/oSkwqGHrri
However, I am curious to know which urls have been ignored during crawling. Is there anyway i can log the list of urls "ignored" by Nutch while it crawls?
The links can be found by using the following command
bin/nutch readdb PATH_TO_CRAWL_DB -stats -sort -dump DUMP_FOLDER -format csv
this will generate part-00000 file in dump_folder which will contain the url list and their status respectively.
Those with the status of db_unfetched have been ignored by the crawler.

Scrapy shutdown crawler from pipline

Is there a way to shutdown a scrapy crawl from the pipeline? My pipline processes urls and adds them to a list. When the list reaches a specified amount I want to shutdown the crawler. I know of raise CloseSpider() but it only seems to work when called from the spider.
Yes, it is possible. You can call Crawler.engine.close_spider(spider_name, reason).
There is an example in the CloseSpider extension