Scrapy - Change intervals between logging stats - scrapy

Scrapy is logging it's stats about number of crawled pages, crawling speed etc. once every 60 seconds. Is there a way to change these intervals?
2016-12-19 15:09:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-19 15:10:01 [scrapy] INFO: Crawled 82 pages (at 82 pages/min), scraped 28290 items (at 28290 items/min)
2016-12-19 15:11:01 [scrapy] INFO: Crawled 167 pages (at 85 pages/min), scraped 57615 items (at 29325 items/min)

There's an (undocumented?) setting called LOGSTATS_INTERVAL, which defaults to 60 (seconds).
You can change that the interval (in seconds) that you prefer in your settings.py. E.g.
LOGSTATS_INTERVAL = 5 * 60.0 # log stats every 5 minutes

Related

Multithreading pdf workload slows the program down

I tried building a program that scans PDF files to find certain elements and then outputs a new PDF file with all the pages that contain the elements. It was originally single-threaded and a bit slow to run. It took about 36 seconds on a 6-Core-5600X. So I tried multiprocessing it with concurrent.futures:
def process_pdf(filename):
# Open the PDF file
f = open(filename, "rb")
print("Searching: " + filename)
# Create a PDF object
pdf = PyPDF2.PdfReader(f)
# Extract the text from each page in the PDF
extracted_text = [page.extract_text() for page in pdf.pages]
# Initialize a list
matching_pages = []
# Iterate through the extracted text
for j, text in enumerate(extracted_text):
# Search for the symbol in the text
if symbol in text:
# If the symbol is found, get a new PageObject instance for the page
page = pdf.pages[j]
# Add the page to the list
matching_pages.append(page)
return matching_pages
Multiprocessing Block:
with concurrent.futures.ThreadPoolExecutor() as executor:
# Get a list of the file paths for all PDF files in the directory
file_paths = [
os.path.join(directory, filename)
for filename in os.listdir(directory)
if filename.endswith(".pdf")
]
futures = [executor.submit(process_pdf, file_path) for file_path in file_paths]
# Initialize a list to store the results
results = []
# Iterate through the futures as they complete
for future in concurrent.futures.as_completed(futures):
# Get the result of the completed future
result = future.result()
# Add the result to the list
results.extend(result)
# Add each page to the new PDF file
for page in results:
output_pdf.add_page(page)
The multiprocessing works, as evident from the printed text, but it somehow doesn't scale at all. 1 thread ~ 35 seconds, 12 Threads ~ 38 seconds.
Why? Where's my bottleneck?
Tried using other libraries, but most were broken or slower.
Tried using re instead of in to search for the symbol, no improvement.
In general PDF consumes much of a devices primary resources, Each device and system will be different so by way of illustration here is one simple PDF task.
Search for text by a common python library component poppler pdftotext, (others may be faster but the aim here is to show the normal attempts and issues)
As a ballpark yardstick for 1 minute I scanned roughly 2500 pages of one file for the word "AMENDMENTS" found 900 occurrences such as
page 82
[1] SHORT TITLE OF 1971 AMENDMENTS
[18] SHORT TITLE OF 1970 AMENDMENTS
[44] SHORT TITLE OF 1968 AMENDMENTS
[54] SHORT TITLE OF 1967 AMENDMENTS
page 83
[11] SHORT TITLE OF 1966 AMENDMENTS
[23] SHORT TITLE OF 1965 AMENDMENTS
[42] SHORT TITLE OF 1964 AMENDMENTS
page 84
[16] SHORT TITLE OF 1956 AMENDMENTS
[26] SHORT TITLE OF 1948 AMENDMENTS
page 85
[43] DRUG ABUSE, AND MENTAL HEALTH AMENDMENTS ACT OF 1988
to scan the whole file (13,234 pages) would be about 5 mins 20 seconds
and I know from testing 4 cpus could process 1/4 of that file (3,308 pages from 13,234) in under 1 minute (there is a gain for using smaller files).
So a 4 core device should do say 3000 pages per core in a short time, well lets see.
If I thread that as 3 x 1000 pages one finishes after about 50 seconds another about 60 seconds and a 3rd about 70 seconds
overall there is little or just a slight gain, so the cause is one application spending one third of its time in each thread.
Overall about 3000 pages per minute
Lets try clever lets use 3 applications on the one file. Surely they can each take less time, guess what
one finishes after about 50 seconds another about 60 seconds and a 3rd about 70 seconds no gain using 3 applications
Overall about 3000 pages per minute
Lets try more clever lets use 3 applications on 3 similar but different files. Surely they can each take less time, guess what
one finishes after about 50 seconds another about 60 seconds and a 3rd about 70 seconds no gain using 3 applications with 3 tasks
Overall about 3000 pages per minute
Whatever way I try the resources on this device are constrained to Overall about 3000 pages per minute.
I may just as well let 1 thread run un-fettered.
So the basic answer is use multiple devices, same as graphics farming, is done.

how to inject urls found during crawl into nutch seed list

I have integrated nutch 1.13 along with solr-6.6.0 on CentOS Linux release 7.3.1611 I had given about 10 urls in seedlist which is at /usr/local/apache-nutch-1.13/urls/seed.txt I followed the tutorial
The command I used is
/usr/local/apache-nutch-1.13/bin/crawl -i -D solr.server.url=httpxxx:8983/solr/nutch/ /usr/local/apache-nutch-1.13/urls/ crawl 100
It seems to run for one or two hour. and i get corresponding results in solr. but during crawling phase alot of urls seem to be fetched and parsed in the terminal screen. Why aren't they being added to seedlist.?
2.How to know whether my crawldb is growing ? It's been about a month and the only results i get on solr are from the seedlist and its links.
3.I have set above command in crontab -e and plesk scheduled tasks. Now I get same links many times in in return for search query. How to avoid duplicate results in solr?
I'm a total newbie and any additional info would be helpful.
1.It seems to run for one or two hour. and i get corresponding results in solr. but during crawling phase alot of urls seem to be fetched and parsed in the terminal screen. Why aren't they being added to seedlist.?
Seed file is never modified by nutch, it just serves as a read only purpose for the injection phase.
2.How to know whether my crawldb is growing ?
You should take a look at the readdb -stats option, where you should get something like this
crawl.CrawlDbReader - Statistics for CrawlDb: test/crawldb
crawl.CrawlDbReader - TOTAL urls: 5584
crawl.CrawlDbReader - shortest fetch interval: 30 days, 00:00:00
crawl.CrawlDbReader - avg fetch interval: 30 days, 01:14:16
crawl.CrawlDbReader - longest fetch interval: 42 days, 00:00:00
crawl.CrawlDbReader - earliest fetch time: Tue Nov 07 09:50:00 CET 2017
crawl.CrawlDbReader - avg of fetch times: Tue Nov 14 11:26:00 CET 2017
crawl.CrawlDbReader - latest fetch time: Tue Dec 19 09:45:00 CET 2017
crawl.CrawlDbReader - retry 0: 5584
crawl.CrawlDbReader - min score: 0.0
crawl.CrawlDbReader - avg score: 5.463825E-4
crawl.CrawlDbReader - max score: 1.013
crawl.CrawlDbReader - status 1 (db_unfetched): 4278
crawl.CrawlDbReader - status 2 (db_fetched): 1014
crawl.CrawlDbReader - status 4 (db_redir_temp): 116
crawl.CrawlDbReader - status 5 (db_redir_perm): 19
crawl.CrawlDbReader - status 6 (db_notmodified): 24
A good trick I always do is to put this command inside the crawl script provided by nutch (bin/crawl), inside the loop
for for ((a=1; ; a++))
do
...
> echo "stats"
> __bin_nutch readdb "$CRAWL_PATH"/crawldb -stats
done
It's been about a month and the only results i get on solr are from the seedlist and its links.
Causes are multiple, you should check the output of each phase and see how the funnel goes.
3.I have set above command in crontab -e and plesk scheduled tasks. Now I get same links many times in in return for search query. How to avoid duplicate results in solr?
Guess you've used nutch default solr schema, check for the url vs. id fields.
As far as I've workd with, id is the unique identifier of a url (which may content redirects)

Xcode Device Logs both type and process are Unknown

What does it means Xcode Device Logs both are Process and Type unknown? My App is crashing but the only report I get is:
Incident Identifier: 33E3CEF3-C2A2-4F21-BEB4-56FC9A20AB62
CrashReporter Key: c51879d0d5337cafa9240e2d296c1239048a4713
Hardware Model: iPad3,1
OS Version: iPhone OS 5.1.1 (9B206)
Kernel Version: Darwin Kernel Version 11.0.0: Sun Apr 8 21:52:26 PDT 2012; root:xnu-1878.11.10~1/RELEASE_ARM_S5L8945X
Date: 2012-12-27 12:11:39 -0300
Time since snapshot: 338 ms
Free pages: 7710
Active pages: 32586
Inactive pages: 19459
Throttled pages: 138248
Purgeable pages: 710
Wired pages: 51955
Largest process: ukingdom
Processes
Name UUID Count resident pages
pasteboardd <6b71e15d2fe432639d1c8e127a8ede10> 256
ukingdom <253fa43b568539b49cae65641094284f> 131217 (jettisoned) (active)
debugserver <2408bf4540f63c55b656243d522df7b2> 316
gputoolsd <8cd5bb538e623da98b5e85c85bd02a91> 705
MobileMail <eed7992f4c1d3050a7fb5d04f1534030> 2094 (jettisoned)
MobilePhone <8f3f3e982d9235acbff1e33881b0eb13> 1661 (jettisoned)
springboardservi <b74f5f58317031e9aef7e95744c816ca> 614
atc <1e5f2a595709376b97f7f0fa29368ef1> 1849
notification_pro <373a488638c436b48ef0801b212593c4> 173
notification_pro <373a488638c436b48ef0801b212593c4> 157
notification_pro <373a488638c436b48ef0801b212593c4> 177
syslog_relay <b07876a121a432d39d89daf531e8f2bd> 134
notification_pro <373a488638c436b48ef0801b212593c4> 173
afcd <c3cc9d594b523fd1902fb69add11c25d> 241
ptpd <62bc5573db7a352ab68409e87dc9abb9> 1153
networkd <80ba40030462385085b5b7e47601d48d> 268
aosnotifyd <8cf4ef51f0c635dc920be1d4ad81b322> 829
BTServer <31e82dfa7ccd364fb8fcc650f6194790> 561
aggregated <a12fa71e6997362c83e0c23d8b4eb5b7> 547
apsd <e7a29f2034083510b5439c0fb5de7ef1> 458
awdd <67774945965531e98d98c2e23a230526> 398
dataaccessd <473ff40f3bfd3f71b5e3b4335b2011ee> 1341
fairplayd.J1 <3884d48fa4393c73aa8b3febf95ca258> 1065
fseventsd <914b28fa8f8a362fabcc47294380c81c> 571
iapd <0a747292a113307abb17216274976be5> 741
imagent <9c3a4f75d1303349a53fc6555ea25cd7> 544
locationd <cf31b0cddd2d3791a2bfcd6033c99045> 897
mDNSResponder <86ccd4633a6c3c7caf44f51ce4aca96d> 388
mediaremoted <327f00bfc10b3820b4a74b9666b0c758> 370
mediaserverd <f03b746f09293fd39a6079c135e7ed00> 1398
wifid <3001cd0a61fe357d95f170247e5458f5> 512
powerd <133b7397f5603cf8bef209d4172d6c39> 268
lockdownd <b06de06b9f6939d3afc607b968841ab9> 542
CommCenterClassi <041d4491826e3c6b911943eddf6aaac9> 473
SpringBoard <c74dc89dec1c3392b3f7ac891869644a> 13198 (active)
configd <ee72b01d85c33a24b3548fa40fbe519c> 540
syslogd <7153b590e0353520a19b74a14654eaaa> 289
notifyd <f6a9aa19d33c3962aad3a77571017958> 255
UserEventAgent <dc32e6824fd33bf189b266102751314f> 605
launchd <5fec01c378a030a8bd23062689abb07f> 274
**End**
Using Xcode 4.6 and it happens with many different devices with different IOS versions
Thanks
The report you added doesn't tell that much. It just shows the processes you have enabled at the time of the crash.
What you could do try these two things:
Try to replicate the crash when debugging on the device, as user dashdom told you: enable zombie
Enable "logging" under Settings > Developer and connect the device to Xcode while running the app normally. The use the organizer (⌘ + ⇧ + 2) go to Devices > (Your device) > Console and see if anything pops up in the console. That could give you more of an idea on what happens.
Hope that helped.

Apache benchmark: what does the total mean milliseconds represent?

I am benchmarking php application with apache benchmark. I have the server on my local machine. I run the following:
ab -n 100 -c 10 http://my-domain.local/
And get this:
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 3 3.7 2 8
Processing: 311 734 276.1 756 1333
Waiting: 310 722 273.6 750 1330
Total: 311 737 278.9 764 1341
However, if I refresh my browser on the page http://my-domain.local/ I find out it takes a lot longer than 737 ms (the mean that ab reports) to load the page (around 3000-4000 ms). I can repeat this many times and the loading of the page in the browser always takes at least 3000 ms.
I tested another, heavier page (page load in browser takes 8-10 seconds). I used a concurrency of 1 to simulate one user loading the page:
ab -n 100 -c 1 http://my-domain.local/heavy-page/
And the results are here:
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 17 20 4.7 18 46
Waiting: 16 20 4.6 18 46
Total: 17 20 4.7 18 46
So what does the total line on the ab results actually tell? Clearly it's not the number of milliseconds the browser is loading the web page. Is the number of milliseconds that it takes from browser to load the page (X) linearly dependent of number of the total mean milliseconds ab reports (Y)? So if I'm able to reduce half of Y, have I also reduced half of X?
(Also Im not really sure what Processing, Waiting and Total mean).
I'll reopen this question since I'm facing the problem again.
Recently I installed Varnish.
I run ab like this:
ab -n 100 http://my-domain.local/
Apache bench reports very fast response times:
Requests per second: 462.92 [#/sec] (mean)
Time per request: 2.160 [ms] (mean)
Time per request: 2.160 [ms] (mean, across all concurrent requests)
Transfer rate: 6131.37 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 1 2 2.3 1 13
Waiting: 0 1 2.0 1 12
Total: 1 2 2.3 1 13
So the time per request is about 2.2 ms. When I browse the site (as an anonymous user) the page load time is about 1.5 seconds.
Here is a picture from Firebug net tab. As you can see my browser is waiting 1.68 seconds for my site to response. Why is this number so much bigger than the request times ab reports?
Are you running ab on the server? Don't forget that your browser is local to you, on a remote network link. An ab run on the webserver itself will have almost zero network overhead and report basically the time it takes for Apache to serve up the page. Your home browser link will have however many milliseconds of network transit time added in, on top of the basic page-serving overhead.
Ok.. I think I know what's the problem. While I have been measuring the page load time in browser I have been logged in.. So none of the heavy stuff is happening. The page load times in browser with anonymous user are closer to the ones ab is reporting.

redis config question?

I am using redis for caching but recently I ran into a problem with the amount of memory used - had to restart my server since all ram had been consumed.
It's not the biggest machine but how should I configure redis to avoid the same problem again?
free -m
total used free shared buffers cached
Mem: 240 222 17 0 6 38
-/+ buffers/cache: 177 62
Swap: 255 46 209
I have changed the following settings:
timeout 60
databases 1
save 300 1
save 60 100
maxmemory 104857600
top
top - 14:15:28 up 1:19, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 49 total, 1 running, 48 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 245956k total, 228420k used, 17536k free, 6916k buffers
Swap: 262136k total, 47628k used, 214508k free, 39540k cached
you can use the "maxmemory" directive in the config file: when this amount of memory is exceeded then Redis will expire earlier keys having already an expire set (the keys that would expire sooner are the first that will be removed).
Unlike memcached, redis is supposed to be a databse; so it won't automatically remove old values to make room for new ones.
You have to explicitly set the expire time for each key/value, and even then you could overflow if you create key/values faster than that.
Use Redis virtual memory in Redis 2.0:
http://antirez.com/post/redis-virtual-memory-story.html