Running the same crontab twice in a minute - scrapy

When I run my spider in scrapy, manually, in the first time it executes the code but gives me 0 results. Yet, when I run it second time, then it crawls perfectly. This is fine when I do it manually, but when I run it in crontab, it does not produce any result. I get this (I deleted the time data):
{'downloader/request_bytes': 221,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 116972,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(xxx, x, xxx, xx, xx, xx, xxxx),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'log_count/WARNING': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
When I run it manually I receive 9 results:
{'downloader/request_bytes': 4696,
'downloader/request_count': 10,
'downloader/request_method_count/GET': 10,
'downloader/response_bytes': 202734,
'downloader/response_count': 10,
'downloader/response_status_count/200': 10,
'dupefilter/filtered': 9,
'finish_reason': 'finished',
'finish_time': datetime.datetime(xxx, x, xx, xx, xx, xx, xxxxxx),
'item_scraped_count': 9,
'log_count/DEBUG': 21,
'log_count/INFO': 8,
'log_count/WARNING': 1,
'request_depth_max': 2,
'response_received_count': 10,
'scheduler/dequeued': 10,
'scheduler/dequeued/memory': 10,
'scheduler/enqueued': 10,
'scheduler/enqueued/memory': 10,
What do I wrong?
And, if I run the same crontab job second time within a minute will it produce the results? If so how do I that?

Can you show full command how do you run in cron?
Also try to add -L DEBUG to crawl command to see more.

Related

Outliers in data

I have a dataset like so -
15643, 14087, 12020, 8402, 7875, 3250, 2688, 2654, 2501, 2482, 1246, 1214, 1171, 1165, 1048, 897, 849, 579, 382, 285, 222, 168, 115, 92, 71, 57, 56, 51, 47, 43, 40, 31, 29, 29, 29, 29, 28, 22, 20, 19, 18, 18, 17, 15, 14, 14, 12, 12, 11, 11, 10, 9, 9, 8, 8, 8, 8, 7, 6, 5, 5, 5, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
Based on domain knowledge, I know that larger values are the only ones we want to include in our analysis. How do I determine where to cut off our analysis? Should it be don't include 15 and lower or 50 and lower etc?
You can do a distribution check with quantile function. Then you can remove values below lowest 1 percentile or 2 percentile. Following is an example:
import numpy as np
data = np.array(data)
print(np.quantile(data, (.01, .02)))
Another method is calculating the inter quartile range (IQR) and setting lowest bar for analysis is Q1-1.5*IQR
Q1, Q3 = np.quantile(data, (0.25, 0.75))
data_floor = Q1 - 1.5 * (Q3 - Q1)

Increasing the label size in matplotlib in pie chart

I have the following dictionary
{'Electronic Arts': 66,
'GT Interactive': 1,
'Palcom': 1,
'Fox Interactive': 1,
'LucasArts': 5,
'Bethesda Softworks': 9,
'SquareSoft': 3,
'Nintendo': 142,
'Virgin Interactive': 4,
'Atari': 7,
'Ubisoft': 28,
'Konami Digital Entertainment': 11,
'Hasbro Interactive': 1,
'MTV Games': 1,
'Sega': 11,
'Enix Corporation': 4,
'Capcom': 13,
'Warner Bros. Interactive Entertainment': 7,
'Acclaim Entertainment': 1,
'Universal Interactive': 1,
'Namco Bandai Games': 7,
'Eidos Interactive': 9,
'THQ': 7,
'RedOctane': 1,
'Sony Computer Entertainment Europe': 3,
'Take-Two Interactive': 24,
'Square Enix': 5,
'Microsoft Game Studios': 22,
'Disney Interactive Studios': 2,
'Vivendi Games': 2,
'Sony Computer Entertainment': 52,
'Activision': 45,
'505 Games': 4}
Now the problem I am facing is viewing the labels. The labels are extremely small and invisible.
Please anyone can suggest on how to increase the label size.
I have tried the below code:
plt.figure(figsize=(80,80))
plt.pie(vg_dict.values(),labels=vg_dict.keys())
plt.show()
Adding textprops argument in plt.pie method:
plt.figure(figsize=(80,80))
plt.pie(vg_dict.values(), labels=vg_dict.keys(), textprops={'fontsize': 30})
plt.show()
You can check all the properties of Text object here.
Updated
I don't know if your labels order matter? To avoid overlapping labels, you can try to modify your start angle (plt start drawing pie counterclockwise from the x-axis), and re-order the "crowded" labels:
vg_dict = {
'Palcom': 1,
'Electronic Arts': 66,
'GT Interactive': 1,
'LucasArts': 5,
'Bethesda Softworks': 9,
'SquareSoft': 3,
'Nintendo': 142,
'Virgin Interactive': 4,
'Atari': 7,
'Ubisoft': 28,
'Hasbro Interactive': 1,
'Konami Digital Entertainment': 11,
'MTV Games': 1,
'Sega': 11,
'Enix Corporation': 4,
'Capcom': 13,
'Acclaim Entertainment': 1,
'Warner Bros. Interactive Entertainment': 7,
'Universal Interactive': 1,
'Namco Bandai Games': 7,
'Eidos Interactive': 9,
'THQ': 7,
'RedOctane': 1,
'Sony Computer Entertainment Europe': 3,
'Take-Two Interactive': 24,
'Vivendi Games': 2,
'Square Enix': 5,
'Microsoft Game Studios': 22,
'Disney Interactive Studios': 2,
'Sony Computer Entertainment': 52,
'Fox Interactive': 1,
'Activision': 45,
'505 Games': 4}
plt.figure(figsize=(80,80))
plt.pie(vg_dict.values(), labels=vg_dict.keys(), textprops={'fontsize': 35}, startangle=-35)
plt.show()
Result:

Scrapy Pausing and resuming crawls, results directory

I have finished a scraping project using resume mode. but I don't know where the results are.
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
I look at https://docs.scrapy.org/en/latest/topics/jobs.html, but it does not indicate anything about it
¿Where is the file with the results?
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-10 23:31:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/scrapy.core.downloader.handlers.http11.TunnelError': 22,
'bans/error/twisted.internet.error.ConnectionRefusedError': 2,
'bans/error/twisted.internet.error.TimeoutError': 6891,
'bans/error/twisted.web._newclient.ResponseNeverReceived': 8424,
'bans/status/500': 9598,
'bans/status/503': 56,
'downloader/exception_count': 15339,
'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 22,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 2,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 6891,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 8424,
'downloader/request_bytes': 9530,
'downloader/request_count': 172,
'downloader/request_method_count/GET': 172,
'downloader/response_bytes': 1848,
'downloader/response_count': 170,
'downloader/response_status_count/200': 169,
'downloader/response_status_count/500': 9,
'downloader/response_status_count/503': 56,
'elapsed_time_seconds': 1717,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 9, 11, 2, 31, 31, 32),
'httperror/response_ignored_count': 67,
'httperror/response_ignored_status_count/500': 67,
'item_scraped_count': 120,
'log_count/DEBUG': 357,
'log_count/ERROR': 119,
'log_count/INFO': 1764,
'log_count/WARNING': 240,
'proxies/dead': 1,
'proxies/good': 1,
'proxies/mean_backoff': 0.0,
'proxies/reanimated': 0,
'proxies/unchecked': 0,
'response_received_count': 169,
'retry/count': 1019,
'retry/max_reached': 93,
'retry/reason_count/500 Internal Server Error': 867,
'retry/reason_count/twisted.internet.error.TimeoutError': 80,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 72,
'scheduler/dequeued': 1722,
'scheduler/dequeued/disk': 1722,
'scheduler/enqueued': 1722,
'scheduler/enqueued/disk': 1722,
'start_time': datetime.datetime(2015, 9, 9, 2, 48, 56, 908)}
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Spider closed (finished)
(Face python 3.8) D:\Selenium\Face python 3.8\TORBUSCADORDELINKS\TORBUSCADORDELINKS\spiders>
'retry/reason_count/500 Internal Server Error': 867,
'retry/reason_count/twisted.internet.error.TimeoutError': 80,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 72,
'scheduler/dequeued': 1722673,
'scheduler/dequeued/disk': 1722,
'scheduler/enqueued': 1722,
'scheduler/enqueued/disk': 1722,
'start_time': datetime.datetime(2020, 9, 9, 2, 48, 56, 908)}
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Spider closed (finished)
Your command,
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
does not indicate an output file path.
Because of that, your results are nowhere.
Use the -o command-line switch to specify an output path.
See also the Scrapy tutorial, which covers this. Or run scrapy crawl --help.

How to print more than 32 values?

Anyone know how to print more than 32 values? My output looks like this, and I'm trying to make it show the rest of the array:
Value of: model.GetOutput(0)
Expected: contains 64 values, where each value and its corresponding value in { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, ... } are an almost-equal pair
Actual: { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... }, where the value pair (1, 2) at index #1 don't match, which is 1 from 1
It's hard-coded in the Google Test sources (kMaxCount = 32). To change it, you have to modify the code and rebuild Google Test. You might be able to define your own printer if the type is specific enough.

MultiPoint crossover using Numpy

I am trying to do crossover on a Genetic Algorithm population using numpy.
I have sliced the population using parent 1 and parent 2.
population = np.random.randint(2, size=(4,8))
p1 = population[::2]
p2 = population[1::2]
But I am not able to figure out any lambda or numpy command to do a multi-point crossover over parents.
The concept is to take ith row of p1 and randomly swap some bits with ith row of p2.
I think you want to select from p1 and p2 at random, cell by cell.
To make it easier to understand i've changed p1 to be 10 to 15 and p2 to be 20 to 25. p1 and p2 were generated at random in these ranges.
p1
Out[66]:
array([[15, 15, 13, 14, 12, 13, 12, 12],
[14, 11, 11, 10, 12, 12, 10, 12],
[12, 11, 14, 15, 14, 10, 13, 10],
[11, 12, 10, 13, 14, 13, 12, 13]])
In [67]: p2
Out[67]:
array([[23, 25, 24, 21, 24, 20, 24, 25],
[21, 21, 20, 20, 25, 22, 24, 22],
[24, 22, 25, 20, 21, 22, 21, 22],
[22, 20, 21, 22, 25, 23, 22, 21]])
In [68]: sieve=np.random.randint(2, size=(4,8))
In [69]: sieve
Out[69]:
array([[0, 1, 0, 1, 1, 0, 1, 0],
[1, 1, 1, 0, 0, 1, 1, 1],
[0, 1, 1, 0, 0, 1, 1, 0],
[0, 0, 0, 1, 1, 1, 1, 1]])
In [70]: not_sieve=sieve^1 # Complement of sieve
In [71]: pn = p1*sieve + p2*not_sieve
In [72]: pn
Out[72]:
array([[23, 15, 24, 14, 12, 20, 12, 25],
[14, 11, 11, 20, 25, 12, 10, 12],
[24, 11, 14, 20, 21, 10, 13, 22],
[22, 20, 21, 13, 14, 13, 12, 13]])
The numbers in the teens come from p1 when sieve is 1
The numbers in the twenties come from p2 when sieve is 0
This may be able to be made more efficient but is this what you expect as output?