Scrapy Pausing and resuming crawls, results directory

Scrapy Pausing and resuming crawls, results directory - scrapy

I have finished a scraping project using resume mode. but I don't know where the results are.
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
I look at https://docs.scrapy.org/en/latest/topics/jobs.html, but it does not indicate anything about it
¿Where is the file with the results?
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-10 23:31:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/scrapy.core.downloader.handlers.http11.TunnelError': 22,
'bans/error/twisted.internet.error.ConnectionRefusedError': 2,
'bans/error/twisted.internet.error.TimeoutError': 6891,
'bans/error/twisted.web._newclient.ResponseNeverReceived': 8424,
'bans/status/500': 9598,
'bans/status/503': 56,
'downloader/exception_count': 15339,
'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 22,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 2,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 6891,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 8424,
'downloader/request_bytes': 9530,
'downloader/request_count': 172,
'downloader/request_method_count/GET': 172,
'downloader/response_bytes': 1848,
'downloader/response_count': 170,
'downloader/response_status_count/200': 169,
'downloader/response_status_count/500': 9,
'downloader/response_status_count/503': 56,
'elapsed_time_seconds': 1717,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 9, 11, 2, 31, 31, 32),
'httperror/response_ignored_count': 67,
'httperror/response_ignored_status_count/500': 67,
'item_scraped_count': 120,
'log_count/DEBUG': 357,
'log_count/ERROR': 119,
'log_count/INFO': 1764,
'log_count/WARNING': 240,
'proxies/dead': 1,
'proxies/good': 1,
'proxies/mean_backoff': 0.0,
'proxies/reanimated': 0,
'proxies/unchecked': 0,
'response_received_count': 169,
'retry/count': 1019,
'retry/max_reached': 93,
'retry/reason_count/500 Internal Server Error': 867,
'retry/reason_count/twisted.internet.error.TimeoutError': 80,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 72,
'scheduler/dequeued': 1722,
'scheduler/dequeued/disk': 1722,
'scheduler/enqueued': 1722,
'scheduler/enqueued/disk': 1722,
'start_time': datetime.datetime(2015, 9, 9, 2, 48, 56, 908)}
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Spider closed (finished)
(Face python 3.8) D:\Selenium\Face python 3.8\TORBUSCADORDELINKS\TORBUSCADORDELINKS\spiders>
'retry/reason_count/500 Internal Server Error': 867,
'retry/reason_count/twisted.internet.error.TimeoutError': 80,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 72,
'scheduler/dequeued': 1722673,
'scheduler/dequeued/disk': 1722,
'scheduler/enqueued': 1722,
'scheduler/enqueued/disk': 1722,
'start_time': datetime.datetime(2020, 9, 9, 2, 48, 56, 908)}
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Spider closed (finished)

Your command,
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
does not indicate an output file path.
Because of that, your results are nowhere.
Use the -o command-line switch to specify an output path.
See also the Scrapy tutorial, which covers this. Or run scrapy crawl --help.

Related

Scrapy Pagination - Works for 2 pages but not after that

I'm crawling cdw.com website. For a given URL, there are around 17 pages. The script that I have written is able t fetch data from Page 1 and Page 2. Spider closes on its own after giving result of first 2 pages. Please let me know, how can I fetch data for remaining 15 pages.
TIA.
import scrapy
from cdwfinal.items import CdwfinalItem
from scrapy.selector import Selector
import datetime
import pandas as pd
import time
class CdwSpider(scrapy.Spider):
name = 'cdw'
allowed_domains = ['www.cdw.com']
start_urls = ['http://www.cdw.com/']
base_url = 'http://www.cdw.com'
def start_requests(self):
yield scrapy.Request(url = 'https://www.cdw.com/search/?key=axiom' , callback=self.parse )
def parse(self, response):
item=[]
hxs = Selector(response)
item = CdwfinalItem()
abc = hxs.xpath('//*[#id="main"]//*[#class="grid-row"]')
for i in range(len(abc)):
try:
item['mpn'] = hxs.xpath("//div[contains(#class,'search-results')]/div[contains(#class,'search-result')]["+ str(i+1) +"]//*[#class='mfg-code']/text()").extract()
except:
item['mpn'] = 'NA'
try:
item['part_no'] = hxs.xpath("//div[contains(#class,'search-results')]/div[contains(#class,'search-result')]["+ str(i+1) +"]//*[#class='cdw-code']/text()").extract()
except:
item['part_no'] = 'NA'
yield item
next_page = hxs.xpath('//*[#id="main"]//*[#class="no-hover" and #aria-label="Next Page"]').extract()
if next_page:
new_page_href = hxs.xpath('//*[#id="main"]//*[#class="no-hover" and #aria-label="Next Page"]/#href').extract_first()
new_page_url = response.urljoin(new_page_href)
yield scrapy.Request(new_page_url, callback=self.parse, meta={"searchword": '123'})
LOG:
2023-02-11 15:39:55 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36
2023-02-11 15:39:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdw.com/search/?key=axiom&pcurrent=3> (referer: https://www.cdw.com/search/?key=axiom&pcurrent=2) ['cached']
2023-02-11 15:39:55 [scrapy.core.engine] INFO: Closing spider (finished)
2023-02-11 15:39:55 [scrapy.extensions.feedexport] INFO: Stored csv feed (48 items) in: Test5.csv
2023-02-11 15:39:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2178,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 68059,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'elapsed_time_seconds': 1.30903,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 2, 11, 10, 9, 55, 327740),
'httpcache/hit': 3,
'httpcompression/response_bytes': 384267,
'httpcompression/response_count': 3,
'item_scraped_count': 48,
'log_count/DEBUG': 62,
'log_count/INFO': 11,
'log_count/WARNING': 45,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2023, 2, 11, 10, 9, 54, 18710)}

Your next_page selector is failing to extract the information for the next page. In general your selectors are more complicated then they need to be, for example you should be using relative xpath expressions in your for loop .
Here is an example that replicates the same behaviour as your spider except using much simpler selectors, and successfully extracts the results from all of the pages.
import scrapy
class CdwSpider(scrapy.Spider):
name = 'cdw'
allowed_domains = ['www.cdw.com']
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}
def start_requests(self):
yield scrapy.Request(url='https://www.cdw.com/search/?key=axiom' , callback=self.parse )
def parse(self, response):
for row in response.xpath('//div[#class="grid-row"]'):
mpn = row.xpath(".//span[#class='mfg-code']/text()").get()
cdw = row.xpath('.//span[#class="cdw-code"]/text()').get()
yield {"mpn": mpn, "part_no": cdw}
current = response.css("div.search-pagination-active")
next_page = current.xpath('./following-sibling::a/#href').get()
if next_page:
new_page_url = response.urljoin(next_page)
yield scrapy.Request(new_page_url, callback=self.parse)
EDIT
The only non-default setting I am using is setting a the user agent.
I have made adjustments in the example above to reflect that.
Partial output:
2023-02-11 22:10:58 [scrapy.core.engine] INFO: Closing spider (finished)
2023-02-11 22:10:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 106555,
'downloader/request_count': 41,
'downloader/request_method_count/GET': 41,
'downloader/response_bytes': 1099256,
'downloader/response_count': 41,
'downloader/response_status_count/200': 41,
'elapsed_time_seconds': 22.968986,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 2, 12, 6, 10, 58, 700080),
'httpcompression/response_bytes': 7962149,
'httpcompression/response_count': 41,
'item_scraped_count': 984,
'log_count/DEBUG': 1028,
'log_count/INFO': 10,
'request_depth_max': 40,
'response_received_count': 41,
'scheduler/dequeued': 41,
'scheduler/dequeued/memory': 41,
'scheduler/enqueued': 41,
'scheduler/enqueued/memory': 41,
'start_time': datetime.datetime(2023, 2, 12, 6, 10, 35, 731094)}
2023-02-11 22:10:58 [scrapy.core.engine] INFO: Spider closed (finished)

Null values at the end of rows after INSERT INTO

I am currently trying to INSERT INTO my SQL database a row of 144 columns.
The problem is that the last 10 values of the new row are NULL while they are supposed to be float and int.
That's an example of what I have in my DB after the INSERT INTO :
First column
Before last column
Last column
1
NULL
NULL
That's the SQL request I am using
INSERT INTO "historic_data2"
VALUES (28438, 163, 156, 1, 'FIST 2', 91, 81, 82, 84, 90, 6, '2 Pts Int M', 'Offensive', 0, '91_81_82_84_90', 86, 85, 0, 36, 62, 24, 0, 132, 86, 0, 83, 0, 0, 0, 0, 42, 77, 24, 0, 173, 107, 0, 204, 0, 0, 0, 0, 42, 77, 24, 0, 173, 107, 0, 204, 0, 0, 0, 81, 62, 34, 23, 19, 45, 32, 18, 9, 19, 0.5555555555555556, 0.5161290322580645, 0.5294117647058824, 0.391304347826087, 1.0, 82, 54, 34, 18, 28, 49, 27, 17, 8, 28, 0.5975609756097561, 0.5, 0.5, 0.4444444444444444, 1.0, 302, 233, 132, 89, 69, 168, 116, 69, 35, 69, 0.5562913907284768, 0.4978540772532189, 0.5227272727272727, 0.39325842696629215, 1.0, 214, 161, 84, 73, 53, 119, 79, 39, 36, 53, 0.5560747663551402, 0.4906832298136646, 0.4642857142857143, 0.4931506849315068, 1.0, 717, 544, 298, 233, 173, 416, 285, 175, 97, 173, 0.5801952580195258, 0.5238970588235294, 0.587248322147651, 0.41630901287553645, 1.0, 466, 315, 183, 128, 151, 357, 233, 138, 91, 151, 0.7660944206008584, 0.7396825396825397, 0.7540983606557377, 0.7109375,1.0,112)
I can't figure out how to solve this issue. My guess would be that there is a hard limit on how much column you can insert at once but I don't know how to solve that.
Thank you in advance for your help

Ignite exception with Too many open files but with ulimit of "open files (-n) 1048576" not work

I stop one Ignite server, and restart agagin, it throws
exception with Too many open files, i have change the ulimt of open file with
ulimit -n 1048576
and check the number changes, but ignite still could not start.
# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 15083
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 15083
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
the error log:
>>> VM name: 7493#test-server-node1
>>> Local node [ID=07F093E4-BBEF-471C-9046-4D1A50B84087, order=9, clientMode=false]
>>> Local node addresses: [san011.fr.alcatel-lucent.com/0:0:0:0:0:0:0:1%lo, 10.0.2.15/10.0.2.15, /127.0.0.1, /192.168.100.11]
>>> Local ports: TCP:10800 TCP:11090 TCP:11211 TCP:47100 UDP:47400 TCP:47500
[19:13:57,052][INFO][main][GridDiscoveryManager] Topology snapshot [ver=9, servers=2, clients=1, CPUs=5, offheap=1.5GB, heap=4.0GB]
[19:13:57,052][INFO][main][GridDiscoveryManager] Data Regions Configured:
[19:13:57,052][INFO][main][GridDiscoveryManager] ^-- default [initSize=256.0 MiB, maxSize=758.3 MiB, persistenceEnabled=true]
[19:13:57,255][INFO][sys-#59][GridDhtPartitionDemander] Completed (final) rebalancing [fromNode=918b2b4e-f98e-4faf-bffd-8f9d1dd97bf3, cacheOrGroup=TxCoinLatestInfoCache, topology=AffinityTopologyVersion [topVer=9, minorTopVer=0], time=307 ms]
[19:13:57,256][INFO][sys-#59][GridDhtPartitionDemander] Starting rebalancing [mode=ASYNC, fromNode=918b2b4e-f98e-4faf-bffd-8f9d1dd97bf3, partitionsCount=512, topology=AffinityTopologyVersion [topVer=9, minorTopVer=0], updateSeq=1]
[19:13:57,490][INFO][sys-#45][GridDhtPartitionDemander] Completed (final) rebalancing [fromNode=918b2b4e-f98e-4faf-bffd-8f9d1dd97bf3, cacheOrGroup=TxCoinMinInfoCache, topology=AffinityTopologyVersion [topVer=9, minorTopVer=0], time=232 ms]
[19:13:57,491][INFO][sys-#45][GridDhtPartitionDemander] Starting rebalancing [mode=ASYNC, fromNode=918b2b4e-f98e-4faf-bffd-8f9d1dd97bf3, partitionsCount=512, topology=AffinityTopologyVersion [topVer=9, minorTopVer=0], updateSeq=1]
[19:13:57,828][SEVERE][sys-#57][NodeInvalidator] Critical error with null is happened. All further operations will be failed and local node will be stopped.
class org.apache.ignite.internal.processors.cache.persistence.file.PersistentStorageIOException: Could not initialize file: part-347.bin
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:445)
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.read(FilePageStore.java:332)
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:322)
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:306)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:655)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:575)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.getOrAllocatePartitionMetas(GridCacheOffheapManager.java:1132)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:1030)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.updateCounter(GridCacheOffheapManager.java:1265)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.updateCounter(GridDhtLocalPartition.java:849)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:697)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:375)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:354)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1060)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1609)
at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1555)
at org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:126)
at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2751)
at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1515)
at org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:126)
at org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1484)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.file.FileSystemException: /usr/share/apache-ignite/work/db/node00-d2cb44e3-b649-4e2e-b6f9-f08f9ae1b3af/cache-TxCoinMinInfoToDbCache/part-347.bin: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newAsynchronousFileChannel(UnixFileSystemProvider.java:196)
at java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:248)
at java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:301)
at org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIO.<init>(AsyncFileIO.java:57)
at org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory.create(AsyncFileIOFactory.java:53)
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:428)
... 26 more
[19:13:57,856][SEVERE][sys-#57][GridCacheIoManager] Failed processing message [senderId=918b2b4e-f98e-4faf-bffd-8f9d1dd97bf3, msg=GridDhtPartitionSupplyMessage [updateSeq=1, topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0], missed=null, clean=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99... and 412 more], msgSize=16500, estimatedKeysCnt=1, size=512, parts=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99... and 412 more], super=GridCacheGroupIdMessage [grpId=-607232546]]]
class org.apache.ignite.IgniteException: Could not initialize file: part-347.bin
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.updateCounter(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.updateCounter(GridDhtLocalPartition.java:849)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:697)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:375)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:354)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1060)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1609)
at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1555)
at org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:126)
at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2751)
at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1515)
at org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:126)
at org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1484)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.internal.processors.cache.persistence.file.PersistentStorageIOException: Could not initialize file: part-347.bin
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:445)
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.read(FilePageStore.java:332)
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:322)
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:306)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:655)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:575)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.getOrAllocatePartitionMetas(GridCacheOffheapManager.java:1132)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:1030)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.updateCounter(GridCacheOffheapManager.java:1265)
... 18 more
Caused by: java.nio.file.FileSystemException: /usr/share/apache-ignite/work/db/node00-d2cb44e3-b649-4e2e-b6f9-f08f9ae1b3af/cache-TxCoinMinInfoToDbCache/part-347.bin: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newAsynchronousFileChannel(UnixFileSystemProvider.java:196)
at java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:248)
at java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:301)
at org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIO.<init>(AsyncFileIO.java:57)
at org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory.create(AsyncFileIOFactory.java:53)
at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:428)
... 26 more
[19:13:57,874][SEVERE][upd-ver-checker][GridUpdateNotifier] Runtime error caught during grid runnable execution: GridWorker [name=grid-version-checker, igniteInstanceName=null, finished=false, hashCode=73805044, interrupted=false, runner=upd-ver-checker]
java.lang.ExceptionInInitializerError
at javax.crypto.JceSecurityManager.<clinit>(JceSecurityManager.java:65)
at javax.crypto.Cipher.getConfiguredPermission(Cipher.java:2586)
at javax.crypto.Cipher.getMaxAllowedKeyLength(Cipher.java:2610)
at sun.security.ssl.CipherSuite$BulkCipher.isUnlimited(CipherSuite.java:535)
at sun.security.ssl.CipherSuite$BulkCipher.<init>(CipherSuite.java:507)
at sun.security.ssl.CipherSuite.<clinit>(CipherSuite.java:614)
at sun.security.ssl.SSLContextImpl.getApplicableCipherSuiteList(SSLContextImpl.java:294)
at sun.security.ssl.SSLContextImpl.access$100(SSLContextImpl.java:42)
at sun.security.ssl.SSLContextImpl$AbstractTLSContext.<clinit>(SSLContextImpl.java:425)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at java.security.Provider$Service.getImplClass(Provider.java:1634)
at java.security.Provider$Service.newInstance(Provider.java:1592)
at sun.security.jca.GetInstance.getInstance(GetInstance.java:236)
at sun.security.jca.GetInstance.getInstance(GetInstance.java:164)
at javax.net.ssl.SSLContext.getInstance(SSLContext.java:156)
at javax.net.ssl.SSLContext.getDefault(SSLContext.java:96)
at javax.net.ssl.SSLSocketFactory.getDefault(SSLSocketFactory.java:122)
at javax.net.ssl.HttpsURLConnection.getDefaultSSLSocketFactory(HttpsURLConnection.java:332)
at javax.net.ssl.HttpsURLConnection.<init>(HttpsURLConnection.java:289)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.<init>(HttpsURLConnectionImpl.java:94)
at sun.net.www.protocol.https.Handler.openConnection(Handler.java:62)
at sun.net.www.protocol.https.Handler.openConnection(Handler.java:57)
at java.net.URL.openConnection(URL.java:979)
at org.apache.ignite.internal.processors.cluster.HttpIgniteUpdatesChecker.getUpdates(HttpIgniteUpdatesChecker.java:59)
at org.apache.ignite.internal.processors.cluster.GridUpdateNotifier$UpdateChecker.body(GridUpdateNotifier.java:268)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at org.apache.ignite.internal.processors.cluster.GridUpdateNotifier$1.run(GridUpdateNotifier.java:113)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.SecurityException: Can not initialize cryptographic mechanism
at javax.crypto.JceSecurity.<clinit>(JceSecurity.java:93)
... 29 more
Caused by: java.security.PrivilegedActionException: java.io.FileNotFoundException: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/lib/security/policy/unlimited/US_export_policy.jar (Too many open files)
at java.security.AccessController.doPrivileged(Native Method)
at javax.crypto.JceSecurity.<clinit>(JceSecurity.java:82)
... 29 more
Caused by: java.io.FileNotFoundException: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/lib/security/policy/unlimited/US_export_policy.jar (Too many open files)
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:225)
at java.util.zip.ZipFile.<init>(ZipFile.java:155)
at java.util.jar.JarFile.<init>(JarFile.java:166)
at java.util.jar.JarFile.<init>(JarFile.java:130)
at javax.crypto.JceSecurity.loadPolicies(JceSecurity.java:353)
at javax.crypto.JceSecurity.setupJurisdictionPolicies(JceSecurity.java:323)
at javax.crypto.JceSecurity.access$000(JceSecurity.java:50)
at javax.crypto.JceSecurity$1.run(JceSecurity.java:85)
... 31 more
[19:13:57,897][INFO][node-stopper][GridTcpRestProtocol] Command protocol successfully stopped: TCP binary
[19:13:57,929][INFO][node-stopper][GridJettyRestProtocol] Command protocol successfully stopped: Jetty REST
[19:13:57,935][INFO][node-stopper][GridDhtPartitionDemander] Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion [topVer=9, minorTopVer=0]]
[19:13:57,939][SEVERE][db-checkpoint-thread-#40][GridCacheDatabaseSharedManager] Runtime error caught during grid runnable execution: GridWorker [name=db-checkpoint-thread, igniteInstanceName=null, finished=false, hashCode=1713594100, interrupted=false, runner=db-checkpoint-thread-#40]
class org.apache.ignite.IgniteException: Failed to perform WAL operation (environment was invalidated by a previous error)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.beforeReleaseWrite(PageMemoryImpl.java:1490)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1349)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:415)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:409)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writeUnlock(PageHandler.java:377)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.writeUnlock(DataStructure.java:198)
at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.releaseAndClose(PagesList.java:359)
at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.saveMetadata(PagesList.java:318)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:190)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onCheckpointBegin(GridCacheOffheapManager.java:167)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:2986)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:2754)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2679)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.internal.pagemem.wal.StorageException: Failed to perform WAL operation (environment was invalidated by a previous error)
at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.checkNode(FileWriteAheadLogManager.java:1354)
at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.access$7700(FileWriteAheadLogManager.java:130)
at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.addRecord(FileWriteAheadLogManager.java:2509)
at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.access$1900(FileWriteAheadLogManager.java:2419)
at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:700)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.beforeReleaseWrite(PageMemoryImpl.java:1486)
... 14 more
[19:13:57,952][INFO][tcp-disco-sock-reader-#8][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.100.13:32777, rmtPort=32777
[19:13:57,978][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=ignite-sys-cache]
[19:13:57,979][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=TxCoinMinInfoToDbCache]
[19:13:57,980][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=TxCoinMinInfoCache]
[19:13:57,981][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=TxCoinLatestInfoCache]
[19:13:57,983][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=datastructures_ATOMIC_PARTITIONED_1#default-ds-group, group=default-ds-group]
[19:13:57,985][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=ignite-sys-atomic-cache#default-ds-group, group=default-ds-group]
[19:13:57,986][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=TradeCoinInfoCache]
[19:13:57,986][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=LvOneTxCache]
[19:13:57,986][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=CoinTypeListCache]
[19:13:57,987][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=MatchResultRecordCache]
[19:14:02,540][INFO][node-stopper][GridDeploymentLocalStore] Removed undeployed class: GridDeployment [ts=1525259633995, depMode=SHARED, clsLdr=sun.misc.Launcher$AppClassLoader#764c12b6, clsLdrId=fa6be802361-07f093e4-bbef-471c-9046-4d1a50b84087, userVer=0, loc=true, sampleClsName=org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionFullMap, pendingUndeploy=false, undeployed=true, usage=0]
[19:14:02,550][INFO][node-stopper][IgniteKernal]
>>> +---------------------------------------------------------------------------------+
>>> Ignite ver. 2.4.0#20180305-sha1:aa342270b13cc1f4713382a8eb23b2eb7edaa3a5 stopped OK
>>> +---------------------------------------------------------------------------------+
>>> Grid uptime: 00:00:05.510
But if i manully change it to
ulimit -n 65535
The node could be restart normally and back to cluster
I have tried several time, it always could be reproduced.

Please get count of open handlers by command: sudo lsof -u user | wc -l . Where user is the user name.
Check the system configuration for file descriptors: sudo sysctl fs.file-nr . You could increase limit in file /etc/sysctl.conf
Please check your application for properly closing of the file resources and resolve what process consumes file descriptors.

Running the same crontab twice in a minute

When I run my spider in scrapy, manually, in the first time it executes the code but gives me 0 results. Yet, when I run it second time, then it crawls perfectly. This is fine when I do it manually, but when I run it in crontab, it does not produce any result. I get this (I deleted the time data):
{'downloader/request_bytes': 221,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 116972,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(xxx, x, xxx, xx, xx, xx, xxxx),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'log_count/WARNING': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
When I run it manually I receive 9 results:
{'downloader/request_bytes': 4696,
'downloader/request_count': 10,
'downloader/request_method_count/GET': 10,
'downloader/response_bytes': 202734,
'downloader/response_count': 10,
'downloader/response_status_count/200': 10,
'dupefilter/filtered': 9,
'finish_reason': 'finished',
'finish_time': datetime.datetime(xxx, x, xx, xx, xx, xx, xxxxxx),
'item_scraped_count': 9,
'log_count/DEBUG': 21,
'log_count/INFO': 8,
'log_count/WARNING': 1,
'request_depth_max': 2,
'response_received_count': 10,
'scheduler/dequeued': 10,
'scheduler/dequeued/memory': 10,
'scheduler/enqueued': 10,
'scheduler/enqueued/memory': 10,
What do I wrong?
And, if I run the same crontab job second time within a minute will it produce the results? If so how do I that?

Can you show full command how do you run in cron?
Also try to add -L DEBUG to crawl command to see more.

Convert ECC PKCS#8 public and private keys to traditional format

I have ECC public and private generated with BouncyCastle:
Security.addProvider(new org.bouncycastle.jce.provider.BouncyCastleProvider());
ECNamedCurveParameterSpec ecSpec = ECNamedCurveTable
.getParameterSpec("secp192r1");
KeyPairGenerator g = KeyPairGenerator.getInstance("ECDSA", "BC");
g.initialize(ecSpec, new SecureRandom());
KeyPair pair = g.generateKeyPair();
System.out.println(Arrays.toString(pair.getPrivate().getEncoded()));
System.out.println(Arrays.toString(pair.getPublic().getEncoded()));
byte[] privateKey = new byte[]{48, 123, 2, 1, 0, 48, 19, 6, 7, 42, -122, 72, -50, 61, 2, 1, 6, 8, 42, -122, 72, -50, 61, 3, 1, 1, 4, 97, 48, 95, 2, 1, 1, 4, 24, 14, 117, 7, -120, 15, 109, -59, -35, 72, -91, 99, -2, 51, -120, 112, -47, -1, -115, 25, 48, -104, -93, 78, -7, -96, 10, 6, 8, 42, -122, 72, -50, 61, 3, 1, 1, -95, 52, 3, 50, 0, 4, 64, 48, -104, 32, 41, 13, 1, -75, -12, -51, -24, -13, 56, 75, 19, 74, -13, 75, -82, 35, 1, -50, -93, -115, -115, -34, -81, 119, -109, -50, -39, -57, -20, -67, 65, -50, 66, -122, 96, 84, 117, -49, -101, 54, -30, 77, -110, -122}
byte[] publicKey = new byte[]{48, 73, 48, 19, 6, 7, 42, -122, 72, -50, 61, 2, 1, 6, 8, 42, -122, 72, -50, 61, 3, 1, 1, 3, 50, 0, 4, 64, 48, -104, 32, 41, 13, 1, -75, -12, -51, -24, -13, 56, 75, 19, 74, -13, 75, -82, 35, 1, -50, -93, -115, -115, -34, -81, 119, -109, -50, -39, -57, -20, -67, 65, -50, 66, -122, 96, 84, 117, -49, -101, 54, -30, 77, -110, -122}
How to convert them into traditional format which can be reused later in https://github.com/kmackay/micro-ecc/blob/master/uECC.h? I need 24 bytes private and 48 public key while now it is 125 and 75.

Gives 24 and 48, sometimes when 0 is added at the beginning 25 or 49:
ECPrivateKey ecPrivateKey = (ECPrivateKey)privateKey;
System.out.println(ecPrivateKey.getS().toByteArray().length);
ECPublicKey ecPublicKey = (ECPublicKey)publicKey;
System.out.println(ecPublicKey.getW().getAffineX().toByteArray().length + ecPublicKey.getW().getAffineY().toByteArray().length);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scrapy Pausing and resuming crawls, results directory - scrapy

Your command, scrapy crawl somespider -s JOBDIR=crawls/somespider-1 does not indicate an output file path. Because of that, your results are nowhere. Use the -o command-line switch to specify an output path. See also the Scrapy tutorial, which covers this. Or run scrapy crawl --help.

Related

Scrapy Pagination - Works for 2 pages but not after that

Null values at the end of rows after INSERT INTO

Ignite exception with Too many open files but with ulimit of "open files (-n) 1048576" not work

Running the same crontab twice in a minute

Convert ECC PKCS#8 public and private keys to traditional format

Categories

Resources