What is the size limit for JsonItemExporter in Scrapy? - scrapy

The following warning is mentioned in the Feed Exports section of Scrapy docs.
From the docs for JsonItemExporter:
JSON is very simple and flexible serialization format, but it doesn’t scale well for large amounts of data since incremental (aka. stream-mode) parsing is not well supported (if at all) among JSON parsers (on any language), and most of them just parse the entire object in memory. If you want the power and simplicity of JSON with a more stream-friendly format, consider using JsonLinesItemExporter instead, or splitting the output in multiple chunks.
Does this mean that the JsonItemExporter is not suitable for incremental (aka stream data) or does it also imply a size limit for json?
If this means that this exporter is not suitable also for large files, does anyone have a clue about the upper limit for json items / file size (for e.g. 10MB or 50MB)?

JsonItemExporter does not have a size limit. The only Limitation remains to be no support for streamable objects.

Related

Is it possible to store PDF files in a CQL blob type in Cassandra?

To avoid questions about. Why do you use casandra in favour of another database. we have to because our custoner decided that Im my option a completely wrong decision.
In our Applikation we have to deal with PDF documents, i.e. Reader them and populate them with data.
So my intention was to hold the documents (templates) in the database read them and then do what we need to do with them.
I noticed that cassandra provieds a blob column type.
However for me it seems that this type has nothing to with a blob in qn Oracle or other relational database.
From what I understand is that cassandra is not for storing documnents and therefore it is not possible?
Or is the only way to make byte-array out of the document?
what is the intention of the blob column type?
The blob type in Cassandra is used to store raw bytes, so it's "theoretically" could be used to store PDF files as well (as bytes). But there is one thing that should be taken into consideration - Cassandra doesn't work well with big payloads - usual recommendation is to store 10s or 100s of Kb, not more than 1Mb. With bigger payloads, operations, such as repair, addition/removal of nodes, etc. could lead to increased overhead and performance degradation. On older versions of Cassandra (2.x/3.0) I have seen the situations when people couldn't add new nodes because join operation failed. It's a bit better situation with newer versions, but still it should be evaluated before jumping into implementation. It's recommended to do performance testing + some maintenance operations at scale to understand if it will work for your load. NoSQLBench is a great tool for such things.
It is possible to store binary files in a CQL blob column however the general recommendation is to only store a small amount of data in blobs, preferably 1MB or less for optimum performance.
For larger files, it is better to place them in an object store and only save the metadata in Cassandra.
Most large enterprises whose applications hold large amount of media files (music, video, photos, etc) typically store them in Amazon S3, Google Cloud Store or Azure Blob then store the metadata (such as URLs) of the files in Cassandra. These enterprises are household names in streaming services and social media apps. Cheers!

Why can't we just use arraybuffer and convert them to int array to upload file?

I got this silly question which originates from my college assignment.
Basically what I was trying to do at that time is to upload an image to a flask backend in REST way and the backend will use open-cv to do a image recognition. Because json data type does not support binary data, I follow some online instructions to use base64 which is of course feasible(it seems to be used a lot in terms of file uploading for REST, not sure about the behind reason). But Later I realized actually I can read the image into ArrayBuffer and convert it to int array and then post to the backend. I just tried it today and it succeeded. Then on both sides, the encoding overhead is avoided and payload size also get reduced since base64 increases size by around 33%.
I want to ask since we can avoid using based64 why we still use base64. Is it just because it avoids issues of line ending encodings across systems? It seems unrelated to binary data uploading.

What makes RecordIO attractive

I have been reading about RecordIO here and there and checking different implementations on github here, and there.
I'm simply trying to wrap my head around the pros of such a file format.
The pros I see are the following:
Block compression. It will be faster if you need to read only a few records because less to decompress.
Because of the somehow indexed structure you could lookup a specific record in acceptable time (assuming keys are sorted). This can be useful to quickly locate a record in an adhoc fashion.
I can also imagine that with such a file format you can have finer sharding strategies. Instead of sharding per file you can shard per block.
But I fail to see how such a file format is faster for reading over some plain protobuf with compression.
Essentially I fail to see a big pro in this format.

Is it possible to memory map a compressed file?

We have large files with zlib-compressed binary data that we would like to memory map.
Is it even possible to memory map such a compressed binary file and access those bytes in an effective manner?
Are we better off just decompressing the data, memory mapping it, then after we're done with our operations compress it again?
EDIT
I think I should probably mention that these files can be appended to at regular intervals.
Currently, this data on disk gets loaded via NSMutableData and decompressed. We then have some arbitrary read/write operations on this data. Finally, at some point we compress and write the data back to disk.
Memory mapping is all about the 1:1 mapping of memory to disk. That's not compatible with automatic decompression, since it breaks the 1:1 mapping.
I assume these files are read-only, since random-access writing to a compressed file is generally impractical. I would therefore assume that the files are somewhat static.
I believe this is a solvable problem, but it's not trivial, and you will need to understand the compression format. I don't know of any easily reusable software to solve it (though I'm sure many people have solved something like it in the past).
You could memory map the file and then provide a front-end adapter interface to fetch bytes at a given offset and length. You would scan the file once, decompressing as you went, and create a "table of contents" file that mapped periodic nominal offsets to real offset (this is just an optimization, you could "discover" this table of contents as you fetched data). Then the algorithm would look something like:
Given nominal offset n, look up greatest real offset m that maps to less than n.
Read m-32k into buffer (32k is the largest allowed distance in DEFLATE).
Begin DEFLATE algorithm at m. Count decompressed bytes until you get to n.
Obviously you'd want to cache your solutions. NSCache and NSPurgeableData are ideal for this. Doing this really well and maintaining good performance would be challenging, but if it's a key part of your application it could be very valuable.

Write multiple streams to a single file without knowing the length of the streams?

For performance of reading and writing a large dataset, we have multiple threads compressing and writing out separate files to a SAN. I'm making a new file spec that will instead have all these files appended together into a single file. I will refer to each of these smaller blocks of a data as a subset.
Since each subset will be an unknown size after compression there is no way to know what byte offset to write to. Without compression each writer can write to a predictable address.
Is there a way to append files together on the file-system level without requiring a file copy?
I'll write an example here of how I would expect the result to be on disk. Although I'm not sure how helpful it is to write it this way.
single-dataset.raw
[header 512B][data1-45MB][data2-123MB][data3-4MB][data5-44MB]
I expect the SAN to be NTFS for now in case there are any special features of certain file-systems.
If I make the subsets small enough to fit into ram, I will know the size after compression, but keeping them smaller has other performance drawbacks.
Use sparse files. Just position each subset at some offset "guaranteed" to be beyond the last subset. Your header can then contain the offset of each subset and the filesystem handles the big "empty" chunks for you.
The cooler solution is to write out each subset as a separate file and then use low-level filesystem functions to join the files by chaining the first block of the next file to the last block of the previous file (along with deleting the directory entries for all but the first file).