Aerospike Store Data in Compressed Format - aerospike

Does Aerospike have any built in support for data compression?
If not, are there any negative side affects of storing bin values in a compressed format?

There is no such built-in support at this point. I don't think there is any specific side effects of storing bins in a compressed format.
If this is a single bin holding a single blob, consider configuring the namespace to be single-bin true.
Update
As of December 13 2018 there is a built-in compression feature in Aerospike Enterprise Edition 4.5:
https://www.aerospike.com/blog/aerospike-4-5-persistent-memory-compression/

Related

Is it possible to store PDF files in a CQL blob type in Cassandra?

To avoid questions about. Why do you use casandra in favour of another database. we have to because our custoner decided that Im my option a completely wrong decision.
In our Applikation we have to deal with PDF documents, i.e. Reader them and populate them with data.
So my intention was to hold the documents (templates) in the database read them and then do what we need to do with them.
I noticed that cassandra provieds a blob column type.
However for me it seems that this type has nothing to with a blob in qn Oracle or other relational database.
From what I understand is that cassandra is not for storing documnents and therefore it is not possible?
Or is the only way to make byte-array out of the document?
what is the intention of the blob column type?
The blob type in Cassandra is used to store raw bytes, so it's "theoretically" could be used to store PDF files as well (as bytes). But there is one thing that should be taken into consideration - Cassandra doesn't work well with big payloads - usual recommendation is to store 10s or 100s of Kb, not more than 1Mb. With bigger payloads, operations, such as repair, addition/removal of nodes, etc. could lead to increased overhead and performance degradation. On older versions of Cassandra (2.x/3.0) I have seen the situations when people couldn't add new nodes because join operation failed. It's a bit better situation with newer versions, but still it should be evaluated before jumping into implementation. It's recommended to do performance testing + some maintenance operations at scale to understand if it will work for your load. NoSQLBench is a great tool for such things.
It is possible to store binary files in a CQL blob column however the general recommendation is to only store a small amount of data in blobs, preferably 1MB or less for optimum performance.
For larger files, it is better to place them in an object store and only save the metadata in Cassandra.
Most large enterprises whose applications hold large amount of media files (music, video, photos, etc) typically store them in Amazon S3, Google Cloud Store or Azure Blob then store the metadata (such as URLs) of the files in Cassandra. These enterprises are household names in streaming services and social media apps. Cheers!

What is the size limit for JsonItemExporter in Scrapy?

The following warning is mentioned in the Feed Exports section of Scrapy docs.
From the docs for JsonItemExporter:
JSON is very simple and flexible serialization format, but it doesn’t scale well for large amounts of data since incremental (aka. stream-mode) parsing is not well supported (if at all) among JSON parsers (on any language), and most of them just parse the entire object in memory. If you want the power and simplicity of JSON with a more stream-friendly format, consider using JsonLinesItemExporter instead, or splitting the output in multiple chunks.
Does this mean that the JsonItemExporter is not suitable for incremental (aka stream data) or does it also imply a size limit for json?
If this means that this exporter is not suitable also for large files, does anyone have a clue about the upper limit for json items / file size (for e.g. 10MB or 50MB)?
JsonItemExporter does not have a size limit. The only Limitation remains to be no support for streamable objects.

Apache Ignite 2.x BinaryObject deserialize performance

I'm observing two orders of magnitude performance difference scanning a local off-heap cache between binary and deserialized mode (200k/sec vs 2k/sec). Have not profiled it with tools yet.
Is the default reflection based binary codec a recommended one for production or there's a better one?
What's the best source to read for description of the binary layout (the official documentation is missing that)?
Or in the most generic form - what's the expected data retrieval performance with Ignite scanning query and how to achieve it?
Since 2.0.0 version ignite stores all data in off heap memory, so it's expected that BinaryObjects works faster, because BinaryObject doesn't deserialize you objects to classes, but works directly with bytes.
So yes, it's recommended to use BinaryObjects if possible for performance sake.
Read the following doc:
https://apacheignite.readme.io/docs/binary-marshaller
it explains how to use BinaryObjects.

Is it possible to query GZIP document stored as Binary data in SQL Server?

I have about thirty-thousand Binary records, all compressed using GZIP, and I need to search the contents of each document for a specified keyword. Currently, I am downloading and extracting all documents at startup. This works well enough, but I expect to be adding another ten-thousand each year. Ideally, I would like to perform a SELECT statement on the Binary column itself, but I have no idea how to go about it, or if this is even possible. I would like to perform this transaction with the least possible amount of data leaving the server. Any help would be appreciated.
EDIT: The Sql records are not compressed. What I mean to say is that I'm compressing the data locally and uploading the compressed files to a SQL Server column of Binary data type. I'm looking for a way to query that compressed data without downloading and decompressing each and every document. The data was stored this way to minimize overhead and reduce transfer cost, but the data must also be queried. It looks like I may have to store two versions of the data on the server, one compressed to be downloaded by the user, and one decompressed to allow search operations to be performed. Is there a more efficient approach?
SQL Server has a Full-Text Search feature. It will not work on the data that you compressed in your application, of course. You have to store it in plain-text in the database. But, it is designed specifically for this kind of search, so performance should be good.
SQL Server can also compress the data in rows or in pages, but this feature is not available in every edition of SQL Server. For more information, see Features Supported by the Editions of SQL Server. You have to measure the impact of compression on your queries.
Another possibility is to write your own CLR functions that would work on the server - load the compressed binary column, decompress it and do the search. Most likely performance would be worse than using the built-in features.
Taking your updated question into account.
I think your idea to store two versions of the data is good.
Store compressed binary data for efficient transfer to and from the server.
Store secondary copy of the data in an uncompressed format with proper indexes (consider full-text indexes) for efficient search by keywords.
Consider using CLR function to help during inserts. You can transfer only compressed data to the server, then call CLR function that would decompress it on the server and populate the secondary table with uncompressed data and indexes.
Thus you'll have both efficient storage/retrieval plus efficient searches at the expense of the extra storage on the server. You can think of that extra storage as an extra structure for the index that helps with searches.
Why compressing 30,000 or 40,000 records? Does not sound like a whole lot of data, of course depending of the average size of a record.
For keyword searching, you should not compress the database records. But to save on disk space, in most operating systems, it is possible to compress data on the file level, without the SQL Server even noticing.
update:
As Vladimir pointed out, SQL Server does not run on a compressed file system. Then you could store that data in TWO columns: once uncompressed, for keyword searching, and once compressed, for improved data transfer.
Storing data in a separate searchable column is not uncommon. For example, if you want to search on a combination of fields, you might as well store that combination in a search column, so that you could index that column to accelerate searching. In your case, you might store the data in the search column all lower cased, and with accented characters converted to ascii, and add an index, to accelerate case-insensitive searching on ascii keywords.
In fact, Vladimir already suggested this.

What are the methods to migrate millions of nodes and edges from 0.44 to 0.5?

I'm migrating the entire Titan graph database from 0.44 to 0.5. There are about 120 million nodes and 90 million edges that's gigabytes of data. I tried the GraphML format, but it didn't work.
Can you suggest methods to do the migration?
At the size you are describing you would probably execute the most efficient migration by using Titan-Hadoop/Faunus. The general process would be to:
Use Faunus 0.4.x to extract the data from your graph as GraphSON and store that in HDFS
Use Titan-Hadoop 0.5.x to read the GraphSON and write back to your storage backend.
Make sure that you've created your schema in your target backend prior to executing step 2.
As an aside, GraphML is not a good format for a graph of this size - it's will take too long and require a lot of resources if it would work at all. You might wonder why you wouldn't use Sequence files if you are using Faunus/Titan Hadoop...the reason you can't in this case is because I believe that there were version differences between 0.4.x and 0.5.x with respect to the file format of Sequence files. In other words, 0.5.x can't read 0.4.x sequence files. GraphSON is readable by both versions so it makes for an ideal migration format.