combining open(Filename, {delayed_write, Size, Delay}) with index - file-io

How can I combine a open(Filename, {delayed_write, Size, Delay}) with an index on where to write this data to?
I want to wait until I receive a certain amount of data and then write it to a position in the file.
Also is {read_ahead, Size} the opposite of {delayed_write, Size, Delay}? I would like to read a certain amount of data to send it.
Thanks

read_ahead is kind of the opposite of delayed_write in the sense that reading is the opposite of writing.
If you want to read and send bigger chunks of memory you don't need read_ahead, just read big chunks and send them (not many os calls to save here).
From the file:open/2 manpage on read_ahead:
If read/2 calls are for sizes not significantly less than, or even greater than Size
bytes, no performance gain can be expected.
You don't need to specify and index when opening. Just use pwrite/3 or a combination of position/2 and write/2.
But writing in different positions of the file might just reduce the gain of delayed_write since (also manpage of file:open/2):
The buffered
data is also flushed before some other file
operation than write/2 is executed.
If you have chunks of data for several positions collect them in a list of {Location, Bytes} and from time to time write them with file:pwrite/2 all in one go. This can map to very efficient writev(2) system call that writes several chunks in one go.

Related

2D Chunked Grid Fast Access

I am working on a program that stores some data in cells (small structs) and processes each one individually. The processing step accesses the 4 neighbors of the cell (2D). I also need them partitioned in chunks because the cells might be distributed randomly trough a very large surface, and having a large grid with mostly empty cells would be a waste. I also use the chunks for some other optimizations (skipping processing of chunks based on some conditions).
I currently have a hashmap of "chunk positions" to chunks (which are the actual fixed size grids). The position is calculated based on the chunk size (like Minecraft). The issue is that, when processing the cells in every chunk, I lose a lot of time doing a lookup to get the chunk of the neighbor. Most of the time, the neighbor is in the same chunk we are processing, so I did a check to prevent looking up a chunk if the neighbor is in the same chunk we are processing.
Is there a better solution to this?
This lacks some details, but hopefully you can employ a solution such as this:
Process the interior of a chunk (ie excluding the edges) separately. During this phase, the neighbours are for sure in the same chunk, so you can do this with zero chunk-lookups. The difference between this and doing a check to see whether a chunk-lookup is necessary, is that there is not even a check. The check is implicit in the loop bounds.
For edges, you can do a few chunk lookups and reuse the result across the edge.
This approach gets worse with smaller chunk sizes, or if you need access to neighbours further than 1 step away. It breaks down entirely in case of random access to cells. If you need to maintain a strict ordering for the processing of cells, this approach can still be used with minor modifications by rearranging it (there wouldn't be strict "process the interior" phase, but you would still have a nice inner loop with zero chunk-lookups).
Such techniques are common in general in cases where the boundary has different behaviour than the interior.

Is there a faster way to concat huge data frames (40GB) using pandas

I have 3 huge data frames of 40 GB size, I opened them using chunks. Then, I wanna concatenate them together. Here is what I tried:
path = 'path/to/myfiles'
files = [os.path.join(path,i) for i in os.listdir(path) if i.endswith('tsv')]
for file in files:
cols = ['col1','col2','col3']
chunks = pd.read_table(file, sep='\t', names=cols, chunksize=10000000)
However, when I try to concatenate all the files, it is taking forever.
I would like to have some suggestions to concatenate all the data frames quicker/faster.
CSV/TSV is a very slow file format, not optimized.
You probably don't need to keep the entire dataset in-memory. Your use-case probably doesn't need full random column- and row-access across the entire combined (120GB) dataset.
(Can you process each row/chunk/group (e.g. zipcode, user_id, etc.) serially? e.g. to compute aggregates, summary statistics, features? Or do you need to be able to apply arbitrary filters across columns (which columns), or rows (which columns)? e.g. "Get all userid's who used service X within the last N days". You can choose a higher-performance file format based on your use-case. There are alternative file formats (HDFS, PARQUET etc.) Some are optimized for columnar access, or row access, some for sequential or random access. There is also PySpark.
You don't necessarily need to combine your dataset into one huge monolithic 120GB file.
You're saying the runtime is slow, but likely you're blowing out memory (in which case runtime goes out the window), so your first check your memory usage.
Your code is trying to read in and store all chunks of each file, not process them individual chunk-by-chunk across the three files: for file in files: ... chunks = pd.read_table(file, ... chunksize=10000000). See Iterating through files chunk by chunk, in pandas.
after you fix that, chunksize=1e7 parameter is not the size of the memory chunk; it's only the number of rows in the chunk. That value is insanely large. If one row of the combined dataframes were to take say 10Kb, then a chunk of 1e7 such rows would take 100Gb(!), which will not fit in most machines.
If you must stick with using CSV, process one single chunk across each of the three files, then write its output to file, don't leave all the chunks hanging around in-memory. Also reduce your chunksize (try e.g. 1e5 or less, and measure the memory and runtime improvement). Also don't hardcode it, figure out a sane value per-machine, and/or make it a command-line parameter. Monitor your memory usage.
.tsv and .csv are fairly slow formats to read/write. I've found parquet works best for most of the stuff I end up doing. It's quite fast on reads and writes, and also allows you to read back a chunked folder of files as a single table easily. It does require string column names, however:
In [102]: df = pd.DataFrame(np.random.random((100000, 100)), columns=[str(i) for i in range(100)])
In [103]: %time df.to_parquet("out.parquet")
Wall time: 980 ms
In [104]: %time df.to_csv("out.csv")
Wall time: 14 s
In [105]: %time df = pd.read_parquet("out.parquet")
Wall time: 195 ms
In [106]: %time df = pd.read_csv("out.csv")
Wall time: 1.53 s
If you don't have control over the format those chunked files are in, you'll obviously need to pay the read cost of them at least once, but converting them could still save you some time in the long run if you do a lot of other read/writes.

Storing 30M records in redis

I'm wondering the most efficient way to store this data.
I need to track 30-50 million data points per day. It needs to be extremely fast read/write, so I'm using redis.
The data only needs to last for 24 hours, at which point it will EXPIRE.
The data looks like this as a key/value hash
{
"statistics:a5ded391ce974a1b9a86aa5322ea9e90": {
xbi: 1,
bid: 0.24024,
xpl: 25.0,
acc: 40,
pid: 43,
cos: 0.025,
xmp: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
clu: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
}
I've replaced the actual string with a lot of x but that IS the proper length of the string.
So far, according to my calculations.... this will use hundreds of GB of memory. Does that seem correct?
This is mostly ephemeral logging data thats important, but not important enough to try to support writing to disk or failovers. I am comfortable keeping it on 1 machine, if that helps make this easier.
What would be the best way to reduce memory space in this scenario? Is there a better way I can do this? Does redis support 300GB on a single instance?
In redis.conf - set hash-max-ziplist-value to 1 more than the length of the field 'xmp'. Then restart redis, and watch your memory go down significantly.
The default value is 64. Increasing it increases cpu utilization when you modify or add new fields in the hash. But your use case seems to be create-only, and in that case there shouldn't be any drawbacks of increasing the setting.
this will use hundreds of GB of memory. Does that seem correct?
YES
Does redis support 300GB on a single instance?
YES
Is there a better way I can do this?
You can try the following methods:
Avoid Using Hash
Since you always get all fields of the log with HGETALL, there's NO need to save the log as HASH. HASH consumes more memory than STRING.
You can serialize all fields into a string, and save the log as a key-value pair:
SET 'statistics:a5ded391ce974a1b9a86aa5322ea9e90' '{xbi: 1, bid: 0.24024, and other fields}'
#Sripathi Krishnan's answer gives another way to avoid HASH, i.e. config Redis to encode the HASH into ZIPLIST. It's a good idea if you don't share your Redis with other applications. Otherwise, this modification might cause problem to others.
Compress The Data
In order to reduce memory usage, you can try to compress your data. Redis can store binary strings, so you can use gzip, snappy or other compression algorithm to compress the log text into binary string, and save it into Redis.
Normally, you can get better compression when the input is bigger. So you'd better compress the whole log, instead of compress each field one by one.
The side-effect is that the producer and consumer of the log need to cost some CPU to compress and decompress the data. However, normally that's NOT a problem, and also it can reduce some network bandwidth.
Batch Write and Batch Read
As I mentioned above, if you want to get better compression, you should get a bigger input. So if you can write multiple logs in a batch, you can compress the batch of logs to get better compression.
Compress multiple logs into a batch: compress(log1, log2, log3) -> batch1: batch-result
Put the batch result into Redis as a key-value pair: SET batch1 batch-result
Build an index for the batch: MSET log1 batch1 log2 batch1 log3 batch1
When you need to get the log:
Search the index to get the batch key: GET log1 -> batch1
Get the batch result: GET batch1 -> batch-result
Decompress the batch result and look up the log from the result
The last method is the most complicated one, and the extra index will cost some extra memory. However, it can largely reduce the size of your data.
Also what these methods can achieve, largely depends on your log. You should do lots of benchmark :)

Erasure Code for chunked file

Is there an erasure code, which can be applied to multiple chunks (maybe 100 or 200, each few hundred kB) by (somehow) adding redundancy chunks ?
I heard about Reed-Solomon, but it doesn't look like it can be used for huge data sets and multiple chunks, am I wrong ?
Thanks!
Of course Reed-Solomon can be used for any data size.
Just think of you data as set of multiple RS-sized blocks (eg. 255 byte for a byte-based RS code) and make the calculation for each block independly. All checksums together are the checksum of the whole big data thing.
If your data length is not a multiple of the RS block size, ie. the last block is too short, just add some 0 bytes to fill it up before encoding, and remove the 0s after decoding again. You'll have to save the original data length somewhere, but that should be no problem.
Erasure codes encode $N$ original data chunks into $M$ parity chunks for redundancy, while these $N$ original data chunks and $M$ parity chunks are only a stripe of the whole storage. Theoretically, the size of $N$ can be arbitrary large for Reed-Solomon(RS) codes, only if the Galois field $GF(2^w)$ the RS constructed over is large enough. Based on the above, your question is more likely as follow
Why the number of (original data) chunks in a stripe rarely be too large, for example, $N = 100$ or $200$ ?
The reasons are update problem and repair problem: If you encode a great number data chunks into parity chunks by the erasure codes, a lot of data/parity chunks are co-related. As long as you update one data chunk, all parity chunks must be updated as well, which cause heavy I/O for the parity part; repair problem is the situation when one data/parity chunk is failed, lots of data/parity chunks are accessed and transferred for repairing, causing enormous disk I/O or network traffic.
Take RAID5 of $3$ data chunks (A, B, C) and parity chunk P=A+B+C as a example, the repair for the failure of any chunk requires all other three chunks to participate.
The greater number of chunks are encoded, the more serious update problem and repair problem for a storage system might meet, which further greatly influences the performance of the system.
BTW, the decoding(process to obtain the original data) speed greatly falls off when the $N$ enlarges.

Which data structure should I use for storing hash values?

I have a hash table that I want to store to disk. The list looks like this:
<16-byte key > <1-byte result>
a7b4903def8764941bac7485d97e4f76 04
b859de04f2f2ff76496879bda875aecf 03
etc...
There are 1-5 million entries. Currently I'm just storing them in one file, 17-bytes per entry times the number of entries. That file is tens of megabytes. My goal is to store them in a way that optimizes first for space on the disk and then for lookup time. Insertion time is unimportant.
What is the best way to do this? I'd like the file to be as small as possible. Multiple files would be okay, too. Patricia trie? Radix trie?
Whatever good suggestions I get, I'll be implementing and testing. I'll post the results here for all to see.
You could just sort entries by key and do a binary search.
Fixed size keys and data entries means you can very quickly jump from row to row, and storing only the key and data means you're not wasting any space on meta data.
I don't think you'll do any better on disk space, and lookup times are O(log(n)). Insertion times are crazy long, but you said that didn't matter.
If you're really willing to tolerate long access times, do sort the table but then chunk it into blocks of some size and compress them. Store the offset* and start/end keys of each block in a section of the file at the start. Using this scheme, you can find the block containing the key you need in linear time and then perform a binary search within the decompressed block. Choose the block sized based on how much of the file you're willing to loading into memory at once.
Using an off the shelf compression scheme (like GZIP) you can tune the compression ratio as needed; larger files will presumably have quicker lookup times.
I have doubts that the space savings will be all that great, as your structure seems to be mostly hashes. If they are actually hashes, they're random and won't compress terribly well. Sorting will help increase the compression ratio, but not by a ton.
*Use the header to lookup the offset of a block to decompress and use.
5 million records it's about 81MB - acceptable to work with array in memory.
As you described problem - it's more unique keys than hash values.
Try to use hash table for accessing values (look at this link).
If there is my misunderstand and this is real hash - try to build second hash level above this.
Hash table can be successfuly organized on disk too (e.g. as separate file).
Addition
Solution with good search performance and little overhead is:
Define hash function, which produces integer values from keys.
Sort records in file according to values, produced by this function
Store file offsets where each hash value starts
To locate value:
4.1. compute it's hash with function
4.2. lookup for offset in file
4.3. read records from file starting from this position until key found or offset of next key not reached or End-Of-File.
There are some additional things which must be pointed out:
Hash function must be fast to be effective
Hash function must produce linear distributed values or near that
Table of hash value offsets can be placed in separated file
Table of hash value offsets can be produced dynamically with sequential read of whole sorted file at start of application and stored in memory
at step 4.3. records must be readed by blocks, not one-by-one, to be effective. Ideally reads all values with computed hash to memory at once.
You can find some examples of hash functions here.
Would the simple approach work and store them in a sqlite database? I don't suppose it'll get any smaller but you should get very good lookup performance, and it's very easy to implement.
First of all - multiple files are not OK if you want to optimize for disk space, because of cluster size - when you create file with size ~100 bytes, disk spaces decreases per cluster size - 2kB for example.
Secondly - in your case i would store all table in single binary file, ordered simply ASC by bytes values in keys. It will give you file with length exactly equals to entriesNumber*17, which is minimal if you do not want to use archiving, and secondly, you can use very quick search with time ~log2(entriesNumber), when you search for key dividing file into two parts and comparing key on their border with needed key. If "border key" is bigger, you take first part of file, if bigger - then second part. And again divide taken part into two parts, etc.
So you will need about log2(entriesNumber) read operations to search single key.
Your key is 128 bits, but if you have max 10^7 entries, it only takes 24 bits to index it.
You could make a hash table, or
Use Bentley-style unrolled binary search (at most 24 comparisons), as in
Here's the unrolled loop (with 32-bit ints).
int key[4];
int a[1<<24][4];
#define COMPARE(key, i) (key[0]>=a[i][0] && key[1]>=a[i][1] && key[2]>=a[i][2] && key[3]>=a[i][3])
i = 0;
if (COMPARE(key, (i+(1<<23))) >= 0) i += (1<<23);
if (COMPARE(key, (i+(1<<22))) >= 0) i += (1<<22);
if (COMPARE(key, (i+(1<<21))) >= 0) i += (1<<21);
...
if (COMPARE(key, (i+(1<<3))) >= 0) i += (1<<3);
if (COMPARE(key, (i+(1<<2))) >= 0) i += (1<<2);
if (COMPARE(key, (i+(1<<1))) >= 0) i += (1<<3);
As always with file design, the more you know (and tell us) about the distribution of data the better. On the assumption that your key values are evenly distributed across the set of all 16-byte keys -- which should be true if you are storing a hash table -- I suggest a combination of what others have already suggested:
binary data such as this belongs in a binary file; don't let the fact that the easy representation of your hashes and values are as strings of hexadecimal digits fool you into thinking that this is string data;
file size is such that the whole shebang can be kept in memory on any modern PC or server and a lot of other devices too;
the leading 4 bytes of your keys divide the set of possible keys into 16^4 (= 65536) subsets; if your keys are evenly distributed and you have 5x10^6 entries, that's about 76 entries per subset; so create a file with space for, say, 100 entries per subset; then:
at offset 0 start writing all the entries with leading 4 bytes 0x0000; pad to the total of 100 entries (1700 bytes I think) with 0s;
at offset 1700 start writing all the entries with leading 4 bytes 0x0001, pad,
repeat until you've written all the data.
Now your lookup becomes a calculation to figure out the offset into the file followed by a scan of up to 100 entries to find the one you want. If this isn't fast enough then use 16^5 subsets, allowing about 6 entries per subset (6x16^5 = 6291456). I guess that this will be faster than binary search -- but it is only a guess.
Insertion is a bit of a problem, it's up to you with your knowledge of your data to decide whether new entries (a) necessitate the re-sorting of a subset or (b) can simply be added at the end of the list of entries at that index (which means scanning the entire subset on every lookup).
If space is very important you can, of course, drop the leading 4 bytes from your entries, since they are computed by the calculation for the offset into the file.
What I'm describing, not terribly well, is a hash table.