Write multiple streams to a single file without knowing the length of the streams? - file-io

For performance of reading and writing a large dataset, we have multiple threads compressing and writing out separate files to a SAN. I'm making a new file spec that will instead have all these files appended together into a single file. I will refer to each of these smaller blocks of a data as a subset.
Since each subset will be an unknown size after compression there is no way to know what byte offset to write to. Without compression each writer can write to a predictable address.
Is there a way to append files together on the file-system level without requiring a file copy?
I'll write an example here of how I would expect the result to be on disk. Although I'm not sure how helpful it is to write it this way.
single-dataset.raw
[header 512B][data1-45MB][data2-123MB][data3-4MB][data5-44MB]
I expect the SAN to be NTFS for now in case there are any special features of certain file-systems.
If I make the subsets small enough to fit into ram, I will know the size after compression, but keeping them smaller has other performance drawbacks.

Use sparse files. Just position each subset at some offset "guaranteed" to be beyond the last subset. Your header can then contain the offset of each subset and the filesystem handles the big "empty" chunks for you.
The cooler solution is to write out each subset as a separate file and then use low-level filesystem functions to join the files by chaining the first block of the next file to the last block of the previous file (along with deleting the directory entries for all but the first file).

Related

Pandas: Large dataframes in the same HDF?

I have several different data frames that are related (and there is ids to join them if needed). However, I don't always need them at the same time.
Since they are quite large, does it make sense to store them in separate HDF stores? Or is the cost of carrying around the "unused" frames negligible when I'm working on the other frames in the same file?
Theoretically if you can separate your HDF files in terms of IO subsystem (different spindles, different storage systems, etc.), you can try to read your DFs in parallel, practically i would test it in your particular case on your hardware with your data, etc.
Another advantage of separating files - if you remove or dramatically decrease the size of a huge DF from/in the HDF Store containing multiple DFs - it's size will remain unchanged. If you have a separate file, you can simply drop it and free unused space
The cost of carrying unused frames is the same if they are in another file or the same file. Ask your self if its better to store this sql table in another database or the same database. If they are related, keep them in the same store.

For linearized PDF how to determine the length of the cross-reference stream in advance?

When generating a linearized PDF, a cross-reference table should be stored in the very beginning of the file. If it is a cross-reference stream, this means the content of the table will be compressed and the actual size of the cross-reference stream after compression is unpredictable.
So my question is:
How to determine the actual size of this cross-reference stream in advance?
If the actual size of the stream is unpredictable, after the offsets of objects are written into the stream and the stream is written into the file, it will change the actual offsets of the following objects again, won't it? Do I miss something here?
Any hints are appreciated.
How to determine the actual size of this cross-reference stream in advance?
First of all you don't. At least not exactly. You described why.
But it suffices to have an estimate. Just add some bytes to the estimate and later-on pad with whitespaces. #VadimR pointed out that such padding can regularly be observed in linearized PDFs.
You can either use a rough estimate as in the QPDF source #VadimR referenced or you can try for a better one.
You could, e.g. make use of predictors:
At the time you eventually have to create the cross reference streams, all PDF objects can already be serialized in the order you need with the exception of the cross reference streams and the linearization dictionary (which contains the final size of the PDF and some object offsets). Thus, you already know the differences between consecutive xref entry values for most of the entries.
If you use up predictors, you essentially only store those differences. So, you already know most of the data to compress. Changes in a few entries won't change the compressed result too much. So this probably gives you a better estimate.
Furthermore, as the first cross reference stream does not contain too many entries in general, you can try compressing that stream multiple times for different numbers of reserved bytes.
PS: I have no idea what Adobe do use in their linearization code. And I don't know whether it makes sense to fight for a few bytes more or less here; after all linearization is most sensible for big documents for which a few bytes more or less hardly count.

Database or other method of storing and dynamically accessing HUGE binary objects

I have some large (200 GB is normal) flat files of data that I would like to store in some kind of database so that it can be accessed quickly and in the intuitive way that the data is logically organized. Think of it as large sets of very long audio recordings, where each recording is the same length (samples) and can be thought of as a row. One of these files normally has about 100,000 recordings of 2,000,000 samples each in length.
It would be easy enough to store these recordings as rows of BLOB data in a relational database, but there are many instances where I want to load into memory only certain columns of the entire data set (say, samples 1,000-2,000). What's the most memory- and time-efficient way to do this?
Please don't hesitate to ask if you need more clarification on the particulars of my data in order to make a recommendation.
EDIT: To clarify the data dimensions... One file consists of: 100,000 rows (recordings) by 2,000,000 columns (samples). Most relational databases I've researched will allow a maximum of a few hundred to a couple thousand rows in a table. Then again, I don't know much about object-oriented databases, so I'm kind of wondering if something like that might help here. Of course, any good solution is very welcome. Thanks.
EDIT: To clarify the usage of the data... The data will be accessed only by a custom desktop/distributed-server application, which I will write. There is metadata (collection date, filters, sample rate, owner, etc.) for each data "set" (which I've referred to as a 200 GB file up to now). There is also metadata associated with each recording (which I had hoped would be a row in a table so I could just add columns for each piece of recording metadata). All of the metadata is consistent. I.e. if a particular piece of metadata exists for one recording, it also exists for all recordings in that file. The samples themselves do not have metadata. Each sample is 8 bits of plain-ol' binary data.
DB storage may not be ideal for large files. Yes, it can be done. Yes, it can work. But what about DB backups? The file contents likely will not change often - once they're added, they will remain the same.
My recommendation would be store the file on disk, but create a DB-based index. Most filesystems get cranky or slow when you have > 10k files in a folder/directory/etc. Your application can generate the filename and store metadata in the DB, then organize by the generated name on disk. Downsides are file contents may not be directly apparent from the name. However, you can easily backup changed files without specialized DB backup plugins and a sophisticated partitioning, incremental backup scheme. Also, seeks within the file become much simpler operations (skip ahead, rewind, etc.). There is generally better support for these operations in a file system than in a DB.
I wonder what makes you think that RDBMS would be limited to mere thousands of rows; there's no reason this would be the case.
Also, at least some databases (Oracle as an example) do allow direct access to parts of LOB data, without loading the full LOB, if you just know the offset and length you want to have. So, you could have a table with some searchable metadata and then the LOB column, and if needed, an additional metadata table containing metadata on the LOB contents so that you'd have some kind of keyword->(offset,length) relation available for partal loading of LOBs.
Somewhat echoing another post here, incremental backups (which you might wish to have here) are not quite feasible with databases (ok, can be possible, but at least in my experience tend to have a nasty price tag attached).
How big is each sample, and how big is each recording?
Are you saying each recording is 2,000,000 samples, or each file is? (it can be read either way)
If it is 2 million samples to make up 200 GB, then each sample is ~10 K, and each recording is 200K (to have 100,000 per file, which is 20 samples per recording)?
That seems like a very reasonable size to put in a row in a DB rather than a file on disk.
As for loading into memory only a certain range, if you have indexed the sample ids, then you could very quickly query for only the subset you want, loading only that range into memory from the DB query result.
I think that Microsoft SQL does what you need with the varbinary(MAX) field type WHEN used in conjnction with filestream storage.
Have a read on TechNet for more depth: (http://technet.microsoft.com/en-us/library/bb933993.aspx).
Basically, you can enter any descriptive fields normally into your database, but the actual BLOB is stored in NTFS, governed by the SQL engine and limited in size only by your NTFS file system.
Hope this helps - I know it raises all kinds of possibilities in my mind. ;-)

Is it possible to memory map a compressed file?

We have large files with zlib-compressed binary data that we would like to memory map.
Is it even possible to memory map such a compressed binary file and access those bytes in an effective manner?
Are we better off just decompressing the data, memory mapping it, then after we're done with our operations compress it again?
EDIT
I think I should probably mention that these files can be appended to at regular intervals.
Currently, this data on disk gets loaded via NSMutableData and decompressed. We then have some arbitrary read/write operations on this data. Finally, at some point we compress and write the data back to disk.
Memory mapping is all about the 1:1 mapping of memory to disk. That's not compatible with automatic decompression, since it breaks the 1:1 mapping.
I assume these files are read-only, since random-access writing to a compressed file is generally impractical. I would therefore assume that the files are somewhat static.
I believe this is a solvable problem, but it's not trivial, and you will need to understand the compression format. I don't know of any easily reusable software to solve it (though I'm sure many people have solved something like it in the past).
You could memory map the file and then provide a front-end adapter interface to fetch bytes at a given offset and length. You would scan the file once, decompressing as you went, and create a "table of contents" file that mapped periodic nominal offsets to real offset (this is just an optimization, you could "discover" this table of contents as you fetched data). Then the algorithm would look something like:
Given nominal offset n, look up greatest real offset m that maps to less than n.
Read m-32k into buffer (32k is the largest allowed distance in DEFLATE).
Begin DEFLATE algorithm at m. Count decompressed bytes until you get to n.
Obviously you'd want to cache your solutions. NSCache and NSPurgeableData are ideal for this. Doing this really well and maintaining good performance would be challenging, but if it's a key part of your application it could be very valuable.

considerations for saving data to ONE file or MULTIPLE?

i am going to be saving data with DPAPI encryption. i am not sure whether i should just have one big file with all the data or should i break up the data into separate files, where every file is its own record. i suspect the entire dataset would be less than 10mb, so i am not sure whether it's worth it to break it down into about a few hundred separate files or should i just keep it one file?
will it take a long time to decrypt 10mb of data?
For 10 megabytes, I wouldn't worry about splitting it up. The cost of encrypting/decrypting
a given volume of data will be pretty much the same, whether it's one big file or a
group of small files. If you needed the ability to selectively decrypt individual records,
as opposed to all at once, splitting the file might be useful.
If you can never think of the hardware your app is going to run on, make it scaleable. It can then run from 10 parallel floppy drives if it's too slow reading from 1.
If your scope is limited to high-perfo computers, and the file size is not likely to rise within the coming next 10 years, put it in 1 file.