Processing potentially large STDIN data, more than once - objective-c

I'd like to provide an accessor on a class that provides an NSInputStream for STDIN, which may be several hundred megabytes (or gigabytes, though unlikely, perhaps) of data.
When a caller gets this NSInputStream it should be able to read from it without worrying about exhausting the data it contains. In other words, another block of code may request the NSInputStream and will expect to be able to read from it.
Without first copying all of the data into an NSData object which (I assume) would cause memory exhaustion, what are my options for handling this? The returned NSInputStream does not have to be the same instance, it simply needs to provide the same data.
The best I can come up with right now is to copy STDIN to a temporary file and then return NSInputStream instances using that file. Is this pretty much the only way to handle it? Is there anything I should be cautious of if I go the temporary file route?
EDIT | I should mention, it's not actually STDIN, this is in a multithreaded FastCGI application and it's the FCGX_Request.in stream which did come from STDIN.

When reading data from a pipe or socket, you have three options:
Process it and forget it.
Add it to a complete record in memory and process it before or after doing so.
Add it to a complete file and process it before or after doing so.
That's the complete list. There's nowhere else to record it but short-term or long-term storage, so the only other thing you can do with data you read is to not record it at all.
The only other way to get the data again is for whatever sent it to you to send it again.

Related

Event Hub, Stream Analytics and Data Lake pipe questions

After reading this article I decided to take a shot on building a pipe of data ingestion. Everything works well. I was able to send data to Event Hub, that is ingested by Stream Analytics and sent to Data Lake. But, I have a few questions regarding some things that seem odd to me. I would appreciate if someone more experienced than me is able to answer.
Here is the SQL inside my Stream Analytics
SELECT
*
INTO
[my-data-lake]
FROM
[my-event-hub]
Now, for the questions:
Should I store 100% of my data in a single file, try to split it in multiple files, or try to achieve one-file-per-object? Stream Analytics is storing all the data inside a single file, as a huge JSON array. I tried setting {date} and {time} as variables, but it is still a huge single file everyday.
Is there a way to enforce Stream Analytics to write every entry from Event Hub on its own file? Or maybe limit the size of the file?
Is there a way to set the name of the file from Stream Analytics? If so, is there a way to override a file if a name already exists?
I also noticed the file is available as soon as it is created, and it is written in real time, in a way I can see data truncation inside it when I download/display the file. Also, before it finishes, it is not a valid JSON. What happens if I query a Data Lake file (through U-SQL) while it is being written? Is it smart enough to ignore the last entry, or understand it as an array of objects that is incomplete?
Is it better to store the JSON data as an array or each object in a new line?
Maybe I am taking a bad approach on my issue, but I have a huge dataset in Google Datastore (NoSQL solution from Google). I only have access to the Datastore, with an account with limited permissions. I need to store this data on a Data Lake. So I made an application that streams the data from Datastore to Event Hub, that is ingested by Stream Analytics, who writes down the files inside the Data Lake. It is my first time using the three technologies, but seems to be the best solution. It is my go-to alternative to ETL chaos.
I am sorry for making so much questions. I hope someone helps me out.
Thanks in advance.
I am only going to answer the file aspect:
It is normally better to produce larger files for later processing than many very small files. Given you are using JSON, I would suggest to limit the files to a size that your JSON extractor will be able to manage without running out of memory (if you decide to use a DOM based parser).
I will leave that to an ASA expert.
ditto.
The answer depends here on how ASA writes the JSON. Clients can append to files and U-SQL should only see the data in a file that has been added in sealed extents. So if ASA makes sure that extents align with the end of a JSON document, you should be only seeing a valid JSON document. If it does not, then you may fail.
That depends on how you plan on processing the data. Note that if you write it as part of an array, you will have to wait until the array is "closed", or your JSON parser will most likely fail. For parallelization and be more "flexible", I would probably get one JSON document per line.

Chronicle Map v3.9.0 returning different sizes

I am using v3.9.0 of Chronicle Map where I have two processes where Process A writes to a ChronicleMap and Process B just initializes with the same persistent file that A uses. After loading, I print Map.size in Process A and Process B but I get different Map size. I am expecting both sizes to be the same. In what cases, can I see this behaviour?
How can I troubleshoot this problem? Is there any kind of flush operation required?
One thing, I tried to do is to dump the file using getAll method but it dumps everything as a json on a single file which is pretty much killing any of the editors I have. I tried to use MapEntryOperations in Process B to see if anything interesting happening but seems like it is mainly invoked when something is written into the map but not when Map is initialized directly from the persistent store.
I was using createOrRecoverPersistedTo instead of createPersistedTo method. Because of this, my other process was not seeing the entire data.
As described in Recovery section in the tutorial elaborates:
.recoverPersistedTo() needs to access the Chronicle Map exclusively. If a concurrent process is accessing the Chronicle Map while another process is attempting to perform recovery, result of operations on the accessing process side, and results of recovery are unspecified. The data could be corrupted further. You must ensure no other process is accessing the Chronicle Map store when calling for .recoverPersistedTo() on this store."
This seems pretty strange. Size in Chronicle Map is not stored in a single memory location. ChronicleMap.size() walks through and sums each segment's size. So size is "weakly consistent" and if one process constantly writes to a Map, size() calls from multiple threads/processes could return slightly different values. But if nobody writes to a map (e. g. in your case just after loading, when process A hasn't yet started writing) all callers should see the same value.
Instead of getAll() and analyzing output manually, you can try to "count" entries by something like
int entries = 0;
for (K k : map.keySet()) {
entries++;
}

Has anybody ever stored an mp3 in a redis cache?

I'm new to redis, and I think I have a good use case for redis. What I'm trying to do is to cache an mp3 file for a short time. These MP3s are >2M in side, but I'm also only talking maybe 5-10 stored any any moment in time. The TTL on them would be fairly short too, minutes, not hours, etc.
(disk persistence isn't an option).
So, what I'm wondering, do I need to get fancy and Base64 out the mp3 to store it? Or can I simply set keyvalue=bytearray[]?
This redis hit will be from a web service, which in turn, gets it's data from a downstream service with disk hits. So what I'm trying to do is to cache the mp3 file a short time on my middleware if you will. I won't need to do it for every file, just the ones >2M so I don't have to keep going bak to the downstream servers and request the file from the disk again.
Thanks!
Nick
You can certainly store them, and 2MB is nothing for redis to store.
Redis is binary safe and you don't need to base64 your data, just store via byte array in your favorite client.
One thing I'd consider doing (it might not be worth it with 2Mb of data, but if I were to store video files for example) is to store the file a sequence of chunks, and not load everything at once. If your app won't hold many files in memory at once, and the files are not that big, it might not be worth it. But if you're expecting high concurrency, do consider this as it will save application memory, not redis memory.
You can do this in a number of ways:
You can store each block as an element of a sorted set with the sequence number as score, and read them one by one, so you won't have to load everything to memory at once.
You can store the file a complete string, but read chunks with GETRANGE.
e.g.
GETRANGE myfile.mp3 0 10000
GETRANGE myfile.mp3 100000 200000
... #until you read nothing of course

archiving some redis data to disk

I have been using redis a lot lately, and really am loving it. I am mostly familiar with persistence (rdb and aof). I do have one concern. I would like to be able to selectively "archive" some of my data to disk (or cheaper storage) once it is no longer important. I don't really want to delete it because it might be valuable at some point.
All of my keys are named id_<id>_<someattribute>. So when I am done with id 4, I want to "archive" all all keys that match id_4_*. I can view them quite easily in with the command line, but I can't do anything with them, persay. I have quite a bit of data (very large bitmaps) associated with this data set, and frankly I can't afford the space once the id is no longer relevant or important.
If this were mysql, I would have my different tables and would very easily just dump it to a .sql file and then drop the table. The actual .sql file isn't directly useful to me, but I could reimport the data if/when I need it. Or maybe I have to mysql database and I want to move one table to another database. Are there redis corollaries to these processes? Is there someway to make an rdb or aof file that is a subset of the data?
Any help or input on this matter would be appreciated! Thanks!
#Hoseong Hwang recently asked what I did, so I'm posting what I ended up doing.
It was really quite simple, actually. I was benefited by the fact that my key space is segmented out by different users. All of my keys were of the structure user_<USERID>_<OTHERVALUES>. My archival needs were on a user basis, some user's data was no longer needed to be kept in redis.
So, I started up another instance of redis-server, on another port locally (6380?) or another machine, it makes no difference. Then, I wrote a short script that basically just called KEYS user_<USERID>_* (I understand the blocking nature of KEYS, my key space is so small it didn't matter, you can use SCAN if that is an issue for you.) Then, for each key, I MIGRATED them to that new redis-server instance. After they were all done. I did a SAVE to ensure that the rdb file for that instance was up to date. And now I have that rdb, which is just the content that I wanted to archive. I then terminated that temporary redis-server and the memory was reclaimed.
Now, keep that rdb file somewhere for cheap, safe keeping. And if you ever needed it again, doing the reverse of my process above to get those keys back into your main redis-server would be fairly straightforward.
Instead of trying to extract data from a live Redis instance for archiving purpose, my suggestion would be to extract the data from a dump file.
Run a bgsave command to generate a dump, and then use redis-rdb-tools to extract the keys you are interested in - you can easily get the result as a json file.
See https://github.com/sripathikrishnan/redis-rdb-tools
You can keep the json data in flat files, or try to store them into a relational database or a document store if you need them to be indexed for retrieval purpose.
A few suggestions for you...
I would like to be able to selectively "archive" some of my data to
disk (or cheaper storage) once it is no longer important. I don't
really want to delete it because it might be valuable at some point.
If such data is that valuable, use a traditional database for storage. Despite redis supporting snap-shotting to disk and AOF logs, you should view it as mostly volatile storage. The primary use case for redis is reducing latency, not persistence of valuable data.
So when I am done with id 4, I want to "archive" all all keys that
match id_4_*
What constitutes done? You need to ask yourself this question; does it mean after 1 day the data can fall out of redis? If so, just use TTL and expiration to let redis remove the object from memory. If you need it again, fall back to the database and pull the object back into redis. That first client will take the hit of pulling from the db, but subsequent requests will be cached. If done means something not associated with a specific duration, then you'll have to remove items from redis manually to conserve memory space.
If this were mysql, I would have my different tables and would very
easily just dump it to a .sql file and then drop the table. The actual
.sql file isn't directly useful to me, but I could reimport the data
if/when I need it.
We do the same at my firm. Important data is imported into redis from rdbms executed as on-demand job. We don't drop tables, we just selectively import data from the database into redis; nothing wrong with that.
Is there someway to make an rdb or aof file that is a subset of the
data?
I don't believe there is a way to do selective archiving; it's either all or none.
IMO, spend more time playing with redis. I highly recommend leveraging out-of-box features instead of reinventing and/or over-engineering solutions to suit your needs.
Hope that helps!...

Does writeToFile:atomically: blocks asynchronous reading?

A few times while using my application I am processing some large data in the background. (To be ready when the user needs it. Something kind of indexing.) When this background process finished it needs to the save the data in a cache file, but since this is really large it take some seconds.
But at the same time the user may open some dialog which displays images and text loaded from the disk. If this happens at the same time while the background process data is saved, the user interface needs to wait until the saving process is completed. (This is not wanted, since the user then have to wait 3-4 seconds until the images and texts from the disk are loaded!)
So I am looking a way to throttling the writing to disk. I thought of splitting up the data in chunks and inserting a short delay between saving the different chunks. In this delay, the user interface will be able to load the needed texts and images, so the user will not recognize a delay.
At the moment I am using [[array componentsJoinedByString:'\n'] writeToFile:#"some name.dic" atomically:YES]. This is very high-level solution which doesn't allow any customization. How can I implement without large data into one file without saving all the data as one-shot?
Does writeToFile:atomically: blocks asynchronous reading?
No. It is like writing to a temporary file. Once completed successfully, then renaming the temporary file to the destination (replacing the pre-existing file at the destination, if it exists).
You should consider how you can break your data up, so it is not so slow. If it's all divided by strings/lines and it takes seconds, and easy approach to divide the database would be by first character. Of course, a better solution could likely be imagined, based on how you access, search, and update the index/database.
…inserting a short delay between saving the different chunks. In this delay, the user interface will be able to load the needed texts and images, so the user will not recognize a delay.
Don't. Just implement the move/replace of the atomic write yourself (writing to a temporary file during index and write). Then your app can serialize read and write commands explicitly for fast, consistent and correct accesses to these shared resources.
You have to look to the class NSFileHandle.
Using combination of seekToEndOfFile and writeData:(NSData *)data you can do the work you wish.