Does anyone know how to read gzip file(gzip in thr spoolSourceDirectory) in Flume process? - gzip

If we want to get data from spoolDir which contains Gzip file in it, what should I change for the source in the Flume process? Just have a customized EventDeserializer or also need new source type(eg., a customized GzipSpoolDirectorySource instead of the default spooldir) for the flume process?

OK, so if you don't want to unpack your GZIP files at Flume level, that#s actually quite easy. You can configure your Spool Dir source to use a BlobDeserializer:
https://flume.apache.org/FlumeUserGuide.html#event-deserializers
This will parse the entire file as one event and spool that. If you want to store that to HDFS for instacne, make sure that you activate the fileHeader property on your spool dir source. You can then use the %{file} variable in your path, which effectively allows you to use flume as a one to one file copy mechanism.

Related

Splunk: Configure inputs.conf to parse only JSON files

How do I setup inputs.conf in splunk to parse only JSON files found on multiple directories? I could define a single sourcetype (KV_MODE=json) in props.conf but not sure about the code in inputs.conf.
Currently, I have the file with multiple stanzas that would each specify the application log path having json files. Each stanza has a sourcetype defined in props.conf to point to json KV_mode. I would like to minimize the steps and consolidate into a single stanza if possible.
Each monitor stanza in Splunk monitors a single file path, although that path can contain wildcards. You could do something like [monitor:///.../*.json] to monitor any file anywhere with a json extension, but that would consume a crazy amount of resources.
You're better off with a separate stanza for each directory that contains JSON data. Maybe you can use wildcards to condense to a few entries.
All of them, however, can use the same sourcetype so there's no need to touch props.conf to monitor a new file path.

How to delete large file in Grails using Apache camel

I am using Grails 2.5. We are using Camel. I have folder called GateIn. In this delay time is 3minutes. So Every 3minutes , it will look into the folder for file. If the file exists, it will start to process. If the file is processed within 3 minutes, file get deleted automatically. Suppose my file takes 10minutes,file is not deleted.Again and again, it process the same file. How to make file get deleted whether it is small or bulk file. I have used noop= true to stop reuse of file. But i want to delete the file too once it is preocessed. Please give me some suggestion for that.
You can check the file size using camel file language and decide what to do next.
Usually, in this kind of small interval want to process a large size of file, it will be better to have another process zone (physical directory), you have to move the file after immediately consuming it to that zone.
You can have a separate logic or camel route to process the file. After successful process, you can delete or do appropriate step according to your requirement. Hope it helps !!

Generate A Large File Inside s3 with .NET

I would to generate a big file (several TB) with special format using my C# logic and persist it to S3. What is the best way to do this. I can launch a node in EC2 and then write the big file into EBS and then upload the file from the EBS into S3 using the S3 .net Clinent library.
Can I stream the file content as I am generating in my code and directly stream it to S3 until the generation is done specially for such large file and out of memory issues. I can see this code help with stream but it sounds like the stream should have already filled up with. I obviously can not put such a mount of data to memory and also do not want to save it as a file to the disk first.
PutObjectRequest request = new PutObjectRequest();
request.WithBucketName(BUCKET_NAME);
request.WithKey(S3_KEY);
request.WithInputStream(ms);
s3Client.PutObject(request);
What is my best bet to generate this big file ans stream it to S3 as I am generating it?
You certainly could upload any file up to 5 TB that's the limit. I recommend using the streaming and multipart put operations. Uploading a file 1TB could easily fail in the process and you'd have to do it all over, break it up into parts when you're storing it. Also you should be aware that if you need to modify the file you would need to download the file, modify the file and re-upload. If you plan on modifying the file at all i recommend trying to split it up into smaller files.
http://docs.amazonwebservices.com/AmazonS3/latest/dev/UploadingObjects.html

Linking Redis database with a dump.rdb or dump.json file

Given a snapshot of an existing redis database in a dump.rdb (or in .json format) file, I want to restore this data in my own machine to run some tests on it.
Any pointers on how to do this would be greatly appreciated.
I have resorted to trying to parse the data in the dump.rdb and then save it in a redis DB manually. I feel like there is/should be a cleaner way.
If you want to restore the entire file, simply copy it to the right directory specified in redis.conf and restart redis server. But if you want to load a subset of keys/databases, you'd have to parse the dump file.
SO:
I continued doing it the "hacky" way and found that using the parser code found here:
https://github.com/sripathikrishnan/redis-rdb-tools was a great help.
using the parser sample code i could:
1) set up a redis client
2) use the parser to parse the data
3) use the client to "set" parsed data into a new redis database.
the rdd tools can also do that,
it work independantly of .rdb files and dump/restore working redis instances
it can apply merge, split, rename, search, filter, insert, delete on dumps and/or redis

Comparing uncompressed local files to compressed files stored on Amazon S3?

We put hundreds of image files on Amazon S3 that our users need to synchronize to their local directories. In order to save storage space and bandwidth, we zip the files stored on S3.
On the user's end they have a python script that runs every 5 min to get a current list of files, and download new/updated files.
My question is what's the best way determine what is new or changed to download?
Currently we add an additional header that we put with the compressed file which contains the MD5 value of the uncompressed file...
We start with a file like this:
image_file_1.tif 17MB MD5 = xxxx1234
We compress it (with 7zip) and put it to S3 (with Python/Boto):
image_file_1.tif.z 9MB MD5 = yyy3456 x-amz-meta-uncompressedmd5 = xxxx1234
The problems is we can't get a large list of files from S3 that include the x-amz-meta-uncompressedmd5 header without an additional API for EACH one (SLOW for hundreds/thousands of files).
Our most practical solution is have users get a full list of files (without the extra headers), download the files that do not exist locally. If it does exist locally, then do and additional API call to get the full headers to compare local MD5 checksum against x-amz-meta-uncompressedmd5.
I'm thinking there must be a better way.
You could include the MD5 hash of the uncompressed image into the compressed filename.
So image_file_1.tif could become image_file_1.xxxx1234.tif.z
Your user python file which does the synchronising would therefore have the information needed to determine if it needed to go get the file again from S3, and could either strip out the MD5 part of the filename, or maintain it, depending on what you wanted to do.
Or, you could also maintain, on S3, a single file containing the full file list including the MD5 metadata. So the python script just need to fetch that single file, parse that, and then decide what to do.