Splunk: Configure inputs.conf to parse only JSON files - splunk

How do I setup inputs.conf in splunk to parse only JSON files found on multiple directories? I could define a single sourcetype (KV_MODE=json) in props.conf but not sure about the code in inputs.conf.
Currently, I have the file with multiple stanzas that would each specify the application log path having json files. Each stanza has a sourcetype defined in props.conf to point to json KV_mode. I would like to minimize the steps and consolidate into a single stanza if possible.

Each monitor stanza in Splunk monitors a single file path, although that path can contain wildcards. You could do something like [monitor:///.../*.json] to monitor any file anywhere with a json extension, but that would consume a crazy amount of resources.
You're better off with a separate stanza for each directory that contains JSON data. Maybe you can use wildcards to condense to a few entries.
All of them, however, can use the same sourcetype so there's no need to touch props.conf to monitor a new file path.

Related

getting some extra files without any extension on Azure Data Lake Store

I am using Azure data Lake Store for files Storage. I am using operations like
Creating a main file
Creating part files
Appending these part files to main file (for Concurrent append)
Example:
There is main log file (eventually will contain logs from all
programs)
There are part log file that each program creates solely and then
append to the main log file
The workflow runs really file but i have noticed some unknown file getting uploaded onto the store directory. These files name is a GUID an has no extension, moreover these unknown files are empty.
Does anyone knows what might be the reason for these extra files.
Thanks for reformatting your question. This looks like some processing artefacts that probably will disappear shortly after. How did you upload/create your files?

Regular expression to include files in multiple folders

I have multiple folders (4000+) with files included in every folder.
What is the Regular expression to select the files in all the folders in Pentaho "Get a file with SFTP" step.
I am afraid you cannot put wildcard in the directory name, nor to make a recursive search (in which case you could write the wildcard in the path name of the files).
You may find alternative in this other question.
Mind that PDI uses Apache's VFS almost everywhere. So you probably do not need the SFTP job entry. You may Copy File or Move File with their uri (spft://path/to/file) instead of their name. With the same trick you may also start the transformation Input File without copying it before.

Executing Abaqus Model in Taverna

I'm pretty new to both Taverna and Abaqus but I am trying to run an Abaqus model using a "Tool" in Taverna remotely on a HPC. This works fine if I already have my model file and inputs on the HPC but I need a way of uploading the files dynamically in Taverna (trying to generically wrap Abaqus models).
I've tried adding a input port that takes a file list but I don't know how I can copy it to the "location" that I've set for the tool. Could a beanshell service be the answer or can I iterate through the file list and copy them up before executing the abaqus model?
Thanks
When you say that you created an input port that takes a file list, I guess you mean an input to the tool service.
Assuming the input port is called my_file_list, when the tool service is run, it will take a list of data values on port my_file_list. As an example, say it has "hello", "hi" and "hola" is the three values in the list.
On the location where the tool service is run, it executes in a temporary directory - a different directory for each execution of the service. It is normally something like /tmp/usecase-2029778474741087696
Three files will be created in the temporary directory; those files contain the (in this example) three values the tool service received on port my_file_list. The files could be called
/tmp/usecase-2029778474741087696/tempfile.0.tmp containing hello
/tmp/usecase-2029778474741087696/tempfile.1.tmp containing hi
/tmp/usecase-2029778474741087696/tempfile.2.tmp containing hola
There will also be a file called my_input_list. That file will contain
/tmp/usecase-2029778474741087696/tempfile.0.tmp
/tmp/usecase-2029778474741087696/tempfile.1.tmp
/tmp/usecase-2029778474741087696/tempfile.2.tmp
The script of your tool service would normally read the contents of my_input_list line by line and do something with the contents of the listed file(s).
I have also seen some scripts that 'cheat' and iterate directly over tempfile*.tmp but that would be "a bad thing". The problem with that trick, is that if you want to add a second list of files to the tool service then the file my_input_list could contain
/tmp/usecase7932018053449784034/tempfile.4.tmp
/tmp/usecase7932018053449784034/tempfile.5.tmp
/tmp/usecase7932018053449784034/tempfile.6.tmp
as other temporary files were used for the other file list port.
I hope that helps
The tool service allows you to upload files - but if you are using the HPC through a job submission node, then you would have to modify your command line tool to then use the job file staging command to further push the files as part of the job. The files would be available in the current (temporary) directory of the specified tool script.
I would try to do it through the Tool service and not involve the beanshell - then you can keep your workflow simpler.
A good thing to remember is that you can write multiple shell commands in the box.
Similarly you would probably want to retrieve back the results so that you can process them further in the workflow (unless they are massive - in which case you should just output their remote filenames and send them in again to the next HPC job)
The exact commands to use for staging files and retrieving them depends on the HPC job submission system. Which one are you using?
Thanks for the input guys.
It was my misunderstanding of how Taverna uses the File list. All the files in the list are copied to the temp "sandbox" and are therefore available for use.
Another nice easy way is to zip the directory and pass the zipped files into an input port for the service. Then just unzip the files inside the command.
Thanks again

Cocoa Lumberjack how to write in same folder with different file name

With help of this CocoaLumberjack FileLogger logging to multiple files , I am able to create multiple log files (with same name in multiple directories),
But I need to use DDLog in one of my project where it is required to write multiple logs files in the same directory with different names.
Is there any way to attain this?
DDFileLogger uses logFileManager for managing logfiles. By default it uses DDLogFileManagerDefault. You can create your own file manager that confirms to DDLogFileManager protocol and provide whatever behavior you need.
The easiest way doing this is to copy DDLogFileManagerDefault and change it to feet your needs.

Ways to achieve de-duplicated file storage within Amazon S3?

I am wondering the best way to achieve de-duplicated (single instance storage) file storage within Amazon S3. For example, if I have 3 identical files, I would like to only store the file once. Is there a library, api, or program out there to help implement this? Is this functionality present in S3 natively? Perhaps something that checks the file hash, etc.
I'm wondering what approaches people have use to accomplish this.
You could probably roll your own solution to do this. Something along the lines of:
To upload a file:
Hash the file first, using SHA-1 or stronger.
Use the hash to name the file. Do not use the actual file name.
Create a virtual file system of sorts to save the directory structure - each file can simply be a text file that contains the calculated hash. This 'file system' should be placed separately from the data blob storage to prevent name conflicts - like in a separate bucket.
To upload subsequent files:
Calculate the hash, and only upload the data blob file if it doesn't already exist.
Save the directory entry with the hash as the content, like for all files.
To read a file:
Open the file from the virtual file system to discover the hash, and then get the actual file using that information.
You could also make this technique more efficient by uploading files in fixed-size blocks - and de-duplicating, as above, at the block level rather than the full-file level. Each file in the virtual file system would then contain one or more hashes, representing the block chain for that file. That would also have the advantage that uploading a large file which is only slightly different from another previously uploaded file would involve a lot less storage and data transfer.