How can we compute checksum for "entire" file data in Pentaho?
I know we can calculate checksum using "Add a checksum" function (But it returns checksum value "per row" for my CSV file input) and "Calculator" function (But it returns null or zero value as a checksum "per row" for my CSV file input)...
Instead I want checksum for the entire file data and NOT per individual row. How can we achieve this?
Thank you
You can use Java functions via the "Modified JavaScript Value" step like this ("filename" is a column with path to the file)
var md5_hash = '';
file = new Packages.java.io.File(filename);
fileInputStream = new Packages.java.io.FileInputStream(file);
md5_hash = Packages.org.apache.commons.codec.digest.DigestUtils.md5Hex(fileInputStream);
fileInputStream.close();
Alternatively, load the entire file in a single row using 'load file data in memory', apply a checksum to that, then do a cartesian or a stream lookup based on filename with your regular data flow.
Finally, I was able to compute checksum of the entire file.
I used "User Defined Class" step and java.security.MessageDigest class in Java to compute and return checksum of a file read using FileInputStream.
Thanks
Related
I have a custom Extractor with AtomicFileProcessing set to false. It extracts a large no of JSON files (each line in the file is a JSON document) and output two files with successful and failed requests, both of them contains the json rows (AUs allocated more than 1 to extract the files). Problem is when I use the same extractor to extract the outputted files in first step with more than one AU, it fails with the error, Unexpected character encountered while parsing value: e. Path '', line 0, position 0.
If I assign 1 AU on Azure or run this locally with AU set to more than 1, it successfully processes the data. Is this behavior because of more AU provided to process a single JSON file and since the file is in non-splittable format, it can't be parallelized?
you can solve this problem converting your json file to Jsonlines.
http://jsonlines.org/examples/
Then you need to read the file using text extractor and use JsonFunctions available on Microsoft.Analytics.Samples.Formats
to read the json.
That transformation will make your file splittable and you can parallelized it!
New to PDI here. Need to output data from a view in a postgresql database to a file daily. The output file will be like xxxx_20160427.txt, so need to append the dynamic date in the file name. How to do it?
EDIT-----------------
I was not clear here by asking how to add dynamic date, I was trying to add not just date but optional other parts to the file name. E.g adding a serial no (01) at the end: xxxx_2016042701.txt etc. So my real question is how to make a dynamic file name? In other ETL tool e.g. SSIS it will be a simple expression. Not sure how it is done in PDI?
In your Text file output step, simply check "Include date in filename?" under the files tab.
You can create a dynamic filename variable with a Modified Java Script value STEP.
and then in the Text File Output STEP click on "Accept file name from field", and select your variable declared from previous step (filename_var on this example).
continuation to this question
Pentaho Spoon - Output to multiple files based on field content
how to do it for "xml output" since "xml output" doesn't have
"accept filename from the field" input in its form
Since the step doesn't accept the filename from previous fields, you can use variables to set the value.
It's a longer path, but at least it allows you to resolve it:
You need to create a transformation to set the filenames in the result (using the "Copy rows to result" step)
You need to create another transformation with "Execute for every input row" option marked. This transformation has to read from the result ("Get rows from result" step) and set the values with "Set Variables" step.
Finally, a third transformation should create the xml file, using the previously created variable as filename (variables are called using this structure: ${variable_name}).
You need to connect those 3 transformations with a job.
How can i print the f[r] and f[k] values in an output file along with r and k values using mathematica?
Is there any way I can automate the export of this output to a .txt file without having to re-write the Print[] commands?
The simplest approach to saving Mathematica expressions is to use the Save function. You could write
Save[filename,x]
and Mathematica will save the definitions associated with variable x into the file you've named. Note
Save appends to an already existing file;
expressions are written in InputForm;
you can load the expressions back into your workspace using the << (aka Get) function, which reads and evaluates the expressions stored in a file.
How you actually use Save to store your data is up to you. You might, perhaps, assign the results of a call such as Table[{k,f[k]},{k,min,max,step}] to a variable and save that result variable, which will appear in the file as a table of k,f[k] pairs.
Since Save appends to an existing file you could, if you are using loops, save a k,f[k] pair at each iteration. But why would you be using loops in Mathematica ?
I ran into this problem when uploading a file with a super long name - my database field was only set to 50 characters. Since then, I have increased my database field length, but I'd like to have a way to check the length of the filename before uploading. Below is my code. The validation returns '85' as the character length. And it returns the same count for every different file I upload (none of which have a file name length of 85).
<cfscript>
missing_info = "<p>There was a slight problem with your submission. The following are required or invalid:</p><ul>";
// Check the length of the file name for our database field
if ( len(Form["ResumeFile1"]) gt 100 )
{
missing_info = missing_info & "<li>'Resume File 1' is invalid. Character length must be less than 100. Current count is " & len(Form["ResumeFile1"]) & ".</li>";
validation_error = true;
ResumeFileInvalidMarker = true;
}
</cfscript>
Anyone see anything wrong with this?
Thanks!
http://www.cfquickdocs.com/cf9/#cffile.upload
After you upload the file, the variable "clientFileName" will give you the name of the uploaded file, without a file extension.
The only way to read the filename before you upload it would be to use JavaScript to read and parse the value (file path) in the file field.
A quick clarification in the wording of your question. By the time your code executes the file upload has already happened. The file resides in a temporary directory on the ColdFusion server and the form field related to the file upload contains the temporary filename for that file. Aside from checking to see if a file has been specified, do not do anything directly with that file or you'll be circumventing some built in security.
You want to use the cffile tag with the upload action (or equivalent udf) to move the temp file into a folder of your choosing. At that point you get access to a structure containing lots of information. Usually I "upload" into a temporary directory for the application, which should be outside of the webroot for security.
At this point you'll then want to do any validation against the file, such as filename length, file type, file size, etc and delete the file if it fails any checks. If it passes all checks then you move it into it's final destination which may be inside the webroot.
In your case you'll want to check the cffile structure element clientFile which is the original filename including extension (which you'll need to check, since an extension doesn't need to be present and can be any length).