GridFs read PDF - pymongo

I am trying to build a financial dashboard with Flask and pymongo. The starting point is a flask form which saves data in a MongoDB database. One of the fields in the form is a FileField (wtforms) which allows the upload of a PDF, which is then stored in MongoDB with GridFS.
Now I manage to save the pdf and I can see the resulting entries within the .files and .chunks collections. Now I would like to build a function that retrieves the PDFs and analyses them with some basic NLP, however I struggle with the getting meaningful data.
When I do:
storage = gridfs.GridFS(db, collection)
data = storage.get('some id')
a = data.read()
The result is a binary file. If I continue with:
with open(data, 'rb') as f:
b = f.read()
The result is "ValueError: embedded null byte or sometimes an empty "byte string".
Any help on this?

To follow up on the above, I found a solution for myself that consists in 2 separate functions:
(1) Upon upload of the form and before uploading the files to MongoDB, I apply a function based on pdfminer that extracts the string content of the PDF and tranform it into a list of sentences using NLTK. I will then store this list in the .files via the storage.put(file, sent_list = sent_list) #sent_list being the variable name of the list of sentences.
Whenever I wish to run NLP operations on the file, I will just call the "sent_list" variable from mongodb.
(2) If I wish to display the stored pdf in its original content however, I included the following function as a separate route.
storage = GridFS(db, collection)
data = storage.get_last_version(filename)
response = make_response(data.read())
extension = data.filename.split('.')[-1]
response.headers['Content-Type'] = f'application/{extension}'
response.headers['Content-Disposition'] = f'inline; filename={data.filename}'
return response
(2) will open a new tab in my flask app showing the .pdf file in its original format.
I hope this helps anyone coming across a similar problem in the future.

Related

Writing text data from PCollection to GCS with custom file name

In a dataflow job written in Kotlin
using a PubSub subscription as input i receive a Proto object (Event) and map this object to Strings.
My pipeline has type:
PCollection<KV<Event, String>>
These strings are the lines of a file that must be written in GCS.
The Event Object has a "Id" that must be used to set the filename, and a "name" to set the folder.
Is it possible using FileIO ?
pipeline.apply(
FileIO.writeDynamic<String, String>()
.to("gs://my-bucket")
// withNaming?
)
My goal is to write the right lines in the right files, based on the information in the Event object
File-names can be customized by providing a FileNaming implementation to the withNaming() API.
However this currently does not support mapping input elements directly to final file names. Input elements can be mapped to groups using the dynamic destinations API and for each group you can provide a file-naming strategy.
To fully customize naming using input element values you might need to implement a new sink transform.

How to get all the data from a DICOM file with Imebra

I am working on a project that integrates Imebra inside an android application. The application is supposed to extract all the data from a given DICOM file and put them into a .xml file. I need a little bit of help with it. For example, I don't know how to get all the VR tags that the given DICOM has, instead of getting them one by one using tag ids.
Thank you for your help.
Load the file using CodecFactory.load(filename).
Then you can use DataSet.getTags() to retrieve a list of tags stored into the DICOM structure.
The returned class TagsIds is a list containing all the TagId: scan each tag ID and retrieve it via DataSet.getString() (to retrieve the value as string) and DataSet.getDataType() to retrieve its VR.
When DataSet.getString() fails then you are dealing with a sequence (an embedded DICOM structure) which can be retrieved with DataSet.getSequenceItem().
You can use the static method DicomDictionary.getTagName() to get a description of a particular tag.

ColdFusion/CFWheels Merge Multiple PDFs in different Controllers

I'm using Coldfusion 10 and CFWheels for my site.
Basically my site has a bunch of different types of forms with their own Controllers and views. For each form, the user has the option to dynamically generate a PDF of the form and download it. It basically loads the controller data but when it hits the view with a parameter of "pdf" it does the following which will generate the PDF and open the document in the browser:
<cfdocument format="PDF" saveAsName="#formtype#_#id#.pdf">
#includePartial("/printView")#
</cfdocument>
Each of these PDFs can have multiple pages depending on how many line items are added. Like I said in the beginning there are multiple types of forms so they will have their own controller and views and PDF generation with their print views. These forms are all customized and associated together with an ID like shipmentID. So I can have one shipment that contains 2 forms of type A and 1 form of type B and 3 of type C, etc. What I need to do is generate 1 PDF with all the forms merged together based on the shipment. So taking my example, the merged PDF for the shipment would contain the 2 forms of type A, 1 form of type B, and 3 of type C all merged.
Currently what I'm doing is making a http "GET" call to each of the dynamically generated PDF pages, save that to a temp directory, then merging them at the end.
I load the shipment and for each different type of form I do the following where urlPath is the path to the view that generates the dynamic PDF:
var httpService = new http();
httpService.setMethod("GET");
httpService.setUrl(urlPath);
invoice = httpService.send().getPrefix().filecontent.toByteArray();
var fullPath = "#filePath##arguments.type#_#id#.pdf";
//write files in temp directory
FileWrite(fullPath, invoice);
After I get the PDF and write it to a file, I save the path in an array for reference so I can loop through and merge all the referenced files in the array, then delete the temp directory where the files were saved.
The reason why I'm doing it this way is because the controllers and views are already set and generate the individual PDFs on the fly as it is.
If I try to load (all associated forms) and put everything in one file, I'll have to add all the same controller logic to load each form specific stuff and the associated views but these already exist for the individual page view.
Is there a better way to do this?
It works fine if there are only a few PDFs but if there a lot of different forms in the shipment like 20, then it's very slow and since we don't have CF Enterprise, I believe the cfdocument is single threaded. The forms have to be generated dynamically so they contain the most current data.
UPDATE for Chris
I've added some code to show what the various forms might look like. I validate and load a bunch of other things but I stripped it down to get the general idea:
controllers/Invoices.cfc
The path might be something like: /shipments/[shipmentkey]/invoices/[key]
public void function show(){
// load shipment to display header details on form
shipment = model("Shipment").findOne(where="id = #params.shipmentkey#");
// load invoice details to display on form
invoice = model("Invoice").findOne(where="id = #params.key#");
// load associated invoice line items to display on form
invoiceLines = model("InvoiceLine").findAll(where="invoiceId = #params.key#");
// load associated containers to display on form
containers = model("Container").findAll(where="invoiceid = #params.key#");
// load associated snumbers to display on form
scnumbers = model("Scnumber").findAll(where="invoiceid = #params.key#");
}
controllers/Permits.cfc
The path might be something like: /shipments/[shipmentkey]/permits/[key]
public void function show(){
// load shipment to display header details on form
shipment = model("Shipment").findOne(where="id = #params.shipmentkey#");
// load permit details to display on form
permit = model("Permit").findOne(where="id = #params.key#");
// load associated permit line items to display on form
permitLines = model("PermitLine").findAll(where="permitId = #params.key#");
}
controllers/Nafta.cfc
The path might be something like: /shipments/[shipmentkey]/naftas/[key]
public void function show(){
// load shipment to display header details on form
shipment = model("Shipment").findOne(where="id = #params.shipmentkey#");
// load NAFTA details to display on form
nafta = model("NAFTA").findOne(where="id = #params.key#");
// load associated NAFTA line items to display on form
naftaLines = model("NaftaLine").findAll(where="naftaId = #params.key#");
}
Currently my view is based on a URL parameter called "view" where the values can be either "print" or "pdf".
print - displays the print view that's pretty much a stripped down version of the form without the webpage headers/footers etc.
pdf - calls the cfdocument code I pasted at the top of the question which uses the printView to generate the PDF.
I don't think I need to post the "show.cfm" code as it would just be a bunch of divs and tables displaying the specific information for each particular form in question.
Keep in mind that these are only 3 example form types and there are 10+ types that may be associated to 1 shipment and the PDF's would need to be merged. Each type may repeat several times within a shipment as well. For example a shipment may contain 10 different invoices with 5 permits and 3 NAFTAs.
To make things slightly more complicated, a shipment can have 2 types: US Bound or Canada Bound and based on this different form types can be associated to the shipment. So an Invoice for Canada will have totally different fields than an invoice for US so the models/tables are different.
Currently to do the merging I have a controller that does something like the following (note that I stripped a lot of validation, loading of other objects to simplify)
public any function displayAllShipmentPdf(shipmentId){
// variable to hold the list of full paths of individual form PDFs
formList = "";
shipment = model("shipment").findOne(where="id = #arguments.shipmentId#");
// path to temporarily store individual form PDFs for later merging
filePath = "#getTempDirectory()##shipment.clientId#/";
if(shipment.bound eq 'CA'){
// load all invoices associated to shipment
invoices = model("Invoice").findAll(where="shipmentId = #shipment.id#");
// go through all associated invoices
for(invoice in invoices){
httpService = new http();
httpService.setMethod("get");
// the following URL loads the invoice details in the Invoice controller and since I'm passing in "view=pdf" the view will display the PDF inline in the browser.
httpService.setUrl("http://mysite/shipments/#shipment.id#/invoices/#invoice.id#?view=pdf");
invoicePdf = httpService.send().getPrefix().fileContent.toByteArray();
fullPath = "#filePath#invoice_#invoice.id#.pdf";
// write the file so we can merge later
FileWrite(fullPath, invoicePdf);
// append the fullPath to the formList as reference for later merging
formList = ListAppend(formList, fullPath);
}
// the above code would be similarly repeated for every other form type (ex. Permits, NAFTA, etc.). So it would call the path with the "view=pdf" which will load the specific form Controller and display the PDF inline which we capture and create a temporary PDF file and add the path to the formList for later merging. You can see how this can be a long process as you have several types of forms associated to a shipment and there can be numerous forms of each type in the shipment and I don't want to have to repeat each form Controller data loading logic.
}else if(shipment.bound eq 'US'){
// does similar stuff to the CA except with different forms
}
// merge the PDFs in the formList
pdfService = new pdf();
// formList contains all the paths to the different form PDFs to be merged
pdfService.setSource(formList);
pdfService.merge(destination="#filePath#shipment_#shipment.id#.pdf");
// read the merged PDF
readPdfService = new pdf();
mergedPdf = readPdfService.read(source="#filePath#shipment_#shipment.id#.pdf");
// delete the temporarily created PDF files and directory
DirectoryDelete(filePath, "true");
// convert to binary to display inline in browser
shipmentPdf = toBinary(mergedPdf);
// set the response to display the merged PDF
response = getPageContext().getFusionContext().getResponse();
response.setContentType('application/pdf');
response.setHeader("Content-Disposition","filename=shipment_#shipment.id#_#dateFormat(now(),'yyyymmdd')#T#timeFormat(now(),'hhmmss')#.pdf");
response.getOutputStream().writeThrough(shipmentPdf);
}
See: https://forums.adobe.com/thread/1121909 ... "...Standard Edition Adobe throttles the PDF functions to a single thread,...Developer runs like Enterprise" so your development environment will whip out the pdfs but your CF Standard production server will be choking.
Also, seems you are not having trouble with one or two pdfs. I have CF Enterprise and it was generating pdfs just fine - a few seconds - and then out of nowhere pdfs started taking 4 minutes. Another comment in the above referenced adobe post suggested check in the /etc/hosts that CF is contacting itself (?????). Well some digging and I found that the Windows\system32\drivers\etc\hosts had been updated a day before users discovered pdfs were timing out. The IP had been changed to some other intranet IP and the server name was the DNS server name. I changed the value back to 127.0.0.1 localhost and voila, pdfs started rendering in normal amounts of time.

Jmeter - read parameters from csv and write back updated parameters

I'm using Jmeter for testing. I need to use some keys in order to perform login, and then change the keys.
I understood that the best way to do it is to create csv file that contains two variables.
I understand how I can read the parameters (using 'CSV Data Set Config'), but I still don't know how to extract specific parameters from result (new keys) and save them in file instead the old ones.
You can use Regular Expression Extractor to extract the values from the response. This site will give you an idea how it works.
It is NOT a good idea to write in the same file which is read by CSV dataset config. Instead, you can use Beanshell post processor to create a CSV file & write as you want.
import org.apache.jmeter.services.FileServer;
f = new FileOutputStream("/your/file/path/filename.csv", true);
p = new PrintStream(f);
p.println("content,to be,written,in,csv,file");
p.close();
f.close();

docx4j Differencer Showing More Differences Than Expected

I have two documents:
Document 1 (input)
Document 2 (output)
Document 2 is the result of passing Document 1 through a transformation process which leaves any content and formatting intact (verified by side-by-side compare in Word).
However, the process removes many id numbers from the .docx files.
For example,
<w:p w:rsidP="00B600D6" w:rsidR="00F55D78" w:rsidRDefault="00B600D6">
becomes
<w:p>
according to a dump of each document via the following code:
Body body = ((Document)newerPackage.getMainDocumentPart().getJaxbElement()).getBody();
Node node = org.docx4j.XmlUtils.marshaltoW3CDomDocument(body).getDocumentElement();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(new DOMSource(node),
new StreamResult(new OutputStreamWriter(System.out, "UTF-8")));
Using the docx4j Differencer comparison method recommended here, everything (except the first line which has no formatting applied) is shown as a modification.
Question is: Are the diffs a result of the missing id's, the formatting or something else?
In case it's important, we're using docx4j in this context to perform automated sanity/regression tests on our round-tripping proceess (i.e. apply the "loss-less" process and expect no differences)
Disclosure: I work on docx4j
If the only difference between paragraphs is the rsid attributes, they will still be detected as different.
You could "clean" the documents before performing the comparison, so that neither docx has rsid attributes. See the Filter sample.
By the way, an easier way to see the XML for an object (eg a single paragraph, or the entire body) is to use XmlUtils.marshaltoString