Using Python to analyze PDF files

Using Python to analyze PDF files - pdf

I wrote a code that detect malicious pdf files.
what I need to do is the following:
every pdf I scan I want to save its hash value in hash database and the output will be saved in output container,
so if I have another pdf file to scan I will check its hash value if exists in hash database then I will print the output from the output container.
but if the hash value doesn't exist so it is added to hash database and the output will be added to the output container.
how could I do that and what the way to link the hash value with the output in the output container

What kind of malicious documents are you worried about? Corrupted files or pdfs with virus in it?
To work with pdf in python you can use
pyPdf
Then you can open the file like:
from pyPdf import PdfFileReader
my_doc = PdfFileReader(file("myfile.pdf", "rb"))
This way you will check if it is a valid file.
About the link, it could be made in the database itself?

Related

How to add disk in zip using DotNetZip

I use DotNetZip for creating zips. It has many option but I couldn't find if it is possible to store the disk where the file is located, in the archive. E.g. like the Absolute mode in 7-Zip. As far as I can see I can only do this:
zip.AddFile(cFileFull, cPath);
When cFileFull is e.g. "c:\temp\SomeFile.txt" and cPath = "c:\temp" opening the zipfile shows
temp
while I would like to see
C
and then, when I click on C
temp
This allows storing the same path/file found on different drives. Is this possible?

how to read multiple text files into a dataframe in pyspark

i have a few txt files in a directory(i have only the path and not the names of the files) that contain json data,and i need to read all of them into a dataframe.
i tried this:
df=sc.wholeTextFiles("path/*")
but i cant even display the data and my main goal is to preform queries in diffrent ways on the data.

Instead of wholeTextFiles(gives key, value pair having key as filename and data as value),
Try with read.json and give your directory name spark will read all the files in the directory into dataframe.
df=spark.read.json("<directorty_path>/*")
df.show()
From docs:
wholeTextFiles(path, minPartitions=None, use_unicode=True)
Read a directory of text files from HDFS, a local file system
(available on all nodes), or any Hadoop-supported file system URI.
Each file is read as a single record and returned in a key-value pair,
where the key is the path of each file, the value is the content of
each file.
Note: Small files are preferred, as each file will be loaded fully in
memory.

Delete Files Which MD5's listed In Text File - VB.net

I have a list of MD5 hash of files stored in a text file. And I want delete them all when it been found on system or a path. But I have problem to code it. I have tried to but it only scan one file from listed MD5 so its not what i needed. Is there any way to find them and deleted files which there MD5 hash's are listed in a path. Thanks.

pidgin pseudocode:
put md5s in array
cycle through a filesystem
for each file, put into varable, compute md5hash of variable
if md5hash is in array, delete file
maybe you should skip swap files and system folders.

CFSCRIPT - How to check the length of a filename before uploading

I ran into this problem when uploading a file with a super long name - my database field was only set to 50 characters. Since then, I have increased my database field length, but I'd like to have a way to check the length of the filename before uploading. Below is my code. The validation returns '85' as the character length. And it returns the same count for every different file I upload (none of which have a file name length of 85).
<cfscript>
missing_info = "<p>There was a slight problem with your submission. The following are required or invalid:</p><ul>";
// Check the length of the file name for our database field
if ( len(Form["ResumeFile1"]) gt 100 )
{
missing_info = missing_info & "<li>'Resume File 1' is invalid. Character length must be less than 100. Current count is " & len(Form["ResumeFile1"]) & ".</li>";
validation_error = true;
ResumeFileInvalidMarker = true;
}
</cfscript>
Anyone see anything wrong with this?
Thanks!

http://www.cfquickdocs.com/cf9/#cffile.upload
After you upload the file, the variable "clientFileName" will give you the name of the uploaded file, without a file extension.
The only way to read the filename before you upload it would be to use JavaScript to read and parse the value (file path) in the file field.

A quick clarification in the wording of your question. By the time your code executes the file upload has already happened. The file resides in a temporary directory on the ColdFusion server and the form field related to the file upload contains the temporary filename for that file. Aside from checking to see if a file has been specified, do not do anything directly with that file or you'll be circumventing some built in security.
You want to use the cffile tag with the upload action (or equivalent udf) to move the temp file into a folder of your choosing. At that point you get access to a structure containing lots of information. Usually I "upload" into a temporary directory for the application, which should be outside of the webroot for security.
At this point you'll then want to do any validation against the file, such as filename length, file type, file size, etc and delete the file if it fails any checks. If it passes all checks then you move it into it's final destination which may be inside the webroot.
In your case you'll want to check the cffile structure element clientFile which is the original filename including extension (which you'll need to check, since an extension doesn't need to be present and can be any length).

How to do mail merge on top of a PDF?

I often get a PDF from our designer (built in Adobe InDesign) which is supposed to be sent out to thousands of people.
I've got the list with all the people, and it's easy doing a mail merge in OpenOffice.org. However, OpenOffice.org doesn't support the advanced PDF. I just want to output some text onto each page and print it out.
Here's how I do it now: print out 6.000 copies of the PDF, then put all of them into the printer again and just print out name, address and other information on top of it. But that's expensive.
Sadly, I can't make the PDF to an image and use that in OpenOffice.org because it grinds the computer to a halt. It also takes extremely long time to send this job to the printer.
So, is there an easy way to do this mail merge (preferably in Python) without paying for third party closed solutions?

Now I've made an account. I fixed it by using the ingenious pdftk.
In my quest I totally overlook the feature "background" and "overlay". My solution was this:
pdftk names.pdf background boat_background.pdf output out.pdf
Creating the names.pdf you can easily do with Python reportlab or similar PDF-creation scripts. It's best using code to do that, creating 6k pages took several hours in LibreOffice/OpenOffice, while it took just a few seconds using Python.

You could probably look at a PDF library like iText. If you have some programming knowledge and a bit of time you could write some code that adds the contact information to the PDFs

There are two much simpler and cheaper solutions.
First, you can do your mail merge directly in InDesign using DataMerge. This is a utility added to InDesign way back in CS. You export or save your names in CSV format. Import the data into an InDesign template and then drop in your name, address and such fields in the layout. Press Go. It will create a new document with all the finished letters or you can go right to the printer.
OR, you can export your data to an XML file and create a dynamic layout using XML placeholders in InDesign.
The book A Designer's Guide to Adobe InDesign and XML will teach you how to do this, or you can check out the Lynda.com videos for Dynamic workflows with InDesign and XML.
Very easy to do.
If you want to create separate PDFs files for the mail merge, you can run out one long PDF with all the names in one file then do an Extract to Separate PDF files in Acrobat Pro itself.

If you cannot get the template in another format than PDF a simple ad-hoc solution would be to
convert the PDF into an image
put the image in the backgroud of your (OpenOffice.org) document
position mail merge fields on top of the image
do the mail merge and print

Probably the best way would be to generate another PDF with the missing text, and overlay one PDF over the other. A quick Google found this link showing how to do it in Acrobat, and I'm sure there are other methods as well.
http://forums.macrumors.com/showthread.php?t=508226

For a no-mess, no-fuss solution, use iText to simply add the text to the pdf. For example, you can do the following to add text to a pdf document once loaded:
PdfContentByte cb= ...;
cb.BeginText();
cb.SetFontAndSize(font, fontSize);
float x = ...;
float y = ...;
cb.SetTextMatrix(x, y);
cb.ShowText(fieldValue);
cb.EndText();
From there on, save it as a different file, and print it.
However, I've found that form fields are the way to go with pdf document generation from templates.
If you have a template with form fields (added with Adobe Acrobat), you have one of two choices :
Create a FDF file, which is essentially a list of values for the fields on the form. A FDF is a simple text document which references the original document so that when you open up the PDF, the document loads with the field values supplied by the FDF.
Alternatively, load the template with with a library like iText / iTextSharp, fill the form fields manually, and save it as a seperate pdf.
A sample FDF file looks like this (stolen from Planet PDF) :
%FDF-1.2
%âãÏÓ
1 0 obj
<<<
/F(Example PDF Form.pdf)
/Fields[
<<
/T(myTextField)
/V(myTextField default value)
>>
]
>>
>> endobj trailer
<>
%%EOF
Because of the simple format and the small size of the FDF, this is the preferred approach, and the approach should work well in any language.
As for filling the fields programmatically, you can use iText in the following way :
PdfAcroForm acroForm = writer.AcroForm;
acroForm.Put(new PdfName(fieldInfo.Name), new PdfString(fieldInfo.Value));

What about using a variable data program such as - XMPie for Adobe Indesign. It's a plug-in that should reference to your list of people (think it might have to be a list in Excel though).

One easy way would be to create a fillable pdf form from the original document in Acrobat and do a mail merge with the form and a csv.
PDF mail merges are relatively easy to do in python and pdftk. Fdfgen (pip install fdfgen) is a python library that will create an fdf from a python array, so you can save the excel grid to a csv, make sure that the csv headers match the name of the pdf form field you want to fill with that column, and do something like
import csv
import subprocess
from fdfgen import forge_fdf
PDF_FORM = 'path/to/form.pdf'
CSV_DATA = 'path/to/data.csv'
infile = open(CSV_DATA, 'rb')
reader = csv.DictReader(infile)
rows = [row for row in reader]
infile.close()
for row in rows:
# Create fdf
filename = row['filename'] # Construct filename
fdf_data = [(k,v) for k, v in row.items()]
fdf = forge_fdf(fdf_data_strings=fdf_data)
fdf_file = open(filename+'.fdf', 'wb')
fdf_file.write(fdf)
fdf_file.close()
# Use PDFTK to create filled, flattened, pdf file
cmds = ['pdftk', PDF_FORM, 'fill_form', filename+'.fdf',
'output', filename+'.pdf', 'flatten', 'dont_ask']
process = subprocess.Popen(cmds, stdout=subprocess.PIPE)
stdout, stderr = process.communicate()
returncode = process.poll()
os.remove(filename+'.fdf')
I've encountered this problem enough to write my own free solution, PdfZero. PdfZero has a mail merge feature to merge spreadsheets with PDF forms. You will still need to create a PDF form, but you can upload the form and csv to pdfzero, select which form fields you want filled with which columns, create a naming convention for each filled pdf using the csv data if needed, and batch generate the filled PDfs.
DISCLAIMER: I wrote PdfZero

Someone asked for specifics. I didn't want to sully my top answer with it, because you can do it how you like (and just knowing pdftk is up to it should give people the idea).
But here's some scripts I used ages ago:
csv_to_pdf.py
#!/usr/bin/python
# This makes one PDF page per name in the CSV file
# csv_to_pdf.py <CSV_FILE>
import csv
import sys
from reportlab.pdfgen.canvas import Canvas
from reportlab.lib.units import cm, mm
in_db = csv.reader(open(sys.argv[1], "rb"));
outname = sys.argv[1].replace("csv", "pdf")
pdf = Canvas(outname)
in_db.next()
i = 0
for rad in in_db:
pdf.setFontSize(11)
adr = rad[1]
tekst = pdf.beginText(2*cm, 26*cm)
for a in adr.split('\n'):
if not a.strip():
continue
if a[-1] == ',':
a = a[:-1]
tekst.textLine(a)
pdf.drawText(tekst)
pdf.showPage()
i += 1
if i % 1000 == 0:
print i
pdf.save()
When you've ran this, you have a file with thousands of pages, only with a name on it. This is when you can background the fancy PDF under all of them:
pdftk <YOUR_NEW_PDF_FILE.pdf> background <DESIGNED_FILE.pdf> <MERGED.pdf>

You can use InDesign's data merge function, or you can do what you've been doing with printing a portion of the job, and then printing the mail merge atop that with Word or Open Office.
But also look into finding a company that can do variable data offset printing or dynamic publishing. Might be a little more expensive up front but can save a bundle when it comes to time, testing, even packaging and mailing.

Disclaimer: I'm the author of this tool.
I ran into this issue enough times that I built a free online tool for it: https://pdfbatchfill.com/
It assumes a PDF form as a template and uses that along with CSV form data to generate a single PDF or individual PDFs in a zip file.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using Python to analyze PDF files - pdf

Related

How to add disk in zip using DotNetZip

how to read multiple text files into a dataframe in pyspark

Delete Files Which MD5's listed In Text File - VB.net

CFSCRIPT - How to check the length of a filename before uploading

How to do mail merge on top of a PDF?

Categories

Resources