Analyze PDF files to detect malicious ones - pdf

I write a code in python that detect malicious PDF.
every file I analyze I calculate its hash value and save it in hash database, besides saving the output in text file.
If I want to scan another file I calculated it hash value then search it in hash database, if found I print the output from the text that is already exist.
but if the hash value is not exist it is saved and the output is saved in text file
I need help on how could I link between the hash value and the text that contain the output?

As Kyle mentioned, you can use a hash table. A hash table is similar to a dictionary. In python I actually believe they're called dictionaries. For more on that, look here: http://www.tutorialspoint.com/python/python_dictionary.htm
As far as your question is concerned, you have a variety of options. You will have to save your 'database' at some point and you could save it in many different formats. You could save it as a JSON file (a very popular style). It could be an XML file (very popular as well). You could even save it as a CSV (not nearly as popular, but it gets the job done). For the sake of this, let's say you save this 'database' in a text file which looks like this:
5a4730fc12097346fdf5589166c373d3{C:\PdfsOutput\FileName.txt}662ad9b45e0f30333a433566cee8988d{C:\PdfsOutput\SomeOtherFile.txt}
Essentially you're formatting it as HashValue{PathToFileOnDisk}... You could then parse this via regex that looks like [0-9a-f]{32}\{[^\}]+ Then you would scan your database on startup using this regex, load up all matches, iterate all matches, split each match at '{' and then put the ValueSplit[0] into a dictionary as the key with the path to that text file as the value for that key.
So, after you do the regex search, get your matches and are iterating them, within the iteration loop say something like:
ValueSplit = RegexMatch.split('{')
HashAndFileDict[ValueSplit[0]] = ValueSplit[1]
This code assumes the regex match in the loop is a string simply called 'RegexMatch'. It also assumes that your dictionary you're storing hash values and paths in is called 'HashAndFileDict'
Later in your code you can check a PDF hash value in question by saying:
if(!HashAndFileDict.hash_key(PDFHashValue):
TextFilePath = savePDFOutputText(ThePDFFile)
HashAndFileDict[PDFHashValue] = TextFilePath
else:
print("File already processed. Text is at: " + HashAndFileDict[PDFHashValue])
If I may, it might be wise to use 2 hashing algorithms and combine their hexadecimal digests into 1 string in order to prevent a collision when processing many PDF files.

Related

Full Text Search for extracting a snippet of the text (returning intended text and it's surrounding)

I'm using SQL file table and for instance I have a saved text file named "SOS.txt" which contains following text
For god's sake, save us right now please. We can't survive.
Now or never!
Now I want to find all files that contain the word save, so I execute following query
SELECT * FROM FileTableExample
WHERE CONTAINS(file_stream, 'save')
and here's the result:
stream file => 0x616C692053617665207573207269676874206E6F772E0D0A4E6F77206F72206E6576657221
As you can see I got the true result, the third column of the result indicates the file under name SOS.txt, I have the stream_id and stream_file but what I'm about to find is the way to show the the intended text in company with it's surrounding in human readable format.
Somethings like this:
Name | Excerpt
-------------+----------------------
SOS.txt |..sake, save us..
Is there any way?
Update:
After searching on the net I found this article which is useful but it didn't mention about full text search in filetable structure.
Based on this article, I converted file stream to string:
SELECT CONVERT(varchar(MAX), file_stream) AS Excerpt, *
from FileTableExample
where contains(file_stream, 'save')
It works if the file is a plain text like SOS.txt but if it's .docx or .pptx file, you are not going to gain a useful convention.
Use this, CAST(file_Stream as varchar(max))

How to write results in to NSArray and save it as csv file using objective-c

I'm trying to store my results in NSArray and save it as CSV File using Objective-C but i don't seem to find any solution which is relevant. Please find the below sample code:
int a=5,b=10;
int c=b-a;
double d=4.5,e=3.0;
double h=d-e;
NSLog(#"host_port:%f", c);
NSLog(#"host_size:%d", h;
I would like to store my values c and h in array and write that to CSV File. Any advise on this would be helpful.
Thanks in advance.
When you ask a question on SO you need to show effort - code you've tried, details of what you've read - if you don't you'll get down and close votes (you have one of each as I write this). The code you have included has nothing to do with CSV or arrays, and is not even pasted in valid code (the formats are wrong).
That said, let's see if you can give you something to get you going.
A CSV file is just plain text, you don't need to use any packages to write one, just standard I/O routines will do the job. You also do not need to store all the values in an array and then output the array, or build up a string version of the whole CSV file and output that, you can output items as they are generated if you wish and it may be more efficient to do so. In your code fragment you only have two values, maybe you intend this to be the core of a loop, and given those we assume you wish the CSV file:
host_port,host_size
5,1.5
your values have basic types, int and double, they are not Objective-C object types. Given this you can use the standard C I/O operations to produce your file.
First you may need to obtain the destination file name from the user, assuming this is a GUI app look up NSOpenPanel for this. That will give you an NSURL from which you can obtain the file path as an NSString, and you can convert that into a C string using NSString methods.
Now you can enter the C I/O world, to find the documentation on the following functions open the Terminal and use the man command, e.g. man fopen etc.
To create and open for writing the file for writing use fopen() passing it the C string pathname you obtained above.
To write the headers and each row of data use fprintf(). This takes a format string just like NSLog(), but you must remember to explicitly include the line breaks by using \n in the format.
When you've finished close the file with fclose().
Now go read the documentation and write your CSV file!
HTH

How to PREPEND text to a file in Swift or Objective C?

Please note that I'm not asking how to append texts at the end of the file. I'm asking how to prepend texts to the beginning of file.
let handle = try FileHandle(forWritingTo: someFile)
//handle.seekToEndOfFile() // This is for appending
handle.seek(toFileOffset: 0) // Me trying to seek to the beginning of file
handle.write(content)
handle.closeFile()
It seems like my content is being written at the beginning of the file, but it just replaces the existing consent as well... Thanks!
One reasonable solution is to write the new content to a temporary file, then append the existing contents to the end of the temporary file. Then move the temporary file over the old file.
When you seek to a point in an existing file and then perform a write, the existing contents are overwritten from that point. This is why your current approach fails.
In general, most file systems don't have built-in support for prepending data to files. Likewise, most file I/O APIs don't either.
In order to prepend data, you first have to shift all of the existing data further along the file to make room for the new data at the beginning. You typically do this by starting near the end, reading a chunk of data, writing that data to the original position plus the length of data you hope to eventually prepend, and then repeating with the next chunk closer to the beginning of the file. In this way, you gradually shift everything down. Only after you've done all of that can you safely write the new data at the beginning of the file safely.
Frankly, if there's any way to avoid this, you should try to. The performance is likely to be terrible if the file is large and/or you're doing it frequently.

Fortran: How to skip many lines of data file efficiently

I have a formatted data file which is typically billions of lines long, with several lines of headers of variable length. The data file takes the form:
# header 1
# header 2
# headers are of variable length.
# data begins from next line.
1.23 4.56 7.89 0.12
2.34 5.67 8.90 1.23
:
:
# billions of lines of data, each row the same length, same format.
-- end of file --
I would like to extract a portion of data from this file, and my current code looks like:
<pre>
do j=1,jmax !Suppose I want to extract jmax lines of data from the file.
[algorithm to determine number of lines to skip, "N(j)"]
!This determines the number of lines to skip from the previous file
!position, when the data was read on j-1th iteration.
!Skip N-1 lines to go to the next data line to read off:
do i=1,N-1
read(unit=unit,fmt='(A)')
end do
!Now read off the line of data I want:
read(unit=unit,fmt='(data_format)'),data1,data2,etc.
!Data is stored in some arrays.
end do
</pre>
The problem is, N(j) can be anywhere between 1 and several billion, so it takes some time to run the code.
My question is, is there a more efficient way of skipping millions of lines of data? The only way I can think of, while sticking to Fortran, is to open the file with direct access and jump to the desired line upon opening the file.
As you suggest, direct access seems like the best option. But that requires the records to all have the same length, which your headers violate. Also, why used formatted output? With a file of this length, its hard to imagine a person reading the file. If you use unformatted IO, the file will be both smaller and IO will be faster. Perhaps create two files, one with the headers (metadata) in human reader form, and the other with the data in native form. Native / binary representation means a loss of portability, which is something to consider if you want to move the files to different computer architectures or have them be useable for decades. Otherwise it's probably worth the convenience. Other options would be to use a more sophisticated file format that combines metadata and data, such as HDF5 or FITS, but for communication between two programs of one person, that's probably excessive.

SAS : read in PDF file

I am looking for ways to read in a PDF file with SAS. Apparently this is not basic functionality and there is very little to be found on the internet. (Let alone that google is not easy with PDF in you search giving you also links to PDF documents that go about other things.)
The only things that can be found, are people looking for ways to import data into datasets from a PDF. For me, that is not even necesarry. I would like to be able to read the contents of the PDF file in one big character variable. If possible, it would even be better to be able to read in the file's binary data.
Is this possible with SAS and how? (I got it to work in Access VBA, but can't find any similar ways in SAS.)
(In the end, the purpose is to convert this to base64 and put that base64-string into an XML document.)
You probably will not be able to read the entire file into one character variable since the maximum size of a character variable is around 33 KB. A simple way to read in one line at a time, though, is something like the following:
%let pdfFileName = Test.pdf;
%let lineSize = 2000;
data base;
format text_line $&lineSize..;
infile "&pdfFileName" lrecl=&lineSize;
input text_line $;
run;
This requires that you have a general idea of the maximum record length ahead of time, but you could write additional code to determine the maximum record size prior to reading in the file. In this example each line of text is read into one character variable named "text_line." From there, you could use a RETAIN statement or double trailers (##) in the INPUT line to process multiple lines at a time. The SAS web-site has plenty of documentation on how to read and process text from various types of input files.