I am using MD5 and SHA256 algorithms for calculating hash. I know the procedure to calculate hash. But I do not know what parameters(like content in a file, file size) are considered while hashing a file. I searched on google but I did not find answer. And how can I optimize the process of hashing a file with size greater than 10 GB?
Hashing has no parameters, the algorithm takes and input and generates a fixed size output.
You can perform an incremental hash instead of loading the complete file getting chunks and adding each chunk to the calculation. For example (pseudocode)
SHA256.init()
SHA256.update(chunk 1)
SHA256.update(chunk 2)
...
SHA256.update(chunk n)
SHA256.digest ()
Related
I have the task of calculating the hash from multiple files.
I also already know the hash from each individual file.
There are two approaches:
hash(f1 + f2 + f3)
hash(hash(f1) + hash(f2) + hash(f3))
In the second approach, there will be less computation since I know the hash of each file individually.
Is the security level of these two approaches different?
Which of these approaches is more secure?
I am not strong in cryptography, so I can not objectively calculate the security level of each approach.
TL,DR: use hash(hash(f1) + hash(f2) + hash(f3))
Note: in this answer, + means concatenation. It is never any kind of numerical addition. If you have numerical data, apply my answer after converting the data to byte strings.
There is a problem with hash(f1 + f2 + f3): you can (for example) move some data from the end of f1 to the beginning of f2, and that won't change the hash. Whether this is a problem depends on what constraints there are, if any, on the file formats and on how the files are used.
It's usually hard to make sure in a system design that this isn't a problem. So whenever you combine strings or files for hashing, you should always make sure the combination is unambiguous. There are a few different ways to do it, such as:
Use some existing format that handles the packing of the strings or files for you. For example zip, ASN.1 DER, etc.
Encode each part in a way that doesn't contain a certain byte, and use that byte as a separator. For example encode each part in Base64 and use line breaks as separators.
Define a maximum length for each part. Before each part, encode the length using a fixed-width encoding. For example, if the maximum length of a part is 2^64-1 bytes, encode the unambiguous concatenation of (f1, f2, f3) as:
8 bytes: length(f1)
length(f1) bytes: f1
8 bytes: length(f2)
length(f2) bytes: f2
8 bytes: length(f3)
length(f3) bytes: f3
If you instead take hashes of hashes, you don't run into this problem, because here you do have a very strong constraint on the strings you're concatenating: they have a well-defined length (whatever the length of the hash algorithm is).
Taking hashes of hashes does not degrade security. It's part of a well-known technique: hash trees. If hash(hash(f1) + hash(f2) + hash(f3)) = hash(hash(g1) + hash(g2) + hash(g3)) then f1 = g1 and f2 = g2 and f3 = g3.
In addition to making the construction and verification easier, this approach lets you save computation if the set of files changes. If you've already stored hash(f1) and hash(f2) and you want to add f3 to the list, you just need to calculate hash(f3), and then the hash of the new list of hashes. This is also very useful for synchronization of data sets. If Alice wants to transmit files to Bob, she can send the hashes first, then Bob verifies which hashes he already knows and tells Alice, and Alice only needs to transmit the files whose hashes Bob doesn't already have.
Suppose a 1KB file called data.bin, If it's possible to construct a gzip of it data.bin.gz, but much larger, how to do it?
How much larger could we theoretically get in GZIP format?
You can make it arbitrarily large. Take any gzip file and insert as many repetitions as you like of the five bytes: 00 00 00 ff ff after the gzip header and before the deflate data.
Summary:
With header fields/general structure: effect is unlimited unless it runs into software limitations
Empty blocks: unlimited effect by format specification
Uncompressed blocks: effect is limited to 6x
Compressed blocks: with apparent means, the maximum effect is estimated at 1.125x and is very hard to achieve
Take the gzip format (RFC1952 (metadata), RFC1951 (deflate format), additional notes for GNU gzip) and play with it as much as you like.
Header
There are a whole bunch of places to exploit:
use optional fields (original file name, file comment, extra fields)
bluntly append garbage (GNU gzip will issue a warning when decompressing)
concatenate multiple gzip archives (the format allows that, the resulting uncompressed data is, likewise, the concatenation or all chunks).
An interesting side effect (a bug in GNU gzip, apparently): gzip -l takes the reported uncompressed size from the last chunk only (even if it's garbage) rather than adding up values from all. So you can make it look like the archive is (absurdly) larger/smaller than raw data.
These are the ones that are immediately apparent, you may be able to find yet other ways.
Data
The general layout of "deflate" format is (RFC1951):
A compressed data set consists of a series of blocks, corresponding to
successive blocks of input data. The block sizes are arbitrary,
except that non-compressible blocks are limited to 65,535 bytes.
<...>
Each block consists of two parts: a pair of Huffman code trees that
describe the representation of the compressed data part, and a
compressed data part. (The Huffman trees themselves are compressed
using Huffman encoding.) The compressed data consists of a series of
elements of two types: literal bytes (of strings that have not been
detected as duplicated within the previous 32K input bytes), and
pointers to duplicated strings, where a pointer is represented as a
pair <length, backward distance>. The representation used in the
"deflate" format limits distances to 32K bytes and lengths to 258
bytes, but does not limit the size of a block, except for
uncompressible blocks, which are limited as noted above.
Full blocks
The 00 00 00 ff ff that Mark Adler suggests is essentially an empty, non-final block (RFC1951 section 3.2.3. for the 1st byte, 3.2.4. for the uncompressed block itself).
Btw, according to gzip overview at the official site and the source code, Mark is the author of the decompression part...
Uncompressed blocks
Using non-empty uncompressed blocks (see prev. section for references), you can at most create one for each symbol. The effect is thus limited to 6x.
Compressed blocks
In a nutshell: some inflation is achievable but it's very hard and the achievable effect is limited. Don't waste your time on them unless you have a very good reason.
Inside compressed blocks (section 3.2.5.), each chunk is [<encoded character(8-9 bits>|<encoded chunk length (7-11 bits)><distance back to data(5-18 bits)>], with lengths starting at 3. A 7-9-bit code unambiguously resolves to a literal character or a specific range of lengths. Longer codes correspond to larger lengths/distances. No space/meaningless stuff is allowed between chunks.
So the maximum for raw byte chunks is 9/8 (1.125x) - if all the raw bytes are with codes 144 - 255.
Playing with reference chunks isn't going to do any good for you: even a reference to a 3-byte sequence gives 25/24 (1.04x) at most.
That's it for static Huffman tables. Looking through the docs on dynamic ones, it optimizes the aforementioned encoding for the specific data or something. So, it should allow to make the ratio for the given data closer to the achievable maximum, but that's it.
I have a formatted data file which is typically billions of lines long, with several lines of headers of variable length. The data file takes the form:
# header 1
# header 2
# headers are of variable length.
# data begins from next line.
1.23 4.56 7.89 0.12
2.34 5.67 8.90 1.23
:
:
# billions of lines of data, each row the same length, same format.
-- end of file --
I would like to extract a portion of data from this file, and my current code looks like:
<pre>
do j=1,jmax !Suppose I want to extract jmax lines of data from the file.
[algorithm to determine number of lines to skip, "N(j)"]
!This determines the number of lines to skip from the previous file
!position, when the data was read on j-1th iteration.
!Skip N-1 lines to go to the next data line to read off:
do i=1,N-1
read(unit=unit,fmt='(A)')
end do
!Now read off the line of data I want:
read(unit=unit,fmt='(data_format)'),data1,data2,etc.
!Data is stored in some arrays.
end do
</pre>
The problem is, N(j) can be anywhere between 1 and several billion, so it takes some time to run the code.
My question is, is there a more efficient way of skipping millions of lines of data? The only way I can think of, while sticking to Fortran, is to open the file with direct access and jump to the desired line upon opening the file.
As you suggest, direct access seems like the best option. But that requires the records to all have the same length, which your headers violate. Also, why used formatted output? With a file of this length, its hard to imagine a person reading the file. If you use unformatted IO, the file will be both smaller and IO will be faster. Perhaps create two files, one with the headers (metadata) in human reader form, and the other with the data in native form. Native / binary representation means a loss of portability, which is something to consider if you want to move the files to different computer architectures or have them be useable for decades. Otherwise it's probably worth the convenience. Other options would be to use a more sophisticated file format that combines metadata and data, such as HDF5 or FITS, but for communication between two programs of one person, that's probably excessive.
I write a code in python that detect malicious PDF.
every file I analyze I calculate its hash value and save it in hash database, besides saving the output in text file.
If I want to scan another file I calculated it hash value then search it in hash database, if found I print the output from the text that is already exist.
but if the hash value is not exist it is saved and the output is saved in text file
I need help on how could I link between the hash value and the text that contain the output?
As Kyle mentioned, you can use a hash table. A hash table is similar to a dictionary. In python I actually believe they're called dictionaries. For more on that, look here: http://www.tutorialspoint.com/python/python_dictionary.htm
As far as your question is concerned, you have a variety of options. You will have to save your 'database' at some point and you could save it in many different formats. You could save it as a JSON file (a very popular style). It could be an XML file (very popular as well). You could even save it as a CSV (not nearly as popular, but it gets the job done). For the sake of this, let's say you save this 'database' in a text file which looks like this:
5a4730fc12097346fdf5589166c373d3{C:\PdfsOutput\FileName.txt}662ad9b45e0f30333a433566cee8988d{C:\PdfsOutput\SomeOtherFile.txt}
Essentially you're formatting it as HashValue{PathToFileOnDisk}... You could then parse this via regex that looks like [0-9a-f]{32}\{[^\}]+ Then you would scan your database on startup using this regex, load up all matches, iterate all matches, split each match at '{' and then put the ValueSplit[0] into a dictionary as the key with the path to that text file as the value for that key.
So, after you do the regex search, get your matches and are iterating them, within the iteration loop say something like:
ValueSplit = RegexMatch.split('{')
HashAndFileDict[ValueSplit[0]] = ValueSplit[1]
This code assumes the regex match in the loop is a string simply called 'RegexMatch'. It also assumes that your dictionary you're storing hash values and paths in is called 'HashAndFileDict'
Later in your code you can check a PDF hash value in question by saying:
if(!HashAndFileDict.hash_key(PDFHashValue):
TextFilePath = savePDFOutputText(ThePDFFile)
HashAndFileDict[PDFHashValue] = TextFilePath
else:
print("File already processed. Text is at: " + HashAndFileDict[PDFHashValue])
If I may, it might be wise to use 2 hashing algorithms and combine their hexadecimal digests into 1 string in order to prevent a collision when processing many PDF files.
I am looking for ways to read in a PDF file with SAS. Apparently this is not basic functionality and there is very little to be found on the internet. (Let alone that google is not easy with PDF in you search giving you also links to PDF documents that go about other things.)
The only things that can be found, are people looking for ways to import data into datasets from a PDF. For me, that is not even necesarry. I would like to be able to read the contents of the PDF file in one big character variable. If possible, it would even be better to be able to read in the file's binary data.
Is this possible with SAS and how? (I got it to work in Access VBA, but can't find any similar ways in SAS.)
(In the end, the purpose is to convert this to base64 and put that base64-string into an XML document.)
You probably will not be able to read the entire file into one character variable since the maximum size of a character variable is around 33 KB. A simple way to read in one line at a time, though, is something like the following:
%let pdfFileName = Test.pdf;
%let lineSize = 2000;
data base;
format text_line $&lineSize..;
infile "&pdfFileName" lrecl=&lineSize;
input text_line $;
run;
This requires that you have a general idea of the maximum record length ahead of time, but you could write additional code to determine the maximum record size prior to reading in the file. In this example each line of text is read into one character variable named "text_line." From there, you could use a RETAIN statement or double trailers (##) in the INPUT line to process multiple lines at a time. The SAS web-site has plenty of documentation on how to read and process text from various types of input files.