When i am taring and gzipping a folder, each time i am getting a different file size.
The contents of the directory is same and not changed. The compressed file size changes by 20 to 100 bytes. Is this a normal behavior?
Does my data will get affected with this?
Thanks
tar generates a different file each time (check with md5sum), the reason is ordering is different. i guess gzip is affected by ordering.
yes, gzip should be affected by different ordering on the tar file, it's dictionary size is much smaller than the filesize.
and it should differ every time, after all each tar file is different! (with the same contents)
Related
I tried a Hive process,
which generate words frequency rank from
sentences,
I would like to output not multiple files but
one file.
I searched the similar question this web site,
I found mapred.reduce.tasks=1,
but it didn't generate one file but 50 files.
The process l tried has 50 input files and
they are all gzip file.
How do I get one merged file?
50 input files size is so large that I suppose the
reason may be some kind of limit.
in your job use Order By clause with some field.
So that hive will enforce to run only one reducer as a result you are going to end up with one file has created in the HDFS.
hive> Insert into default.target
Select * from default.source
order by id;
For more details regards to order by clause refer to this and this links.
Thank you for your kind answers,
you are really saving me.
I am trying order by,
but it is taking much time,
i am waiting for it.
All I have to do is get one file
to make output file into input of
the next step,
I am also going to try simply cat all files from reducer outputs according to the advice,
if I will do it, I am worried that files are unique and does not have any the same word between files , and whether it is normal gzip file made by catting multiple gzip files.
Assuming Parquet files on AWS S3 (used for querying by AWS Athena).
I need to anonymize a record with specific numeric field by changing the numeric value (changing one digit is enough).
Can I scan a parquet file as Binary and find a numeric value ? Or the compression will make it impossible to find such string ?
Assuming I can do #1 - can I anonymize the record by changing a digit on this number on the binary level without corrupting the parquet file ?
10X
No, this will not be possible. Parquet has two layers in its format that make this impossible: encoding and compression. They both reorder the data to fit into less space, the difference between them is CPU usage and universalness. Sometimes data can be compressed so that we need less than a byte per value if all values are the same / very similar. Changing a single value would than lead to more space usage which in turn makes your edit impossible.
Currently we have an application that picks files out of a folder and processes them. It's simple enough but there are two pretty major issues with it. The processing is simply converting images to a base64 string and putting that into a database.
Problem
The problem is after the file has been processed, it won't need processing again and for performance reasons we don't really want it to be so.
Moving the files after processing is also not an option as these image files need to always be available in the same directory for other parts of the system to use.
This program must be written in VB.NET as it is an extension of a product already using this.
Ideal Solution
What we are looking for really is a way of keeping track of which files have been processed so we can develop a kind of ignore list when running the application.
For every processed image file Image0001.ext, once processed create a second file Image0001.ext.done. When looking for files to process, use a filter on the extension type of your images, and as each filename is found check for the existence of a .done file.
This approach will get incrementally slower as the number of files increases, but unless you move (or delete) files this is inevitable. On NTFS you should be OK until you get well into the tens of thousands of files.
EDIT: My approach would be to apply KISS:
Everything is in one folder, therefore cannot be a big number of images: I don't need to handle hundreds of files per hour every hour of every day (first run might be different).
Writing a console application to convert one file (passed on the command line) is each. Left as an exercise.
There is no indication of any urgency to the conversion: can schedule to run every 15min (say). Also left as an exercise.
Use PowerShell to run the program for all images not already processed:
cd $TheImageFolder;
# .png assumed as image type. Can have multiple filters here for more image types.
Get-Item -filter *.png |
Where-Object { -not (Test-File -path ($_.FullName + '.done') } |
Foreach-Object { ProcessFile $_.FullName; New-Item ($_.FullName + '.done') -ItemType file }
In a table, store the file name, file size, (and file hash if you need to be more sure about the file), for each file processed. Now, when you're taking a new file to process, you can compare it with your table entries (a simple query would do). Using hashes might degrade your performance, but you can be a bit more certain about an already processed file.
I have 2 zip files. Inside each zip file there are multiple text and binary files. However not all files are the same. Some files are different due to time stamp and other data, others are identical.
Can I use CRC to definitively prove that specific files are identical?
Example: I have file A,B,C in both archives. Can I use CRC to prove that A,B,C files is identical in both archives?
Thank you.
Definitively? No - CRC collisions are perfectly possible, just very improbable.
If you need absolute proof then you're going to need to compare the files byte-for-byte. If you just mean within the expectations of everyday use, sure. If the filesize is the same and the CRC is the same then it's very very likely the files are the same.
Is there a difference between having say n files with 1 line each in the input folder and having 1 file with n lines in the input folder when running hadoop?
If there are n files, does the "InputFormat" just see it all as 1 continuous file?
There's a big difference. It's frequently referred to as "the small files problem" , and has to do with the fact that Hadoop expects to split giant inputs into smaller tasks, but not to collect small inputs into larger tasks.
Take a look at this blog post from Cloudera:
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
If you can avoid creating lots of files, do so. Concatenate when possible. Large splittable files are MUCH better for Hadoop.
I once ran Pig on the netflix dataset. It took hours to process just a few gigs. I then concatenated the input files (I think it was a file per movie, or a file per user) into a single file -- had my result in minutes.