Hadoop Input Files

Hadoop Input Files - amazon-s3

Is there a difference between having say n files with 1 line each in the input folder and having 1 file with n lines in the input folder when running hadoop?
If there are n files, does the "InputFormat" just see it all as 1 continuous file?

There's a big difference. It's frequently referred to as "the small files problem" , and has to do with the fact that Hadoop expects to split giant inputs into smaller tasks, but not to collect small inputs into larger tasks.
Take a look at this blog post from Cloudera:
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
If you can avoid creating lots of files, do so. Concatenate when possible. Large splittable files are MUCH better for Hadoop.
I once ran Pig on the netflix dataset. It took hours to process just a few gigs. I then concatenated the input files (I think it was a file per movie, or a file per user) into a single file -- had my result in minutes.

Related

How to find less frequenlty accessed files in HDFS

Beside using Cloudera Navigator, how can I find the less frequently accessed files, in HDFS.

I assume that you are looking for the time a file was last accessed (open, read, etc.), because as longer in the past the file would be less accessed.
Whereby you can do this in Linux quite simple via ls -l -someMoreOptions, in HDFS more work is necessary.
Maybe you could monitor the /hdfs-audit.log for cmd=open of the mentioned file. Or you could implement a small function to read out the FileStatus.getAccessTime() and as mentioned under Is there anyway to get last access time of HDFS files? or How to get last access time of any files in HDFS? in Cloudera Community.
In other words, it will be necessary to create a small program which scans all the files, read out the properties
...
status = fs.getFileStatus(new Path(line));
...
long lastAccessTimeLong = status.getAccessTime();
Date lastAccessTimeDate = new Date(lastAccessTimeLong);
...
and order it. It that you will be able find files which were not accessed for long times.

How to get one file in Hive

I tried a Hive process,
which generate words frequency rank from
sentences,
I would like to output not multiple files but
one file.
I searched the similar question this web site,
I found mapred.reduce.tasks=1,
but it didn't generate one file but 50 files.
The process l tried has 50 input files and
they are all gzip file.
How do I get one merged file?
50 input files size is so large that I suppose the
reason may be some kind of limit.

in your job use Order By clause with some field.
So that hive will enforce to run only one reducer as a result you are going to end up with one file has created in the HDFS.
hive> Insert into default.target
Select * from default.source
order by id;
For more details regards to order by clause refer to this and this links.

Thank you for your kind answers,
you are really saving me.
I am trying order by,
but it is taking much time,
i am waiting for it.
All I have to do is get one file
to make output file into input of
the next step,
I am also going to try simply cat all files from reducer outputs according to the advice,
if I will do it, I am worried that files are unique and does not have any the same word between files , and whether it is normal gzip file made by catting multiple gzip files.

How does combineInputFormat works in Hive?

I have a Hive table with following properties
TextFile Format
Unpartitioned
Unbucketed
Having 50 files of 3.5 MB each
Follows the table parameters from "DESCRIBE FORMATTED" command
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 50
totalSize 170774650
I am performing a count(*) operation on this table and it is running with
4 mappers and 1 reducers on AWS cluster
1 mapper and 1 reducer on my standalone cluster.[Pseudo cluster mode installation]
The max split size for both the Hive sessions is 256MB
I wanted to know how the combine input format works?
On a single machine, the data is clubbed together since all the files/blocks were on the same machine and since the total size of the files combined together is less than max split size, a single split and hence a single mapper is called for.
In the other case, AWS cluster resulted in 4 mappers. I read that CombineInputFormat employs rack/machine locality but precisely how?
Thanks for all your answers in advance.

Ok! No reply!!! I figured it out over time and was visiting my Stack Overflow account today and found this unlucky question sitting unanswered. So follows the details.
Splits are constructed from the files under the input paths. A split cannot have files from different pools. Each split returned may contain blocks from different files. If a maxSplitSize is specified, then blocks on the same node are combined to form a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default splitting behavior in Hadoop: each block is a locally processed split. Subclasses implement InputFormat.createRecordReader(InputSplit, TaskAttemptContext) to construct RecordReader's for CombineFileSplit's.
Hope it helps some one having a similar question!

Just wanted to follow up on this.
A split cannot have files from different pools.
There may be other factors, but there is only one pool per partition. If two small files exist in the same partition, they will be combined and only a single Mapper required, if the same files exist in to different partitions, it will require two Mappers to process.

Google Dataflow not reading more than 3 input compressed files at once when there are multiple sources

Background: I have 30 days data in 30 separate compressed files stored in google storage. I have to write them to a BigQuery table in 30 different partitions in the same table. Each compressed file size was around 750MB.
I did 2 experiments on the same data set on Google Dataflow today.
Experiment 1: I read each day's compressed file using TextIO, applied a simple ParDo transform to prepare TableRow objects and wrote them directly to BigQuery using BigQueryIO. So basically 30 pairs of parallel unconnected sources and and sinks got created. But I found that at any point of time, only 3 files were read, transformed and written to BigQuery. The ParDo transformation and BigQuery writing speed of Google Dataflow was around 6000-8000 elements/sec at any point in time.
So only 3 source and sinks were being processed out of 30 at any time which significantly slowed the process. In over 90 minutes only 7 out 30 files were written to separate BigQuery partitions of a table.
Experiment 2: Here I first read each day's data from the same compressed file for 30 days, applied ParDo transformation on these the 30 PCollections and stored these 30 resultant Pcollections in a PCollectionList object. All these 30 TextIO sources were being read in parallel.
Now I wrote each PCollection corresponding to each day's data in the PCollectionList to BigQuery using BigQueryIO directly. So 30 sinks were being written into again in parallel.
I found that out of 30 parallel sources, again only 3 sources were being read and applied ParDo transformation at a speed of around 20000 elements/sec. At the time of writing of this question when 1 hr had already elapsed, reading from the all the compressed file had not even read completely 50% of the files and writing to the BigQuery table partitions had not even started.
These problems seem to occur only when Google Dataflow reads compressed files. I had asked a question about its slow reading from compressed files(Relatively poor performance when reading compressed files vis a vis normal text files kept in google storage using google dataflow) and was told that parallelizing work would make reading faster as only 1 worker reads a compressed file and multiple sources would mean multiple workers being given chance to read multiple files. But this also does not seem to be working.
Is there any way to speed up this whole process of reading from multiple compressed files and writing to separate partitions of the same table in BigQuery in dataflow job at the same time?

Each compressed file will be read by a single worker. The initial number of workers for a job can be increased with the numWorkers pipeline option, and the maximum number that can be scaled up to can be set with the maxNumWorkers pipeline option.

Processing Files - Keeping Track

Currently we have an application that picks files out of a folder and processes them. It's simple enough but there are two pretty major issues with it. The processing is simply converting images to a base64 string and putting that into a database.
Problem
The problem is after the file has been processed, it won't need processing again and for performance reasons we don't really want it to be so.
Moving the files after processing is also not an option as these image files need to always be available in the same directory for other parts of the system to use.
This program must be written in VB.NET as it is an extension of a product already using this.
Ideal Solution
What we are looking for really is a way of keeping track of which files have been processed so we can develop a kind of ignore list when running the application.

For every processed image file Image0001.ext, once processed create a second file Image0001.ext.done. When looking for files to process, use a filter on the extension type of your images, and as each filename is found check for the existence of a .done file.
This approach will get incrementally slower as the number of files increases, but unless you move (or delete) files this is inevitable. On NTFS you should be OK until you get well into the tens of thousands of files.
EDIT: My approach would be to apply KISS:
Everything is in one folder, therefore cannot be a big number of images: I don't need to handle hundreds of files per hour every hour of every day (first run might be different).
Writing a console application to convert one file (passed on the command line) is each. Left as an exercise.
There is no indication of any urgency to the conversion: can schedule to run every 15min (say). Also left as an exercise.
Use PowerShell to run the program for all images not already processed:
cd $TheImageFolder;
# .png assumed as image type. Can have multiple filters here for more image types.
Get-Item -filter *.png |
Where-Object { -not (Test-File -path ($_.FullName + '.done') } |
Foreach-Object { ProcessFile $_.FullName; New-Item ($_.FullName + '.done') -ItemType file }

In a table, store the file name, file size, (and file hash if you need to be more sure about the file), for each file processed. Now, when you're taking a new file to process, you can compare it with your table entries (a simple query would do). Using hashes might degrade your performance, but you can be a bit more certain about an already processed file.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas