I tried a Hive process,
which generate words frequency rank from
sentences,
I would like to output not multiple files but
one file.
I searched the similar question this web site,
I found mapred.reduce.tasks=1,
but it didn't generate one file but 50 files.
The process l tried has 50 input files and
they are all gzip file.
How do I get one merged file?
50 input files size is so large that I suppose the
reason may be some kind of limit.
in your job use Order By clause with some field.
So that hive will enforce to run only one reducer as a result you are going to end up with one file has created in the HDFS.
hive> Insert into default.target
Select * from default.source
order by id;
For more details regards to order by clause refer to this and this links.
Thank you for your kind answers,
you are really saving me.
I am trying order by,
but it is taking much time,
i am waiting for it.
All I have to do is get one file
to make output file into input of
the next step,
I am also going to try simply cat all files from reducer outputs according to the advice,
if I will do it, I am worried that files are unique and does not have any the same word between files , and whether it is normal gzip file made by catting multiple gzip files.
Related
I am wondering if there is a way to set an order to the way files are loaded into redshift through data pipeline from S3. I know we can use manifest to specify the files but haven't found anything about the order of files being loaded.
for instance, my s3 folder1 has 10 files. In the data pipeline, I set it to this folder, but how can I set an order of these files loading , if we can.
In short as far I understand there are no means to load files in a predefined order while being consumed by a data pipeline. Anyone correct me if I am wrong.
I am thinking of a case where there can be multiple sources files and they can have duplicate rows but with different values. In such case the order in which the files are consumed is important.
For example , File1 , File2 are part of a data pipeline schedule and if both the files have a common customer entry named xyz . File1 xyz Cost_owed 1000, File2 xyz Cost_owed 500. So in reality the customer xyz owes just 500, but since i use delete and insert mode the order of the files are important here . So my redshift table might end up having an entry for xyz as 1000 OR 500 , in such specific case or any other cases the order of the files matters. Or should this be handled in any other way if so,
can you give me some ideas.
Thanks
The order of files doesn't/couldn't matter for COPY command in Redshift since it's a MPP system.
Redshift relies on the SORTKEY of the target table to enforce ordering.
We've created a query in BigQuery that returns SKUs and correlations between them. Something like:
sku_0,sku_1,0.023
sku_0,sku_2,0.482
sku_0,sku_3,0.328
sku_1,sku_0,0.023
sku_1,sku_2,0.848
sku_1,sku_3,0.736
The result has millions of rows and we export it to Google Cloud Storage which results in several compressed files.
These files are downloaded and we have a Python application that loops through them to make some calculations using the correlations.
We tried then to make use of the fact that our first columns of SKUs is already ordered and not have to apply this ordering inside of our application.
But then we just found that the files we get from GCS changes the order in which the skus appear.
It looks like the files are created by several processes reading the results and saving it in different files, which breaks the ordering we wanted to maintain.
As an example, if we have 2 files created, the first file would look something like that:
sku_0,sku_1,0.023
sku_0,sku_3,0.328
sku_1,sku_2,0.0848
And the second file:
sku_0,sku_2,0.482
sku_1,sku_0,0.328
sku_1,sku_3,0.736
This is an example of what it looks like two processes reading the results and each one saving its current row on a specific file which changes the order of the column.
So we looked for some flag that we could use to force the preservation of the ordering but couldn't find any so far.
Is there some way we could use to force the order in these GCS files to be preserved? Or is there some workaround?
Thanks in advance,
As far I know there is no flag to maintain order.
As a workaround you can rethink your data output to use of NESTED type, and make sure that what you want to group together are converted in NESTED rows, and you can export to JSON.
is there some workaround?
As an option - you can move your processing logic from Python to BigQuery, thus eliminating moving data out of BigQuery to GCS.
I have a ETL which give text file output and I have to check the those text content has the word error or bad using pentaho.
Is there any simple way to find it?
If you are trying to process a number of files, you can use a Get Filenames step to get all the filenames. Then, if your text files are small, you can use a Get File Content step to get the whole file as one row, then use a Java Filter or other matching step (RegEx, e.g.) to search for the words.
If your text files are too big but line-based or otherwise in a fixed format (which it likely is if you used a text file output step), you can use a Text File Input step to get the lines, then a matcher step (see above) to find the words in the line. Then you can use a Filter Rows step to choose just those rows that contain the words, then Select Values to choose just the filename, then a Sort Rows on the filename, then a Unique Rows step. The result should be a list of filenames whose contents contain the search words.
This may seem like a lot of steps, but Pentaho Data Integration or PDI (aka Kettle) is designed to be a flow of steps with distinct (and very reusable) functionality. A smaller but less "PDI" method is to write a User Defined Java Class (or other scripting) step to do all the work. This solution has a smaller number of steps but is not very configurable or reusable.
If you're writing these files out yourself, then dont you already know the content? So scan the fields at the point at which you already have them in memory.
If you're trying to see if Pentaho has written an error to the file, then you should use error handling on the output step.
Finally PDI is not a text searching tool. If you really need to do this, then probably best bet is good old grep..
I wish to get some expert advice on this problem.
I have two text files, one very large ( ~ GB ) and other small ( ~ MB). These files essentially have information per line. I can say that bigger file has a subset of information about the smaller file. Each line in files is organized as tuples sperated by spaces and diff is found by looking at one or more of columns in those two files. Both of these files are sorted based on one of such column (document id).
I implemented it by keeping index on document id and line number and doing a random access to that line in larger file to start the diff. But this method is slow. I want to know any good mechanism for this scenario.
Thanks in advance.
If the files are known to be sorted in the same order by the same key, and the lines that share a common key are expected to match exactly, then comm is probably what you want - it has flags to allow you to show only the lines that are common between two files, or the lines that are in one file but not the other.
Is there a difference between having say n files with 1 line each in the input folder and having 1 file with n lines in the input folder when running hadoop?
If there are n files, does the "InputFormat" just see it all as 1 continuous file?
There's a big difference. It's frequently referred to as "the small files problem" , and has to do with the fact that Hadoop expects to split giant inputs into smaller tasks, but not to collect small inputs into larger tasks.
Take a look at this blog post from Cloudera:
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
If you can avoid creating lots of files, do so. Concatenate when possible. Large splittable files are MUCH better for Hadoop.
I once ran Pig on the netflix dataset. It took hours to process just a few gigs. I then concatenated the input files (I think it was a file per movie, or a file per user) into a single file -- had my result in minutes.