Hadoop put command doing nothing! - apache

I am running Cloudera's distribution of Hadoop and everything is working perfectly.The hdfs contains a large number of .seq files.I need to merge the contents of all the .seq files into one large .seq file.However, the getmerge command did nothing for me.I then used cat and piped the data of some .seq files onto a local file.When i want to "put" this file into hdfs it does nothing.No error message shows up,and no file is created.
I am able to "touchz" files in the hdfs and user permissions are not a problem here.The put command simply does not work.What am I doing wrong?

Write a job that merges the all sequence files into a single one. It's just the standard mapper and reducer with only one reduce task.

if the "hadoop" commands fails silently you should have a look at it.
Just type: 'which hadoop', this will give you the location of the "hadoop" executable. It is a shell script, just edit it and add logging to see what's going on.
If the hadoop bash script fails at the beginning it is no surprise that the hadoop dfs -put command does not work.

Related

How to delete remote file using Kettle Pentaho

I have a directory in remote Linux machine where files are being archived and kept for a certain period of time. I want to delete a file from remote (Linux) machine using kettle transformation based on some condition.
If file does not exists then job should not throw any error but if file exists at remote location, then job should delete file or raise an error in case some other reason, i.e., permission issue.
Here, the file name will be retrieved as a variable from previous steps of transformation and directory path of archived files will be fixed one.
How can I achieve this in Pentaho Kettle transformation?
Make use of "Run SSH commands" utility to pass commands to your remote server.
Assuming you do a rm -f /path/file it won't error for a non-existent file.
You can capture the output and perform an error handling as well (Filter rows and trigger the course of action).
Or you can mount remote directory to machine where kettle is, and try to delete file as regular.
Using ssh, i think, non trivial. It needs a lots of experiments to find out error types, to find way to distinguish errors. It might be and error with ssh connection or error to delete file.

how do I split a large csv.gz file in Google Cloud Storage?

I get this error when trying to load a table in Google BQ:
Input CSV files are not splittable and at least one of the files is
larger than the maximum allowed size. Size is: 56659381010. Max
allowed size is: 4294967296.
Is there a way to split the file using gsutil or something like that without having to upload everything again?
The largest compressed CSV file you can load into BigQuery is 4 gigabytes. GCS unfortunately does not provide a way to decompress a compressed file, nor does it provide a way to split a compressed file. GZip'd files can't be arbitrarily split up and reassembled in the way you could a tar file.
I imagine your best bet would likely be to spin up a GCE instance in the same region as your GCS bucket, download your object to that instance (which should be pretty fast, given that it's only a few dozen gigabytes), decompress the object (which will be slower), break that CSV file into a bunch of smaller ones (the linux split command is useful for this), and then upload the objects back up to GCS.
I ran into the same issue and this is how I dealt with it:
First, spin up a Google Compute Engine VM instance.
https://console.cloud.google.com/compute/instances
Then install the gsutil commands and then go through the authentication process.
https://cloud.google.com/storage/docs/gsutil_install
Once you have verified that the gcloud, gsutil, and bq commands are working then save a snapshot of the disk as snapshot-1 and then delete this VM.
On your local machine, run this command to create a new disk. This disk is used for the VM so that you have enough space to download and unzip the large file.
gcloud compute disks create disk-2017-11-30 --source-snapshot snapshot-1 --size=100GB
Again on your local machine, run this command to create a new VM instance that uses this disk. I use the --preemptible flag to save some cost.
gcloud compute instances create loader-2017-11-30 --disk name=disk-2017-11-30,boot=yes --preemptible
Now you can SSH into your instance and then run these commands on the remote machine.
First, copy the file from cloud storage to the VM
gsutil cp gs://my-bucket/2017/11/20171130.gz .
Then unzip the file. In my case, for ~4GB file, it took about 17 minutes to complete this step:
gunzip 20171130.gz
Once unzipped, you can either run the bq load command to load it into BigQuery but I found that for my file size (~70 GB unzipped), that operation would take about 4 hours. Instead, I uploaded the unzipped file back to Cloud Storage
gsutil cp 20171130 gs://am-alphahat-regional/unzipped/20171130.csv
Now that the file is back on cloud storage, you can run this command to delete the VM.
gcloud compute instances delete loader-2017-11-30
Theoretically, the associated disk should also have been deleted, but I found that the disk was still there and I needed to delete it with an additional command
gcloud compute disks delete disk-2017-11-30
Now finally, you should be able to run the bq load command or you can load the data from the console.

What is the path for a bootstrapped file for a Pig job running in Amazon EMR

I bootstrap a data file in my EMR job. The bootstrapping succeeds and the file is copied to /home/hadoop/contents/ folder with right permissions.
However when I try to access it in the Pig script like below:
userdidstopick = load '/home/hadoop/contents/UserIdsToPick.txt' AS (uid:chararray);
I get an error that the input path does not exist:
hdfs://10.183.166.176:9000/home/hadoop/contents/UserIdsToPick.txt
When running Ruby jobs the bootstrapped file was always accessible under /home/hadoop/contents/ folder and everything worked for me.
Is it different for Pig?
By default Pig on EMR is configured to access HDFS location instead of local filesystem. The error shows the HDFS location.
There are 2 ways to solve this:
Either copy the file on S3, and directly load file from s3
userdidstopick = load 's3_bucket_location/UserIdsToPick.txt' AS (uid:chararray);
Or you can first copy the file on HDFS (instead of local filesystem), and then directly use it as path you are doing today.
I would prefer first option.

How can I use boto or boto-rsync a full backup of 1000+ files to an S3-compatible cloud?

I'm trying to back up my entire collection of over 1000 work files, mainly text but also pictures, and a few large (0.5-1G) audiorecordings, to an S3 cloud (Dreamhost DreamObjects). I have tried to use boto-rsync to perform the first full 'put' with this:
$ boto-rsync --endpoint objects.dreamhost.com /media/Storage/Work/ \
> s3:/work.personalsite.net/ > output.txt
where '/media/Storage/Work/' is on a local hard disk, 's3:/work.personalsite.net/' is a bucket named after my personal web site for uniqueness, and output.txt is where I wanted a list of the files uploaded and error messages to go.
Boto-rsync grinds its way through the whole dirtree, but refreshing output about each file's progress doesn't look so good when it's printed in a file. Still as the upload is going, I 'tail output.txt' and I see that most files are uploaded, but some are only uploaded to less than 100%, and some are skipped altogether. My questions are:
Is there any way to confirm that a transfer is 100% complete and correct?
Is there a good way to log the results and errors of a transfer?
Is there a good way transfer a large number of files in a big directory hierarchy to one or more buckets for the first time, as opposed to an incremental backup?
I am on a Ubuntu 12.04 running Python 2.7.3. Thank you for your help.
you can encapsulate the command in an script and starts over nohup:
nohup script.sh
nohup generates automaticaly nohup.out file where all the output aof the script/command are captured.
to appoint the log you can do:
nohup script.sh > /path/to/log
br
Eddi

When using LZO on Hadoop output on AWS EMR, does it index the files (stored on S3) for future automatic splitting?

I want to use LZO compression on my Elastic Map Reduce job's output that is being stored on S3, but it is not clear if the files are automatically indexed so that future jobs run on this data will split the files into multiple tasks.
For example, if my output is a bunch of lines of TSV data, in a 1GB LZO file, will a future map job only create 1 task, or something like (1GB/blockSize) tasks (i.e. the behavior of when files were not compressed, or if there was a LZO index file in the directory)?
Edit: If this is not done automatically, what is recommended for getting my output to be LZO-indexed? Do the indexing before uploading the file to S3?
Short answer to my first question: AWS does not do automatic indexing. I've confirmed this with my own job, and also read the same from Andrew#AWS on their forum.
Here's how you can do the indexing:
To index some LZO files, you'll need to use my own Jar built from the Twitter hadoop-lzo project. You'll need to build the Jar somewhere, then upload to Amazon S3, if you want to Index directly with EMR.
On side note, Cloudera has good instructions on all the steps for setting this up on your own cluster. I did this on my local cluster, which allowed me to build the Jar and upload to S3. You can probably find a pre-built Jar on the net if you don't want to build it yourself.
When outputting your data from your Hadoop job, make sure you use the LzopCodec and not the LzoCodec, otherwise the files are not indexable (at least based on my experience). Example Java code (same idea carries over to Streaming API):
import com.hadoop.compression.lzo.LzopCodec;
TextOutputFormat.setCompressOutput(job, true);
TextOutputFormat.setOutputCompressorClass(job, LzopCodec.class)
Once your hadoop-lzo Jar is on S3, and your Hadoop job has outputted .lzo files, run your indexer on the output directory (instructions below you got a EMR job/cluster running):
elastic-mapreduce -j <existingJobId> \
--jar s3n://<yourBucketName>/hadoop-lzo-0.4.17-SNAPSHOT.jar \
--args com.hadoop.compression.lzo.DistributedLzoIndexer \
--args s3://<yourBucketName>/output/myLzoJobResults \
--step-name "Lzo file indexer Jar"
Then when you're using the data in a future job, be sure to specify that the input is in LZO format, otherwise the splitting won't occur. Example Java code:
import com.hadoop.mapreduce.LzoTextInputFormat;
job.setInputFormatClass(LzoTextInputFormat.class);