i have a MR job to read file on amazon S3 and process the data on local hdfs. the files are zipped text file as .gz. i tried to setup the job as below but it won't work, anyone know what might be wrong? do i need to add extra step to unzip the file first?
thanks!
String S3_LOCATION = "s3n://access_key:private_key#bucket_name"
protected void prepareHadoopJob() throws Exception {
this.getHadoopJob().setMapperClass(Mapper1.class);
this.getHadoopJob().setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(this.getHadoopJob(), new Path(S3_LOCATION));
this.getHadoopJob().setNumReduceTasks(0);
this.getHadoopJob().setOutputFormatClass(TableOutputFormat.class);
this.getHadoopJob().getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, myTable.getTableName());
this.getHadoopJob().setOutputKeyClass(ImmutableBytesWritable.class);
this.getHadoopJob().setOutputValueClass(Put.class);
}
Normally, you should not need to unzip the file first but it is tough to identify what went wrong without any details on the error message
Related
I need to restore a lot of object from aws s3 glacier deep archive. So i try to use a s3 batch jobs. For that i use a python code to create a manifest as a csv with to columns Bucket,Key.
But my first issue : some Key contain a comma so the job failed.
To solve (partialy) this issue i just cut the csv file to keep only the first two columns hoping that there are not many files involved.
But now i have another issue:
ErrorMessage: Task target couldn't be URL decoded
Any Idea ?
As mentioned on https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-create-job.html#specify-batchjob-manifest, the manifest CSV file must be URL encoded. The , character in a key name gets converted to %2C with URL encoding so the resulting file will be valid CSV even with commas in the key name
I am using Filechooser, and checked whether the correct file is loaded using the print function until the data path.
text file name: A.txt
self.data_name = str(self.list_data_load.selection[0])
self.fdata=self.data_name
check_error= 'Path' + self.fdata
f=open(self.fdata,"r")
/emulated/0/data/A.txt
#F=f.readline()
but, when I try to use readlines, the app crashes and shuts down in Andorid.
I'm not sure if I need to grant additional permissions to read text files in external storage.
I need some help please.
I am trying to process a CSV stored in a S3 Bucket with Apache NiFi.
For this aim I am using the following flow:
The thing is that I need to replace some text of the csv file, but what I get as a output of FetchS3Object is a file, not a text.
How can I access to the text of the S3Object?
Thanks!
Finally I found the way to fix the problem.
Using SplitText processor after FetchS3Object is possible to split line by line (count = 1) and process each line after that processor.
I ran this command to load 11 files to a Bigquery table:
bq load --project_id=ardent-course-601 --source_format=NEWLINE_DELIMITED_JSON dw_test.rome_defaults_20140819_test gs://sm-uk-hadoop/queries/logsToBq_transformLogs/rome_defaults/20140819/23af7218-617d-42e8-884e-f213a583094a/part* /opt/sm-analytics/projects/logsTobqMR/jsonschema/rome_defaultsSchema.txt
I got this error:
Waiting on bqjob_r46f38146351d545_00000147ef890755_1 ... (11s) Current status: DONE
BigQuery error in load operation: Error processing job 'ardent-course-601:bqjob_r46f38146351d545_00000147ef890755_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 5: Unexpected. Please try again.
I tried many times after that and still got the same error.
To debug what went wrong, I instead load each file one by one to the Bigquery table. For example:
/usr/local/bin/bq load --project_id=ardent-course-601 --source_format=NEWLINE_DELIMITED_JSON dw_test.rome_defaults_20140819_test gs://sm-uk-hadoop/queries/logsToBq_transformLogs/rome_defaults/20140819/23af7218-617d-42e8-884e-f213a583094a/part-m-00011.gz /opt/sm-analytics/projects/logsTobqMR/jsonschema/rome_defaultsSchema.txt
There are 11 files total and each ran fine.
Could someone please help? Is this a bug on Bigquery side?
Thank you.
There was an error reading one of the files: gs://...part-m-00005.gz
Looking at the import logs, it appears that the gzip reader encountered an error decompressing the file.
It looks like that file may not actually be compressed. BigQuery samples the header of the first file in the list to determine whether it is dealing with compressed or uncompressed files and to determine the compression type. When you import all of the files at once, it only samples the first file.
When you run the files individually, bigquery reads the header of the file and determines that it isn't actually compressed (despite having the suffix '.gz') so imports it as a normal flat file.
If you run a load that doesn't mix compressed and uncompressed files, it should work successfully.
Please let me know if you think this is not the case and I'll dig in some more.
I'm trying to load simple file:
log = load 'file_1.gz' using TextLoader AS (line:chararray);
dump log
And I get an error:
2014-04-08 11:46:19,471 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input Pattern hdfs://hadoop1:8020/pko/file*gz matches 0 files
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1054)
Is is possible to manage such situation before error appears?
Input Pattern hdfs://hadoop1:8020/pko/file*gz matches 0 files
The error is the input file doesn't exist in the given hdfs path.
log = load 'file_1.gz' using TextLoader AS (line:chararray);
as you haven’t mentioned the absolute path of file_1.gz , it will taken the home hdfs dir of the user with which you are running your pig-script
Unfortunately in the current version of Pig (0.15.0) it is impossible to manage these errors without using UDF's.
I suggest creating a Java or Python script using try and catch to take care of this.
Here's a good website that might be of some use to you: https://wiki.apache.org/pig/PigErrorHandlingInScripts
Good luck learning Pig!
I'm facing this issue as well. My load command is:
DATA = LOAD '${qurwf_folder_input}/data/*/' AS (...);
I want to load all files from the data subfolders, but the data folder is empty and I got the same error as you. What I did, in my particular case, was to create an empty folder in the data directory. So the LOAD returns an empty dataset and the script did not fail.
By the way, I'm using Oozie workflow to run the scripts, and in the prepare, I create the empty folders.