Pig latin load from S3 (folder expansion) - amazon-s3

I am trying to use load with data source as S3 bucket.
load s3n://hourly-logprocessing/{2013090100,2013100501}/??????_0.gz' using some loader()
does not work.
load s3n://hourly-logprocessing/{201309????}/??????_0.gz using some loader()
does not work.
I get this exception.
Caused by: java.lang.IllegalArgumentException: Can not create a Path
from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:91)
at org.apache.hadoop.fs.Path.(Path.java:99)
at org.apache.hadoop.fs.Path.(Path.java:58)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:498)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1341)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1418)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1602)
at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1539)
It only works when i use a single folder.
load s3n://some-folder/2013090100/??????_0.gz
How does pig expand. Any help would be appreciated.

First of all, I didn't try your examples, o lazy me, but this works for my 'load' statements:
's3n://SOME_BUCKET/20[0-9][0-9]-[0-9][0-9]-[0-9][0-9]-23-*.mystuff_v14*'
don't forget single quotes after the load command (which is missing in your examples)

Related

pyspark.sql.utils.IllegalArgumentException: requirement failed: Temporary GCS path has not been set

On Google Cloud Platform, I am trying to submit a pyspark job that writes a dataframe to BigQuery.
The code that executes the writing is as the following:
finalDF.write.format("bigquery")\
.mode('overwrite')\
.option("table","[PROJECT_ID].dataset.table")\
.save()
And I get the mentioned error in the title. How can I set the GCS temporary path?
As the github repository of spark-bigquery-connector states
One can specify it when writing:
df.write
.format("bigquery")
.option("temporaryGcsBucket","some-bucket")
.save("dataset.table")
Or in a global manner:
spark.conf.set("temporaryGcsBucket","some-bucket")
Property "temporaryGcsBucket" needs to be set either at the time of writing dataframe or while creating sparkSession.
.option("temporaryGcsBucket","some-bucket")
or like .option("temporaryGcsBucket","some-bucket/optional_path")
1. finalDF.write.format("bigquery") .mode('overwrite').option("temporaryGcsBucket","some-bucket").option("table","[PROJECT_ID].dataset.table") .save()

ImageFileError: Cannot work out file type of ".nii"

When I try to load my .NII file as a 4D Niimg-like object (I've tried both nilearn and nibabel),
I get the below error
Error: ImageFileError: Cannot work out file type of
"/Users/audreyphan/Documents/Spring2020/DESPO/res4d/1/res4d_anat.nii"
Here is my code:
ds_name = '/Users/audreyphan/Documents/Spring2020/DESPO/res4d/1/res4d_anat.nii'
block = nib.load(ds_name) #Nibabel
block = image.load_img(ds_name) #Nilearn
Both attempts result in the same error.
I'm not sure what's causing this error to occur?
Thanks!
It looks like the libraries are not able to extract the file type from your file.
So first of all we have to be sure that the file is not corrupt. Therefore, can you load the data correctly with a tool such as ITK-SNAP (http://www.itksnap.org)?
If yes, you can try to define the file type by your own in the nibabel package by using the specific loader function. E.g. you can try each one of the following loader functions:
img_nifti1 = nib.Nifti1Image.from_filename(file)
img_nifti2 = nib.Nifti2Image.from_filename(file)
Oddly enough, this error also occurs when the access permissions are not set appropriately for the file you are trying to load. Try using chmod to change those access permissions appropriately and then loading the *.nii file.

Saving RDD to file results in _temporary path for parts

I have data in Spark which I want to save to S3. The recommended method is to save is using the saveAsTextFile method on the SparkContext, which is successful. I expect that the data will be saved as 'parts'.
My problem is that when I go to S3 to look at my data it has been saved in a folder name _temporary, with a subfolder 0 and then each part or task saved in its own folder.
For example,
data.saveAsTextFile("s3:/kirk/data");
results in file likes
s3://kirk/data/_SUCCESS
s3://kirk/data/_temporary/0/_temporary_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000/part-00000
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001/part-00001
and so on. I would expect and have seen something like
s3://kirk/data/_SUCCESS
s3://kirk/data/part-00000
s3://kirk/data/part-00001
Is this a configuration setting, or do I need to 'commit' the save to resolve the temporary files?
I had the same problem with spark streaming, that was because my Sparkmaster was set up with conf.setMaster("local") instead of conf.SetMaster("local[*]")
Without the [*], spark can't execute saveastextfile during the stream.
Try using coalesce() to reduce the rdd to 1 partition before you export.
Good luck!

Hadoop: wrong classpath in map reduce job

I'm running a cloudera cluster in 3 virtual maschines and try to execute hbase bulk load via a map reduce job. But I got always the error:
error: Class org.apache.hadoop.hbase.mapreduce.HFileOutputFormat not found
So, it seems that the map process doesnt find the class. So I tried this:
1) add the hbase.jar to the HADOOP_CLASSPATH on every node
2) adding TableMapReduceUtil.addDependencyJars(job) / TableMapReduceUtil.addDependencyJars(myConf, HFileOutputFormat.class) to my source code
nothing worked. I have absolute no idea why the class is not found, because the jar/class is definitely available in the classpath.
If I take a look into the job.xml I see the following entry:
name=tmpjars value=file:/C:/Users/Thomas/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5-cdh4.3.0/zookeeper-3.4.5-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/org/apache/hbase/hbase/0.94.6-cdh4.3.0/hbase-0.94.6-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.3.0/hadoop-core-2.0.0-mr1-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar,file:/C:/Users/Thomas/.m2/repository/com/google/protobuf/protobuf-java/2.4.0a/protobuf-java-2.4.0a.jar
This seems a little bit odd to me, these are my local jars on the windows system. Maybe this should be the hdfs jars? If yes, how can I change the values for "tmpjars"?
Here is the java code I try to execute:
configuration = new Configuration(false);
configuration.set("mapred.job.tracker", "192.168.2.41:8021");
configuration.set("fs.defaultFS", "hdfs://192.168.2.41:8020/");
configuration.set("hbase.zookeeper.quorum", "192.168.2.41");
configuration.set("hbase.zookeeper.property.clientPort", "2181");
Job job = new Job(configuration, "HBase Bulk Import for "
+ tablename);
job.setJarByClass(HBaseKVMapper.class);
job.setMapperClass(HBaseKVMapper.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setOutputFormatClass(HFileOutputFormat.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
job.setInputFormatClass(TextInputFormat.class);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileInputFormat.addInputPath(job, new Path("myfile1"));
FileOutputFormat.setOutputPath(job, new Path("myfile2"));
job.waitForCompletion(true);
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(
configuration);
loader.doBulkLoad(new Path("myFile3"), hTable);
EDIT:
I tried a little bit more and its totaly strange. I add the following line to the java code:
job.setJarByClass(HFileOutputFormat.class);
after I executed this, the error is gone, but another class not found exception appear:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class mypackage.bulkLoad.HBaseKVMapper not found
HBaseKVMapper is my custom Mapper Class I want to execute. So, I tried to add it with "job.setJarByClass(HBaseKVMapper.class)", but it doesnt work since its its only a class file and no jar. So I generated a Jarfile including HBaseKVMapper.class. After that, I executed it again and now got the HFileOutputFormat.class not found exception again.
After debugging a little bit, I found out that the setJarByClass Methode only copies the local jar file to .staging/job_#number/job.jar on HDFS. So, this setJarByClass() Method will only work for one jar file because it overwrites the job.jar after executing setJarByClass() again with another jar.
While searching for the eroor I saw the following strcuture in the the job staging direcotry:
and inside the libjars direcotry I saw the relevant jar files
so, the hbase jar is inside the libjars directory but the jobtracker doesn't use this it for executing the job. Why?
I would try using Cloudera Manager (free version) as it takes care of these issues for you. Otherwise note the following:
Both your own classes and the HBase Class HFileOutputFormat need to be available on the classpath locally and remotely.
Submitting the job
Meaning getting the classpath right locally for when your driver runs:
$ env HADOOP_CLASSPATH=$(hbase classpath) hadoop jar path/to/jar class....
On the server
In your hadoop-env.sh
export HADOOP_CLASSPATH=$(hbase claspath)
or use
TableMapReduceUtil.addDependencyJars
I found a "hacked" solution which worked for me, but I'm not happy with it because it's not really practicable.
My "hacked" solution:
create one big Jar with all necessary class files, I called it "big.jar" and add it to the local (eclipse) classpath
add the line: job.setJarByClass(MyMapperClass.class) ... the MyMapperClass has to be in the big.jar
When I execute this the big.jar will be copied for every job to the filesystem. No errors anymore. The problem is, that the jar is 80mb in size and have to be copied every time.
If anywone knows a better way I would be tahnkful if he could tell me how.
EDIT:
Now I try to execute jobs with Apache Pig and have exactly the same problem. My hacked soultion doesn't work in this case because pig creats the jobs automaticly. Here is the pig error:
java.lang.ClassNotFoundException: Class org.apache.hadoop.hbase.mapreduce.TableSplit not found

google big query: export table to own bucket results in unexpected error

I'am stuck trying to export a table to my google cloud storage bucket.
Example job id: job_0463426872a645bea8157604780d060d
I tried the cloud storage target with alot of different variations, all reveal the same error. If I try to copy the natality report, it works.
What am I doing wrong?
Thanks!
Daniel
It looks like the error says:
"Table too large to be exported to a single file. Specify a uri including a * to shard export." Try switching the destination URI to something like gs://foo/bar/baz*
Specify the file extension along with the pattern. Example
gs://foo/bar/baz*.gz in case of GZIP (compressed)
gs://foo/bar/baz*.csv in case of csv (uncompressed)
The foo directory is the bucket name and bar directory can be your
date in string format which could be generated on the fly.
I was able to do it with:
bq extract --destination_format=NEWLINE_DELIMITED_JSON myproject:mydataset.mypartition gs://mybucket/mydataset/mypartition/{*}.json