We know that
add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;
is only effective during the current session.
Is there a way to add jar to Hive permanently and globally so that the jar will be available during the lifecycle of the cluster?
UPDATE:
I figured out a way: download the jar by using aws cli, aws s3 cp s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar ., then copy the jar to /usr/lib/hive/lib of all nodes of the EMR cluster
Is there a better way to do this?
insert your ADD JAR commands in your .hiverc file and start hive.
add jar yourjarName.jar
What is .hiverc file?
It is a file that is executed when you launch the hive shell - making it an ideal place for adding any hive configuration/customization you want set, on start of the hive shell. This could be:
Setting column headers to be visible in query results
Making the current database name part of the hive prompt
Adding any jars or files
Registering UDFs
2 .hiverc file location
The file is loaded from the hive conf directory.
I have the CDH4.2 distribution and the location is:
/etc/hive/conf.cloudera.hive1
If the file does not exist, you can create it. It needs to be
deployed to every node from where you might launch the Hive shell.
ref-http://hadooped.blogspot.in/2013/08/hive-hiverc-file.html
Related
I can download a single snappy.parquet partition file with:
aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet
And then use:
parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet
But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file. Instead the prefix: "s3://bucket/my-data.parquet"
Also what if the schema is different in different row groups across different partition files?
Following instructions here I downloaded a jar file and ran
hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/
But this resulted in error: No FileSystem for schema "s3".
This answer seems promising, but only for reading from HDFS. Any solution for S3?
I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.
You should be able to do:
pip install "clidb[extras]"
clidb s3://bucket/
and then click to load parquet files as views to inspect and run SQL against.
I am working on Pentaho Kettle version 5.0.1. In one of my transformation I am using javascript component where I am calling a method located in the JAR which I have copied to the lib folder of data-integration and everything is working fine in my local. But in my dev environment(I run it using kitchen) I don't have permission to copy my Jar file to the lib folder due to the restrictions on the server. Is there any other way using which I can supply the path of my custom Jar during run time so that the Kettle Job/Transformation can use it while being executed. Is there a way Kettle can pick the Jar location other than data-integration/lib?. Any help will be appreciated.
Take a look into kitchen.sh (and pan.sh). At some point the script starts adding stuff to the classpath. You can add more folders to the classpath there.
You still need permissions to edit the kitchen.sh file, though. If you can't do that, I suggest creating a copy of kitchen.sh you can write, in a separate location, and change the $BASEDIR folder to the actual PDI installation, so that kitchen can be located elsewhere.
If you have permission you can put your jar in another directory and after you specify this directory in the launcher.properties which you find in data-integration\launcher.
For exemple: if you put your jar in this directory: /export/home.
In launcher.properties: you will add this path and precisely libraries=../test:../lib:../libswt:../export/home
I wanner to write a pig script which should load a jar file. The following is my code:
Register /aa/bb/cc/ex.jar
I run the pig by hui and the jar file exists in the hdfs. However, it always remind that the file doesn't exist.
I am not sure if I use correct method to register a jar file from hdfs. Could you please give me ideas?
Thanks in advance.
According to
http://pig.apache.org/docs/r0.12.0/basic.html#register, you have to specify a full location uri for the jar file. For example,
register hdfs://namenode:port/aa/bb/cc/ex.jar;
I'd like to sync a single file from my filesystem to s3.
Is this possible or can only directories by synced?
Use include/exclude options for the sync-directory command:
e.g. To sync just /var/local/path/filename.xyz to S3 use:
s3 sync /var/local/path s3://bucket/path --exclude='*' --include='*/filename.xyz'
cp can be used to copy a single file to S3. If the filename already exists in the destination, this will replace it:
aws s3 cp local/path/to/file.js s3://bucket/path/to/file.js
Keep in mind that per the docs, sync will only make updates to the target if there have been file changes to the source file since the last run: s3 sync updates any files that have a size or modified time that are different from files with the same name at the destination. However, cp will always make updates to the target regardless of whether the source file has been modified.
Reference: AWS CLI Command Reference: cp
Just to comment on pythonjsgeo's answer. That seems to be the right solution but make sure so execute the command without the = symbol after the include and exclude tag. I was including the = symbol and getting weird behavior with the sync command.
s3 sync /var/local/path s3://bucket/path --exclude '*' --include '*/filename.xyz'
You can mount S3 bucket as a local folder (using RioFS, for example) and then use your favorite tool to synchronize file(-s) or directories.
I am new to oozie and want to add a hive job to my workflow. Could you please tell me where could I find or how could I create the hive-default.xml. I have actually installed everthing via cloudera manager and I am not sure where to find this file. I have looked for it in /etc/hive/conf which seems to be its usual directory but it is not there in that folder. I also did a find command in the terminal and it didn't pull up any file. Please help.
In /etc/hive/conf you should have hive-site.xml. You can copy this file to your HDFS workflow directory and rename it to hive-default.xml, should work.