Enabling Hive Lineage - hive

I thought Hive lineage was not available, but after some research I have found that it can be enable. Some of the things I found while searching was enabling its lineage via either Cloudera Manager or IBM Infosphere, which I am not interested in. Finally found a tag that was supposed to enable it:
<property>
<name>hive.exec.post.hooks</name>
<value>org.apache.hadoop.hive.ql.hooks.LineageLogger</value>
</property>
The mentioned tag had to be written in the hive-site.xml file, and was supposed to write the lineage in the directory /var/log/hive/lineage. I said supposed because I haven't found anything in that directory, in fact, it wasn't even created.
I am currently using Hive-server 3.12

Related

spring cloud config server with multiple property sources

I have spring cloud config server reading properties from multiple sources (Git and Vault). For a given path, even it finds the resource in Git, it still queries vault and report failure as the resources are not available at both sources. My requirement is to look for a resource and if its found, no need to query the other source. Please suggest if its possible. Thanks

Turning off embedded databases with Hibernate

I have a Vaadin Flow app that accesses an Azure SQL database that keeps shutting down and restarting. Upon further investigation, I discovered that Hibernate is trying to drop and recreate database tables because it is using an H2 embedded database, which I do not want. After reading the documentation, I determined that I can turn this feature off by setting auto-dll to none in either hibernate.cfg.xml or application.properties or application.yml. The problem is that none of these files exist on my local machine, and I can't find them on my cloud drive either. How can I stop Hibernate from dropping and recreating my database tables?
The file was located in src/main/resources. It doesn't show up in a Windows search, not even if I search for system files.

Mule - Copy the directory from HDFS

I need to copy the directory (/tmp/xxx_files/xxx/Output) head containing sub folders and files from HDFS (Hadoop distributed file system). I'm using HDFS connector but it seems it does not support this.
It always getting an error like:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /tmp/xxx_files/xxx/Output/
I don't see any option is HDFS connector for copying the files/directories inside the path specified. It is always expecting file names to be copied.
Is it possible to copy a directory head containing sub-folders and files using the HDFS connector from MuleSoft?
As the technical documentation of the HSFS connector on the official MuleSoft website states, the code is hosted at the GitHub site of the connector:
The Anypoint Connector for the Hadoop Distributed File System (HDFS)
is used as a bi-directional gateway between applications. Its source
is stored at the HDFS Connector GitHub site.
What it does not state, that there is also a more detailed technical documentation available on the GitHub site.
Here you can also find different examples how to use the connector for basic file-system operations.
The links seem to be broken in the official MuleSoft documentation.
You can find the repository here:
https://github.com/mulesoft/mule-hadoop-connector
The operations are implemented in the HdfsOperations java class. (See also the FileSystemApiService class)
As you can see, the functionality you expect is not implemented. It is not supported out-of-the-box.
You can't copy a directory head containing sub folders and files from HDFS without any further effort using the HDFS connector.

Configure Apache Hive for LLAP without using slider

There's a new feature in hive called LLAP. During the investigation I've found out that it's quite difficult to configure LLAP so there's a component called Apache Slider that will configure it. Still I couldn't find any documentation for manual configuration without Slider. https://cwiki.apache.org/confluence/display/Hive/LLAP
Take a look at this documentation.
https://hortonworks.com/hadoop-tutorial/interactive-sql-hadoop-hive-llap/
[Updated] it seems like the above page was removed by Hortonworks
The only option I can suggested now is
https://www.google.com/search?q=hadoop+interactive+sql+hadoop+hive+llap&oq=hadoop+interactive+sql+hadoop+hive+llap&gs_l=serp.3..35i39k1.5338.10878.0.11135.4.4.0.0.0.0.199.655.0j4.4.0....0...1c.1.64.serp..0.3.499.N2KWHY3UFi8

Problem with copying local data onto HDFS on a Hadoop cluster using Amazon EC2/ S3

I have setup a Hadoop cluster containing 5 nodes on Amazon EC2. Now, when i login into the Master node and submit the following command
bin/hadoop jar <program>.jar <arg1> <arg2> <path/to/input/file/on/S3>
It throws the following errors (not at the same time.) The first error is thrown when i don't replace the slashes with '%2F' and the second is thrown when i replace them with '%2F':
1) Java.lang.IllegalArgumentException: Invalid hostname in URI S3://<ID>:<SECRETKEY>#<BUCKET>/<path-to-inputfile>
2) org.apache.hadoop.fs.S3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/' XML Error Message: The request signature we calculated does not match the signature you provided. check your key and signing method.
Note:
1)when i submitted jps to see what tasks were running on the Master, it just showed
1116 NameNode
1699 Jps
1180 JobTracker
leaving DataNode and TaskTracker.
2)My Secret key contains two '/' (forward slashes). And i replace them with '%2F' in the S3 URI.
PS: The program runs fine on EC2 when run on a single node. Its only when i launch a cluster, i run into issues related to copying data to/from S3 from/to HDFS. And, what does distcp do? Do i need to distribute the data even after i copy the data from S3 to HDFS?(I thought, HDFS took care of that internally)
IF you could direct me to a link that explains running Map/reduce programs on a hadoop cluster using Amazon EC2/S3. That would be great.
Regards,
Deepak.
You probably want to use s3n:// urls, not s3:// urls. s3n:// means "A regular file, readable from the outside world, at this S3 url". s3:// refers to an HDFS file system mapped into an S3 bucket.
To avoid the URL escaping issue for the access key (and to make life much easier), put them into the /etc/hadoop/conf/core-site.xml file:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>0123458712355</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>hi/momasgasfglskfghaslkfjg</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>0123458712355</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>hi/momasgasfglskfghaslkfjg</value>
</property>
There was at one point an outstanding issue with secret keys that had a slash -- the URL was decoded in some contexts but not in others. I don't know if it's been fixed, but I do know that with the keys in the .conf this goes away.
Other quickies:
You can most quickly debug your problem using the hadoop filesystem commands, which work just fine on s3n:// (and s3://) urls. Try hadoop fs -cp s3n://myhappybucket/ or hadoop fs -cp s3n://myhappybucket/happyfile.txt /tmp/dest1 and even hadoop fs -cp /tmp/some_hdfs_file s3n://myhappybucket/will_be_put_into_s3
The distcp command runs a mapper-only command to copy a tree from there to here. Use it if you want to copy a very large number of files to the HDFS. (For everyday use, hadoop fs -cp src dest works just fine).
You don't have to move the data to the HDFS if you don't want. You can pull all the source data straight from s3, do all further manipulations targeting either the HDFS or S3 as you see fit.
Hadoop can become confused if there is a file s3n://myhappybucket/foo/bar and a "directory" (many files with keys s3n://myhappybucket/foo/bar/something). Some old versions of the s3sync command would leave just such 38-byte turds in the S3 tree.
If you start seeing SocketTimeoutException's, apply the patch for HADOOP-6254. We were, and we did, and they went away.
You can also you Apache Whirr for this workflow. Check the Quick Start Guide and the 5 minutes guide for more info.
Disclaimer: I'm one of the committers.
Try using Amazon Elastic MapReduce. It removes the need for configuring the hadoop nodes, and you can just access objects in your s3 account in the way you expect.
Use
-Dfs.s3n.awsAccessKeyId=<your-key> -Dfs.s3n.awsSecretAccessKey=<your-secret-key>
e.g.
hadoop distcp -Dfs.s3n.awsAccessKeyId=<your-key> -Dfs.s3n.awsSecretAccessKey=<your-secret-key> -<subsubcommand> <args>
or
hadoop fs -Dfs.s3n.awsAccessKeyId=<your-key> -Dfs.s3n.awsSecretAccessKey=<your-secret-key> -<subsubcommand> <args>