Zeppelin - output format - hive

I have a problem with Zeppelin. Zeppelin have given me an output salary with comma. It is any possible way to print the result without the comma?my outpu I am using distribution Hortonworks 2.4.
Thank you guys.

I guess you are using zeppelin 0.5.x or 0.6.x
In the latest version, you can change the number display format.

Related

How can I use HiveQL on Jupyter Notebook?

Is there any way that I can write HiveQL query directly on Jupyter Notebook not using Python or R kernel?
I understand that Jupyter provides HiveQL kernel, but it's updated like 2 years ago, and they say "Some hive functions are extended." meaning not all the hive functions are implemeneted.
Please advise me if you have any.
Thanks in advance!

Cdh to hdp hive

I have written hive udf in cloudera and we’re migrating it to hortonworks. When I try to apply the same udf in hortonworks cluster it throws me an error below.
Use the right dependencies with the correct versions. Sit with admin team regarding the versions and try to run it. Limit always scan few records and apply the operation on that data instead of whole dataset so, it worked for me when I apply the udf with limit. Even any version you use/even cdh version will work if you use limit. But the problem comes when you apply it on whole data set. As my sample data is around 5 million records, it has to run map reduce job.

Can't access external Hive metastore with Pyspark

I am trying to run a simple code to simply show databases that I created previously on my hive2-server. (note in this example there are both, examples in python and scala both with the same results).
If I log in into a hive shell and list my databases I see a total of 3 databases.
When I start Spark shell(2.3) on pyspark I do the usual and add the following property to my SparkSession:
sqlContext.setConf("hive.metastore.uris","thrift://*****:9083")
And re-start a SparkContext within my session.
If I run the following line to see all the configs:
pyspark.conf.SparkConf().getAll()
spark.sparkContext._conf.getAll()
I can indeed see the parameter has been added, I start a new HiveContext:
hiveContext = pyspark.sql.HiveContext(sc)
But If I list my databases:
hiveContext.sql("SHOW DATABASES").show()
It will not show the same results from the hive shell.
I'm a bit lost, for some reason it looks like it is ignoring the config parameter as I am sure the one I'm using it's my metastore as the address I get from running:
hive -e "SET" | grep metastore.uris
Is the same address also if I run:
ses2 = spark.builder.master("local").appName("Hive_Test").config('hive.metastore.uris','thrift://******:9083').getOrCreate()
ses2.sql("SET").show()
Could it be a permission issue? Like some tables are not set to be seen outside the hive shell/user.
Thanks
Managed to solve the issue, because a communication issue the Hive was not hosted in that machine, corrected the code and everything fine.

ExecuteSQL processor returns corrupted data

I have a flow in NiFI in which I use the ExecuteSQL processor to get a whole a merge of sub-partitions named dt from a hive table. For example: My table is partitioned by sikid and dt. So I have under sikid=1, dt=1000, and under sikid=2, dt=1000.
What I did is select * from my_table where dt=1000.
Unfortunately, what I've got in return from the ExecuteSQL processor is corrupted data, including rows that have dt=NULL while the original table does not have even one row with dt=NULL.
The DBCPConnectionPool is configured to use HiveJDBC4 jar.
Later I tried using the compatible jar according to the CDH release, didn't fix it either.
The ExecuteSQL processor is configured as such:
Normalize Table/Column Names: true
Use Avro Logical Types: false
Hive version: 1.1.0
CDH: 5.7.1
Any ideas what's happening? Thanks!
EDIT:
Apparently my returned data includes extra rows... a few thousand of them.. which is quite weird.
Does HiveJDBC4 (I assume the Simba Hive driver) parse the table name off the column names? This was one place there was an incompatibility with the Apache Hive JDBC driver, it didn't support getTableName() so doesn't work with ExecuteSQL, and even if it did, when the column names are retrieved from the ResultSetMetaData, they had the table names prepended with a period . separator. This is some of the custom code that is in HiveJdbcCommon (used by SelectHiveQL) vs JdbcCommon (used by ExecuteSQL).
If you're trying to use ExecuteSQL because you had trouble with the authentication method, how is that alleviated with the Simba driver? Do you specify auth information on the JDBC URL rather than in a hive-site.xml file for example? If you ask your auth question (using SelectHiveQL) as a separate SO question and link to it here, I will do my best to help out on that front and get you past this.
Eventually it was solved by using hive property hive.query.result.fileformat=SequenceFile

Saving/Exporting the results of a Spark SQL Zeppelin query

We're using apache zeppelin to analyse our datasets. We have some queries that we would like to run that have a large number of results that come back from them and would like to run the query in zeppelin but save the results (display is limited to 1000). Is there an easy way to get zeppelin save all the results of a query to s3 bucket maybe?
I managed to whip up a notebook that effectively does what i want using the scala interpreter.
z.load("com.databricks:spark-csv_2.10:1.4.0")
val df= sqlContext.sql("""
select * from table
""")
df.repartition(1).write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("s3://amazon.bucket.com/csv_output/")
Its worth mentioning that the z.load function seemed to work for me one day, but then i tried it again and for some reason i had to declare it in its own paragraph with the %dep interpreter, then the remaining code in the standard scala interpreter