Is there any way that I can write HiveQL query directly on Jupyter Notebook not using Python or R kernel?
I understand that Jupyter provides HiveQL kernel, but it's updated like 2 years ago, and they say "Some hive functions are extended." meaning not all the hive functions are implemeneted.
Please advise me if you have any.
Thanks in advance!
Related
I'm running on my job computer and I'm forced to use MPython 2.2.3 for MineSight (Python 2.7.6). I need to work with Teradata and I hopefully could install Teradata module but I couldn't install Pandas module. I am able to connect to Teradata and run my query but I don't know how to export my results to a CSV file since I can't use Pandas module. Is there a way to do this on Python 2.7 version?
there are basically two, straight forward options:
let to Python the CSV. You don't need Pandas, as it's in the Standard library (https://docs.python.org/2/library/csv.html)
let Teradata to the CSV. Teradata can compile data to a CSV line (https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/ypCTxtJqcKR7Qh58vU7XpQ)
I have some datasets in BigQuery, I wonder if there is a way to use the same datasets in Data Lab? As the datasets are big, I can't download it and reload it in Data Lab.
Thank you very much.
The BigQuery Python client library support querying data stored in BigQuery. To load the commands from the client library, paste the following code into the first cell of the notebook:
%load_ext google.cloud.bigquery
%load_ext is one of the many Jupyter built-in magic commands.
The BigQuery client library provides a %%bigquery cell, which runs a SQL query and returns the results as a Pandas DataFrame.
You can query data from a public dataset or from the datasets in your project:
%%bigquery
SELECT *
FROM `MY_PROJECT.MY_DATASET.MY_TABLE`
LIMIT 50
I was able to successfully get data from the dataset without any issues.
You can follow this tutorial. I hope it helps.
I am trying to run a simple code to simply show databases that I created previously on my hive2-server. (note in this example there are both, examples in python and scala both with the same results).
If I log in into a hive shell and list my databases I see a total of 3 databases.
When I start Spark shell(2.3) on pyspark I do the usual and add the following property to my SparkSession:
sqlContext.setConf("hive.metastore.uris","thrift://*****:9083")
And re-start a SparkContext within my session.
If I run the following line to see all the configs:
pyspark.conf.SparkConf().getAll()
spark.sparkContext._conf.getAll()
I can indeed see the parameter has been added, I start a new HiveContext:
hiveContext = pyspark.sql.HiveContext(sc)
But If I list my databases:
hiveContext.sql("SHOW DATABASES").show()
It will not show the same results from the hive shell.
I'm a bit lost, for some reason it looks like it is ignoring the config parameter as I am sure the one I'm using it's my metastore as the address I get from running:
hive -e "SET" | grep metastore.uris
Is the same address also if I run:
ses2 = spark.builder.master("local").appName("Hive_Test").config('hive.metastore.uris','thrift://******:9083').getOrCreate()
ses2.sql("SET").show()
Could it be a permission issue? Like some tables are not set to be seen outside the hive shell/user.
Thanks
Managed to solve the issue, because a communication issue the Hive was not hosted in that machine, corrected the code and everything fine.
I have a problem with Zeppelin. Zeppelin have given me an output salary with comma. It is any possible way to print the result without the comma?my outpu I am using distribution Hortonworks 2.4.
Thank you guys.
I guess you are using zeppelin 0.5.x or 0.6.x
In the latest version, you can change the number display format.
We're using apache zeppelin to analyse our datasets. We have some queries that we would like to run that have a large number of results that come back from them and would like to run the query in zeppelin but save the results (display is limited to 1000). Is there an easy way to get zeppelin save all the results of a query to s3 bucket maybe?
I managed to whip up a notebook that effectively does what i want using the scala interpreter.
z.load("com.databricks:spark-csv_2.10:1.4.0")
val df= sqlContext.sql("""
select * from table
""")
df.repartition(1).write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("s3://amazon.bucket.com/csv_output/")
Its worth mentioning that the z.load function seemed to work for me one day, but then i tried it again and for some reason i had to declare it in its own paragraph with the %dep interpreter, then the remaining code in the standard scala interpreter