How can I connect to hive using pyspark? - hive

I'm trying to create a table in HIVE. But it is creating a folder like testdb.db inside spark-warehouse folder. How can I directly store in HIVE as we store to MySQL/MongoDB databases.
conf = SparkConf().setAppName("data_import")
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)
sqlContext.setConf("spark.sql.shuffle.partitions", "2")
sqlContext.sql("CREATE DATABASE testdb")
sqlContext.sql("use testdb")
sqlContext.sql("create table daily_revenue(order_date string, daily_revenue float)")

When you creates a table in HIVE then what happens behind the scene is, it stores the metadata in some relational database depending on which is configured for your environment and actual data will be stored on HDFS warehouse directory if that is managed table.
Similarly when you try to create the table from Spark in HIVE then what it will do is, first it will create the folder .db and inside this folder it will create another folder with table name, which inturn store the data on HDFS.
So in your case, you should have <warehouse_dir>/testdb.db/table folder. and
if you load any data to this table, that will be present inside the table folder.
Hope it helps.
Regards,
Neeraj

sqlContext.sql("create database if not exists demo")
>>> sqlContext.sql("show tables in demo").show()
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
+---------+-----------+
sqlContext.sql("create table demo.dummy (id int, name string)")
>>> sqlContext.sql("show tables in demo").show()
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| dummy| false|
+---------+-----------+
>>> sqlContext.sql("desc demo.dummy").show()
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
| id| int| null|
| name| string| null|
+--------+---------+-------+

Related

Databricks - alter complex data type in hive table

I have a delta table which is partitioned with below structure in Databricks
|--Name
|--Group
|--Computer
| |--C
| |--Java
|--Biology
|--Commerce
|--Date
I am trying to add a new struct field under Group.Computer.
Expected Schema:
|--Name
|--Group
|--Computer
| |--C
| |--Java
| |--Python
|--Biology
|--Commerce
|--Date
I am trying to do this using below alter command
ALTER TABLE <TABLE_NAME> CHANGE COLUMN GROUP GROUP STRUCT<COMPUTER:STRUCT<C:STRING, JAVA:STRING, PYTHON STRING>,BIOLOGY:STRING, COMMERCE:STRING>
But i get an error like below:
cannot update spark_catalog.table_name field computer type:update a struct by updating its fields;
How to alter complex data types of existing table in databricks?
You need to add it as a column but using the fully-qualified name in dotted notation - Databricks SQL doc explicitly says about it:
alter table <TABLE_NAME> add column group.computer.cobol string

synapse pipeline copied data to azure synapse data lake table - serverless sql select shows rows, but spark sql select does not

I made a new Azure synapse data lake database and table:
CREATE DATABASE IF NOT EXISTS db1 LOCATION '/db1';
CREATE TABLE IF NOT EXISTS db1.tbl1(id int, name string) USING CSV (header=true);
I then ran a synapse pipeline to copy data from a source to the ADLS sink for this table. I see the expected csv file in the ADLS container and folder - synapse-workspace-container/db1/tbl1/employee.csv:
id,name
1,adam
2,bob
3,charles
Running a serverless SQL select statement I see my rows:
SELECT TOP (100) [id]
,[name]
FROM [db1].[dbo].[tbl1]
+---+-------+
| id| name|
+---+-------+
| 1| adam|
| 2| bob|
| 3|charles|
+---+-------+
Running a pyspark sql select I see no rows:
sdf=spark.sql("SELECT * FROM db1.tbl1 ORDER BY id ASC")
sdf.show()
+---+----+
| id|name|
+---+----+
+---+----+
Why are no rows showing for spark sql?
I would like to suggest creating global temp view. As a result, you can use this view in any notebook you want as long as your cluster is not terminated. Having said that, you could create global temp view as below -
df.createOrReplaceGlobalTempView("temp_view")
Kindly refer the below document:
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-view.html

Impala ACID table select ERROR: Operation not supported on transactional (ACID) table:

I'm using impala 3.4 directly with hive 3.1.
The problem is that if you create a general table in the hive and then select it in impala, an error occurs.
The error message is as follows:
Query: show tables
+----------+
| name |
+----------+
| customer |
| lineitem |
| nation |
| orders |
| part |
| partsupp |
| region |
| supplier |
| t |
+----------+
Fetched 9 row(s) in 0.02s
[host.cluster.com] default> select * from customer;
Query: select * from customer
Query submitted at: 2020-11-20 09:56:12 (Coordinator: http://host.cluster.com:25000)
ERROR: AnalysisException: Operation not supported on transactional (ACID) table: default.customer
In the hive, the acid table and the orc table are only concerned with whether to delete or update, but I knew that selection is common.
In fact, the select statement is normally executed through hive jdbc. Only impala would like to help you understand why this error occurs.
I solved this problem. It was confirmed that the table created through Hive in impala operates normally.
There are two possible causes:
Connect impala built with Hive2 to Hive 3 databases.
When creating a Hive Table that I did not recognize, set the default flag related to ACID.
This version can't read ACID table wich are created by Hive. Hive creates ACID table by default.

How to drop a column from a Databricks Delta table?

I have recently started discovering Databricks and faced a situation where I need to drop a certain column of a delta table. When I worked with PostgreSQL it was as easy as
ALTER TABLE main.metrics_table
DROP COLUMN metric_1;
I was looking through Databricks documentation on DELETE but it covers only DELETE the rows that match a predicate.
I've also found docs on DROP database, DROP function and DROP table but absolutely nothing on how to delete a column from a delta table. What am I missing here? Is there a standard way to drop a column from a delta table?
There is no drop column option on Databricks tables: https://docs.databricks.com/spark/latest/spark-sql/language-manual/alter-table-or-view.html#delta-schema-constructs
Remember that unlike a relational database there are physical parquet files in your storage, your "table" is just a schema that has been applied to them.
In the relational world you can update the table metadata to remove a column easily, in a big data world you have to re-write the underlying files.
Technically parquet can handle schema evolution (see Schema evolution in parquet format). But the Databricks implementation of Delta does not. It probably just too complicated to be worth it.
Therefore the solution in this case is to create a new table and insert the columns you want to keep from the old table.
use below code :
df = spark.sql("Select * from <DB Name>.<Table Name>")
df1 = df.drop("<Column Name>")
spark.sql("DROP TABLE if exists <DB Name>.<TableName>_OLD")
spark.sql("ALTER TABLE <DB Name>.<TableName> RENAME TO <DB Name>.<Table Name>_OLD ")
df1.write.format("delta").mode("OVERWRITE").option("overwriteSchema", "true").saveAsTable("<DB Name>.<Table Name>")
One way that I figured out to make that work is to first drop the table and then recreate the table from the dataframe using the overwriteSchema option to true. You also need to use the option of mode = overwrite so that it recreate the physical files using new schema that the dataframe contains.
Break down of the steps :
Read the table in the dataframe.
Drop the columns that you don't want in your final table
Drop the actual table from which you have read the data.
now save the newly created dataframe after dropping the columns as the same table name.
but make sure you use two options at the time of saving the dataframe as table.. (.mode("overwrite").option("overwriteSchema", "true") )
Above steps would help you recreate the same table with the extra column/s removed.
Hope it helps someone facing the similar issue.
Databricks Runtime 10.2+ supports dropping columns if you enable Column Mapping mode
ALTER TABLE <table_name> SET TBLPROPERTIES (
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5',
'delta.columnMapping.mode' = 'name'
)
And then drops will work --
ALTER TABLE table_name DROP COLUMN col_name
ALTER TABLE table_name DROP COLUMNS (col_name_1, col_name_2, ...)
You can overwrite the table without the column if the table isn't too large.
df = spark.read.table('table')
df = df.drop('col')
df.write.format('delta')\
.option("overwriteSchema", "true")\
.mode('overwrite')\
.saveAsTable('table')
As of Delta Lake 1.2, you can drop columns, see the latest ALTER TABLE docs.
Here's a fully working example if you're interested in a snippet you can run locally:
# create a Delta Lake
columns = ["language","speakers"]
data = [("English", "1.5"), ("Mandarin", "1.1"), ("Hindi", "0.6")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.write.format("delta").saveAsTable("default.my_cool_table")
spark.sql("select * from `my_cool_table`").show()
+--------+--------+
|language|speakers|
+--------+--------+
|Mandarin| 1.1|
| English| 1.5|
| Hindi| 0.6|
+--------+--------+
Here's how to drop the language column:
spark.sql("""ALTER TABLE `my_cool_table` SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')""")
spark.sql("alter table `my_cool_table` drop column language")
Verify that the language column isn't included in the table anymore:
spark.sql("select * from `my_cool_table`").show()
+--------+
|speakers|
+--------+
| 1.1|
| 1.5|
| 0.6|
+--------+
It works only if you added your column after creating the table.
If it is so, and if it is possible for you to recover the data inserted after altering your table, you may consider using the table history to restore the table to a previous version.
With
DESCRIBE HISTORY <TABLE_NAME>
you can check all the available versions of your table (operation 'ADD COLUMN' will create a new table version).
Afterwards, with RESTORE it is possible to transform the table to any available state.
RESTORE <TALBE_NAME> VERSION AS OF <VERSION_NUMBER>
Here you have more information about TIME TRAVEL

HIVE ORC returns NULLs

I am creating hive external table ORC (ORC file located on S3).
Command
CREATE EXTERNAL TABLE Table1 (Id INT, Name STRING) STORED AS ORC LOCATION 's3://bucket_name'
After running the query:
Select * from Table1;
Result is:
+-------------------------------------+---------------------------------------+
| Table1.id | Table1.name |
+-------------------------------------+---------------------------------------+
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
+-------------------------------------+---------------------------------------+
Interesting that the number of returned records 10 and it is correct but all records are NULL.
What is wrong, why query returns only NULLs?
I am using EMR instances on AWS. Should I configure/check to support ORC format for hive?
I did use your sample ORC file and tried to CREATE an external table in HIVE, I was able to see the data output.
You can also make use of the ORC Dump utility to get to know the metadata of the ORC file in JSon format.
hive --orcfiledump -j -p <Location of Orc File>
Try to load the data using the LOAD statement or creating a Managed Table, JFYI "I tried them all and was getting the data as below" :) I really dont find anything wrong with your statements.
You can also check the link for more information ORC Dump
I run into the same issue with EMR Hive and orc files located in s3.
Problem was in mismatch between field name in orc schema and hive fields name.
In my case names should match 100% (including case sensitive) + note that hive will translate camelCase field names in lowercase.
In you case you it would be better to create table like:
CREATE EXTERNAL TABLE Table1 (id INT, name STRING) STORED AS ORC LOCATION 's3://bucket_name'
And when creating .orc files use schema like:
private final TypeDescription SCHEMA = TypeDescription.createStruct()
.addField("id", TypeDescription.createInt())
.addField("name", TypeDescription.createString());
In this case Hive field names match to field names in orc schema and EMR Hive was able to read values from those files.
The issue I had was with the case of the column name in Hive table, if your ORC file have column name in upper case then the hive table should have the same case. I used spark data frame to convert the column to the lower case:
import findspark
findspark.init()
import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
import pandas
from pyspark.sql import functions as F
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
orc_df=sqlContext.read.orc("hdfs://data/part-00000.snappy.orc")
new_orc_df=orc_df.select([F.col(x).alias(x.lower()) for x in orc_df.columns])
new_orc_df.printSchema()
new_orc_df.write.orc(os.path.join(tempfile.mkdtemp(), '/home/hadoop/vishrant/data'), 'overwrite')