HIVE ORC returns NULLs

HIVE ORC returns NULLs - hive

I am creating hive external table ORC (ORC file located on S3).
Command
CREATE EXTERNAL TABLE Table1 (Id INT, Name STRING) STORED AS ORC LOCATION 's3://bucket_name'
After running the query:
Select * from Table1;
Result is:
+-------------------------------------+---------------------------------------+
| Table1.id | Table1.name |
+-------------------------------------+---------------------------------------+
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
+-------------------------------------+---------------------------------------+
Interesting that the number of returned records 10 and it is correct but all records are NULL.
What is wrong, why query returns only NULLs?
I am using EMR instances on AWS. Should I configure/check to support ORC format for hive?

I did use your sample ORC file and tried to CREATE an external table in HIVE, I was able to see the data output.
You can also make use of the ORC Dump utility to get to know the metadata of the ORC file in JSon format.
hive --orcfiledump -j -p <Location of Orc File>
Try to load the data using the LOAD statement or creating a Managed Table, JFYI "I tried them all and was getting the data as below" :) I really dont find anything wrong with your statements.
You can also check the link for more information ORC Dump

I run into the same issue with EMR Hive and orc files located in s3.
Problem was in mismatch between field name in orc schema and hive fields name.
In my case names should match 100% (including case sensitive) + note that hive will translate camelCase field names in lowercase.
In you case you it would be better to create table like:
CREATE EXTERNAL TABLE Table1 (id INT, name STRING) STORED AS ORC LOCATION 's3://bucket_name'
And when creating .orc files use schema like:
private final TypeDescription SCHEMA = TypeDescription.createStruct()
.addField("id", TypeDescription.createInt())
.addField("name", TypeDescription.createString());
In this case Hive field names match to field names in orc schema and EMR Hive was able to read values from those files.

The issue I had was with the case of the column name in Hive table, if your ORC file have column name in upper case then the hive table should have the same case. I used spark data frame to convert the column to the lower case:
import findspark
findspark.init()
import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
import pandas
from pyspark.sql import functions as F
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
orc_df=sqlContext.read.orc("hdfs://data/part-00000.snappy.orc")
new_orc_df=orc_df.select([F.col(x).alias(x.lower()) for x in orc_df.columns])
new_orc_df.printSchema()
new_orc_df.write.orc(os.path.join(tempfile.mkdtemp(), '/home/hadoop/vishrant/data'), 'overwrite')

Related

Databricks - alter complex data type in hive table

I have a delta table which is partitioned with below structure in Databricks
|--Name
|--Group
|--Computer
| |--C
| |--Java
|--Biology
|--Commerce
|--Date
I am trying to add a new struct field under Group.Computer.
Expected Schema:
|--Name
|--Group
|--Computer
| |--C
| |--Java
| |--Python
|--Biology
|--Commerce
|--Date
I am trying to do this using below alter command
ALTER TABLE <TABLE_NAME> CHANGE COLUMN GROUP GROUP STRUCT<COMPUTER:STRUCT<C:STRING, JAVA:STRING, PYTHON STRING>,BIOLOGY:STRING, COMMERCE:STRING>
But i get an error like below:
cannot update spark_catalog.table_name field computer type:update a struct by updating its fields;
How to alter complex data types of existing table in databricks?

You need to add it as a column but using the fully-qualified name in dotted notation - Databricks SQL doc explicitly says about it:
alter table <TABLE_NAME> add column group.computer.cobol string

synapse pipeline copied data to azure synapse data lake table - serverless sql select shows rows, but spark sql select does not

I made a new Azure synapse data lake database and table:
CREATE DATABASE IF NOT EXISTS db1 LOCATION '/db1';
CREATE TABLE IF NOT EXISTS db1.tbl1(id int, name string) USING CSV (header=true);
I then ran a synapse pipeline to copy data from a source to the ADLS sink for this table. I see the expected csv file in the ADLS container and folder - synapse-workspace-container/db1/tbl1/employee.csv:
id,name
1,adam
2,bob
3,charles
Running a serverless SQL select statement I see my rows:
SELECT TOP (100) [id]
,[name]
FROM [db1].[dbo].[tbl1]
+---+-------+
| id| name|
+---+-------+
| 1| adam|
| 2| bob|
| 3|charles|
+---+-------+
Running a pyspark sql select I see no rows:
sdf=spark.sql("SELECT * FROM db1.tbl1 ORDER BY id ASC")
sdf.show()
+---+----+
| id|name|
+---+----+
+---+----+
Why are no rows showing for spark sql?

I would like to suggest creating global temp view. As a result, you can use this view in any notebook you want as long as your cluster is not terminated. Having said that, you could create global temp view as below -
df.createOrReplaceGlobalTempView("temp_view")
Kindly refer the below document:
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-view.html

Impala ACID table select ERROR: Operation not supported on transactional (ACID) table:

I'm using impala 3.4 directly with hive 3.1.
The problem is that if you create a general table in the hive and then select it in impala, an error occurs.
The error message is as follows:
Query: show tables
+----------+
| name |
+----------+
| customer |
| lineitem |
| nation |
| orders |
| part |
| partsupp |
| region |
| supplier |
| t |
+----------+
Fetched 9 row(s) in 0.02s
[host.cluster.com] default> select * from customer;
Query: select * from customer
Query submitted at: 2020-11-20 09:56:12 (Coordinator: http://host.cluster.com:25000)
ERROR: AnalysisException: Operation not supported on transactional (ACID) table: default.customer
In the hive, the acid table and the orc table are only concerned with whether to delete or update, but I knew that selection is common.
In fact, the select statement is normally executed through hive jdbc. Only impala would like to help you understand why this error occurs.

I solved this problem. It was confirmed that the table created through Hive in impala operates normally.
There are two possible causes:
Connect impala built with Hive2 to Hive 3 databases.
When creating a Hive Table that I did not recognize, set the default flag related to ACID.

This version can't read ACID table wich are created by Hive. Hive creates ACID table by default.

How can I connect to hive using pyspark?

I'm trying to create a table in HIVE. But it is creating a folder like testdb.db inside spark-warehouse folder. How can I directly store in HIVE as we store to MySQL/MongoDB databases.
conf = SparkConf().setAppName("data_import")
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)
sqlContext.setConf("spark.sql.shuffle.partitions", "2")
sqlContext.sql("CREATE DATABASE testdb")
sqlContext.sql("use testdb")
sqlContext.sql("create table daily_revenue(order_date string, daily_revenue float)")

When you creates a table in HIVE then what happens behind the scene is, it stores the metadata in some relational database depending on which is configured for your environment and actual data will be stored on HDFS warehouse directory if that is managed table.
Similarly when you try to create the table from Spark in HIVE then what it will do is, first it will create the folder .db and inside this folder it will create another folder with table name, which inturn store the data on HDFS.
So in your case, you should have <warehouse_dir>/testdb.db/table folder. and
if you load any data to this table, that will be present inside the table folder.
Hope it helps.
Regards,
Neeraj

sqlContext.sql("create database if not exists demo")
>>> sqlContext.sql("show tables in demo").show()
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
+---------+-----------+
sqlContext.sql("create table demo.dummy (id int, name string)")
>>> sqlContext.sql("show tables in demo").show()
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| dummy| false|
+---------+-----------+
>>> sqlContext.sql("desc demo.dummy").show()
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
| id| int| null|
| name| string| null|
+--------+---------+-------+

how to add columns to existing hive partitioned table?

alter table abc add columns (stats1 map<string,string>, stats2 map<string,string>)
i have altered my table with above query. But after while checking the data i got NULL's for the both extra columns. I'm not getting data.
screenshot

CASCADE is the solution.
Query:
ALTER TABLE dbname.table_name ADD columns (column1 string,column2 string) CASCADE;
This changes the columns of a table's metadata and cascades the same change to all the partition metadata.
RESTRICT is the default, limiting column change only to table metadata.

As others have noted CASCADE will change the metadata for all partitions. Without CASCADE, if you want to change old partitions to include the new columns, you'll need to DROP the old partitions first and then fill them, INSERT OVERWRITE without the DROP won't work, because the metadata won't update to the new default metadata.
Let's say you have already run alter table abc add columns (stats1 map<string,string>, stats2 map<string,string>) without CASCADE by accident and then you INSERT OVERWRITE an old partition without DROPPING first. The data will be stored in the underlying files, but if you query that table from hive for that partition, it won't show because the metadata wasn't updated. This can be fixed without having to rerun the insert overwrite using the following:
Run SHOW CREATE TABLE dbname.tblname and copy all the column definitions that existed before adding new columns
Run ALTER TABLE dbname.tblname REPLACE COLUMNS ({paste in col defs besides columns to add here}) CASCADE
Run ALTER TABLE dbname.tblname ADD COLUMNS (newcol1 int COMMENT "new col") CASCADE
be happy that the metadata has been changed for all partitions =)
As an example of steps 2-3:
DROP TABLE IF EXISTS junk.testcascade ;
CREATE TABLE junk.testcascade (
startcol INT
)
partitioned by (d int)
stored as parquet
;
INSERT INTO TABLE junk.testcascade PARTITION(d=1)
VALUES
(1),
(2)
;
INSERT INTO TABLE junk.testcascade PARTITION(d=2)
VALUES
(1),
(2)
;
SELECT * FROM junk.testcascade ;
+-----------------------+----------------+--+
| testcascade.startcol | testcascade.d |
+-----------------------+----------------+--+
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
+-----------------------+----------------+--+
--no cascade! opps
ALTER TABLE junk.testcascade ADD COLUMNS( testcol1 int, testcol2 int) ;
INSERT OVERWRITE TABLE junk.testcascade PARTITION(d=3)
VALUES
(1,1,1),
(2,1,1)
;
INSERT OVERWRITE TABLE junk.testcascade PARTITION(d=2)
VALUES
(1,1,1),
(2,1,1)
;
--okay! because we created this table after altering the metadata
select * FROM junk.testcascade where d=3;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol | testcascade.testcol1 | testcascade.testcol2 | testcascade.d |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1 | 1 | 1 | 3 |
| 2 | 1 | 1 | 3 |
+-----------------------+-----------------------+-----------------------+----------------+--+
--not okay even tho we inserted =( because the metadata isnt changed
select * FROM junk.testcascade where d=2;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol | testcascade.testcol1 | testcascade.testcol2 | testcascade.d |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1 | NULL | NULL | 2 |
| 2 | NULL | NULL | 2 |
+-----------------------+-----------------------+-----------------------+----------------+--+
--cut back to original columns
ALTER TABLE junk.testcascade REPLACE COLUMNS( startcol int) CASCADE;
--add
ALTER table junk.testcascade ADD COLUMNS( testcol1 int, testcol2 int) CASCADE;
--it works!
select * FROM junk.testcascade where d=2;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol | testcascade.testcol1 | testcascade.testcol2 | testcascade.d |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1 | 1 | 1 | 2 |
| 2 | 1 | 1 | 2 |
+-----------------------+-----------------------+-----------------------+----------------+--+

To add columns into partitioned table you need to recreate partitions.
Suppose the table is external and the datafiles already contain new columns, do the following:
1. Alter table add columns...
2. Recreate partitions. For each partitions do Drop then create. Newly created partition schema will inherit the table schema.
Alternatively you can drop the table then create table and create all partitions or restore them simply running MSCK REPAIR TABLE abc command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS.
See manual here: RECOVER PARTITIONS
Also in Hive 1.1.0 and later you can use CASCADE option of ALTER TABLE ADD|REPLACE COLUMNS. See manual here: ADD COLUMN
These suggestions work for external tables.

This solution only works if your data is partitioned and you know the location of the latest partition. In this case instead of doing a recover partition or a repair which is a costly operation, you can do something like:
Read the partitioned table and get the schema details
Read the table you want to update
Now find which all columns are different and do a alter table for each
Posting a scala code for reference:
def updateMetastoreColumns(spark: SparkSession, partitionedTablePath: String, toUpdateTableName: String): Unit = {
//fetch all column names along with their corresponding datatypes from latest partition
val partitionedTable = spark.read.orc(partitionedTablePath)
val partitionedTableColumns = partitionedTable.columns zip partitionedTable.schema.map(_.dataType.catalogString)
//fetch all column names along with their corresponding datatypes from currentTable
val toUpdateTable = spark.read.table(toUpdateTableName)
val toUpdateTableColumns = toUpdateTable.columns zip toUpdateTable.schema.map(_.dataType.catalogString)
//check if new columns are present in newer partition
val diffColumns = partitionedTableColumns.diff(toUpdateTableColumns)
//update the metastore with new column info
diffColumns.foreach {column: (String, String) => {
spark.sql(s"ALTER TABLE ${toUpdateTableName} ADD COLUMNS (${column._1} ${column._2})")
}}
}
This will help you dynamically find latest columns which are added to newer partition and update it to your metastore on the fly.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

HIVE ORC returns NULLs - hive

Related

Databricks - alter complex data type in hive table

synapse pipeline copied data to azure synapse data lake table - serverless sql select shows rows, but spark sql select does not

Impala ACID table select ERROR: Operation not supported on transactional (ACID) table:

How can I connect to hive using pyspark?

how to add columns to existing hive partitioned table?

Categories

Resources