Spark shell and spark dataframe gives different results for parquet files

Spark shell and spark dataframe gives different results for parquet files - hive

I have data in hdfs location /data/published/r6/omega which is full of parquet files ,
In that there is column etl_cre_tmst which has data . The parquet files has data
val loc = "/data/published/r6/omega"
val df = sqlContext.read.parquet(loc)
df.select("etl_cre_tmst").show(10,false)
+---------------------+
|etl_cre_tmst |
+---------------------+
|2019-03-08 04:41:10.0|
|2019-03-08 04:41:10.0|
|2019-03-08 04:41:10.0|
|2019-03-08 04:41:10.0|
|2019-03-08 04:41:10.0|
|2019-03-08 04:41:10.0|
+---------------------+
But when i try to access the data via Hive table, it shows only null
val df = hc.sql("select etl_cre_tmst from db_r6.omega ")
df.show(10,false)
+---------------------+
|etl_cre_tmst |
+---------------------+
|null|
|null|
|null|
|null|
|null|
+---------------------
The parquet file schema and datatype for etl_cre_tmst matches with hive table schema and datatype . the datatype for etl_cre_tmst is timestamp in parquet files and in hive table also
Why am I get null values when I try to same data via spark-shell. when I access the same table via hive shell then it works. the issue is only with spark shell alone
can someone help?

Related

Querying struct within array - Databricks SQL

I am using Databricks SQL to query a dataset that has a column formatted as an array, and each item in the array is a struct with 3 named fields.
I have the following table:
id
array
1
[{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]
2
[{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}]
In a different SQL editor, I was able to achieve this by doing the following:
SELECT
id,
struct.firstName
FROM
table
CROSS JOIN UNNEST(array) as t(struct)
With a resulting table of:
id
firstName
1
John
1
Jane
2
Bob
2
Betty
Unfortunately, this syntax does not work in the Databricks SQL editor, and I get the following error.
[UNRESOLVED_COLUMN] A column or function parameter with name `array` cannot be resolved.
I feel like there is an easy way to query this, but my search on Stack Overflow and Google has come up empty so far.

1. SQL API
The first solution uses the SQL API. The first code snippet prepares the test case, so you can ignore it if you already have it in place.
import pyspark.sql.types
schema = StructType([
StructField('id', IntegerType(), True),
StructField("people", ArrayType(StructType([
StructField('firstName', StringType(), True),
StructField('lastName', StringType(), True),
StructField('age', StringType(), True)
])), True)
])
sql_df = spark.createDataFrame([
(1, [{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]),
(2, [{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}])
], schema)
sql_df.createOrReplaceTempView("sql_df")
What you need to use is the LATERAL VIEW clause (docs) which allows to explode the nested structures, like this:
SELECT id, exploded.firstName
FROM sql_df
LATERAL VIEW EXPLODE(sql_df.people) sql_df AS exploded;
+---+---------+
| id|firstName|
+---+---------+
| 1| John|
| 1| Jane|
| 2| Bob|
| 2| Betty|
+---+---------+
2. DataFrame API
The alternative approach is to use explode method (docs), which gives you the same results, like this:
from pyspark.sql.functions import explode, col
sql_df.select("id", explode(col("people.firstName"))).show()
+---+-----+
| id| col|
+---+-----+
| 1| John|
| 1| Jane|
| 2| Bob|
| 2|Betty|
+---+-----+

create hive table from nested json data with flatten out fields

I want to create the external hive table from nested json data but fields should be flatten out from nested json.
For Example:-
{
"key1":"value1",
"key2":{
"nestedKey1":1,
"nestedKey2":2
}
}
Hive table should have format or fields flatten out like
key1: String, key2.nestedKey1:Int,key2.nestedKey1:Int
Thanks In Advance

Use JsonSerDe and create table with below syntax:
hive> create table sample(key1 string,key2 struct<nestedKey1:int,nestedKey2:int>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
hive> select key1,key2.nestedkey1,key2.nestedkey2 from sample;
+---------+-------------+-------------+--+
| key1 | nestedkey1 | nestedkey2 |
+---------+-------------+-------------+--+
| value1 | 1 | 2 |
+---------+-------------+-------------+--+
hive> select * from sample;
+--------------+----------------------------------+--+
| sample.key1 | sample.key2 |
+--------------+----------------------------------+--+
| value1 | {"nestedkey1":1,"nestedkey2":2} |
+--------------+----------------------------------+--+
(or)
If you want to create table with flatten out json fields then use RegexSerDe and matching regex to extract nestedkey from the data.
Refer this link for more details regards to regex serde.
UPDATE:
Inputdata:
{"key1":"value1","key2":{"nestedKey1":1,"nestedKey2":2}}
HiveTable:
hive> CREATE TABLE dd (key1 string, nestedKey1 string, nestedKey2 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES
('input.regex'=".*:\"(.*?)\",\"key2\":\\{\"nestedKey1\":(\\d),\"nestedKey2\":(\\d).*$");
Select data from the table:
hive> select * from dd;
+---------+-------------+-------------+--+
| key1 | nestedkey1 | nestedkey2 |
+---------+-------------+-------------+--+
| value1 | 1 | 2 |
+---------+-------------+-------------+--+

How to analyze the contents fsimage via hive queries

Help needed, please
I have downloaded the fsimage converted into a delimited csv file via OIV tool.
I also created a hive table and inserted the csv file into it.
I am not so familiar with sql hence querying the data is difficult.
eg: Each record in a file is something like this:
/tmp/hive/ltonakanyan/9c01cc22-55ef-4410-9f55-614726869f6d/hive_2017-05-08_08-44-39_680_3710282255695385702-113/-mr-10000/.hive-staging_hive_2017-05-08_08-44-39_680_3710282255695385702-113/-ext-10001/000044_0.deflate|3|2017-05-0808:45|2017-05-0808:45|134217728|1|176|0|0|-rw-r-----|ltonakanyan|hdfs
/data/lz/cpi/ofz/zd/cbt_ca_verint/new_data/2017-09-27/253018001769667.xml | 3| 2017-09-2723:41| 2017-09-2817:09| 134217728| 1| 14549| 0| 0| -rw-r----- | bc55_ah_appid| hdfs
Table description is:
| hdfspath | string
| replication | int
| modificationtime | string
| accesstime | string
| preferredblocksize | int
| blockscount | int
| filesize | bigint
| nsquota | bigint
| dsquota | bigint
| permissionx | string
| userx | string
| groupx | string
I need to know how to query only /tmp , /data with filesize and then go to second level ( /tmp/hive ) ( /data/lz ) , subsequent levels with filesize
i created something like this:
select substr(hdfspath, 2, instr(substr(hdfspath,2), '/')-1) zone,
sum(filesize)
from example
group by substr(hdfspath, 2, instr(substr(hdfspath,2), '/')-1);
But its not giving the data..file sizes are all in bytes.

select joinedpath, sumsize
from
(
select joinedpath,round(sum(filesize)/1024/1024/1024,2) as sumsize
from
(select concat('/',split(hdfspath,'\/')[1]) as joinedpath,accesstime,filesize, userx
from default.hdfs_meta_d
)t
where joinedpath != 'null'
group by joinedpath
)h
please check the query above, it can help you!

This job is failing due to heap memory error. Try to increase heap size before executing hdfs oiv command.
export HADOOP_OPTS="-Xmx4096m"
If the command is still failing you might need to move fsimage to a different machine/server which has more memory and increase heap memory using the above environment variable.

Not able to query records from Hive , when data stored as AVRO format , returns "error_error..." exception

We have followed the below steps ,
imported a table from MySQL to HDFS location user/hive/warehouse/orders/, the table schema as
mysql> describe orders;
+-------------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+-------------+------+-----+---------+-------+
| order_id | int(11) | YES | | NULL | |
| order_date | varchar(30) | YES | | NULL | |
| order_customer_id | int(11) | YES | | NULL | |
| order_items | varchar(30) | YES | | NULL | |
+-------------------+-------------+------+-----+---------+-------+
Created an External Table in Hive using the same data from (1).
CREATE EXTERNAL TABLE orders
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///user/hive/warehouse/retail_stage.db/orders'
TBLPROPERTIES ('avro.schema.url'='hdfs://host_name//tmp/sqoop-cloudera/compile/bb8e849c53ab9ceb0ddec7441115125d/orders.avsc');
Sqoop Command :
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=root \
--password=cloudera \
--table orders \
--target-dir /user/hive/warehouse/retail_stage.db/orders \
--as-avrodatafile \
--split-by order_id
Describe formatted orders , returning error , tried many combination but failed.
hive> describe orders;
OK
error_error_error_error_error_error_error string from deserializer
cannot_determine_schema string from deserializer
check string from deserializer
schema string from deserializer
url string from deserializer
and string from deserializer
literal string from deserializer
Time taken: 1.15 seconds, Fetched: 7 row(s)
Same thing worked for --as-textfile , where as throwing error in case of --as-avrodatafile.
Referred some stack overflow but did not able to resolve. Any idea?

I think the reference to avro schema file in TBLPROPERTIES should be checked.
does following resolve?
hdfs dfs -cat hdfs://host_name//tmp/sqoop-cloudera/compile/bb8e849c53ab9ceb0ddec7441115125d/orders.avsc
I was able to create exact scenario and select from hive table.
hive> CREATE EXTERNAL TABLE sqoop_test
> COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> LOCATION '/user/cloudera/categories/'
> TBLPROPERTIES
> ('avro.schema.url'='hdfs:///user/cloudera/categories.avsc')
> ;
OK
Time taken: 1.471 seconds
hive> select * from sqoop_test;
OK
1 2 Football
2 2 Soccer
3 2 Baseball & Softball

Parquet-backed Hive table: array column not queryable in Impala

Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps.
I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news!
As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:
123,ASDFG|SDFGH|DFGHJ|FGHJK
234,QWERT|WERTY|ERTYU
A Hive table was created:
CREATE TABLE `id_member_of`(
`id` INT,
`member_of` ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
The raw data was loaded into the Hive table:
LOAD DATA LOCAL INPATH 'raw_data.csv' INTO TABLE id_member_of;
A Parquet version of the table was created:
CREATE TABLE `id_member_of_parquet` (
`id` STRING,
`member_of` ARRAY<STRING>)
STORED AS PARQUET;
The data from the CSV-backed table was inserted into the Parquet table:
INSERT INTO id_member_of_parquet SELECT id, member_of FROM id_member_of;
And the Parquet table is now queryable in Hive:
hive> select * from id_member_of_parquet;
123 ["ASDFG","SDFGH","DFGHJ","FGHJK"]
234 ["QWERT","WERTY","ERTYU"]
Strangely, when I query the same Parquet-backed table in Impala, it doesn't return the array column:
[hadoop01:21000] > invalidate metadata;
[hadoop01:21000] > select * from id_member_of_parquet;
+-----+
| id |
+-----+
| 123 |
| 234 |
+-----+
Question: What happened to the array column? Can you see what I'm doing wrong?

It turned out to be really simple: we can access the array by adding it to the FROM with a dot, e.g.
Query: select * from id_member_of_parquet, id_member_of_parquet.member_of
+-----+-------+
| id | item |
+-----+-------+
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+-----+-------+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark shell and spark dataframe gives different results for parquet files - hive

Related

Querying struct within array - Databricks SQL

create hive table from nested json data with flatten out fields

How to analyze the contents fsimage via hive queries

Not able to query records from Hive , when data stored as AVRO format , returns "error_error..." exception

Parquet-backed Hive table: array column not queryable in Impala

Categories

Resources