Pyspark cannot create a parquet table in hive - hive

Many searches point to pyspark code to create tables in hive metastore with something like:
hivecx.sql("...create table syntax that matches the dataframe...")
df.write.mode("overwrite").partitionBy('partition_colname').insertInto("national_dev.xh_claimline")
I've tried many variations of write/save/insertinto and modes, but always get:
Caused by: java.io.FileNotFoundException: File does not exist: /user/hive/warehouse/national_dev.db/xh_claimline/000000_0
The table directory exists in hadoop, but the 000000_0 sub directory(s) does not. I thought this was because the table is empty and i haven't written to it yet.
hadoop fs -ls /user/hive/warehouse/national_dev.db/xh_claimline
Found 2 items
drwxrwxrwt - mryan hive 0 2017-03-20 12:26 /user/hive/warehouse/national_dev.db/xh_claimline/.hive-staging_hive_2017-03-20_12-26-35_382_2703713921168172595-1
drwxrwxrwt - mryan hive 0 2017-03-20 12:29 /user/hive/warehouse/national_dev.db/xh_claimline/.hive-staging_hive_2017-03-20_12-29-40_775_73045420253990110-1
On Cloudera, Spark version:
17/03/20 11:45:21 INFO spark.SparkContext: Running Spark version 1.6.0

Looked at insert into statement, Here data write mode overwrite is used, then no need to write insert in to. Directly use saveAsTable with parquet format. Here is modified statement:-
df = hivecx.sql("...create table syntax that matches the dataframe...")
df.write.mode("overwrite").format("parquet").partitionBy('partition_colname').saveAsTable("national_dev.xh_claimline")

Related

Export external hive table from one VM to another

I have 2 environments namely Dev and stage. Both has hive installed (same version 2.1). On Dev I have external hive tables pointing to hbase table. I have to export this hive table to stage. No compulsion that hbase table also be migrated. If managed hive table is created with data in it, will be sufficient. Can anyone suggest me how to do this? Below is diagrammatic representation of scenario. Solution to any of the expected scenario will be useful.
I tried:
Dump hive table's data into CSV file and load it into managed hive table on stage. But data have Japanese characters (non-utf8) causing higher row count on stage w.r.t. row count on Dev.
I guess, this is completely theoretical problem so not adding queries. Please let me know if you wish to see queries.
Dev Hive table -> Dev HDFS location -> Distcp -> Stage HDFS location -> Import -> Stage Hive table
You can export the hive table data to an HDFS location using the command below.
INSERT OVERWRITE DIRECTORY 'hdfs_exports_location/department' SELECT * FROM department;
Copy the HDFS data to the stage environment HDFS location using distcp
hadoop distcp <hdfs_export_location>/department hdfs://<stage name node>/<import location>
Import the table from the copied HDFS files
import from '<import location>';
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport

PySpark: java.lang.ClassCastException

I have a PySpark code which develops the query and runs insert into command on another Hive table which is internally mapped to a HBase table.
When I run the insert into command onto the Hive table using spark sql I get the following exception..
java.lang.ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
I checked the datatypes and tblproperties but unable to get through this exception.
The versions I am using are:
PySpark -- 1.6.0
Hive -- 1.1.0-cdh5.8.2
The table properties are:
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping"=":key,colf:a")
tblproperties("hbase.table.name"="abc",'hbase.mapred.output.outputtable' = 'abc');
I tried removing the Row Format Serde even though getting the same issue..
Am I getting the issue because of the versions not getting matched?? or am I going wrong??
This is a bug of spark, see this apache spark pull, https://github.com/apache/spark/pull/17989

Query fails on presto-cli for a table created in hive in orc format with data residing in s3

I set up an Amazon EMR instance which includes 1 Master & 1 Core (m4 Large) with the following version details:
EMR : 5.5.0
Presto: Presto 0.170
Hadoop 2.7.3 HDFS
Hive 2.1.1 Metastore
My Spark app wrote out the data in ORC to Amazon S3. Then I created the table in hive (create external table TABLE ... partition() stored as ORC location 's3a"//'), and tried to query from presto-cli, and I get the following error for query SELECT * from TABLE:
Query 20170615_033508_00016_dbhsn failed: com.facebook.presto.spi.type.DoubleType
The only query that works is:
SELECT COUNT(*) from TABLE
Any ideas?
Found out the problem. The column orders when it was stored as orc did not match those when table was created in hive :)!!!

Cloudera 5.6: Parquet does not support date. See HIVE-6384

I am currently using Cloudera 5.6 trying to create a parquet format table in hive table based off another table, but I am running into an error.
create table sfdc_opportunities_sandbox_parquet like
sfdc_opportunities_sandbox STORED AS PARQUET
Error Message
Parquet does not support date. See HIVE-6384
I read that hive 1.2 has a fix for this issue, but Cloudera 5.6 and 5.7 do not come with hive 1.2. Has anyone found way around this issue?
Except from using an other data type like TIMESTAMP or an other storage format like ORC, there might be no way around if there is a dependency to the used Hive version and Parquet file storage format.
According Clouderas CDH 5 Packaging and Tarball Information, the whole branch 5 comes packed with Apache Parquet in v1.5.0 and Apache Hive in v1.1.0.
Date was implemented in ParquetSerde with HIVE-8119 and as of Hive 1.2.

How to make a table that is automatically updated Hive

I have created an external table that in Hive that uses data from a Parquet store in HDFS.
When the data in HDFS is deleted, there is no data in the table. When the data is inserted again in the same spot in HDFS, the table does not get updated to contain the new data. If I insert new records into the existing table that contains data, no new data is shown when I run my Hive queries.
How I create the table in Hive:
CREATE EXTERNAL TABLE nodes (id string) STORED AS PARQUET LOCATION "/hdfs/nodes";
The relevant error:
Error: java.io.FileNotFoundException: File does not exist: /hdfs/nodes/part-r-00038-2149d17d-f890-48bc-a9dd-5ea07b0ec590.gz.parquet
I have seen several posts that explain that external tables should have the most up to date data in them, such as here. However, this is not the case for me, and I don't know what is happening.
I inserted the same data into the database again, and queried the table. It contained the same amount of data as before. I then created an identical table with a different name. It had twice as much data in it, which was the right amount.
The issue might be with the metastore database. I am using PostgreSQL instead of Derby for the the database.
Relevant information:
Hive 0.13.0
Spark Streaming 1.4.1
PostgreSQL 9.3
CentOS 7
EDIT:
After examining the Parquet files, I found that the part files have seemingly incompatible file names.
-rw-r--r-- 3 hdfs hdfs 18702811 2015-08-27 08:22 /hdfs/nodes/part-r-00000-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18703029 2015-08-26 15:43 /hdfs/nodes/part-r-00000-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18724320 2015-08-27 08:22 /hdfs/nodes/part-r-00001-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18723575 2015-08-26 15:43 /hdfs/nodes/part-r-00001-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet
These files are the files that causes Hive to error when it can't find it in the error described above. This means that the external table is not acting dynamically, accepting any files in the directory (if you call it that in HDFS), but instead is probably just keeping track of the list of parquet files inside the directory when it was created.
Sample Spark code:
nodes.foreachRDD(rdd => {
if (!rdd.isEmpty())
sqlContext.createDataFrame(rdd.map(
n => Row(n.stuff), ParquetStore.nodeSchema)
.write.mode(SaveMode.Append).parquet(node_name)
})
Where the nodeSchema is the schema and node_name is "/hdfs/nodes"
See my other question about getting Hive external tables to detect new files.
In order to get Hive to update its tables, I had to resort to using the partitioning feature of Hive. By creating a new partition during each Spark run, I create a series of directories internal to the /hdfs/nodes directory like this:
/hdfs/nodes/timestamp=<a-timestamp>/<parquet-files>
/hdfs/nodes/timestamp=<a-different-timestamp>/<parquet-files>
Then, after each Spark job completes, I run the Hive command MSCK REPAIR TABLE nodes using a HiveContext in my Spark job, which finds new partitions and updates the table.
I realize this isn't automatic, but it at least works.
Ok, so probably you need to encapsulate the file in a folder. Hive external table must be mapped on a folder where there could be more than one file.
try to write the file to: /path/to/hdfs/nodes/file
and then map the external table to /path/to/hdfs/nodes
so in the folder nodes you will have only the parquet file and it should works