Write schema to an existing ORC file - hive

We are facing the following issue: we use hive 1.2.x to write orc files, it is a known problem, that hive before version 2.x does not write the orc column names into the orc file (it writes only col_0,col_1,etc.).
We would like to use an other application which reads the schema from the orc file and can not connect to hcat metastore for the correct column names. Unfortunately we do not have a chance to upgrade to 2.x version of hive.
Is there any solution to "append" or replace the correct column names into these exitsing orc files? Thanks in advnace for you help.

Related

Is it possible to merge two parquet directory on hdfs?

I have two parquet directory on my HDFS with the same schema. I want to merge these two directories into one parquet directory, to be able to create an external hive table from it.
I have googled my problem, but almost all result is about merging small parquet files into larger parquet files.
As long as the parquet files have the same schema, you can simply put them in the same directory. Hive will process all files that it finds in an external table's directory (except a few special files with specific names), so you can simply put your data there and Hive will find it. (In older Hive versions this was true for non-external tables as well. In newer Hive versions, however, it is only true for external tables thus you should not tamper with the contents of so-called managed tables.)

how can we load orcdata into hive using nifi hive streaming processor

I have orc files and their schema i have tried loading this orc files in local hive and its working fine, now I will generate multiple orc files and need to load this orc files to hive table using nifi put hive streamming processor ?
PutHiveStreaming expects incoming flow files to be in Avro format. If you are using PutHive3Streaming you have more flexibility but it doesn't accept flow files in ORC format; instead both of those processors convert the input into ORC and write it into a managed table in Hive.
If your files are already in ORC format, you can use PutHDFS to place them directly into HDFS. If you don't have permissions to write directly into a managed table location, you could write to a temporary location, create an external table on top of it, and then load from there into the managed table using INSERT INTO myTable FROM SELECT * FROM externalTable or whatever.

Cloudera 5.6: Parquet does not support date. See HIVE-6384

I am currently using Cloudera 5.6 trying to create a parquet format table in hive table based off another table, but I am running into an error.
create table sfdc_opportunities_sandbox_parquet like
sfdc_opportunities_sandbox STORED AS PARQUET
Error Message
Parquet does not support date. See HIVE-6384
I read that hive 1.2 has a fix for this issue, but Cloudera 5.6 and 5.7 do not come with hive 1.2. Has anyone found way around this issue?
Except from using an other data type like TIMESTAMP or an other storage format like ORC, there might be no way around if there is a dependency to the used Hive version and Parquet file storage format.
According Clouderas CDH 5 Packaging and Tarball Information, the whole branch 5 comes packed with Apache Parquet in v1.5.0 and Apache Hive in v1.1.0.
Date was implemented in ParquetSerde with HIVE-8119 and as of Hive 1.2.

How to load ORC table data into PIG variables

I'm pretty new to PIG programming, can we read or write data to Hive ORC table. Our requirement is to dump data from One hive ORC table to another ORC table, by doing some massaging of data in PIG script. Can any one share with me Samples for ORC reading or Writing. Please also share any quick learning sites of PIG.
A = LOAD 'analysis.bic_orc' USING org.apache.hcatalog.pig.HCatLoader();
I found this in Hortonworks website, I haven't personally used Pig to load ORC files.
http://hortonworks.com/community/forums/topic/pig-is-much-slower-than-hive-when-reading-orc-files-using-hcatalog/
Hope this helps.

Sqoop, Avro and Hive

I'm currently importing from Mysql into HDFS using Sqoop in avro format, this works great. However what's the best way to load these files into HIVE?
Since avro files contain the schema I can pull the files down to the local file system, use avro tools and create the table with the extracted schema but this seems excessive?
Also if a column is dropped from a table in mysql can I still load the old files into a new HIVE table created with the new avro schema (dropped column missing)?
After version 9.1, Hive has come packaged with an Avro Hive SerDe. This allows Hive to read from Avro files directly while Avro still "owns" the schema.
For you second question, you can define the Avro schema with column defaults. When you add a new column just make sure to specify a default, and all your old Avro files will work just find in a new Hive table.
To get started, you can find the documentation here and the book Programming Hive (available on Safari Books Online) has a section on the Avro HiveSerde which you might find more readable.