hive - how to set parquet/ORC as default output format - hive

hive use Text as default format, a extra "store as parquet/ORC" clause have to be added if parquet/ORC file format is needed.
how to set parquet/ORC as default output format ?

hive.default.fileformat
Default Value: TextFile
Added In: Hive 0.2.0
Default file format for CREATE TABLE statement. Options are TextFile,
SequenceFile, RCfile, ORC, and Parquet. Users can explicitly say
CREATE TABLE ... STORED AS
TEXTFILE|SEQUENCEFILE|RCFILE|ORC|AVRO|INPUTFORMAT...OUTPUTFORMAT... to
override. (RCFILE was added in Hive 0.6.0, ORC in 0.11.0, AVRO in
0.14.0, and Parquet in 2.3.0) See Row Format, Storage Format, and SerDe for details.
hive.default.fileformat.managed
Default Value: none
Added In: Hive
1.2.0 with HIVE-9915 Default file format for CREATE TABLE statement applied to managed tables only. External tables will be created with
format specified by hive.default.fileformat. Options are none,
TextFile, SequenceFile, RCfile, ORC, and Parquet (as of Hive 2.3.0).
Leaving this null will result in using hive.default.fileformat for all
native tables. For non-native tables the file format is determined by
the storage handler, as shown below (see the StorageHandlers section
for more information on managed/external and native/non-native
terminology).
+----------+---------------------------------------------------------------------------+-------------------------------------+
| | Native | Non-Native |
+----------+---------------------------------------------------------------------------+-------------------------------------+
| Managed | hive.default.fileformat.managed (or fall back to hive.default.fileformat) | Not covered by default file-formats |
| External | hive.default.fileformat | Not covered by default file-formats |
+----------+---------------------------------------------------------------------------+-------------------------------------+
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-FileFormats

For external tables, execute the following:
set hive.default.fileformat=Parquet
For managed tables, execute the following:
set hive.default.fileformat.managed=Parquet
This would be set only for the current session. If you want to set these for your entire hive configuration, set these properties in your hive-site.xml and restart your hive service.

Related

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

Hive External Table with Azure Blob Storage

Is there a way to create a Hive external table using SerDe with location pointing to Azure Storage, organized in such a way that the data uses the fewest number of blobs. For example if insert 10000 records, I would like it to create just 100 page blobs with 100 line records each instead of maybe 10000 with 1 record each. I am de serializing from the blob, so fewer blobs will require lesser time.What would be the most optimal format in hive?
First, there is a way to create a Hive external table using Serde with localtion pointing to Azure Blob Storage, but not directly, please see the section Create Hive database and tables like the HiveQL below.
create database if not exists <database name>;
CREATE EXTERNAL TABLE if not exists <database name>.<table name>
(
field1 string,
field2 int,
field3 float,
field4 double,
...,
fieldN string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>' lines terminated by '<line separator>'
STORED AS TEXTFILE LOCATION '<storage location>' TBLPROPERTIES("skip.header.line.count"="1");
And focus the below content for explaination <storage location>.
<storage location>: the Azure storage location to save the data of Hive tables. If you do not specify LOCATION , the database and the tables are stored in hive/warehouse/ directory in the default container of the Hive cluster by default. If you want to specify the storage location, the storage location has to be within the default container for the database and tables. This location has to be referred as location relative to the default container of the cluster in the format of 'wasb:///<directory 1>/' or 'wasb:///<directory 1>/<directory 2>/', etc. After the query is executed, the relative directories are created within the default container.
So it means you can access Azure Blob Storage location on Hive via wasb protocol, which required hadoop-azure library that support Hadoop access HDFS on Azure Storage. If your Hive on Hadoop not deployed on Azure, you need to refer to the Hadoop offical document Hadoop Azure Support: Azure Blob Storage to configure it.
For using serde, it is depended on the file format you used, like for orc file format, the hql code using OrcSerde like below.
CREATE EXTERNAL TABLE IF NOT EXSISTS <table name> (<column_name column_type>, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS ORC
LOCATION '<orcfile path>'
For your second, the most optimal format is ORC File Format in Hive.

Hue on Cloudera - NULL values (importing file)

Yesterday I installed Cloudera QuickStart VM 5.8. After the import operation of files from the database by HUE, in some tables there were a NULL value (the entire column). In previous steps data display them properly as they should be imported.
First Pic.
Second Pic.
can you run the command describe formatted table_name in hive shell and see what is the field delimiter and then go to the warehouse directory and see if the delimiter in the data and in the table definition is same.i am sure it will not be same thats why you see null.
i am assuming you have imported the data in the default warehouse directory.
then you can do one of the following
1) delete your hive table and create it again with correct delimiter as it is in the actual data ( row format delimited fields terminated by "your delimitor" and give location as your data file
or
2) delete the data that is imported and run sqoop import again and give the fields-terminated-by " the delimitor in the hive table definition"
Once check datatype of second(col_1) and third(col_2) in original database from where your exporting.
This can not be case of missing delimiter, else fourth(col_3) would not have populated correctly, which is correct.

Does DROP PARTITION delete data from external table in HIVE?

An external table in HIVE is partitioned on year, month and day.
So does the following query delete data from external table for the specific partitioned referenced in this query?:-
ALTER TABLE MyTable DROP IF EXISTS PARTITION(year=2016,month=7,day=11);
Partitioning scheme is not data. Partitioning scheme is part of table DDL stored in metadata (simply saying: partition key value + location where the data-files are being stored).
Data itself are stored in files in the partition location(folder). If you drop partition of external table, the location remain untouched, but unmounted as partition (metadata about this partition is deleted). You can have few versions of partition location unmounted (for example previous versions).
You can drop partition and mount another location as partition (alter table add partition) or change existing partition location. Also drop external table do not delete table/partitions folders with files in it. And later you can create table on top of this location.
Have a look at this answer for better understanding external table/partition concept: It is possible to create many tables (both managed and external at the same time) on top of the same location in HDFS.
No external table have only references that will be deleted actual file will still persists at location .
External Table data files are not owned by table neither moved to hive warehouse directory
Only PARTITION meta will be deleted from hive metastore tables..
Difference between Internal & external tables :
For External Tables -
External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.
Meta data is maintained on master node and deleting an external table from HIVE, only deletes the metadata not the data/file.
For Internal Tables-
Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file .
Deleting the table deletes the metadata & data from master-node and HDFS respectively.
Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organisation to organisation).
Hive may have internal or external tables this is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
Hive should not own data and control settings, dirs, etc., you may have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life-cycle of the table and data.
Note: Meta table if you will look in to the database ( configured details)
|BUCKETING_COLS |
| COLUMNS |
| DBS |
| NUCLEUS_TABLES |
| PARTITIONS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SORT_COLS |
| TABLE_PARAMS |
| TBLS |

Import data flat files in hive without defining hive table structure

Can I import CSV or any other flat files in to hive without creating and defining table structure first in hive. Say my csv file is having 200 columns and needs to be imported into hive table. So I have to first create a table in hive and define all the column names and datatype within that hive table and import. Is there any way in which I can directly import in to hive and it automatically creates tables structure from first line say, similar to sqoop import?
use sqoop with a "hive-import" switch & it will create your table for you http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_importing_data_into_hive
Check your hive-site.xml for the value of the property
javax.jdo.option.ConnectionURL. If you do not define this explicitly,
the default value will use a relative path for creation of hive
metastore (jdbc:derby:;databaseName=metastore_db;create=true) which
will be different depending upon where you launch the process from.
This would explain why you cannot see the table via show tables.
The way to overcome it would be to define this property value in your
hive-site.xml using an absolute path