why my hive orc table have many strips?
I create a tmp table, test_orc_1 have 400 numbers of columns
CREATE TABLE tmp.test_orc_2 stored as orc AS
SELECT *
FROM tmp.test_orc_1
;
table test_orc_2 have 3928 strips in a file, this file's size is 300M
how can i decrease numbers of strips?
Related
I have text file with snappy compression partitioned by field 'process_time' (result of Flume job). Example: hdfs://data/mytable/process_time=25-04-2019
This is my script for create table:
CREATE EXTERNAL TABLE mytable
(
...
)
PARTITIONED BY (process_time STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/mytable/'
TBLPROPERTIES("textfile.compress"="snappy");
The result of queries against this table are allways 0 (but I know that there are some data). Any help?
Thanks!
As you are creating external table on top of HDFS directory then to add the partitions to the hive table we need to run either of these commands.
if any partition added to HDFS directly(instead of using insert queries) then hive doesn't know about the newly added partitions, so we need to run either msck (or) add partitions to add newly added partitions to hive table.
To add all partitions to hive table:
hive> msck repair table <db_name>.<table_name>;
(or)
To manually add each partition to hive table:
hive> alter table <db_name>.<table_name> add partition(process_time="25-04-2019")
location '/data/mytable/process_time=25-04-2019';
For more details refer to this link.
When you create a ORC table in hive, you are changing the file type to be orc. This means you can't look at a specific file outside of the orc table.
Here's an example orc create table statement
CREATE TABLE IF NOT EXISTS table_orc_v1
(
col1 int,
col2 int
)
PARTITIONED BY (odate date)
CLUSTERED BY (col1) INTO 10 BUCKETS
STORED AS ORC TBLPROPERTIES('transactional'='true');
If I try to make this a csv table (like you do on a non-orc table) will it
1) not affect table performance
2) slow down performance as it converts things to a csv file that you can never read
3) give me some benefit that I'm not aware of
4) do something else
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
if you are using any binary format (ORC, AVRO, Parquet) to store you data then ROW FORMAT DELIMITED FIELDS TERMINATED BY is just ignored, you can use it in your table syntax, it might not give you any error. However they are not being used
I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?
Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.
What is the efficient way to export the data from hive/impala table with conditions into file(the data would be huge, close to 10 GB)? The format of the hive table is paraquet with snappy compressed and file is csv.
The table is partitioned daily and data needs to be extracted on daily basis, I would like to know if
1) Imapala approach
impala-shell -k -i servername:portname -B -q 'select * from table where year_month_date=$$$$$$$$' -o filename '--output_delimiter=\001'
2) Hive approach
Insert overwrite directory '/path' select * from table where year_month_date=$$$$$$$$
would be efficient
Assuming table tbl as your hive parquet table and condition as your filter condition.
CTAS command:
CREATE TABLE tbl_text ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/tmp/data' AS select * from tbl where condition;
You will find your CSV text file (delimited by ',') at /tmp/data in HDFS.
You can get this file to your local file system if needed using:
hadoop fs -get /tmp/data
Please try to use Dynamic Partitioning for your Hive/Impala table to efficiently export the data conditionally.
Partition your table with the columns of your interest and based on your queries for best results
Step 1: Create a Temporary Hive Table TmpTable and load your raw data into it
Step 2: Set hive parameters to support Dynamic partition
SET hive.exec.dynamic.partition.mode=non-strict;
SET hive.exec.dynamic.partition=true;
Step 3: Create your Main Hive Table with partition columns, example :
CREATE TABLE employee (
emp_id int,
emp_name string
PARTITIONED BY (location string)
STORED AS PARQUET;
Step 4: Load data from Temporary table to your employee table (Main Table)
insert overwrite table employee partition(location)
select emp_id,emp_name, location from TmpTable;
Step 5: export the data from hive with a condition
INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM employee WHERE location='CALIFORNIA';
Please refer this link:
Dynamic Partition Concept
Hope this is useful.
I am creating a Hive table from a file with 8k rows, but the table created has 78k rows. The command line is the following:
bin/hive_executable < my_script.hql
my_script.hql:
create table my_table(k1 t1, k2 t2....);
load data local inpath 'path/to/table_file.txt' INTO TABLE my_table;
table_file.txt:
v1 v2 v3...
I've tried both space and tab delimited fields, and explicitly declaring the structure in the create table statement. When I use example code to create a table from $HIVE_HOME/example/file/kv1.txt, the table and file both have 500 lines / rows.
Any ideas?
Thanks
Strip text fields of all newline characters.