I m using Hive 0.8.0 version. I wanted to insert the system timestamp into a timestamp field while loading data into a hive table.
In Detail:
I have a file with 2 fields like below:
id name
1 John
2 Merry
3 Sam
Now i wanted to load this file on hive table along with the extra column "created_date". So i have created hive table with the extra filed like below:
CREATE table mytable(id int,name string, created_date timestamp) row format delimited fields terminated by ',' stored as textfile;
If i load the data file i used the below query:
LOAD DATA INPATH '/user/user/data/' INTO TABLE mytable;
If i run the above query the "created_date" field will be NULL. But i wanted that field should be inserted with the system timestamp instead of null while loading the data into hive table. Is it possible in hive. How can i do it?
You can do this in two steps. First load data from the file into a temporary table without the timestamp. Then insert from the temp table into the actual table, and generate the timestamp with the unix_timestamp() UDF:
create table temptable(id int, name string)
row format delimited fields terminated by ','
stored as textfile;
create table mytable(id int, name string, created_date timestamp)
row format delimited fields terminated by ','
stored as textfile;
load data inpath '/user/user/data/' into table temptable;
insert into table mytable
select id, name, unix_timestamp()
from temptable;
Related
I have the following query in hive:
CREATE EXTERNAL TABLE shop.id_store (
person_id INT,
shop_category STRING
)
row format delimited fields terminated by ',' stored as textfile
LOCATION "user/schema/table"
tblproperties('skip.header.line.count'='1', 'external.table.purge'='true');
LOAD DATA INPATH 'tmp/ids.csv' OVERWRITE INTO TABLE shop.id_store;
INSERT OVERWRITE TABLE shop.id_store
SELECT
*
FROM
shop.id_store
my csv ids.csv, does contain headers, however i have noticed that the above code actually removes the first row of my actual data. What is going on?
I need to migrate 2 tables (table A and B) to a new cluster.
I applied the same query on the 2 tables. Table A works fine, but Table B has mismatched counts. There are more counts in the new cluster. After some investigation, I found the extra counts are Null rows. But I can't find the cause of this extra-count issue.
My procedure is as below:
Export Hive table
INSERT OVERWRITE LOCAL DIRECTORY
'/path/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0007' null defined as '' stored as textfile
SELECT * FROM export_table_name
WHERE file_date between '2021-01-01' and '2022-01-31'
LIMIT 2100000000;
*One difference between Table A and B: Table B is a lot bigger than A. When I exported Table B, I sliced it half and exported twice. The query was WHERE date between '2021-01-01' and '2021-06-30' and WHERE date between '2021-07-01' and '2021-12-31'
SCP the exported files to the new cluster
Create table schema with
CREATE TABLE myTable_temp(
columns
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0007'
stored as textfile;
Import the files to the temp table (non-partitioned)
load data inpath 'myPath' overwrite into table myTable_temp;
*For table B, I imported twice. The query for the second import was load data inpath 'myPath' into table myTable_temp;
Create table schema + one extra column "partition_key" for the actual table
Inject data from the temp table to the actual table (partitioned)
insert into table myTable partition(partition_key) select *, concat(year(file_date)) partition_key from myTable_temp;
I am trying to create impala table with array column type, I have to use custom delimiter for array type column.
I tried below query. But, its throwing error.
Create table array_demo( arra_col ARRAY<string>) row format delimited fields terminated by ','
collection items terminated by '|' stored as parquet
You should omit the ROW FORMAT clause and the subclauses specifying the terminators, and include a STORED AS clause (Parquet is the only format Impala supports with complex data).
The data files to load the table have to be in parquet format too.
If you don't have the data file in Parquet format, you can create the table in Hive,
then create a copy using CREATE TABLE … AS SELECT (CTAS statement), with STORED AS PARQUET.
You then can query the table in Impala.
As an example
-- Create table in Hive
CREATE TABLE array_demo( arra_col ARRAY<STRING>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
STORED AS TEXTFILE;
-- Copy the table as parquet format
CREATE TABLE array_demo_impala AS
SELECT *
FROM array_demo
STORED AS PARQUET;
Let's imagine I store one file per day in a format:
/path/to/files/2016/07/31.csv
/path/to/files/2016/08/01.csv
/path/to/files/2016/08/02.csv
How can I read the files in a single Hive table for a given date range (for example from 2016-06-04 to 2016-08-03)?
Assuming every files follow the same schema, I would then suggest that you store the files with the following naming convention :
/path/to/files/dt=2016-07-31/data.csv
/path/to/files/dt=2016-08-01/data.csv
/path/to/files/dt=2016-08-02/data.csv
You could then create an external table partitioned by dt and pointing to the location /path/to/files/
CREATE EXTERNAL TABLE yourtable(id int, value int)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/path/to/files/'
If you have several partitions and don't want to write alter table yourtable add partition ... queries for each one, you can simply use the repair command that will automatically add partitions.
msck repair table yourtable
You can then simply select data within a date range by specifying the partition range
SELECT * FROM yourtable WHERE dt BETWEEN '2016-06-04' and '2016-08-03'
Without moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
Loading files into tables
Query with HiveQL ( select * from table where dt between '2016-06-04 ' and '2016-08-03')
Moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
move /path/to/files/2016/07/31.csv under /dbname.db/tableName/dt=2016-07-31, then you'll have
/dbname.db/tableName/dt=2016-07-31/file1.csv
/dbname.db/tableName/dt=2016-08-01/file1.csv
/dbname.db/tableName/dt=2016-08-02/file1.csv
load partition with
alter table tableName add partition (dt=2016-07-31);
See Add partitions
In Spark-shell, read hive table
/path/to/data/user_info/dt=2016-07-31/0000-0
1.create sql
val sql = "CREATE EXTERNAL TABLE `user_info`( `userid` string, `name` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://.../data/user_info'"
2. run it
spark.sql(sql)
3.load data
val rlt= spark.sql("alter table user_info add partition (dt=2016-09-21)")
4.now you can select data from table
val df = spark.sql("select * from user_info")
I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this:
CREATE TABLE csvimport(id BIGINT, time STRING, log STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/home/hadoop/file.csv' OVERWRITE INTO TABLE csvimport;
I now want to store the Hive table in a S3 bucket so the table is preserved once I terminate the MapReduce instance.
Does anyone know how to do this?
Yes you have to export and import your data at the start and end of your hive session
To do this you need to create a table that is mapped onto S3 bucket and directory
CREATE TABLE csvexport (
id BIGINT, time STRING, log STRING
)
row format delimited fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
LOCATION 's3n://bucket/directory/';
Insert data into s3 table and when the insert is complete the directory will have a csv file
INSERT OVERWRITE TABLE csvexport
select id, time, log
from csvimport;
Your table is now preserved and when you create a new hive instance you can reimport your data
Your table can be stored in a few different formats depending on where you want to use it.
Above Query needs to use EXTERNAL keyword, i.e:
CREATE EXTERNAL TABLE csvexport ( id BIGINT, time STRING, log STRING )
row format delimited fields terminated by ',' lines terminated by '\n'
STORED AS TEXTFILE LOCATION 's3n://bucket/directory/';
INSERT OVERWRITE TABLE csvexport select id, time, log from csvimport;
An another alternative is to use the query
INSERT OVERWRITE DIRECTORY 's3n://bucket/directory/' select id, time, log from csvimport;
the table is stored in the S3 directory with HIVE default delimiters.
If you could access aws console and have the "Access Key Id" and "Secret Access Key" for your account
You can try this too..
CREATE TABLE csvexport(id BIGINT, time STRING, log STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3n://"access id":"secret key"#bucket/folder/path';
Now insert the data as other stated above..
INSERT OVERWRITE TABLE csvexport select id, time, log from csvimport;