To create a table and load data in that table using a .tbl file we need to, first create the schema and then load the data?
Is not possible to do both operations just in one command, like this command below?
create external table customer (
C_CUSTKEY INT,
C_NAME STRING,
C_ADDRESS STRING,
C_NATIONKEY INT,
C_PHONE STRING,
C_ACCTBAL DOUBLE,
C_MKTSEGMENT STRING,
C_COMMENT STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/user/hadoopadmin/tables/customer.tb.';
Because I try to do this above command first and the schema was created but when I tried to do "select count(*) from customer" I get 0 rows.
But using the first command to create the schema and then use "LOAD DATA INPATH" command to load the data works. Now with select count(*) I get the number of rows in table.
So its necessary to execute two commands? Its not possible with my first code example with "LOCATION" option, or the first code have some issue and because of that didnt work?
Related
I want to create a table in Hive using a select statement which takes a subset of a data from another table. I used the following query to do so :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada';
When I looked into the HDFS location of this table, there are no field separators.
But I need to create a table with filtered data from another table along with a field separator. For example I am trying to do something like :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada'
ROW FORMAT SERDE
FIELDS TERMINATED BY '|';
This is not working though. I know the alternate way is to create a table structure with field names and the "FIELDS TERMINATED BY '|'" command and then load the data.
But is there any other way to combine the two into a single query that enables me to create a table with filtered data from another table and also with a field separator ?
Put row format delimited .. in front of AS select
do it like this
Change the query to yours
hive> CREATE TABLE ttt row format delimited fields terminated by '|' AS select *,count(1) from t1 group by id ,name ;
Query ID = root_20180702153737_37802c0e-525a-4b00-b8ec-9fac4a6d895b
here is the result
[root#hadoop1 ~]# hadoop fs -cat /user/hive/warehouse/ttt/**
2|\N|1
3|\N|1
4|\N|1
As you can see in the documentation, when using the CTAS (Create Table As Select) statement, the ROW FORMAT statement (in fact, all the settings related to the new table) goes before the SELECT statement.
I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!
I have a use case where I build a hive table from a bunch of csv files. While writing csv information into hive table, I assign INPUT__FILE__NAME (part of the name) to one of the columns. When I want to the update the records for the same filename, I need to delete the records of the csv file before writing it again.
I use the below query but failed
CREATE EXTERNAL TABLE T_TEMP_CSV(
F_FRAME_RANK BIGINT,
F_FRAME_RATE BIGINT,
F_SOURCE STRING,
F_PARAMETER STRING,
F_RECORDEDVALUE STRING,
F_VALIDITY INT,
F_VALIDITY_INTERPRETATION STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ';'
location '/user/baamarna5617/HUMS/csv'
TBLPROPERTIES ("skip.header.line.count"="2");
DELETE FROM T_RECORD
WHERE T_RECORD.F_SESSION = split(reverse(split(reverse(T_TEMP_CSV.INPUT__FILE__NAME),"/")[0]), "[.]")[0]
from T_TEMP_CSV;
The T_RECORD table has a column called F_SESSION which was assigned part of the INPUT__FILE__NAME using the split method shown above. I want to use the same method while removing those records. Can someone please point me where i am going wrong in this query?
I could successfully delete the records using the below syntax
DELETE FROM T_RECORD
WHERE F_SESSION = 68;
I need to get that 68 from the INPUT_FILE_NAME.
I am creating an external table in my own database :
create external table test1 (
event_uid string,
event_type_id int,
event_category_id int,
event_date string,
log_date string,
server string,
server_type string,
process_id int,
device_id string,
account_id string,
ip_address string,
category_id string,
content_id string,
entitlement_id string,
product_id string,
sku string,
title_id string,
service_id string,
order_id bigint,
transaction_id bigint,
company_code string,
product_code string,
key_value_pairs map<string,string>,
process_run_id string)
partitioned by (A string, B string, C string)
location '/data/a1/pnt/lte/formatted/evt'
When I try SHOW PARTITIONS TEST, I just get OK as an output.
However, there is a table with the same same DDL and the same location in another database which is giving results when I do SHOW PARITITIONS TEST. I have also tried MSCK REPAIR TABLE TEST which displays partitions .
Please suggest
When using partitions no actual partitions are created when you execute your DDL. The partitions are created when you load data into your table. So you need to load data and then you would be able to see the partitions with the show partitions statement.
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE test1 ADD PARTITION (A=2016, B=07, C=19)
hive> LOCATION '/data/a1/pnt/lte/formatted/evt/somedatafor_20160719'
hive> ;
When we specify LOCATION '/data/a1/pnt/lte/formatted/evt' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create for next data set(say here for C=20), we can simply place the data file in that location like '/data/a1/pnt/lte/formatted/evt/A=2016/B=07/C=20/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE test1;
The above statement will sync up the partition to the hive meta store of the table "test1".
I'm trying to create a bucket in hive by using following commands:
hive> create table emp( id int, name string, country string)
clustered by( country)
row format delimited
fields terminated by ','
stored as textfile ;
Command is executing successfully: when I load data into this table, it executes successfully and all data is shown when using select * from emp.
However, on HDFS it is only creating one table and only one file is there with all data. That is, there is no folder for specific country records.
First of all, in the DDL statement you have to explicitly mention how many buckets you want.
create table emp( id int, name string, country string)
clustered by( country)
INTO 2 BUCKETS
row format delimited
fields terminated by ','
stored as textfile ;
In the above statement I have mention 2 buckets, similarly you can mention any number you want.
Still you are not done!!
After that, while loading data into the table you also have to mention the below hint to hive.
set hive.enforce.bucketing = true;
That should do it.
After this you should be able to see that number of files created under the table directory is same as the number of buckets mentioned in the DDL statement.
Bucketing doesn't create HDFS folders, rather if you want a separate floder to be created for a country then you should PARTITION.
Please go through hive partitioning and bucketing in detail.