This may be really nothing but as i am new to hive. I don't know how to do this in Hive?
I am trying to create a external table in hive, in which one of the columns has array as data type. My sample data looks like this:
QuRYeRnAuXM EvilSquirrelPictures 1135 Pets&Animals 252 1075 4.96 46 86 gFa1YMEJFag nRcovJn9xHg 3TYqkBJ9YRk
Now, here is the problem my columns are '\t' separated and I want the last three columns(gFa1YMEJFag nRcovJn9xHg 3TYqkBJ9YRk) placed in one column namely related_video_ids. I tried and came up with DDL for my table, here's how it looks:
create external table youtube(video_id char(12), uploader string, time_of_est int,
category string, len int, views int, ratings float, num_of_rat int, comments int,
related_video_ids array<struct<col1:string,col2:string,col3:string>>) row format delimited
fields terminated by '\t'
collection items terminated by '\t'
location '/youtube/';
But Every time, I write the above Query in Hive I get results like:
QuRYeRnAuXM EvilSquirrelPictures 1135 Pets & Animals 252 1075 4.96000003815 46 86 [{"col1":"gFa1YMEJFag","col2":null,"col3":null}]
Please help me to get correct DDL for my table.
Related
I have three files in a location '/user/hive/warehouse/dig.db/',let files be:
text1.csv
text2.csv
text3.csv
How do I create table using these 3 files(which are having same headers or schema) in impala
I have tried this but it only applies to only one file,not all three csv files. The rest 2 files data is stored under single field
create external table dig.Tunnel (
tbm string,
year smallint,
month tinyint,
day tinyint,
hour tinyint,
dist decimal(8,2),
lon decimal(9,6),
lat decimal(9,6))
row format delimited fields terminated by ","
location '/user/hive/warehouse/dig.db/'
Yes - to your last comment. If all your files do not have the same format, and that format doesn't match the format you have defined in the Impala table definition, then you won't see the data
I have a bunch of gzipped files in HDFS under directories of the form /home/myuser/salesdata/some_date/ALL/<country>.gz , for instance /home/myuser/salesdata/20180925/ALL/us.gz
The data is of the form
<country> \t count1,count2,count3
So essentially it's first tab separated and then I need to extract the comma separated values into separate columns
I'd like to create an external table, partitioning this by country, year, month and day. The size of the data is pretty huge, potentially 100s of TB and so I'd like to have an external table itself, rather than having to duplicate the data by importing it into a standard table.
Is it possible to achieve this by using only an external table?
considering your country is separated by tab '\t' and other fields separated by , this is what you can do.
You can create a temporary table which has first columns as string and rest as array.
create external table temp.test_csv (country string, count array<int>)
row format delimited
fields terminated by "\t"
collection items terminated by ','
stored as textfile
location '/apps/temp/table';
Now if you drop your files into the /apps/temp/table location you should be able to to select the data as mentioned below.
select country, count[0] as count_1, count[1] count_2, count[2] count_3 from temp.test_csv
Now to create partitions create another table, as mentioned below.
drop table temp.test_csv_orc;
create table temp.test_csv_orc ( count_1 int, count_2 int, count_3 int)
partitioned by(year string, month string, day string, country string)
stored as orc;
And load the data from temporary table into this one.
insert into temp.test_csv_orc partition(year="2018", month="09", day="28", country)
select count[0] as count_1, count[1] count_2, count[2] count_3, country from temp.test_csv
I have taken country as Dynamic Parition as it's coming from file however others aren't so it's static.
Trying to create a Hive table but due to the folder structure it's going to take hours just to partition.
Below is an example of what I am currently using to create the table, but it would be really helpful if I could filter the partioning.
In the below I need every child_company, just one year, every month, and just one type of report.
Is there any way to do something like set hcat.dynamic.partitioning.custom.pattern = '${child_company}/year=${2016}/${month}/report=${inventory}'; When partitioning to avoid the need to read through all folders (> 300k)?
Language: Hive
Version: 1.2
Interface: Quobole
use my_database;
set hcat.dynamic.partitioning.custom.pattern = '${child_company}/${year}/${month}/${report}';
drop table if exists table_1;
create external table table_1
(
Date_Date string,
Product string,
Quantity int,
Cost int
)
partitioned by
(
child_company string,
year int,
month int,
report string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
location 's3://mycompany-myreports/parent/partner_company-12345';
alter table table_1 recover partitions;
show partitions table_1;
My input data is as follows:
1,srinivas,courtthomas,memphis
2,vindhya,courtthomas,memphis
3,srinivas,courtthomas,kolkata
4,vindhya,courtthomas,memphis
And I have created the following queries:
create EXTERNAL table seesaw (id int,name string,location string) partitioned by (address string) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile LOCATION '/seesaw';
LOAD DATA INPATH '/sampledoc' OVERWRITE INTO TABLE seesaw PARTITION (address = 'Memphis');
when I try to fetch my query it comes as follows:
Select * from seesaw;
OK
1 srinivas courtthomas Memphis
2 vindhya courtthomas Memphis
3 srinivas courtthomas Memphis
4 vindhya courtthomas Memphis
I really don't understand how all the rows have been showing memphis at the end.
Read your code closely:
create EXTERNAL table seesaw (id int,name string,location string)
Notice that there are only three columns, id, name and location.
Your data, however,
1,srinivas,courtthomas,memphis
2,vindhya,courtthomas,memphis
3,srinivas,courtthomas,kolkata
4,vindhya,courtthomas,memphis
has four columns. Something's fishy here.
LOAD DATA INPATH '/sampledoc'
OVERWRITE INTO TABLE seesaw
PARTITION (address = 'Memphis');
you're asking to partition a category that only contains courtthomas by Memphis. The result is to little surprise not what you want.
If you are using external table, you will need to manually create folders for each partition, i.e in your case - create two folders [address=Memphis] and [address=kolkata] AND copy the corresponding input data files under the corresponding folder and then add the partitions to metadata as follows:
ALTER TABLE seesaw ADD PARTITION(address='Memphis');
ALTER TABLE seesaw ADD PARTITION(address='kolkata');
Refer this article for a simple example of how to do this - hive-external-table-with-partitions
I've been working on Hive partitioning from the past few days. Here is an example I created :-
Table - Transactions (Non - Partitioned Managed Table):
CREATE TABLE TRANSACTIONS (
txnno INT,
txndate STRING,
custid INT,
amount DOUBLE,
category STRING,
product STRING,
city STRING,
state STRING,
spendby STRING)
row format delimited fields terminated by ',';
Loaded the data inside this table using the load command.
Created another table as follows :-
Table - Txnrecordsbycat (Partitioned Managed Table):
CREATE TABLE TXNRECORDSBYCAT(txnno INT, txndate STRING, custno INT, amount DOUBLE, product STRING, city STRING, state STRING, spendby STRING)
partitioned by (category STRING)
clustered by (state) INTO 10 buckets
row format delimited fields terminated by ',';
Used the following query to load the data from Transactions table to Txnrecordsbycat table.
FROM TRANSACTIONS txn INSERT OVERWRITE TABLE TXNRECORDSBYCAT PARTITION(category) select txn.txnno,txn.txndate,txn.custid, txn.amount, txn.product,txn.city, txn.state, txn.spendby,txn.category DISTRIBUTE BY CATEGORY;
Now as long as I'm firing simple queires like select * from Transactions and select * from trxrecordsbycat, I can see my queries being efficient (i.e take less execution time) on the partitioned table as compared to non-partitioned table.
However, as soon as my queries become a little complex, something like select count(*) from table, the query becomes less efficient (i.e takes more time) on partitioned table.
Any idea whey this might be happening ?
Many Thanks for help.