Multi column bucketing in Hive - hive

Multi column bucketing :
**Table 1:**
create table fact_c (clainno int,uplodateid int,claim_amnt int,svcid int,prodid int) clustered by(prodid,svcid) into 10 buckets row format delimited fields terminated by ',' stored as text;
**Table 2:**
create table product(prodid int,prodname string ) clustered by(prodid) into 10 buckets row format delimited fields terminated by ',' stored as text;
**Table 3:**
create table svc(svcid int,svcname string ) clustered by(svcid) into 10 buckets row format delimited fields terminated by ',' stored as textfile;
Since Table 1 on multi bucketing on int columns ,the keys are summed up and divided by number of buckets to get the bucket id.
For below query how bucket join will work since keys will not be available in relevant buckets between table f and p and s.
Hive Query:
select p.prodid,svcname from fact_c f inner join product p on (f.prodid=p.prodid) inner join svc s (f.svcid=s.svcid);

Related

Hive mismatched counts after table migration

I need to migrate 2 tables (table A and B) to a new cluster.
I applied the same query on the 2 tables. Table A works fine, but Table B has mismatched counts. There are more counts in the new cluster. After some investigation, I found the extra counts are Null rows. But I can't find the cause of this extra-count issue.
My procedure is as below:
Export Hive table
INSERT OVERWRITE LOCAL DIRECTORY
'/path/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0007' null defined as '' stored as textfile
SELECT * FROM export_table_name
WHERE file_date between '2021-01-01' and '2022-01-31'
LIMIT 2100000000;
*One difference between Table A and B: Table B is a lot bigger than A. When I exported Table B, I sliced it half and exported twice. The query was WHERE date between '2021-01-01' and '2021-06-30' and WHERE date between '2021-07-01' and '2021-12-31'
SCP the exported files to the new cluster
Create table schema with
CREATE TABLE myTable_temp(
columns
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0007'
stored as textfile;
Import the files to the temp table (non-partitioned)
load data inpath 'myPath' overwrite into table myTable_temp;
*For table B, I imported twice. The query for the second import was load data inpath 'myPath' into table myTable_temp;
Create table schema + one extra column "partition_key" for the actual table
Inject data from the temp table to the actual table (partitioned)
insert into table myTable partition(partition_key) select *, concat(year(file_date)) partition_key from myTable_temp;

Hive - partition external table by content of data

I have a bunch of gzipped files in HDFS under directories of the form /home/myuser/salesdata/some_date/ALL/<country>.gz , for instance /home/myuser/salesdata/20180925/ALL/us.gz
The data is of the form
<country> \t count1,count2,count3
So essentially it's first tab separated and then I need to extract the comma separated values into separate columns
I'd like to create an external table, partitioning this by country, year, month and day. The size of the data is pretty huge, potentially 100s of TB and so I'd like to have an external table itself, rather than having to duplicate the data by importing it into a standard table.
Is it possible to achieve this by using only an external table?
considering your country is separated by tab '\t' and other fields separated by , this is what you can do.
You can create a temporary table which has first columns as string and rest as array.
create external table temp.test_csv (country string, count array<int>)
row format delimited
fields terminated by "\t"
collection items terminated by ','
stored as textfile
location '/apps/temp/table';
Now if you drop your files into the /apps/temp/table location you should be able to to select the data as mentioned below.
select country, count[0] as count_1, count[1] count_2, count[2] count_3 from temp.test_csv
Now to create partitions create another table, as mentioned below.
drop table temp.test_csv_orc;
create table temp.test_csv_orc ( count_1 int, count_2 int, count_3 int)
partitioned by(year string, month string, day string, country string)
stored as orc;
And load the data from temporary table into this one.
insert into temp.test_csv_orc partition(year="2018", month="09", day="28", country)
select count[0] as count_1, count[1] count_2, count[2] count_3, country from temp.test_csv
I have taken country as Dynamic Parition as it's coming from file however others aren't so it's static.

Hive - Create Table statement with 'select query' and 'fields terminated by' commands

I want to create a table in Hive using a select statement which takes a subset of a data from another table. I used the following query to do so :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada';
When I looked into the HDFS location of this table, there are no field separators.
But I need to create a table with filtered data from another table along with a field separator. For example I am trying to do something like :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada'
ROW FORMAT SERDE
FIELDS TERMINATED BY '|';
This is not working though. I know the alternate way is to create a table structure with field names and the "FIELDS TERMINATED BY '|'" command and then load the data.
But is there any other way to combine the two into a single query that enables me to create a table with filtered data from another table and also with a field separator ?
Put row format delimited .. in front of AS select
do it like this
Change the query to yours
hive> CREATE TABLE ttt row format delimited fields terminated by '|' AS select *,count(1) from t1 group by id ,name ;
Query ID = root_20180702153737_37802c0e-525a-4b00-b8ec-9fac4a6d895b
here is the result
[root#hadoop1 ~]# hadoop fs -cat /user/hive/warehouse/ttt/**
2|\N|1
3|\N|1
4|\N|1
As you can see in the documentation, when using the CTAS (Create Table As Select) statement, the ROW FORMAT statement (in fact, all the settings related to the new table) goes before the SELECT statement.

Hive: How to create multiple tables from multiple files or count entries per file

My goal is to combine entries from multiple files into 1 table but am having some trouble getting there.
So I understand that you can add all entries into a table by doing:
CREATE EXTERNAL TABLE tablename
(
teams STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:///hive-data';
Where each text file's data looks something like:
student#A18645
student#V86541
student#X78543
However, with the code above, this combines all the data from N number of files together in my directory which makes it hard to combine the data. What I want is to have the first entries from all files be concatenated together into a single string and entered into a new table and so forth.
I have tried to number each entry using ROW_NUMBER() but that does not give the number of their place in the file but rather in the table itself.
Therefore, is there a way I can create a table per file, number the entries, and join all the tables together so that in the end, I can get a table that looks like:
number students
1 student#A18645,student#D94655,...student#S45892
2 student#V86541,student#D45645,...student#F46444
3 student#X78543,student#T78722,...student#M99846
Or rather, a way to number each entry as the line number of the file it came from so I can do an inner join on my table.
Note: the number of files can vary so I do not have a set number of files to loop through
You can use this approach to build the final table.
Let's say that these are two files for two teams.
-- team1.txt
student#A18645
student#V86541
student#X78543
-- team2.txt
student#P20045
student#Q30041
student#R40043
Load them in HDFS, each file into a separate directory
hadoop fs -mkdir /hive-data/team1
hadoop fs -mkdir /hive-data/team2
hadoop fs -put team1.txt /hive-data/team1
hadoop fs -put team2.txt /hive-data/team2
Create two tables in Hive, one for each team
CREATE EXTERNAL TABLE team1
(
teams STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:////hive-data/team1';
CREATE EXTERNAL TABLE team2
(
teams STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:////hive-data/team2';
Create the final table in Hive to hold combined data
CREATE TABLE teams
(
team_number INT,
students STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
Populate the final table
(Since comma is a field delimiter, colon is used to generate the string of list of students)
INSERT INTO teams (team_number, students)
SELECT 1, CONCAT_WS(":", COLLECT_LIST(teams)) FROM team1;
INSERT INTO teams (team_number, students)
SELECT 2, CONCAT_WS(":", COLLECT_LIST(teams)) FROM team2;
Verify the final table
SELECT * FROM teams;
teams.team_number teams.students
1 student#A18645:student#V86541:student#X78543
2 student#P20045:student#Q30041:student#R40043

Dynamic partition cannot be the parent of a static partition

I'm trying to aggregate data from 1 table (whose data is re-calculated monthly) in another table (holding the same data but for all time) in Hive. However, whenever I try to combine the data, I get the following error:
FAILED: SemanticException [Error 10094]: Line 3:74 Dynamic partition cannot be the parent of a static partition 'category'
The code I'm using to create the tables is below:
create table my_data_by_category (views int, submissions int)
partitioned by (category string)
row format delimited
fields terminated by ','
escaped by '\\'
location '${hiveconf:OUTPUT}/${hiveconf:DATE_DIR}/my_data_by_category';
create table if not exists my_data_lifetime_total_by_category
like my_data_by_category
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
The code I'm using to populate the tables is below:
insert overwrite table my_data_by_category partition(category)
select mdcc.col1, mdcc2.col2, pcc.category
from my_data_col1_counts_by_category mdcc
left outer join my_data_col2_counts_by_category mdcc2 where mdcc.category = mdcc2.category
group by mdcc.category, mdcc.col1, mdcc2.col2;
insert overwrite table my_data_lifetime_total_by_category partition(category)
select mdltc.col1 + mdc.col1 as col1, mdltc.col2 + mdc.col2, mdc.category
from my_data_lifetime_total_by_category mdltc
full outer join my_data_by_category mdc on mdltc.category = mdc.category
where mdltc.col1 is not null and mdltc.col2 is not null;
The frustrating part is that I have this data partitioned on another column and repeating this same process with that partition works without a problem. I've tried Googling the "Dynamic partition cannot be the parent of a static partition" error message, but I can't find any guidance on what causes this or how it can be fixed. I'm pretty sure that there's an issue with a way that 1 or more of my tables is set up, but I can't see what. What's causing this error and what I can I do resolve it?
There is no partitioned by clause in this script. As you are trying to insert into non partitioned table using partition in insert statement, it is failing.
create table if not exists my_data_lifetime_total_by_category
like my_data_by_category
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
No. You don't need to add partition clause.
You are doing group by mdcc.category in insert overwrite table my_data_by_category partition(category)..... but you are not using any UDAF.
Are you sure you can do this?
I think that if you change your second create statement to:
create table if not exists my_data_lifetime_total_by_category
partitioned by (category string)
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
you should then be free of errors