Max function on Hive Bucket - hive

I have a table structure in HIVE as below -
create table if not exists cdp_compl_status
(
EmpNo INT,
RoleCapability STRING,
EmpPUCode STRING,
SBUCode STRING,
CertificationCode STRING,
CertificationTitle STRING,
Competency STRING,
Certification_Type STRING,
Certification_Group STRING,
Contact_Based_Program_Y_N STRING,
ExamDate DATE,
Onsite_Offshore STRING,
AttendedStatus STRING,
Marks INT,
Result STRING,
Status STRING,
txtPlanCategory STRING,
SkillID1 INT,
Complexity STRING
)
CLUSTERED BY (Marks) INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
TBLPROPERTIES('created on' = '12 Aug');
Now, I want to query MAX(MARKS) from every bucket in the table. If I do -
SELECT MAX(MARKS) from cdp_compl_status;
It shows Maximum Marks from the whole table. Is there any way, I can find out MAX(MARKS) from every bucket?

As you have divided Table into 5 Buckets...
Data split into buckets on basis of % function i.e. eg:
marks%5==0 into 1st bucket
marks%5==1 into 2nd bucket
marks%5==2 into 3rd bucket
marks%5==3 into 4th bucket
marks%5==4 into 5th bucket
So you need to write 5 queries like this one:
Select max(marks) from cdp_compl_status where marks%5=0; -- for max in first bucket
I guess this should do it.

Use tablesample:
select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 1 out of 5 on marks);
select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 2 out of 5 on marks);
select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 3 out of 5 on marks);
select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 4 out of 5 on marks);
select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 5 out of 5 on marks);

Related

how can function SUBSTR can help me remove vlues from my column

25779101724|GTG1105-Kibimba .These are the telefone numbers, there is more like this.After the numbers the following characters are a location of the telefone number. so i want to remove everything after these numbers(the bar,the location, and the space) so that i can query in the db the numbers which are active among these. and then i dont want to lose these location after numbers because i will need them to report the active numbers AND their location. how can i remove these location and query the active numbers and then replace the location.
I am hoping a response.
From my point of view, your by far the best option is to normalize that table and store each piece of information into its own column. Because, as long as you can easily split that string into several parts, joining it to another table will suffer as number of rows gets higher.
Anyway, here you are.
Sample data:
SQL> with test (msisdn) as
2 (select '25779101724|GTG1105-Kibimba' from dual union all
3 select '25776030896|BRR1351-Kaberenge2' from dual
4 )
Query begins here:
5 select
6 substr(msisdn, 1, instr(msisdn, '|') - 1) phone_number,
7 substr(msisdn, instr(msisdn, '|') + 1) the_rest
8 from test;
PHONE_NUMBER THE_REST
------------------------------ ------------------------------
25779101724 GTG1105-Kibimba
25776030896 BRR1351-Kaberenge2
SQL>

How to convert column values to single row value in postgresql

Data
id cust_name
1 walmart_ca
2 ikea_mo
2 ikea_ca
2 ikea_in
1 walmart_in
when i do
select id,cust_name from test where id=2
Query returns below output
id cust_name
2 ikea_mo
2 ikea_ca
2 ikea_in
How can i get or store the result as single column value as shown below
id cust_name
2 {ikea_mo,ikea_ca,ikea_in}
you should use string_agg function and here is an example for it
select string_agg(field_1,',') from db.schema.table;
you should mention the separator in your case its , so I am doing string_agg(field_1,',')

How to get first n elements in an array in Hive

I use split function to create an array in Hive, how can I get the first n elements from the array, and I want to go through the sub-array
code example
select col1 from table
where split(col2, ',')[0:5]
'[0:5]'looks likes python style, but it doesn't work here.
This is a much simpler way of doing it. There is a UDF here called TruncateArrayUDF.javathat can do what you are asking. Just clone the repo from the main page and build the jar with Maven.
Example Data:
| col1 |
----------------------
1,2,3,4,5,6,7
11,12,13,14,15,16,17
Query:
add jar /complete/path/to/jar/brickhouse-0.7.0-SNAPSHOT.jar;
create temporary function trunc as 'brickhouse.udf.collect.TruncateArrayUDF';
select pos
,newcol
from (
select trunc(split(col1, '\\,'), 5) as p
from table
) x
lateral view posexplode(p) explodetable as pos, newcol
Output:
pos | newcol |
-------------------
0 1
1 2
2 3
3 4
4 5
0 11
1 12
2 13
3 14
4 15
This is a tricky one.
First grab the brickhouse jar from here
Then add it to Hive : add jar /path/to/jars/brickhouse-0.7.0-SNAPSHOT.jar;
Now create the two functions we will be usings :
CREATE TEMPORARY FUNCTION array_index AS 'brickhouse.udf.collect.ArrayIndexUDF';
CREATE TEMPORARY FUNCTION numeric_range AS 'brickhouse.udf.collect.NumericRange';
The query will be :
select a,
n as array_index,
array_index(split(a,','),n) as value_from_Array
from ( select "abc#1,def#2,hij#3" a from dual union all
select "abc#1,def#2,hij#3,zzz#4" a from dual) t1
lateral view numeric_range( length(a)-length(regexp_replace(a,',',''))+1 ) n1 as n
Explained :
select "abc#1,def#2,hij#3" a from dual union all
select "abc#1,def#2,hij#3,zzz#4" a from dual
Is just selecting some test data, in your case replace this with your table name.
lateral view numeric_range( length(a)-length(regexp_replace(a,',',''))+1 ) n1 as n
numeric_range is a UDTF that returns a table for a given range, in this case, i asked for a range between 0 (default) and the number of elements in string (calculated as the number of commas + 1)
This way, each row will be multiplied by the number of elements in the given column.
array_index(split(a,','),n)
This is exactly like using split(a,',')[n] but hive doesn't support it.
So we get the n-th element for each duplicated row of the initial string resulting in :
abc#1,def#2,hij#3,zzz#4 0 abc#1
abc#1,def#2,hij#3,zzz#4 1 def#2
abc#1,def#2,hij#3,zzz#4 2 hij#3
abc#1,def#2,hij#3,zzz#4 3 zzz#4
abc#1,def#2,hij#3 0 abc#1
abc#1,def#2,hij#3 1 def#2
abc#1,def#2,hij#3 2 hij#3
If you really want a specific number of elements (say 5) then just use :
lateral view numeric_range(5 ) n1 as n

hive count and count distinct not correct

I have a table in Hive that has 20 columns and I want to count unique records and all records per hour.
Table looks like:
CREATE EXTERNAL TABLE test1(
log_date string,
advertiser_creatives_id string,
cookieID string,
)
STORED AS ORC
LOCATION "/day1orc"
tblproperties ("orc.compress"="ZLIB");
And my query like this:
SELECT Hour(log_date),
Count(DISTINCT cookieid) AS UNIQUE,
Count(1) AS impressions
FROM test1
GROUP BY Hour(log_date);
But the results are not correct. I have about 70 million entries and when I do a sum of impressions I only get like 8 million so I suspect the distinct takes too many columns in account.
So how can I fix this so that I get the correct amount of impressions?
** Extra information **
hive.vectorized.execution.enabled is undefined so it is not active.
The same query in TEXT format returns even less rows (about 2.7 million)
Result of COUNT(*): 70643229
Result of COUNT(cookieID): 70643229
Result of COUNT(DISTINCT cookieID): 1440195
Cheers
I have an example,may be useful for you.I think you "row format delimited fields terminated by" has some problems .
I have a text,seperate by "\t",like below:
id date value
1 01-01-2014 10
1 03-01-2014 05
1 07-01-2014 40
1 05-01-2014 20
2 05-01-2014 10
but I only create a table have 2 columns, like below:
use tmp ;
create table sw_test(id string,td string) row format delimited fields terminated by '\t' ;
LOAD DATA LOCAL INPATH '/home/hadoop/b.txt' INTO TABLE sw_test;
How do you think the result of "select td from sw_test ;"
NOT
td
01-01-2014 10
03-01-2014 05
07-01-2014 40
05-01-2014 20
05-01-2014 10
BUT
td
01-01-2014
03-01-2014
07-01-2014
05-01-2014
05-01-2014
So,I think you cookie contains some special column include your defined seperator.
I hope this can help you .
good luck!

List category/subcategory tree and display its sub-categories in the same row

I have a hierarchical table of Regions and sub-regions, and I need to list a tree of regions and sub-regions (which is easy), but also, I need a column that displays, for each region, all the ids of it's sub regions.
For example:
id name superiorId
-------------------------------
1 RJ NULL
2 Tijuca 1
3 Leblon 1
4 Gavea 2
5 Humaita 2
6 Barra 4
I need the result to be something like:
id name superiorId sub-regions
-----------------------------------------
1 RJ NULL 2,3,4,5,6
2 Tijuca 1 4,5,6
3 Leblon 1 null
4 Gavea 2 4
5 Humaita 2 null
6 Barra 4 null
I have done that by creating a function that retrieves a STUFF() of a region row,
but when I'm selecting all regions from a country, for example, the query becomes really, really slow, since I execute the function to get the region sons for each region.
Does anybody know how to get that in an optimized way?
The function that "retrieves all the ids as a row" is:
I meant that the function returns all the sub-region's ids as a string, separated by a comma.
The function is:
CREATE FUNCTION getSubRegions (#RegionId int)
RETURNS TABLE
AS
RETURN(
select stuff((SELECT CAST( wine_reg.wine_reg_id as varchar)+','
from (select wine_reg_id
, wine_reg_name
, wine_region_superior
from wine_region as t1
where wine_region_superior = #RegionId
or exists
( select *
from wine_region as t2
where wine_reg_id = t1.wine_region_superior
and (
wine_region_superior = #RegionId
)
) ) wine_reg
ORDER BY wine_reg.wine_reg_name ASC for XML path('')),1,0,'')as Sons)
GO
When we used to make these concatenated lists in the database we took a similar approach to what you are doing at first
then when we looked for speed
we made them into CLR functions
http://msdn.microsoft.com/en-US/library/a8s4s5dz(v=VS.90).aspx
and now our database is only responsible for storing and retrieving data
this sort of thing will be in our data layer in the application