I have a hive table like
col1 col2
1 ["apple", "orange"]
1 ["orange", "banana"]
1 ["mango"]
2 ["apple"]
2 ["apple", "orange"]
There data types is
col1 int
col2 array<string>
I want to query something like :
select col1, concat(col2) from table group by col1;
Output should be :
1 ["apple", "orange", "banana", "mango"]
2 ["apple", "orange"]
Is there any function in hive to do this ?
Also I write this data to csv and when I read it as a dataframe I get the col2 dtype as object. Is there a way to output it as an array.
Try by exploding the array then use collect_set function by grouping by col1.
Example:
Input:
select * from table;
OK
dd.col1 dd.col2
1 ["apple","orange"]
1 ["mango"]
1 ["orange","banana"]
select col1,collect_set(tt1)col2 from (
select * from table lateral view explode(col2) tt as tt1
)cc
group by col1;
Output:
col1 col2
1 ["apple","orange","mango","banana"]
Related
I'm trying to summarize data in a table:
counting total rows
counting values on specific fields
getting the distinct values on specific fields
and, more importantly, I'm struggling with:
getting the count for each field nested in an object
given this data
COL1
COL2
A
0
null
1
B
null
B
null
the expected result from this query would be:
with dummy as (
select 'A' as col1, 0 as col2
union all
select null, 1
union all
select 'B', null
union all
select 'B', null
)
select
count(1) as total
,count(col1) as col1
,array_agg(distinct col1) as dist_col1
--,object_construct(???) as col1_object_count
,count(col2) as col2
,array_agg(distinct col2) as dist_col2
--,object_construct(???) as col2_object_count
from
dummy
TOTAL
COL1
DIST_COL1
COL1_OBJECT_COUNT
COL2
DIST_COL2
COL2_OBJECT_COUNT
4
3
["A", "B"]
{"A": 1, "B", 2, null: 1}
2
[0, 1]
{0: 1, 1: 1, null: 2}
I've tried several functions inside OBJECT_CONSTRUCT mixed with ARRAY_AGG, but all failed
OBJECT_CONSTRUCT can work with several columns but only given all (*), if you try a select statement inside, it will fail
another issue is that analytical functions are not easily taken by the object or array functions in Snowflake.
You could use Snowflake Scripting or Snowpark for this but here's a solution that is somewhat flexible so you can apply it to different tables and column sets.
Create test table/view:
Create or Replace View dummy as (
select 'A' as col1, 0 as col2
union all
select null, 1
union all
select 'B', null
union all
select 'B', null
);
Set session variables for table and colnames.
set tbname = 'DUMMY';
set colnames = '["COL1", "COL2"]';
Create view that generates the required table_column_summary data:
Create or replace View table_column_summary as
with
-- Create table of required column names
cn as (
select VALUE::VARCHAR CNAME
from table(flatten(input => parse_json($colnames)))
)
-- Convert rows into objects
,ro as (
select
object_construct_keep_null(*) row_object
-- using identifier on session variable to dynamically supply table/view name
from identifier($tbname) )
-- Flatten row objects into key/values
,rof as (
select
key col_name,
ifnull(value,'null')::VARCHAR col_value
from ro, lateral flatten(input => row_object), cn
-- You will only need this filter if you need a subset
-- of columns from the source table/query summarised
where col_name = cn.cname)
-- Get the column value distinct value counts
,cdv as (
select col_name,
col_value,
sum(1) col_value_count
from rof
group by 1,2
)
-- and derive required column level stats and combine with cdv
,cv as (
select
(select count(1) from dummy) total,
col_name,
object_construct('COL_COUNT', count(col_value) ,
'COL_DIST', array_agg(distinct col_value),
'COL_OBJECT_COUNT', object_agg(col_value,col_value_count)) col_values
from cdv
group by 1,2)
-- Return result
Select * from cv;
Use this final query if you want a solution that works flexibility with any table/columns provided as input...
Select total, object_agg(col_name, col_values) col_values_obj
From table_column_summary
Group by 1;
Or use this final query if you want the fixed columns output as described in your question...
Select total,
COL1[0]:COL_COUNT COL1,
COL1[0]:COL_DIST DIST_COL1,
COL1[0]:COL_OBJECT_COUNT COL1_OBJECT_COUNT,
COL2[0]:COL_COUNT COL2,
COL2[0]:COL_DIST DIST_COL2,
COL2[0]:COL_OBJECT_COUNT COL2_OBJECT_COUNT
from table_column_summary
PIVOT ( ARRAY_AGG ( col_values )
FOR col_name IN ( 'COL1', 'COL2' ) ) as pt (total, col1, col2);
I have a row of data and I want to turn this row into a column so I can use a cursor to run through the data one by one. I have tried to use
SELECT * FROM TABLE(PIVOT(TEMPROW))
but I get
'PIVOT' Invalid Identifier error.
I have also tried that same syntax but with
('select * from TEMPROW')
Everything I see using pivot is always using count or sum but I just want this one single row of all varchar2 to turn into a column.
My row would look something like this:
ABC | 123 | aaa | bbb | 111 | 222 |
And I need it to turn into this:
ABC
123
aaa
bbb
111
222
My code is similar to this:
BEGIN
OPEN C_1 FOR SELECT * FROM TABLE(PIVOT( 'SELECT * FROM TEMPROW'));
LOOP
FETCH C_1 INTO TEMPDATA;
EXIT WHEN C_2%NOTFOUND;
DBMS_OUTPUT.PUT_LINE(1);
END LOOP;
CLOSE C_1;
END;
You have to unpivot to convert whole row into 1 single column
select * from Table
UNPIVOT
(col for col in (
'ABC' , '123' , 'aaa' ,' bbb' , '111' , '222'
))
or use union but for that you need to add col names manually like
Select * from ( Select col1 from table
union
select col2 from table union...
Select coln from table)
sample output to show as below
One option for unpivoting would be numbering columns by decode() and cross join with the query containing the column numbers :
select decode(myId, 1, col1,
2, col2,
3, col3,
4, col4,
5, col5,
6, col6 ) as result_col
from temprow
cross join (select level AS myId FROM dual CONNECT BY level <= 6 );
or use a query with unpivot keyword by considering the common expression for the column ( namely col in this case ) must have same datatype as corresponding expression :
select result_col from
(
select col1, to_char(col2) as col2, col3, col4,
to_char(col5) as col5, to_char(col6) as col6
from temprow
)
unpivot (result_col for col in (col1,col2,col3,col4,col5,col6));
Demo
I have a table in hive , with 2 columns as col1 array<int> and col2 array<double>. Output is as shown below
col1 col2
[1,2,3,4,5] [0.43,0.01,0.45,0.22,0.001]
I want to sort this col2 in ascending order and col1 should also change its index accordingly for e.g.
col1 col2
[5,2,4,3,1] [0.001,0.01,0.22,0.43,0.45]
Explode both arrays, sort, then aggregate arrays again. Use sort in the subquery before collect_list to sort the array:
with your_data as(
select array(1,2,3,4,5) as col1,array(0.43,0.01,0.45,0.22,0.001)as col2
)
select original_col1,original_col2, collect_list(c1_x) as new_col1, collect_list(c2_x) as new_col2
from
(
select d.col1 as original_col1,d.col2 as original_col2, c1.x as c1_x, c2.x as c2_x, c1.i as c1_i
from your_data d
lateral view posexplode(col1) c1 as i,x
lateral view posexplode(col2) c2 as i,x
where c1.i=c2.i
distribute by original_col1,original_col2
sort by c2_x
)s
group by original_col1,original_col2;
Result:
OK
original_col1 original_col2 new_col1 new_col2
[1,2,3,4,5] [0.43,0.01,0.45,0.22,0.001] [5,2,4,1,3] [0.001,0.01,0.22,0.43,0.45]
Time taken: 34.642 seconds, Fetched: 1 row(s)
Edit: Simplified version of the same script, you can do without second posexplode, use direct reference by position d.col2[c1.i] as c2_x
with your_data as(
select array(1,2,3,4,5) as col1,array(0.43,0.01,0.45,0.22,0.001)as col2
)
select original_col1,original_col2, collect_list(c1_x) as new_col1, collect_list(c2_x) as new_col2
from
(
select d.col1 as original_col1,d.col2 as original_col2, c1.x as c1_x, d.col2[c1.i] as c2_x, c1.i as c1_i
from your_data d
lateral view posexplode(col1) c1 as i,x
distribute by original_col1,original_col2
sort by c2_x
)s
group by original_col1,original_col2;
I have a SQL table (actually a BigQuery table) that has a huge number of columns (over a thousand). I want to quickly find the min and max value of each column. Is there a way to do that?
It is impossible for me to list all the columns. Looking for ways to do something like
SELECT MAX(*) FROM mytable;
and then running
SELECT MIN(*) FROM mytable;
I have been unable to Google a way of doing that. Not sure that's even possible.
For example, if my table has the following schema:
col1 col2 col3 .... col1000
the (say, max) query should return
Row col1 col2 col3 ... col1000
1 3 18 0.6 ... 45
and the min query should return (say)
Row col1 col2 col3 ... col1000
1 -5 4 0.1 ... -5
The numbers are just for illustration. The column names could be different strings and not easily scriptable.
See below example for BigQuery Standard SQL - it works for any number of columns and does not require explicit calling/use of columns names
#standardSQL
WITH `project.dataset.mytable` AS (
SELECT 1 AS col1, 2 AS col2, 3 AS col3, 4 AS col4 UNION ALL
SELECT 7,6,5,4 UNION ALL
SELECT -1, 11, 5, 8
)
SELECT
MIN(CAST(value AS INT64)) AS min_value,
MAX(CAST(value AS INT64)) AS max_value
FROM `project.dataset.mytable` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'":(.*?)(?:,"|})')) value
with result
Row min_value max_value
1 -1 11
Note: if your columns are of STRING data type - you should remove CAST ... AS INT64
Or if they are of FLOAT64 - replace INT64 with FLOAT64 in the CAST function
Update
Below is option to get MIN/Max for each column and present result as array of respective values as list of respective values in the order of the columns
#standardSQL
WITH `project.dataset.mytable` AS (
SELECT 1 AS col1, 2 AS col2, 3 AS col3, 14 AS col4 UNION ALL
SELECT 7,6,5,4 UNION ALL
SELECT -1, 11, 5, 8
), temp AS (
SELECT pos, MIN(CAST(value AS INT64)) min_value, MAX(CAST(value AS INT64)) max_value
FROM `project.dataset.mytable` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'":(.*?)(?:,"|})')) value WITH OFFSET pos
GROUP BY pos
)
SELECT 'min_values' stats, TO_JSON_STRING(ARRAY_AGG(min_value ORDER BY pos)) vals FROM temp UNION ALL
SELECT 'max_values', TO_JSON_STRING(ARRAY_AGG(max_value ORDER BY pos)) FROM temp
with result as
Row stats vals
1 min_values [-1,2,3,4]
2 max_values [7,11,5,14]
Hope this is something you can still apply to whatever your final goal
Lets say I have a pivoted sorted dataset like this
ID Col1 Col2
1 a 11
2 b 22
3 c 33
4 d 44
5 e 55
When I make a paging call by returning two records at a time I would get the first two rows.
Lets say I want to return the same data but not pivot the data so my data set looks like
ID Col Val
1 Col1 a
2 Col1 b
3 Col1 c
4 Col1 d
5 Col1 e
1 Col2 11
2 Col2 22
3 Col2 33
4 Col2 44
5 Col2 55
I would like to write an sql statement that would return the same data as in the first example but without pivoting the data first.
Some additional challanges
1) There could be n columns not just two
2) Tt should also support a filter on all the columns. This part I have solved already see below
Filter on pivoted data
WHERE Col1 in ('a', 'b', 'c')
AND Col2 in ('11', '22')
Filter on unpivoted data
WHERE (Col = 'Col1' and Val in ('a', 'b', 'c')) or Col != 'Col1')
AND (Col = 'Col2' and Val in ('11', '22')) or Col != 'Col2')
Both filters return the same results.
The filter part I have figured out already I am stuck on the sorting and paging.
SQL, as a standard, doesn't support such operations. If you want it to handle arbitrarily many columns for your reformatting of the data, then use something like Perl's DBI interface which can tell you the names of the columns for any table. From there you can generate your table create.
To create your second table the insert will take the form:
INSERT INTO newtable (id, col, val)
SELECT id, 'Col1', Col1 from oldtable
UNION
SELECT id, 'Col2', Col2 from oldtable;
Just create an additional UNION SELECT... for each column you want to include.
As for you filter query, you're making it unnecessarily complicated. Your query of:
SELECT * FROM newtable
WHERE (Col = 'Col1' and Val in ('a', 'b', 'c')) or Col != 'Col1')
AND (Col = 'Col2' and Val in ('11', '22')) or Col != 'Col2')
Can be rewritten as
SELECT * from newtable
WHERE ( Col = 'Col1' and Val in ('a','b','c') )
OR ( Col = 'Col2' and Val in ('11','22') )
Each separate ORd clause doesn't interfere with the others.
I also don't understand why people try to work such travesties in SQL. It appears that you're trying to make a reasonable schema into something akin to a key/value store. Which may currently be all the rage with the kids nowadays, but you should really try to learn how to use the full power of SQL with good data modeling.