How to derive a new column based on other columns comparisions in Hive? - sql

I have huge table and 2 columns A and B in hive.
Rows are identical where either A or B or both have same values.
I would like to build a new column and assign a value based on this comparision :
A B
-- --
a b
a c
d b
p q
Result :
A B New_Col
-- -- -----
a b id1
a c id1
d b id1
p q id2
Any efficient solution?

You can achieve this by using conditional functions in Hive in your SELECT statement:
SELECT A, B, IF(A == 'a' OR B == 'b', 'id1', 'id2') AS New_Col FROM huge_table;
Here's how to create a new_huge_table from your huge_table with the new, derived column New_Col:
CREATE TABLE my_database.new_huge_table (A STRING, B STRING, New_Col STRING);
INSERT OVERWRITE TABLE my_database.new_huge_table
SELECT A, B, IF(A == 'a' OR B == 'b', 'id1', 'id2') AS New_Col FROM huge_table;

Related

Is there something like Spark's unionByName in BigQuery?

I'd like to concatenate tables with different schemas, filling unknown values with null.
Simply using UNION ALL of course does not work like this:
WITH
x AS (SELECT 1 AS a, 2 AS b ),
y AS (SELECT 3 AS b, 4 AS c )
SELECT * FROM x
UNION ALL
SELECT * FROM y
a b
1 2
3 4
(unwanted result)
In Spark, I'd use unionByName to get the following result:
a b c
1 2
3 4
(wanted result)
Of course, I can manually create the needed query (adding nullss) in BigQuery like so:
SELECT a, b, NULL c FROM x
UNION ALL
SELECT NULL a, b, c FROM y
But I'd prefer to have a generic solution, not requiring me to generate something like that.
So, is there something like unionByName in BigQuery? Or can one come up with a generic SQL function for this?
Consider below approach (I think it is as generic as one can get)
create temp function json_extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function json_extract_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp table temp_table as (
select json, key, value
from (
select to_json_string(t) json from table_x as t
union all
select to_json_string(t) from table_y as t
) t, unnest(json_extract_keys(json)) key with offset
join unnest(json_extract_values(json)) value with offset
using(offset)
order by key
);
execute immediate(select '''
select * except(json) from temp_table
pivot (any_value(value) for key in ("''' || string_agg(distinct key, '","') || '"))'
from temp_table
)
if applied to sample data in your question - output is

May I know how can I construct the follow query in SQL Server?

CREATE TABLE (
A INT NOT NULL,
B INT NOT NULL
)
A is an enumerated values of 1, 2, 3, 4, 5
B can be any values
I would like to count() the number of occurrence group by B, with a specific subset of A e.g. {1, 2}
Example:
A B
1 7 *
2 7 *
3 7
1 8 *
2 8 *
1 9
3 9
When B = 7, A = 1, 2, 3. Good
When B = 8, A = 1, 2. Good
When B = 9, A = 1, 3. Not satisfy, 2 is missing
So the count will be 2 (when B = 7 and 8)
If I've understood you correctly, we want to find B values for which we have both a 1 and a 2 in A, and then we want to know how many of those we have.
This query does this:
declare #t table (A int not null, B int not null)
insert into #t(A,B) values
(1,7),
(2,7),
(3,7),
(1,8),
(2,8),
(1,9),
(3,9)
select COUNT(DISTINCT B) from (
select B
from #t
where A in (1,2)
group by B
having COUNT(DISTINCT A) = 2
) t
One or both of the DISTINCTs may be unnecessary - it depends on whether your data can contain repeating values.
If I understand correctly and the requirement is to find Bs with a series of As that doesn't have any "gaps", you could compare the difference between the minimal and maximal A with number of records (per B, of course):
SELECT b
FROM mytable
GROUP BY b
HAVING COUNT(*) + 1 = MAX(a) - MIN(a)
SELECT COUNT(DISTINCT B) FROM TEMP T WHERE T.B NOT IN
(SELECT B FROM
(SELECT B,A,
LAG (A,1) OVER (PARTITION BY B ORDER BY A) AS PRE_A
FROM Temp) K
WHERE K.PRE_A IS NOT NULL AND K.A<>K.PRE_A+1);

How to update a column for all rows after each time one row is processed by a UDF in BigQuery?

I'm trying to update a column for all rows after each time one row is processed by a UDF.
The example has 3 rows with 6 columns. Column "A" has the same value across 3 rows; column "B" and "A" is the joint identifier of each row; column "C" is arrays with any letters in a,b,c,d,e; column "D" is the target array to be filled in; column "E" is some integers; column "abcde" is the integer array with 5 integers specifying the counts for each letter a,b,c,d,e.
Each row will be passed into a UDF to update the column "D" and column "abcde" according to the column "C" and column "E". The rule is: select the number, which specified by "E", of items from "C" to put into "D"; the selection is random; after each selection done for a row, the column 'abcde' will be updated across all rows.
For example, to process the first row, we randomly select one item from ('a','b','c') to put into "D". Let's say the system picked the 'c' in the column "C", so the value in "D" for this row becomes ['c'] and 'abcde' gets updated to [1,3,1,1,1] (before was [1,3,2,1,1]) for all three rows.
Example data:
#StandardSQL in BigQuery
#code to generate the example table
with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, [] as D, 1 as E, [1,3,2,1,1] as abcde union all
select 'y1','x2',['a','b'],[],2,[1,3,2,1,1] union all
select 'y1','x3',['c','d','e'],[],3,[1,3,2,1,1])
select * from sample order by B
After the first row is processed:
with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, ['c'] as D, 1 as E, [1,3,1,1,1] as abcde union all
select 'y1','x2',['a','b'],[],2,[1,3,1,1,1] union all
select 'y1','x3',['c','d','e'],[],3,[1,3,1,1,1])
select * from sample order by B
After the second row is processed:
with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, ['c'] as D, 1 as E, [0,2,1,1,1] as abcde union all
select 'y1','x2',['a','b'],['a','b'],2,[0,2,1,1,1] union all
select 'y1','x3',['c','d','e'],[],3,[0,2,1,1,1])
select * from sample order by B
After the third row is processed:
with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, ['c'] as D, 1 as E, [0,2,0,0,0] as abcde union all
select 'y1','x2',['a','b'],['a','b'],2,[0,2,0,0,0] union all
select 'y1','x3',['c','d','e'],['c','d','e'],3,[0,2,0,0,0])
select * from sample order by B
Don't worry about how the UDF will do the random selection. I'm just wondering, if it's possible in BigQuery to do the task to update the column 'abcde' in the way I want?
I've tried using UDFs, but I'm struggling to get it working because my understanding of a UDF is that it can only take one row in and produce multiple rows out. So, I can't update the other rows. Is it possible just using SQL?
Expected output:
After the first row is processed:
After the third row is processed:
Additional information:
create temporary function selection(A string, B string, C ARRAY<STRING>, D ARRAY<STRING>, E INT64, abcde ARRAY<INT64>)
returns STRUCT< A stRING, B string, C array<string>, D array<string>, E int64, abcde array<int64>>
language js AS """
/*
for the row i in the data:
select the number i.E of items (randomly) from i.C where the numbers associated with the item in i.abcde is bigger than 0 (i.e. only the items with numbers in abcde bigger than 0 can be the cadidates for the random selection);
put the selected items in i.D and deduct the amount of selected items from the number for the corresponding item in the column 'abcde' FOR ALL ROWS;
proceed to the next row i+1 until every row is processed;
*/
return {A,B,C,D,E,abcde}
""";
with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, CAST([] AS ARRAY<STRING>) as D, 1 as E, [1,3,2,1,1] as abcde union all
select 'y1','x2',['a','b'],[],2,[1,3,2,1,1] union all
select 'y1','x3',['c','d','e'],[],2,[1,3,2,1,1])
select selection(A,B,C,D,E,abcde) from sample order by B
Below is for BigQuery Standard SQL
#StandardSQL
WITH sample AS (
SELECT 'y1' AS A, 'x1' AS B, ['a','b','c'] AS C, ['c'] AS D, 1 AS E, [1,3,2,1,1] AS abcde UNION ALL
SELECT 'y1','x2',['a','b'],['a','b'],2,[1,3,2,1,1] UNION ALL
SELECT 'y1','x3',['c','d','e'],['c','d','e'],3,[1,3,2,1,1] UNION ALL
SELECT 'y2' AS A, 'x1' AS B, ['a','b','c'] AS C, ['a','b'] AS D, 2 AS E, [1,3,2,1,1] AS abcde UNION ALL
SELECT 'y2','x2',['a','b'],['b'],1,[1,3,2,1,1] UNION ALL
SELECT 'y2','x3',['c','d','e'],['d','e'],2,[1,3,2,1,1]
),
counts AS (
SELECT A AS AA, dd, COUNT(1) AS cnt
FROM sample, UNNEST(D) AS dd
GROUP BY AA, dd
),
processed AS (
SELECT A, B, ARRAY_AGG(aa - IFNULL(cnt, 0) ORDER BY pos) AS abcde
FROM sample, UNNEST(abcde) AS aa WITH OFFSET AS pos
LEFT JOIN counts ON A = counts.AA
AND CASE dd
WHEN 'a' THEN 0
WHEN 'b' THEN 1
WHEN 'c' THEN 2
WHEN 'd' THEN 3
WHEN 'e' THEN 4
END = pos
GROUP BY A, B
)
SELECT s.A, s.B, s.C, s.D, s.E, p.abcde
FROM sample AS s
JOIN processed AS p
USING (A, B)
-- ORDER BY A, B
Don't worry about how the UDF will do the random selection
So, as you can see - I just put "random" values into sample data to mimic D

Can I add multiple columns to Totals

Using MS SQL 2012
I want to do something like
select a, b, c, a+b+c d
However a, b, c are complex computed columns, lets take a simple example
select case when x > 4 then 4 else x end a,
( select count(*) somethingElse) b,
a + b c
order by c
I hope that makes sense
You can use a nested query or a common table expression (CTE) for that. The CTE syntax is slightly cleaner - here it is:
WITH CTE (a, b)
AS
(
select
case when x > 4 then 4 else x end a,
count(*) somethingElse b
from my_table
)
SELECT
a, b, (a+b) as c
FROM CTE
ORDER BY c
I would probably do this:
SELECT
sub.a,
sub.b,
(sub.a + sub.b) as c,
FROM
(
select
case when x > 4 then 4 else x end a,
(select count(*) somethingElse) b
FROM MyTable
) sub
ORDER BY c
The easiest way is to do this:
select a,b,c,a+b+c d
from (select <whatever your calcs are for a,b,c>) x
order by c
That just creates a derived table consisting of your calculations for a, b, and c, and allows you to easily reference and sum them up!

In R, How Do I Create a data.frame with Unique Values from One Column of another data.frame?

I'm trying to learn R, but I'm stuck on something that seems simple. I know SQL, and the easiest way for me to communicate my question is with that language. Can someone help me with a translation from SQL to R?
I've figured out that this:
SELECT col1, sum(col2) FROM table1 GROUP BY col1
translates into this:
aggregate(x=table1$col2, by=list(table1$col1), FUN=sum)
And I've figured out that this:
SELECT col1, col2 FROM table1 GROUP BY col1, col2
translates into this:
unique(table1[,c("col1","col2")])
But what is the translation for this?
SELECT col1 FROM table1 GROUP BY col1
For some reason, the "unique" function seems to switch to a different return type when working on only one column, so it doesn't work as I would expect.
-TC
I'm guessing that you are referring to the fact that calling unique on a vector will return a vector, rather than a data frame. Here are a couple of examples that may help:
#Some example data
dat <- data.frame(x = rep(letters[1:2],times = 5),
y = rep(letters[3:4],each = 5))
> dat
x y
1 a c
2 b c
3 a c
4 b c
5 a c
6 b d
7 a d
8 b d
9 a d
10 b d
> unique(dat)
x y
1 a c
2 b c
6 b d
7 a d
#Unique => vector
> unique(dat$x)
[1] "a" "b"
#Same thing
> unique(dat[,'x'])
[1] "a" "b"
#drop = FALSE preserves the data frame structure
> unique(dat[,'x',drop = FALSE])
x
1 a
2 b
#Or you can just convert it back (although the default column name is ugly)
> data.frame(unique(dat$x))
unique.dat.x.
1 a
2 b
If you know SQL then try packages sqldf and data.table.