Spark SQL create a boolean column based on a condition - apache-spark-sql

I have this table:
Column1
Column2
A
black
B
red
C
green
D
yellow
E
white
I am interested in C and E values.
I need to write a Spark SQL query to create a new boolean column that shows the rows that I am interested in as 1 and others as 0. Like this:
Column1
Column2
NewColumn
A
black
0
B
red
0
C
green
1
D
yellow
0
E
white
1
Any help is appreciated! Thanks!

You may use a case expression in your spark sql
SELECT
*,
CASE
WHEN Column1 IN ('C','E') THEN 1
ELSE 0
END AS NewColumn
FROM
my_table

Related

How can I extract multiple rows from a single one having 4 columns in PostgreSQL?

I'm having a issue with a query in Postgres. I need to extract from a table with a specific pattern using a pure select. For example I have:
Column A
Column B
Column C
Column D
1
2
3
4
From this table I want to select something like this:
Column A
Column B
Column C
Column D
1
null
null
null
1
2
null
null
1
2
3
null
1
2
3
4
I don't really have a clue on how to do it efficiently. Anyone can help? Many thanks
You can cross join with a list of integers and use a conditional expression:
with n as (select * from(values(1),(2),(3),(4))x(n))
select
case when n >= cola then cola end cola,
case when n >= colb then colb end colb,
case when n >= colc then colc end colc,
case when n >= cold then cold end cold
from n
cross join t;
working Fiddle

1 or 0 Per Group

Re-edited to make it clearer:
I would like my results to appear as they do in 'Column b' based on 'Column a' groupings? so a 1 or 0 per group based on column a. Column b does not exist currently, I am trying to code this in. I was trying to use row_number or rank but this not appear to work for me. So how do I write my SQL so I can get my SQL results to mirror Column b? Any help is appreciated
Thank - you
column a
column b
aaa
1
aaa
0
ddd
1
ddd
0
ddd
0
yyy
1
yyy
0
yyy
0
You just need to wrap your row_number() in a case, something like this:
select
column_a
case row_number() over (partition by column_a)
when 1 then 1
else 0
end as column_b
from
table
/

Successively down the rows, assign value to a column based on conditions in another columns

I want to populate the column New based on the values in column A and the comparison between columns B and C. I want the column New to have the initial value=1 from the start, and to have memory down the rows, but to reset to 1 at certain conditions.
New is set to 1 as initial value.
Let's look at row 3: Since A=40>30 and B=C, then New=New+1=2, since New=1 in row 2 above.
Let's look at row 6: Since A=40>30 and B<>C, then New=1 (counting starting over).
Creating initial table where New will be manipulated later on:
CREATE TABLE table_20220112
(
Ordered_by int,
A float,
B nvarchar(100) ,
C nvarchar(100),
New int,
);
INSERT INTO table_20220112
VALUES
(1,10,'Apples','Apples',0),
(2,5,'Apples','Apples',0),
(3,40,'Apples','Apples',0),
(4,10,'Apples','Apples',0),
(5,50,'Apples','Apples',0),
(6,40,'Oranges','Apples',0),
(7,10,'Oranges','Apples',0),
(8,25,'Oranges','Bananas',0);
select * from table_20220112
--drop table table_20220112
Code logic would be something like this (I do not know the corresponding SQL-syntax):
New=1 (initail value before going in a looping down all rows)
If A<=30 Then
IF B=C Then New=New
Else if B<>C Then New=1
Else If A>30 Then
IF B=C Then New=New+1
Else if B<>C Then New=1
END IF
Desired outcome:
Ordered_by
A
B
C
New
1
10
Apples
Apples
1
2
5
Apples
Apples
1
3
40
Apples
Apples
2
4
10
Apples
Apples
2
5
50
Apples
Apples
3
6
40
Oranges
Apples
1
7
10
Oranges
Apples
1
8
25
Oranges
Bananas
1
Use recursive cte and implement that logic using CASE expression
with rcte as
(
select Ordered_by, A, B, C, New = 1
from #table_20220112
where Ordered_by = 1
union all
select t.Ordered_by, t.A, t.B, t.C,
New = case when t.A <= 30
then case when t.B = t.C then r.New else 1 end
when t.A > 30
then case when t.B = t.C then r.New + 1 else 1 end
end
from rcte r
inner join #table_20220112 t on r.Ordered_by = t.Ordered_by - 1
)
select *
from rcte
order by Ordered_by

How to only SELECT rows with non-zero and non-null columns efficiently in Big Query?

I am having a table with large number of columns in Big Query.
The table has lot of rows with some column values as 0/0.0 and null.
For example
Row A B C D E F
1 "abc" 0 null "xyz" 0 0.0
2 "bcd" 1 5 "wed" 4 65.5
I need to select only those rows which have non zero Integer, Float and non NULL values. Basically, I need just Row 2 in the above table
I know I can do this by using this query for each of the columns
SELECT * FROM table WHERE (B IS NOT NULL AND B is !=0) AND
.
.
.
But I have lot of columns and writing query like this for each of the columns would be difficult. Is there any better approach to handle this?
Below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT "abc" a, 0 b, NULL c, "xyz" d, 0 e, 0.0 f UNION ALL
SELECT "bcd", 1, 5, "wed", 4, 65.5
)
SELECT *
FROM `project.dataset.table` t
WHERE NOT REGEXP_CONTAINS(TO_JSON_STRING(t), r':0[,}]|null[,}]')
with output
Row a b c d e f
1 bcd 1 5 wed 4 65.5

To pull 1 record out of multiple records having same data in a field based on other fields

A | B | C | D | E
a y 6 12 21
b n 3 10 5
c n 4 12 12
c n 7 12 2
c y 1 12 22
d n 6 10 32
d n 7 10 32
OUTPUT TABLE:
A | B | C | F
a y 6 21
b n 3 12
c y 1 22
d n 6 10
I have a table that contains certain fields. From that table I want to remove duplicate records in A and produce the output table.
Now, the field F is calculated based on the field C when there are no duplicates for the records in A. So, if there is only one record of a in A then if C>5 then the F Column(Output table) pulls the record in E column. So, if record b has the value <5 in field C, then the F column (output table) will pull the record in D column for b. I have been able to achieve this using a case statement.
However, when there are duplicate records in column A, I want only one of the records based on the column B. Only that record should be pulled that has the value 'y' in column B and where the column F contains the value from column E. If none of the duplicate records in A have a value of 'n' in the B column, then pull any record with column D as column F in the output table. I am not able to figure out this part.
Please let me know if anything is not clear.
Code I am using:
SELECT A,B,C,
CASE
WHEN (SELECT COUNT(*) FROM MyTable t2 WHERE t1.A=t2.A)>1
THEN (SELECT TOP 1 CASE WHEN b='y' THEN E ELSE D END
FROM MyTable t3
WHERE t3.A=t1.A
ORDER BY CASE WHEN b='y' THEN 0 ELSE 1 END)
ELSE {
case when cast(C as float) >= 5.00 then (Case when E = '0.00' then D else E end)
when cast(C as float)< 5.00 then D end )
}
END AS F
FROM MyTable t1
You might want to encapsulate this logic in a Function to make it look cleaner, but the logic would go like this:
IF the record count of rows in the table with the same value for A as the current row is greater than 1, THEN SELECT the TOP 1 record with this value for A ORDER BY CASE WHEN b='y' THEN 0 ELSE 1 END
Use another CASE WHEN b='y' to determine if you will use column E or D for output column F.
And ELSE (the record count is not greater than 1), use your existing CASE expression.
EDIT: Here is a more psuedo-codey explanation:
WITH cte AS (SELECT A,B,C,
ROW_NUMBER() OVER (PARTITION BY A, ORDER BY CASE WHEN b='y' THEN 0 ELSE 1 END) rn
FROM MyTable
)
SELECT A,B,C,
CASE
WHEN (SELECT COUNT(*) FROM MyTable t2 WHERE t1.A=t2.A)>1
THEN CASE WHEN b='y' THEN E ELSE D END
ELSE {use your existing CASE Expression}
END AS F
FROM cte t1
WHERE rn=1