Azure Databricks - Write to parquet file using spark.sql with union and subqueries - apache-spark-sql

Issue:
I'm trying to write to parquet file using spark.sql, however I encounter issues when having unions or subqueries. I know there's some syntax I can't seem to figure out.
Ex.
%python
df = spark.sql("SELECT
sha2(Code, 256) as COUNTRY_SK,
Code as COUNTRY_CODE,
Name as COUNTRY_NAME,
current_date() as EXTRACT_DATE
FROM raw.EXTR_COUNTRY)
UNION ALL
SELECT
-1 as COUNTRY_SK,
'Unknown' as COUNTRY_CODE,
'Unknown' as COUNTRY_NAME,
current_date() as EXTRACT_DATE")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country",
mode="overwrite")
WHEN doing a simple query I have no issues at all, such as:
%python
df = spark.sql("select * from raw.EXTR_COUNTRY")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country/",
mode="overwrite")

There are few problems with your code that needs to be fixed:
you're using single quotes (") for multi-line string. Instead you need to use tripple quotes (""" or ''')
your SQL syntax is incorrect for second part of the query (after union all) - you didn't specify FROM which table you need to pull that data. See docs for details of the SQL syntax.
I really recommend to debug each subquery separately, maybe first using the %sql, and only after it works, put it into the spark.sql string.
Also, because you're overwriting the data, it could be easier to use create or replace table syntax to perform everything in SQL (docs), something like this:
create or replace table delta.`/mnt/devstorage/landing/companyx/country/` AS (
SELECT
sha2(Code, 256) as COUNTRY_SK,
Code as COUNTRY_CODE,
Name as COUNTRY_NAME,
current_date() as EXTRACT_DATE
FROM raw.EXTR_COUNTRY)
UNION ALL
SELECT
-1 as COUNTRY_SK,
'Unknown' as COUNTRY_CODE,
'Unknown' as COUNTRY_NAME,
current_date() as EXTRACT_DATE
FROM ....
)

The quotes solved the issue, the sql-script itself wasn't the issue. So using tripple quotes (""" or ''') solved the issue.
%python
df = spark.sql("""SELECT
sha2(Code, 256) as COUNTRY_SK,
Code as COUNTRY_CODE,
Name as COUNTRY_NAME,
current_date() as EXTRACT_DATE
FROM raw.EXTR_COUNTRY)
UNION ALL
SELECT
-1 as COUNTRY_SK,
'Unknown' as COUNTRY_CODE,
'Unknown' as COUNTRY_NAME,
current_date() as EXTRACT_DATE""")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country",
mode="overwrite")

Related

How to Pass list of words into SQL 'LIKE' operator

Iam trying to pass a list of words into SQL Like operator.
The query is to return column called Customer Issue where Customer Issue matches any word in the above list.
my_list =['air con','no cold air','hot air','blowing hot air']
SELECT customer_comments
FROM table
where customer_comments like ('%air con%') #for single search
How do i pass my_list above?
Regular expression can help here. Other solution is using unnest. Which is given already.
SELECT customer_comments
FROM table
where REGEXP_CONTAINS(lower(customer_comments), r'air con|no cold air|hot air|blowing hot air');
A similiar question was answered on the following, works for SQL Server:
Combining "LIKE" and "IN" for SQL Server
Basically you'll have to chain a bunch of 'OR' conditions.
Based on the post #Jordi shared, I think below query can be an option in BigQuery.
query:
SELECT DISTINCT customer_comments
FROM sample,
UNNEST(['air con','no cold air','hot air','blowing hot air']) keyword
WHERE INSTR(customer_comments, keyword) <> 0;
output:
with sample:
CREATE TEMP TABLE sample AS
SELECT * FROM UNNEST(['air conditioner', 'cold air', 'too hot air']) customer_comments;
Consider below
with temp as (
select ['air con','no cold air','hot air','blowing hot air'] my_list
)
select customer_comments
from your_table, (
select string_agg(item, '|') list
from temp t, t.my_list item
)
where regexp_contains(customer_comments, r'' || list)
There are myriad ways to refactor above based on your specific use case - for example
select customer_comments
from your_table
where regexp_contains(customer_comments, r'' ||
array_to_string(['air con','no cold air','hot air','blowing hot air'], '|')
)

ORACLE TO_CHAR SPECIFY OUTPUT DATA TYPE

I have column with data such as '123456789012'
I want to divide each of each 3 chars from the data with a '/' in between so that the output will be like: "123/456/789/012"
I tried "SELECT TO_CHAR(DATA, '999/999/999/999') FROM TABLE 1" but it does not print out the output as what I wanted. Previously I did "SELECT TO_CHAR(DATA, '$999,999,999,999.99') FROM TABLE 1 and it printed out as "$123,456,789,012.00" so I thought I could do the same for other case as well, but I guess that's not the case.
There is also a case where I also want to put '#' in front of the data so the output will be something like this: #12345678901234. Can I use TO_CHAR for this problem too?
Is these possible? Because when I go through the documentation of oracle about TO_CHAR, it stated a few format that can be use for TO_CHAR function and the format that I want is not listed there.
Thank you in advance. :D
Here is one option with varchar2 datatype:
with test as (
select '123456789012' a from dual
)
select listagg(substr(a,(level-1)*3+1,3),'/') within group (order by rownum) num
from test
connect by level <=length(a)
or
with test as (
select '123456789012.23' a from dual
)
select '$'||listagg(substr((regexp_substr(a,'[0-9]{1,}')),(level-1)*3+1,3),',') within group (order by rownum)||regexp_substr(a,'[.][0-9]{1,}') num
from test
connect by level <=length(a)
output:
1st query
123/456/789/012
2nd query
$123,456,789,012.23
If you wants groups of three then you can use the group separator G, and specify the character to use:
SELECT TO_CHAR(DATA, 'FM999G999G999G999', 'NLS_NUMERIC_CHARACTERS=./') FROM TABLE_1
123/456/789/012
If you want a leading # then you can use the currency indicator L, and again specify the character to use:
SELECT TO_CHAR(DATA, 'FML999999999999', 'NLS_CURRENCY=#') FROM TABLE_1
#123456789012
Or combine both:
SELECT TO_CHAR(DATA, 'FML999G999G999G999', 'NLS_CURRENCY=# NLS_NUMERIC_CHARACTERS=./') FROM TABLE_1
#123/456/789/012
db<>fiddle
The data type is always a string; only the format changes.

How to easily remove count=1 on aliased field in SQL?

I have the following data in a table:
GROUP1|FIELD
Z_12TXT|111
Z_2TXT|222
Z_31TBT|333
Z_4TXT|444
Z_52TNT|555
Z_6TNT|666
And I engineer in a field that removes the leading numbers after the '_'
GROUP1|GROUP_ALIAS|FIELD
Z_12TXT|Z_TXT|111
Z_2TXT|Z_TXT|222
Z_31TBT|Z_TBT|333 <- to be removed
Z_4TXT|Z_TXT|444
Z_52TNT|Z_TNT|555
Z_6TNT|Z_TNT|666
How can I easily query the original table for only GROUP's that correspond to GROUP_ALIASES with only one Distinct FIELD in it?
Desired result:
GROUP1|GROUP_ALIAS|FIELD
Z_12TXT|Z_TXT|111
Z_2TXT|Z_TXT|222
Z_4TXT|Z_TXT|444
Z_52TNT|Z_TNT|555
Z_6TNT|Z_TNT|666
This is how I get all the GROUP_ALIAS's I don't want:
SELECT GROUP_ALIAS
FROM
(SELECT
GROUP1,FIELD,
case when instr(GROUP1, '_') = 2
then
substr(GROUP1, 1, 2) ||
ltrim(substr(GROUP1, 3), '0123456789')
else
substr(GROUP1 , 1, 1) ||
ltrim(substr(GROUP1, 2), '0123456789')
end GROUP_ALIAS
FROM MY_TABLE
GROUP BY GROUP_ALIAS
HAVING COUNT(FIELD)=1
Probably I could make the engineered field a second time simply on the original table and check that it isn't in the result from the latter, but want to avoid so much nesting. I don't know how to partition or do anything more sophisticated on my case statement making this engineered field, though.
UPDATE
Thanks for all the great replies below. Something about the SQL used must differ from what I thought because I'm getting info like:
GROUP1|GROUP_ALIAS|FIELD
111,222|,111|111
111,222|,222|222
etc.
Not sure why since the solutions work on my unabstracted data in db-fiddle. If anyone can spot what db it's actually using that would help but I'll also check on my end.
Here is one way, using analytic count. If you are not familiar with the with clause, read up on it - it's a very neat way to make your code readable. The way I declare column names in the with clause works since Oracle 11.2; if your version is older than that, the code needs to be re-written just slightly.
I also computed the "engineered field" in a more compact way. Use whatever you need to.
I used sample_data for the table name; adapt as needed.
with
add_alias (group1, group_alias, field) as (
select group1,
substr(group1, 1, instr(group1, '_')) ||
ltrim(substr(group1, instr(group1, '_') + 1), '0123456789'),
field
from sample_data
)
, add_counts (group1, group_alias, field, ct) as (
select group1, group_alias, field, count(*) over (partition by group_alias)
from add_alias
)
select group1, group_alias, field
from add_counts
where ct > 1
;
With Oracle you can use REGEXP_REPLACE and analytic functions:
select Group1, group_alias, field
from (select group1, REGEXP_REPLACE(group1,'_\d+','_') group_alias, field,
count(*) over (PARTITION BY REGEXP_REPLACE(group1,'_\d+','_')) as count from test) a
where count > 1
db-fiddle

Select with IF statement on postgresql

I have a code like that:
select
tbl.person
,COUNT(distinct tbl.project)
,if (tbl.stage like '%SIGNED%') then sum(tbl.value) else '0' end if as test
from
my_table tbl
group by
1
And it returns me that error message:
SQL Error [42601]: ERROR: syntax error at or near "then"
I didn't got it. As I saw on documentation, the if statement syntax appears to be used correctly
IF is to be used in procedures, not in queries. Use a case expression instead:
select
tbl.person
,COUNT(distinct tbl.project)
,sum(case when tbl.stage like '%SIGNED%' then tbl.value else 0 end) as test
from
my_table tbl
group by
1
Notes:
tbl.stage is not part of the group by, so it should most probably be enclosed within the aggregate expression, not outside of it
all values returned by a case expression need to have the same datatype. Since sum(tbl.value) is numeric, the else part of the case should return 0 (number), not '0' (string).
In Postgres, I would recommend using filter:
select tbl.person, COUNT(distinct tbl.project)
sum(tbl.value) filter (where tbl.stage like '%SIGNED%') as test
from my_table tbl
group by 1;
if is control flow logic. When working with queries, you want to learn how to think more as sets. So the idea is to filter the rows and add up the values after filtering.
replace
if (tbl.stage like '%SIGNED%') then sum(tbl.value) else '0' end if as test
with
sum(case when tbl.stage like '%SIGNED%' then tbl.value end) as test

Replace LIKE by SUBSTR

I tried to select the name of students that end with 'a'. I wrote this code:
Select name form students where name like '%a' ;
How can I get the same results using SUBSTR?
I actually think using RIGHT() would make the most sense here:
SELECT name
FROM students
WHERE RIGHT(name, 1) = 'a'
The above query would work on MySQL, SQL Server, and Postgres, but not Oracle, where you would have to use SUBSTR():
SELECT name
FROM students
WHERE SUBSTR(name , -1) = 'a'
Not all platforms accept negative start integers or length integers for SUBSTR()
Can you try if your DBMS supports the RIGHT() string function?
Works like this:
SQL>SELECT RIGHT('abcd',1) AS rightmost_char;
rightmost_char
--------------
d
Happy playing ...
Marco
You can use :
Select name from students where SUBSTR(name, -1, 1) = 'a' ;
Using SUBSTR().
SELECT name
FROM students
WHERE SUBSTR(name , -1) = 'a'