SQL - expand dataset into lookup table? - sql

I currently have a legacy table that looks like the one below.
This is a set of rules that our business has stored over the years. the issue is the "all" and "both" values really should be separated out into rows so they can be queried more efficiently.
For example, the contract length column can only ever be between 1 and 5, the type column can only ever be "gas" or "water" and the sales channel "internal" or "external". Instead of saying all or both, another row should exist with the specific rule and the table should look like the below.
So this will have a row for every variation in the first table.
I didn't think it would be a long task to manually do myself. but I was wrong :)
Does anyone have any idea on how to achieve this quickly in SQL? I would say what I have tried so far...but I am completely stumped on this one so am wondering if it can even be done at all?

This could be done in a single sql statement, but for the sake of your mental health and the ability to check interim result sets before you get to the final output is probably a lot healthier and less risky.
I would approach this with a UNION query, one set of UNIONs for each column that should be split out to more granular rows.
For instance for contractlength:
SELECT Supplier, 1, Type, SalesChannel FROM yourtable WHERE contractLength in ('1', 'All')
UNION ALL
SELECT Supplier, 2, Type, SalesChannel FROM yourtable WHERE contractLength in ('2', 'All')
UNION ALL
SELECT Supplier, 3, Type, SalesChannel FROM yourtable WHERE contractLength in ('3', 'All')
UNION ALL
SELECT Supplier, 4, Type, SalesChannel FROM yourtable WHERE contractLength in ('4', 'All')
UNION ALL
SELECT Supplier, 5, Type, SalesChannel FROM yourtable WHERE contractLength in ('5', 'All')
You can write those results out to a temp table, and then build your query for type on top of it writing to a new temp table.
SELECT Supplier, contractLength, 'Gas', SalesChannel FROM previousTempTable WHERE type in ('Gas','Both')
UNION ALL
SELECT Supplier, contractLength, 'Water', SalesChannel FROM previousTempTable WHERE type in ('Gas','Both')
Rinse and repeat for SalesChannel.
There's other more elegant ways to solve this with some SELECT DISTINCT and cross joins, but your list of values for each column is limited and this solution I'm proposing feels like a quick easy way to get your data in shape. It's also easy to understand if this is auditable data or the process needs to be repeated.

You don't need to query your table multiple times, or use temp tables. You can do this pretty elegantly with conditional unpivots, by using CROSS APPLY
SELECT
t.Supplier,
c1.ContractLength,
c2.Type,
c3.SalesChannel
FROM YourTable t
CROSS APPLY (
SELECT t.ContractLength
WHERE t.ContractLength <> 'All'
UNION ALL
SELECT *
FROM (VALUES
(1),(2),(3),(4),(5)
) v(ContractLength)
WHERE t.ContractLength = 'All'
) c1
CROSS APPLY (
SELECT t.Type
WHERE t.Type <> 'Both'
UNION ALL
SELECT *
FROM (VALUES
('Gas'),('Water')
) v(Type)
WHERE t.Type = 'Both'
) c2
CROSS APPLY (
SELECT t.SalesChannel
WHERE t.SalesChannel <> 'Both'
UNION ALL
SELECT *
FROM (VALUES
('Internal'),('External')
) v(SalesChannel)
WHERE t.SalesChannel = 'Both'
) c3;
A somewhat less efficient, but more compact, version of the same, is to use normal joins against the VALUES clauses
SELECT
t.Supplier,
c1.ContractLength,
c2.Type,
c3.SalesChannel
FROM YourTable t
JOIN (VALUES
(1),(2),(3),(4),(5)
) c1(ContractLength)
ON c1.ContractLength = t.ContractLength OR t.ContractLength = 'All'
JOIN (VALUES
('Gas'),('Water')
) c2(Type)
ON c2.Type = t.Type OR t.Type = 'Both'
JOIN (VALUES
('Internal'),('External')
) c3(SalesChannel)
ON c3.SalesChannel = t.SalesChannel OR t.SalesChannel = 'Both';

Related

Join type (inner, left) and data type casting influences query plan, and order of operations

create or replace table test.bugs.table_one as (
select *, random(1337) as cost
from (
values
('', '2010-01-01', 'one')
, ('10', '2010-01-01', 'two')
, ('11', '2010-01-01', 'three')
, ('12', '2010-01-01', 'four')
)
);
create or replace table test.bugs.table_two as (
select *, random(1337) as budget
from (
values
(9, '2010-01-01', 'one')
, (10, '2010-01-01', 'two')
)
);
with
t1 as (
select
column1::int as column1
, column2
, column3
, cost
from table_one
where column1 !=''
),
t2 as (
select
column1
, column2
, column3
, budget
from table_two
)
select *
from t1
inner join t2
on t1.column1 = t2.column1
and t1.column2 = t2.column2
and t1.column3 = t2.column3;
Returns: 3 rows
Changing the join type to INNER results in error: Numeric value '' is not recognized. Instead of ::int I ended up using try_to_number() function, but it took a bit of trial and error to figure out (query above is simplified, mine was more convoluted).
Is this a bug, or am I doing something odd?
Databases do not guarantee the order of evaluation of expressions. In some databases, your code would always work. In others, it might work sometimes and fail other times.
Is this a bug? I consider it a bug, but clearly some database vendors do not. You have found the work around. Another method would be a case expression:
select (case when column1 regexp '^[0-9]+$' then column1::int end)
This should work, because case should guarantee the order of evaluation of its arguments.
When the join because an inner join things done before or after the join are equal. So things like cast can get hoisted.
The WHERE clause is supposed to evaluate before the SELECT section of t1 CTE.
I just retested by bug submition code, and now the broken case works, but the working case (with the correct TRY_TO_NUMBER fails).
I have queries like your that worked, and then once an extra layer of select around the outside was run with an aggregation over the results, the cast was hoisted back to the error state.
But yes, it's a bug, so I would report it.

BigQuery use the where clause to filter on a column that not always exists in the table

I need to create some kind of a uniform query for multiple tables. Some tables contain a certain column with a type. If this is the case, I need to apply filtering to it. I don't know how to do this.
I have for example two tables
table_customer_1
CustomerId, CustomerType
1, 1
2, 1
3, 2
Table_customer_2
Customerid
4
5
6
The query needs to be something like the one below and should work for both tables (the table name wil be replaced by the customer that uses the query):
With input1 as(
SELECT
(CASE WHEN exists(customerType) THEN customerType ELSE "0" END) as customerType, *
FROM table_customer_1)
SELECT * from input1
WHERE customerType != 2
Below is for BigQuery Standard SQL
#standardSQL
SELECT *
FROM `project.dataset.table` t
WHERE SAFE_CAST(IFNULL(JSON_EXTRACT_SCALAR(TO_JSON_STRING(t), '$.CustomerType'), '0') AS INT64) != 2
or as a simplification you can ignore casting to INT64 and use comparison to STRING
#standardSQL
SELECT *
FROM `project.dataset.table` t
WHERE IFNULL(JSON_EXTRACT_SCALAR(TO_JSON_STRING(t), '$.CustomerType'), '0') != '2'
above will work for whatever table you put instead of project.dataset.table: either project.dataset.table_customer_1 or project.dataset.table_customer_2 - so quite generic I think
I can think of no good reason for doing this. However, it is possible by playing with the scoping rules for subqueries:
SELECT t.*
FROM (SELECT t.*,
(SELECT customerType -- will choose from tt if available, otherwise x
FROM table_customer_1 tt
WHERE tt.Customerid = t.Customerid
) as customerType
FROM (SELECT t.* EXCEPT (Customerid)
FROM table_customer_1 t
) t CROSS JOIN
(SELECT 0 as customerType) x
) t
WHERE customerType <> 2

sql group by ignoring case and suffix or final letter

I have a table like this:
I am going to count the number of categories and how many rows are in each category.
I used this query:
But unfortunately apples is counted as separate category because it has "s" at the end.
I would recommend you look at some of the comments and restructure your data as I think it will cause you issues going forward but this query will do what you want but it isnt a nice one.
CTE for testing:
WITH fruit_table(Fruit, No_Fruit)
AS (
SELECT 'Apple', 3
UNION ALL
SELECT 'Apples', 2
UNION ALL
SELECT 'Orange', 1
UNION ALL
SELECT 'oranges', 2)
Query:
SELECT DISTINCT
LOWER(CASE
WHEN right(fruit, 1) = 's'
THEN left(fruit, -1)
ELSE fruit
END),
SUM(No_Fruit)
FROM fruit_table
GROUP BY LOWER(CASE
WHEN right(fruit, 1) = 's'
THEN left(fruit, -1)
ELSE Fruit
END);
There will be a more elegant way to get the results you want and a better solution would be to fix your schema but.... it works

Using a case statement as an if statement

I am attempting to create an IF statement in BigQuery. I have built a concept that will work but it does not select the data from a table, I can only get it to display 1 or 0
Example:
SELECT --AS STRUCT
CASE
WHEN (
Select Count(1) FROM ( -- If the records are the same, then return = 0, if the records are not the same then > 1
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Prior_Filtered`
Except Distinct
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Latest_Filtered`
)
)= 0
THEN
(Select * from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Latest`) -- This Does not
work Scalar subquery cannot have more than one column unless using SELECT AS
STRUCT to build STRUCT values at [16:4] END
SELECT --AS STRUCT
CASE
WHEN (
Select Count(1) FROM ( -- If the records are the same, then return = 0, if the records are not the same then > 1
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Prior_Filtered`
Except Distinct
Select Distinct ESCO, SOURCE, LDCTEXT, STATUS,DDR_DATE, TempF, HeatingDegreeDays, DecaTherms
from `gas-ddr.gas_ddr_outbound.LexingtonDDRsOutbound_onchange_Latest_Filtered`
)
)= 0
THEN 1 --- This does work
Else
0
END
How can I Get this query to return results from an existing table?
You question is still a little generic, so my answer same as well - and just mimic your use case at extend I can reverse engineer it from your comments
So, in below code - project.dataset.yourtable mimics your table ; whereas
project.dataset.yourtable_Prior_Filtered and project.dataset.yourtable_Latest_Filtered mimic your respective views
#standardSQL
WITH `project.dataset.yourtable` AS (
SELECT 'aaa' cols, 'prior' filter UNION ALL
SELECT 'bbb' cols, 'latest' filter
), `project.dataset.yourtable_Prior_Filtered` AS (
SELECT cols FROM `project.dataset.yourtable` WHERE filter = 'prior'
), `project.dataset.yourtable_Latest_Filtered` AS (
SELECT cols FROM `project.dataset.yourtable` WHERE filter = 'latest'
), check AS (
SELECT COUNT(1) > 0 changed FROM (
SELECT DISTINCT cols FROM `project.dataset.yourtable_Latest_Filtered`
EXCEPT DISTINCT
SELECT DISTINCT cols FROM `project.dataset.yourtable_Prior_Filtered`
)
)
SELECT t.* FROM `project.dataset.yourtable` t
CROSS JOIN check WHERE check.changed
the result is
Row cols filter
1 aaa prior
2 bbb latest
if you changed your table to
WITH `project.dataset.yourtable` AS (
SELECT 'aaa' cols, 'prior' filter UNION ALL
SELECT 'aaa' cols, 'latest' filter
) ......
the result will be
Row cols filter
Query returned zero records.
I hope this gives you right direction
Added more explanations:
I can be wrong - but per your question - it looks like you have one table project.dataset.yourtable and two views project.dataset.yourtable_Prior_Filtered and project.dataset.yourtable_Latest_Filtered which present state of your table prior and after some event
So, first three CTE in the answer above just mimic those table and views which you described in your question.
They are here so you can see concept and can play with it without any extra work before adjusting this to your real use-case.
For your real use-case you should omit them and use your real table and views names and whatever columns the have.
So the query for you to play with is:
#standardSQL
WITH check AS (
SELECT COUNT(1) > 0 changed FROM (
SELECT DISTINCT cols FROM `project.dataset.yourtable_Latest_Filtered`
EXCEPT DISTINCT
SELECT DISTINCT cols FROM `project.dataset.yourtable_Prior_Filtered`
)
)
SELECT t.* FROM `project.dataset.yourtable` t
CROSS JOIN check WHERE check.changed
It should be a very simple IF statement in any language.
Unfortunately NO! it cannot be done with just simple IF and if you see it fit you can submit a feature request to BigQuery team for whatever you think makes sense

Combining Columns from different tables

I've write a SQL code to combine several columns from different tables.
SELECT *
FROM
(
SELECT PD_BARCODE
FROM docsadm.PD_BARCODE
WHERE SYSTEM_ID = 11660081
) t,
(
SELECT A_JAHRE
FROM docsadm.A_PD_DATENSCHUTZ
WHERE system_ID = 2066
) t2,
(
SELECT PD_PART_NAME
FROM docsadm.PD_FILE_PART
WHERE system_id = 11660082
) t3;
code works fine but if one of my where clause is not found in a table,the result is null even the other columns have value. How you can solve this problem?
It looks like you are doing a cross join between the three subquery tables. This would probably only yield output which makes sense if each subquery return a single value. I might suggest instead that you use a UNION ALL here:
SELECT ISNULL(PD_BARCODE, 'NA' AS value
FROM docsadm.PD_BARCODE WHERE SYSTEM_ID = 11660081
UNION ALL
SELECT ISNULL(A_JAHRE, 'NA')
FROM docsadm.A_PD_DATENSCHUTZ WHERE system_ID = 2066
UNION ALL
SELECT ISULL(PD_PART_NAME, 'NA')
FROM docsadm.PD_FILE_PART WHERE system_id = 11660082
The above union query might require a slight modification if the three columns being select don't all have the same type (which I assume to be varchar in my query).
If you really need these three points of data as separate columns, then you can just include the three subqueries as items in an outer select:
SELECT
(SELECT ISNULL(PD_BARCODE, 'NA')
FROM docsadm.PD_BARCODE WHERE SYSTEM_ID = 11660081) AS PD_BARCODE,
(SELECT ISNULL(A_JAHRE, 'NA')
FROM docsadm.A_PD_DATENSCHUTZ WHERE system_ID = 2066) AS A_JAHRE,
(SELECT ISNULL(PD_PART_NAME, 'NA')
FROM docsadm.PD_FILE_PART WHERE system_id = 11660082) AS PD_PART_NAME;
Note that as the above is written we simply including the subqueries as values in the select statement. But as you wrote your original query, you are joining the subqueries as separate tables.
Here is Query.
You can replace the word 'empty' by your required word or value
SELECT isnull(
(
SELECT PD_BARCODE
FROM docsadm.PD_BARCODE
WHERE SYSTEM_ID = 11660081
), 'Empty') AS PD_BARCODE,
isnull(
(
SELECT A_JAHRE
FROM docsadm.A_PD_DATENSCHUTZ
WHERE system_ID = 2066
), 'Empty') AS A_JAHRE,
isnull(
(
SELECT PD_PART_NAME
FROM docsadm.PD_FILE_PART
WHERE system_id = 11660082
), 'Empty') AS PD_PART_NAME;
This is a special case since you would be getting only one value per query of yours which you are trying to put up as column. In this case, you can use SQL PIVOT clause with UNION ALL of your queries like shown below. MIN can be used for aggregation in this case.
This would mean, you can get your data row-wise as you like, even for multiple different fields and then pivot it into columns in one go.
SELECT * FROM
(
SELECT 'PD_BARCODE' as KeyItem, PD_BARCODE Name from docsadm.PD_BARCODE
where SYSTEM_ID=11660081
UNION ALL
SELECT 'A_JAHRE', A_JAHRE from docsadm.A_PD_DATENSCHUTZ
where system_ID=2066
UNION ALL
select 'PD_PART_NAME', PD_PART_NAME from docsadm.PD_FILE_PART
where system_id=11660082
) VTABLE
PIVOT(MIN(Name)
FOR KeyItem IN ([PD_BARCODE], [A_JAHRE], [PD_PART_NAME])) as pivotData;