Return a SQL query where field doesn't contain specific text - sql

I will setup a quick scenario and then ask my question: Let's say I have a DB for my warehouse with the following fields: StorageBinID, StorageReceivedDT, StorageItem, and StorageLocation.
Any single storage bin could have multiple records because of the multiple items in them. So, what I am trying to do is create a query that only returns a storage bin that doesn't contain a certain item, BUT, I don't want the rest of the contents. For example lets say I have 5000 storage bins in my warehouse and I know that there are a handful of bins that do not have "ItemX" in them, listed in the StorageItem field. I would like to return that short list of StorageBinID's without getting a full list of all of the bins without ItemX and their full contents. (I think that rules out IN, LIKE, and CONTAIN and their NOTS)
My workaround right now is running two queries, usually within a StorageReceivedDT. The first is the bins received with the date and then the second is the bins containing ItemX. Then import both .csv files into Excel and use a ISNA(MATCH) formula to compare the two columns.
Is this possible through a query? Thank you very much in advance for any advice.

You can do this as an aggregation query, with a having clause. Just count the number of rows where "ItemX" appears in each bin, and choose the bins where the count is 0:
select StorageBinID
from table t
group by StorageBinID
having sum(case when StorageItem = "ItemX" then 1 else 0 end) = 0;
Note that this only returns bins that have some items in them. If you have completely empty bins, they will not appear in the results. You do not provide enough information to handle that situation (although I can speculate that you have a StorageBins table that would be needed to solve this problem).

What flavour of SQL do you use ?
From the info that you gave, you could use:
select distinct StorageBinID
from table_name
where StorageBinID not in (
select StorageBinID
from table_name
where StorageItem like '%ItemX%'
)
You'll have to replace table_name with the name of your table.
If you want only exact matches (the StorageItem to be exactly "ItemX"), you should replace the condition
where StorageItem like '%ItemX%'
with
where StorageItem = 'ItemX'
Another option (should be faster):
select StorageBinID
from table_name
minus
select StorageBinID
from table_name
where StorageItem like '%ItemX%'

Related

Write SQL from SAS

I have this code in SAS, I'm trying to write SQL equivalent. I have no experience in SAS.
data Fulls Fulls_Dupes;
set Fulls;
by name, coeff, week;
if rid = 0 and ^last.week then output Fulls_Dupes;
else output Fulls;
run;
I tried the following, but didn't produce the same output:
Select * from Fulls where rid = 0 groupby name,coeff,week
is my sql query correct ?
SQL does not have a concept of observation order. So there is no direct equivalent of the LAST. concept. If you have some variable that is monotonically increasing within the groups defined by distinct values of name, coeff, and week then you could select the observation that has the maximum value of that variable to find the observation that is the LAST.
So for example if you also had a variable named DAY that uniquely identified and ordered the observations in the same way as they exist in the FULLES dataset now then you could use the test DAY=MAX(DAY) to find the last observation. In PROC SQL you can use that test directly because SAS will automatically remerge the aggregate value back onto all of the detailed observations. In other SQL implementations you might need to add an extra query to get the max.
create table new_FULLES as
select * from FULLES
group by name, coeff, week
having day=max(day) or rid ne 0
;
SQL also does not have any concept of writing two datasets at once. But for this example since the two generated datasets are distinct and include all of the original observations you could generate the second from the first using EXCEPT.
So if you could build the new FULLS you could get FULLS_DUPES from the new FULLS and the old FULLS.
create table FULLS_DUPES as
select * from FULLES
except
select * from new_FULLES
;

Exclude a string starting with certain letter by a SQL query in Google BigQuery

I'm trying to exclude values from a larger public table in Googles BigQuery using the following SQL lines.
The last line has the purpose to exclude entries, that start with a certain letter, e.g. 'C'.
For some reason, when I add the line, the count increases. Logically the selected rows should decrease and I can’t figure out why.
How can I make the exclusion work?
SELECT Count(*)
FROM `patents-public-data.patents.publications`,
unnest(description_localized) as description_unnest,
unnest(claims_localized) as claims_unnest,
unnest(cpc) as cpc_unnest
where description_unnest.language in ('en','EN')
and claims_unnest.language in ('en','EN')
and publication_date >19900101
and (SUBSTRING(cpc_unnest.code,1,1) <> 'C');
OK. I think I found the mistakes I made
I compared the number of cases with and without this line #7. That was leading to the increased rows.
#7 unnest(cpc) as cpc_unnest
THIS IS MOST IMPORTANT: I did not want to know the number of rows, but the number of unique entries. As the table is build up according to the publication numbers, I can use this number to search for unique entries. The SQL command Count(DISTINCT attribut) can be used:
SELECT Count(DISTINCT publication_number)
The whole solution is this
SELECT Count(DISTINCT publication_number)
FROM `patents-public-data.patents.publications`,
unnest(description_localized) as description_unnest,
unnest(claims_localized) as claims_unnest,
unnest(cpc) as cpc_unnest
where description_unnest.language in ('en','EN')
and claims_unnest.language in ('en','EN')
and publication_date >19900101
and (SUBSTRING(cpc_unnest.code,1,1) <> 'C')

Best practices for dealing with duplicate rows caused by unnested records in BigQuery?

Working with data coming from Facebook more often than not involves working with records where, in my case, all the “spicy” data is at. However, there is a downside, namely the huge amount of duplicate rows, which when not handled properly can cause over-reporting and/or data discrepancy.
Below is a use case which when joined with my primary data (coming from tables which do not involve any unnesting) causes a slight discrepancy in the final numbers.
Technologies used - Facebook Data -> Stitch -> BigQuery -> dbt -> Google Data Studio
I would usually create separate models where I’d unnest a record, transform the data and then join it into the rest of my models. An example of this is getting all website purchase conversion from the ads_insights’s actions record. 
Here is the difference though:

Query:
SELECT count(*) AS row_count
FROM ads_insights
Result:
 row_count - 316

Query:
SELECT count(*) AS row_count
FROM ads_insights,
UNNEST(actions) AS actions
Result:
 row_count - 5612

After unnesting, I’d use the row data to create columns for each conversion like so:
CASE WHEN value.action_type = 'offsite_conversion.fb_pixel_purchase' THEN COALESCE(value._28d_click, 0) + COALESCE(value._1d_view, 0) ELSE 0 END AS website_purchase

And finally I would join this model to the rest of my models. The only problem is that those 5600 rows cause a slight discrepancy when joined with the rest, and since I’ve already used the row data to create the columns, I don’t care about the unnested record data anymore, and I can go back to my original 316 rows. The only question is how? What techniques are out there that will help me clean up my model?
Solution:
Even though at some point I'd aggregate and group all the fields in my query like dylanbaker suggested in his answer, the discrepancy would still persist, and after doing a deep dive at my data I found that the unnested query will return 279 rows, whereas the nested one will return 314. This focused my attention at the unnesting query, where it will remove 35 rows, and those 35 rows happened to be null. After doing some google search I found this StackOverflow article which suggest using LEFT JOIN UNNEST to preserve all rows that have null record values, instead of CROSS JOIN UNNEST which will remove them.
You would typically want to do a 'pivot' here. You're most of the way there, you just need to sum and group by the relevant columns in order to get this back to the grain that you originally had and want.
I believe you'll want something like this:
select
ads_insights.some_column,
ads_insights.some_other_column,
sum(case
when value.action_type = 'offsite_conversion.fb_pixel_purchase'
then coalesce(value._28d_click, 0) + coalesce(value._1d_view, 0)
else 0
end) AS website_purchase
from ads_insights,
unnest(actions) as actions
group by 1,2
The initial columns would be whatever you want from the original table. The 'sum case whens' would be to pivot and aggregate the unnested data.
You can actually do some magic with unnests inside the select statement
Does this work for you?
SELECT
some_column,
(SELECT coalesce(_28d_click, 0) + coalesce(_1d_view, 0) from unnest(actions) WHERE action_type = "offsite_conversion.fb_pixel_purchase") AS website_purchase
FROM ads_insights

BigQuery can use wildcard table names and table_suffix, but I am looking for a similar solution like wildcard datasets and dataset_suffix

So if you process data daily and put the results into the same dataset, such as results, and each day will have the same table name (first part) and with date as table_suffix, such as result1_20190101, result1_20190102 etc., they you query the result tables use wildcard table names and table_suffix.
So your dataset/tables looks like
results/result1_20190101
results/result1_20190102
results/result2_20190101
results/result2_20190102
So I can query all the result1
select * from `xxxx.results.result1*`
But I arrange the results tables differently. Due to I have dozens tables processed each day. so to easily check and manage each day results. I use date as dataset.
so my dataset/tables look like
20190101/result1
20190101/result2
...
20190102/result1
20190102/result2
...
And my daily data process usually will not query cross dates(datasets). the daily results are pushed to next step data pipelines etc.
But once a while, I need to do some quick check, and I need to query across the dates(in my case, across the datasets)
so when I try to query result1, I have to hard code the dataset name.
select * from `xxxxxx.20190101/result1`
union all
select * from `xxxxxx.20190102/result1`
union all
...
1) First question is, are there anyway I could use wildcards and suffix on datasets, like we could with tables?
2) Second question: how could I use the date function, such as DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY) to get the date value and use the data value in the below query
select * from `xxxxxx.20190101/result1`
union all
select * from `xxxxxx.20190102/result1`
union all
...
to replace the hard coded value, 20190101, 20190102 etc.
There is no wildcards and/or suffix available on BigQuery datasets (at least as of today)
Meantime, you can check a feature request for INFORMATION_SCHEMA that is in Alpha now. You can apply for it by submitting form that is available there.
In short: you will be able to query list of datasets in the projects and then use it to construct your query. Please note - you still need to use some sort of client to script all this properly

SQL MIN() returns multiple values?

I am using SQL server 2005, querying with Web Developer 2010, and the min function appears to be returning more than one value (for each ID returned, see below). Ideally I would like it to just return the one for each ID.
SELECT Production.WorksOrderOperations.WorksOrderNumber,
MIN(Production.WorksOrderOperations.OperationNumber) AS Expr1,
Production.Resources.ResourceCode,
Production.Resources.ResourceDescription,
Production.WorksOrderExcel_ExcelExport_View.PartNumber,
Production.WorksOrderOperations.PlannedQuantity,
Production.WorksOrderOperations.PlannedSetTime,
Production.WorksOrderOperations.PlannedRunTime
FROM Production.WorksOrderOperations
INNER JOIN Production.Resources
ON Production.WorksOrderOperations.ResourceID = Production.Resources.ResourceID
INNER JOIN Production.WorksOrderExcel_ExcelExport_View
ON Production.WorksOrderOperations.WorksOrderNumber = Production.WorksOrderExcel_ExcelExport_View.WorksOrderNumber
WHERE Production.WorksOrderOperations.WorksOrderNumber IN
( SELECT WorksOrderNumber
FROM Production.WorksOrderExcel_ExcelExport_View AS WorksOrderExcel_ExcelExport_View_1
WHERE (WorksOrderSuffixStatus = 'Proposed'))
AND Production.Resources.ResourceCode IN ('1303', '1604')
GROUP BY Production.WorksOrderOperations.WorksOrderNumber,
Production.Resources.ResourceCode,
Production.Resources.ResourceDescription,
Production.WorksOrderExcel_ExcelExport_View.PartNumber,
Production.WorksOrderOperations.PlannedQuantity,
Production.WorksOrderOperations.PlannedSetTime,
Production.WorksOrderOperations.PlannedRunTime
If you can get your head around it, I am selecting certain columns from multiple tables where the WorksOrderNumber is also contained within a subquery, and numerous other conditions.
Result set looks a little like this, have blurred out irrelevant data.
http://i.stack.imgur.com/5UFIp.png (Wouldn't let me embed image).
The highlighted rows are NOT supposed to be there, I cannot explicitly filter them out, as this result set will be updated daily and it is likely to happen with a different record.
I have tried casting and converting the OperationNumber to numerous other data types, varchar type returns '100' instead of the '30'. Also tried searching search engines, no one seems to have the same problem.
I did not structure the tables (they're horribly normalised), and it is not possible to restructure them.
Any ideas appreciated, many thanks.
The MIN function returns the minimum within the group.
If you want the minimum for each ID you need to get group on just ID.
I assume that by "ID" you are referring to Production.WorksOrderOperations.WorksOrderNumber.
You can add this as a "table" in your SQL:
(SELECT Production.WorksOrderOperations.WorksOrderNumber,
MIN(Production.WorksOrderOperations.OperationNumber)
FROM Production.WorksOrderOperations
GROUP BY Production.WorksOrderOperations.WorksOrderNumber)