Extracting rows from fact table which have missing data, using metadata table - sql

I have a situation where I have:
A fact table with an id column which is NOT unique but is never null. This fact also has a lot of other dimensions (columns) which may be with a default value -1 (which logically means null)
Example:
id | Dimension1 | Dimension2 | Dimension3
1 Value -1 Value
1 -1 -1 Value
2 -1 Value Value
A metadata table that has the same dimensions as the fact table. Each row in this table represents an unique id from the fact table. Rest of the columns are populated with either null or 1, where 1 means that this dimension is a required dimension in the fact table for this id.
Example:
id | Dimension1 | Dimension2 | Dimension3
1 1 1
2 1 1
My goal is to get ONLY the rows from the fact table that are missing required information according to the metadata table. So from the examples above I would get only the row with id = 1 where Dimension1 = -1, since metadata table says for id = 1 dimensions 1 and 3 are required.
Is there an easy way of doing this?
I have made a very complicated query where there is join between these two tables and a case checks between all dimensions (I have more than 100 of them). Then these checks assign a -1 if dimension is missing in fact but is required, and there is an outer query that would sum these for all rows and only pick up rows with negative sum.
It does not work to 100% and I think its way too complicated to run on a real big fact table, so I'm open to ideas.
edit: Dynamic SQL is not allowed :(

I would suggest using a cte and an except query... in this solution, you will have to add the cases as well, but the join seems far more simple to me and you don't need to sum up any dummy values...
DECLARE #t TABLE(
id int, Dimension1 int, Dimension2 int, Dimension3 int
)
DECLARE #tMeta TABLE(
id int, Dimension1 int, Dimension2 int, Dimension3 int
)
INSERT INTO #t VALUES (1, 123, -1, 345), (1, -1, -1, 246), (2, -1, 567, 987)
INSERT INTO #tMeta VALUES (1, 1, NULL, 1), (2, NULL, 1, 1)
;WITH cte AS(
SELECT id,
CASE WHEN Dimension1 = -1 THEN NULL ELSE 1 END Dimension1,
CASE WHEN Dimension2 = -1 THEN NULL ELSE 1 END Dimension2,
CASE WHEN Dimension3 = -1 THEN NULL ELSE 1 END Dimension3
FROM #t
EXCEPT
SELECT *
FROM #tMeta
EXCEPT
SELECT id, ISNULL(Dimension1,1), ISNULL(Dimension2,1), ISNULL(Dimension3,1)
FROM #tMeta
)
SELECT t.*
FROM #t t
JOIN cte c ON t.id = c.id
AND CASE WHEN t.Dimension1 = -1 THEN -1 ELSE 1 END = ISNULL(c.Dimension1, -1)
AND CASE WHEN t.Dimension2 = -1 THEN -1 ELSE 1 END = ISNULL(c.Dimension2, -1)
AND CASE WHEN t.Dimension3 = -1 THEN -1 ELSE 1 END = ISNULL(c.Dimension3, -1)

You can use UNPIVOT to simplify query also you don't have ROWId in your fact table so the first CTE to make ROW_NUMBER() works as a RowId in the fact table. Then we make unpivoted tables (fact and template table) and join them:
WITH TFBase AS
(
SELECT TF.*, ROW_NUMBER() OVER (ORDER BY ID) as TableRowID FROM TF
),
TFU AS
(
select id,TableRowID,dim,val
from TFBase
unpivot
(
val for dim in (Dimension1, Dimension2, Dimension3)
) u
WHERE U.Val <>-1
)
,
TFT AS
(
select id,dim,val
from TTemplate
unpivot
(
val for dim in (Dimension1, Dimension2, Dimension3)
) u
WHERE Val is NOT NULL
)
SELECT * FROM TFBase WHERE
TableRowID IN
(
SELECT TableRowID FROM TFU
LEFT JOIN TFT ON
(TFU.id=TFT.id) AND (TFU.dim = TFT.dim)
GROUP BY TableRowID, TFU.ID
HAVING COUNT(TFT.Val) <> (SELECT COUNT(*) FROM TFT WHERE ID = TFU.ID)
)

Related

SQL Server - Get column who have specific value

I have a SQL query which returns :
id | value
1 a
1 a
1 b
2 a
2 a
I want to get only id who have only the value a. So the id 2
How to do this ?
You can use aggregation and having clause to check if all the rows have value 'a' for a given id:
Using Count:
select id
from t
group by id
having count(*) = count(case when value = 'a' then 1 end);
Or using Sum
select id
from t
group by id
having SUM(case when value = 'a' then 0 else 1 end) = 0;
Use the next code:-
Select id
from #test
group by id
having sum (case when value = 'a' then 0 else 1 end) = 0
The clue is passing 0 for 'a' and pass 1 for other, then having sum equals 0
This is slightly slower than #Gurwinder Singh's answer but can be more readable if performance is not your top priority.
CREATE TABLE tmp (id int, [value] char(1))
INSERT INTO tmp values (1,'a'),(1,'a'),(1,'b'),(2,'a'),(2,'a')
SELECT DISTINCT id
FROM tmp a
WHERE [value] = 'a'
AND id NOT IN (
SELECT id FROM tmp
WHERE [value] <> 'a')

How to get data based on Case condition and MAX Date

I have some data:
Declare #table table (RID VARCHAR(10),
CommType INT,
CommunicationType INT,
VALUE VARCHAR(20),
lastDate Datetime)
INSERT INTO #table (RID, CommType, CommunicationType, VALUE, lastDate)
VALUES
('00WAAS', 3, 0, 'mohan#gmail', '2012-06-15 15:23:49.653'),
('00WAAS', 3, 1, 'manasa#gmail', '2015-08-15 15:23:49.653'),
('00WAAS', 3, 2, 'mother#gmail', '2014-09-15 15:23:49.653'),
('00WAAS', 3, 2, 'father#gmail', '2016-01-15 15:23:49.653'),
('00WAAS', 3, 0, 'hello#gmail', '2013-01-15 15:23:49.653')
My query:
SELECT
TT.RID,
COALESCE(Homemail, BusinessMail, OtherMail) Mail
FROM
(SELECT
RID, MAX(Homemail) Homemail,
MAX(BusinessMail) BusinessMail,
MAX(OtherMail) OtherMail
FROM
(SELECT
RID,
CASE
WHEN CommType = 3 AND CommunicationType = 0 THEN VALUE
END AS Homemail,
CASE
WHEN CommType = 3 AND CommunicationType = 1 THEN VALUE
END AS BusinessMail,
CASE
WHEN CommType = 3 AND CommunicationType = 2 THEN VALUE
END AS OtherMail,
lastDate
FROM
#table) T
GROUP BY RID) TT
What I'm expecting
Here I need to get result if CommType = 3 and CommunicationType = 0 then related value based on latest date and if data is not available for
CommType = 3 and CommunicationType = 0
then I need to get data of CommunicationType = 1
related value based on latest date and if there is no data for
CommunicationType = 1
then CommunicationType = 2 based on latest date of that CommunicationTypes.
Here I have tried Case condition ,MAX and Coalesce
If combination data is present in CommunicationType = 0 is present get CommunicationType = 0 based on latest date
If combination data is not present in CommunicationType = 0 then get CommunicationType = 1 based on latest date
If combination data is not present in CommunicationType = 1 then get CommunicationType = 2 based on latest date
I'm not entirely sure I've understood the requirement. But I think you want:
One record returned for each RID.
The returned record should have a CommType of 3.
If there is more than one record with a CommType 3 you want the record with the lowest CommunicationType.
If there is still more than one record you want the one with the most recent lastDate.
This query uses the windowed function ROW_NUMBER to rank the available records, within a subquery. PARTITION BY ensures each RID is ranked sepearatly. The outer query returns all records with a rank of 1.
Query
SELECT
r.*
FROM
(
/* For each RID We want the lowest communication type with
* the most recent last date.
*/
SELECT
ROW_NUMBER() OVER (PARTITION BY RID ORDER BY CommunicationType, lastDate DESC) AS rn,
*
FROM
#table
WHERE
CommType = 3
) AS r
WHERE
r.rn = 1
;
Next Steps
This query is ok but could be better. For example what would happen if two records had a matching CommType, CommunicationType and lastDate? Reading up on the differences between ROW_NUMBER, RANK, DENSE_RANK and NTILE will help you figure out your options here.
If I understood you correctly, use ROW_NUMBER() :
SELECT tt.RID,COALESCE(tt.Homemail,tt.businessMail,tt.OtherMail)
FROM(
select s.RID,
MAX(CASE WHEN s.CommType = 3 AND s.CommunicationType = 0 THEN s.VALUE END) AS Homemail,
MAX(CASE WHEN s.CommType = 3 AND s.CommunicationType = 1 THEN s.VALUE END) AS BusinessMail,
MAX(CASE WHEN s.CommType = 3 AND s.CommunicationType = 2 THEN s.VALUE END) AS OtherMail
from (SELECT t.*,ROW_NUMBER() OVER(PARTITION BY t.rid,t.communicationType ORDER BY t.lastDate DESC)
FROM #table t
WHERE t.commType = 3) s
WHERE s.rnk = 1
GROUP BY s.rid) tt

How do I determine if a group of data exists in a table, given the data that should appear in the group's rows?

I am writing data to a table and allocating a "group-id" for each batch of data that is written. To illustrate, consider the following table.
GroupId Value
------- -----
1 a
1 b
1 c
2 a
2 b
3 a
3 b
3 c
3 d
In this example, there are three groups of data, each with similar but varying values.
How do I query this table to find a group that contains a given set of values? For instance, if I query for (a,b,c) the result should be group 1. Similarly, a query for (b,a) should result in group 2, and a query for (a, b, c, e) should result in the empty set.
I can write a stored procedure that performs the following steps:
select distinct GroupId from Groups -- and store locally
for each distinct GroupId: perform a set-difference (except) between the input and table values (for the group), and vice versa
return the GroupId if both set-difference operations produced empty sets
This seems a bit excessive, and I hoping to leverage some other commands in SQL to simplify. Is there a simpler way to perform a set-comparison in this context, or to select the group ID that contains the exact input values for the query?
This is a set-within-sets query. I like to solve it using group by and having:
select groupid
from GroupValues gv
group by groupid
having sum(case when value = 'a' then 1 else 0 end) > 0 and
sum(case when value = 'b' then 1 else 0 end) > 0 and
sum(case when value = 'c' then 1 else 0 end) > 0 and
sum(case when value not in ('a', 'b', 'c') then 1 else - end) = 0;
The first three conditions in the having clause check that each elements exists. The last condition checks that there are no other values. This method is quite flexible, for various exclusions and inclusion conditions on the values you are looking for.
EDIT:
If you want to pass in a list, you can use:
with thelist as (
select 'a' as value union all
select 'b' union all
select 'c'
)
select groupid
from GroupValues gv left outer join
thelist
on gv.value = thelist.value
group by groupid
having count(distinct gv.value) = (select count(*) from thelist) and
count(distinct (case when gv.value = thelist.value then gv.value end)) = count(distinct gv.value);
Here the having clause counts the number of matching values and makes sure that this is the same size as the list.
EDIT:
query compile failed because missing the table alias. updated with right table alias.
This is kind of ugly, but it works. On larger datasets I'm not sure what performance would look like, but the nested instances of #GroupValues key off GroupID in the main table so I think as long as you have a good index on GroupID it probably wouldn't be too horrible.
If Object_ID('tempdb..#GroupValues') Is Not Null Drop Table #GroupValues
Create Table #GroupValues (GroupID Int, Val Varchar(10));
Insert #GroupValues (GroupID, Val)
Values (1,'a'),(1,'b'),(1,'c'),(2,'a'),(2,'b'),(3,'a'),(3,'b'),(3,'c'),(3,'d');
If Object_ID('tempdb..#FindValues') Is Not Null Drop Table #FindValues
Create Table #FindValues (Val Varchar(10));
Insert #FindValues (Val)
Values ('a'),('b'),('c');
Select Distinct gv.GroupID
From (Select Distinct GroupID
From #GroupValues) gv
Where Not Exists (Select 1
From #FindValues fv2
Where Not Exists (Select 1
From #GroupValues gv2
Where gv.GroupID = gv2.GroupID
And fv2.Val = gv2.Val))
And Not Exists (Select 1
From #GroupValues gv3
Where gv3.GroupID = gv.GroupID
And Not Exists (Select 1
From #FindValues fv3
Where gv3.Val = fv3.Val))

SQL - Combining incomplete

I'm using Oracle 10g. I have a table with a number of fields of varying types. The fields contain observations that have been made by made about a particular thing on a particular date by a particular site.
So:
ItemID, Date, Observation1, Observation2, Observation3...
There are about 40 Observations in each record. The table structure cannot be changed at this point in time.
Unfortunately not all the Observations have been populated (either accidentally or because the site is incapable of making that recording). I need to combine all the records about a particular item into a single record in a query, making it as complete as possible.
A simple way to do this would be something like
SELECT
ItemID,
MAX(Date),
MAX(Observation1),
MAX(Observation2)
etc.
FROM
Table
GROUP BY
ItemID
But ideally I would like it to pick the most recent observation available, not the max/min value. I could do this by writing sub queries in the form
SELECT
ItemID,
ObservationX,
ROW_NUMBER() OVER (PARTITION BY ItemID ORDER BY Date DESC) ROWNUMBER
FROM
Table
WHERE
ObservationX IS NOT NULL
And joining all the ROWNUMBER 1s together for an ItemID but because of the number of fields this would require 40 subqueries.
My question is whether there's a more concise way of doing this that I'm missing.
Create the table and the sample date
SQL> create table observation(
2 item_id number,
3 dt date,
4 val1 number,
5 val2 number );
Table created.
SQL> insert into observation values( 1, date '2011-12-01', 1, null );
1 row created.
SQL> insert into observation values( 1, date '2011-12-02', null, 2 );
1 row created.
SQL> insert into observation values( 1, date '2011-12-03', 3, null );
1 row created.
SQL> insert into observation values( 2, date '2011-12-01', 4, null );
1 row created.
SQL> insert into observation values( 2, date '2011-12-02', 5, 6 );
1 row created.
And then use the KEEP clause on the MAX aggregate function with an ORDER BY that puts the rows with NULL observations at the end. whatever date you use in the ORDER BY needs to be earlier than the earliest real observation in the table.
SQL> ed
Wrote file afiedt.buf
1 select item_id,
2 max(val1) keep( dense_rank last
3 order by (case when val1 is not null
4 then dt
5 else date '1900-01-01'
6 end) ) val1,
7 max(val2) keep( dense_rank last
8 order by (case when val2 is not null
9 then dt
10 else date '1900-01-01'
11 end) ) val2
12 from observation
13* group by item_id
SQL> /
ITEM_ID VAL1 VAL2
---------- ---------- ----------
1 3 2
2 5 6
I suspect that there is a more elegant solution to ignore the NULL values than adding the CASE statement to the ORDER BY but the CASE gets the job done.
i dont know about commands in oracle but in sql you could use some how that
first use pivot table is contains consecutives numbers 0,1,2...
i'm not sure but in oracle the function "isnull" is "NVL"
select items.ItemId,
case p.i = 0 then observation1 else '' end as observation1,
case p.i = 0 then observation1 else '' end as observation2,
case p.i = 0 then observation1 else '' end as observation3,
...
case p.i = 39 then observation4 else '' as observation40
from (
select items.ItemId
from table as items
where items.item = _paramerter_for_retrive_only_one_item /* select one item o more item where you filter items here*/
group by items.ItemId) itemgroup
left join
(
select
items.ItemId,
p.i,
isnull( max ( case p.i = 0 then observation1 else '' end ), '' ) as observation1,
isnull( max ( case p.i = 1 then observation2 else '' end ), '' ) as observation2,
isnull( max ( case p.i = 2 then observation3 else '' end), '' ) as observation3,
...
isnull( max ( case p.i = 39 then observation4), '' ) as observation40,
from
(select i from pivot where id < 40 /*you number of columns of observations, that attach one index*/
)
as p
cross join table as items
lef join table as itemcombinations
on item.itemid = itemcombinations.itemid
where items.item = _paramerter_for_retrive_only_one_item /* select one item o more item where you filter items here*/
and (p.i = 0 and not itemcombinations.observation1 is null) /* column 1 */
and (p.i = 1 and not itemcombinations.observation2 is null) /* column 2 */
and (p.i = 2 and not itemcombinations.observation3 is null) /* column 3 */
....
and (p.i = 39 and not itemcombinations.observation3 is null) /* column 39 */
group by p.i, items.ItemId
) as itemsimplified
on itemsimplified.ItemId = itemgroup.itemId
group by itemgroup.itemId
About pivot table
create an pivot table, Take a look at that
pivot table schema
name: pivot columns: {i : datatype int}
How populate
create foo table
schema foo
name: foo column: value datatype varchar
insert into foo
values('0'),
values('1'),
values('2'),
values('3'),
values('4'),
values('5'),
values('6'),
values('7'),
values('8'),
values('9');
/* insert 100 values */
insert into pivot
select concat(a.value, a.value) /* mysql */
a.value + a.value /* sql server */
a.value | a.value /* Oracle im not sure about that sintax */
from foo a, foo b
/* insert 1000 values */
insert into pivot
select concat(a.value, b.value, c.value) /* mysql */
a.value + b.value + c.value /* sql server */
a.value | b.value | c.value /* Oracle im not sure about that sintax */
from foo a, foo b, foo c
the idea about pivot table can consult in "Transact-SQL Cookbook By Jonathan Gennick, Ales Spetic"
I have to admit that the above solution (by Justin Cave) is simpler and easier to understand but this is another good option
at the end like you said you solved

SQL retrieval from tables

I have a table something like
EMPLOYEE_ID DTL_ID COLUMN_A COLUMN_B
---------------------------
JOHN 0 1 1
JOHN 1 3 1
LINN 0 1 12
SMITH 0 9 1
SMITH 1 11 12
It means for each person there will be one or more records with different DTL_ID's value (0, 1, 2 .. etc).
Now I'd like to create a T-SQL statement to retrieve the records with EMPLOYEE_ID and DTL_ID.
If the specified DTL_ID is NOT found, the record with DTL_ID=0 will be returned.
I know that I can achieve this in various ways such as checking if a row exists via EXISTS or COUNT(*) first and then retrieve the row.
However, I'd like to know other possible ways because this retrieval statement is very common in my application and my table have hundred thousand of rows.
In the above approach, I've had to retrieve twice even if the record with the DTL_ID specified exists, and I want to avoid this.
Like this:
SELECT *
FROM table
WHERE EMPLOYEE_ID = ?? AND DTL_ID = ??
UNION
SELECT *
FROM table
WHERE EMPLOYEE_ID = ?? AND DTL_ID = 0
AND NOT EXISTS (SELECT *
FROM table
WHERE EMPLOYEE_ID = ?? AND DTL_ID = ??)
You will of course have to fill in the ?? with the proper number.
If DTL_ID is always 0 or positive:
SELECT TOP 1 * FROM table
where EmployeeID = #EmployeeID and DTL_ID in (#DTL_ID,0)
order by DTL_ID desc
If you're working across multiple employees in a single query, etc, then you might want to use ROW_NUMBER() if your version of SQL supports it.
Use ISNULL(DTL_ID, 0) in your final SELECT query
SELECT E1.EMPLOYEE_ID, ISNULL(E2.DTL_ID, 0), E1.COLUMN_A, E1.COLUMN_B EMPLIYEES AS E1
LEFT JOIN EMPLIYEES AS E2
ON E1.EMPLOYEE_ID = E2.EMPLOYEE_ID AND E2.DTL_ID = 42
You can use top and union, e.g.:
declare #t table(id int, value int, c char)
insert #t values (1,0,'a'), (1,1,'b'), (1,2,'c')
declare #id int = 1;
declare #value int = 2;
select top(1) *
from
(
select *
from #t t
where t.value = #value and t.id = #id
union all
select *
from #t t
where t.value = 0
)a
order by a.value desc
If #value = 2 than query returns 1 2 c. If #value = 3 than query returns 1 0 a.
SELECT MAX(DTL_ID) ...
WHERE DTL_ID IN (#DTL_ID, 0)