SQL query on AB-->C - sql

Consider a relation S with attributes A , B , C , and D . Write an SQL query that returns an empty answer if and only if the functional dependency AB-->C holds on relation S . (It is not important what your query returns when the functional dependency does not hold on S , as long as the query result is not empty in this case.) Assume that no NULL values are present.
My question is how to return an empty answer and how to correct my part if it's wrong.
Select
From S AS S1, S As S2,
Where (S1.C!=S2.C) AND (S1.A=S2.A) AND (S1.B=S2.B)

... iff each value of (a,b) is associated with exactly one value of (c).
The check that is needed is whether there are any value of the tuple (a,b) that are related to more than one value of c.
To demonstrate that the functional dependency does not hold, we would need to demonstrate a counterexample.
Here are a couple of simple examples.
Functional dependency (a,b)->(c) holds
a b c d
-- -- -- --
2 3 5 42
2 3 5 42
Functional dependency does not hold
a b c d
-- -- -- --
2 3 7 42
2 3 11 42
If the functional dependency does not hold, that requires that the same value of (a,b) correspond to different values of (c).
Several queries are possible. Here is one example:
SELECT s.a
, s.b
FROM s
GROUP
BY s.a
, s.b
HAVING NOT ( MIN(s.c) <=> MAX(s.c) )

Related

Fetch big tree without overhead in left tables

My question is more theoretical and it is about why do RDBMS/drivers return data the way they all do it, not how they find a correct set, nor how to find it. I'm pretty familiar with SQL, but there is one thing that always annoyed my sense of economy.
Consider following "class" graph:
A {
field1, ..., field9
b_items = [ b1, ..., bN ]
}
B {
field1, ..., field6
c_items = [ c1, ..., cM ]
}
C {
field1, field2
}
We have few A objects, each A object has many B objects, and each B objects has lots of C objects. count(A) < count(B) << count(C).
Now I would like to use a RDBMS to store it, because relations are cool and optimizers are smart, so I can get virtually anything in milliseconds, provided there is a good plan and index set.
I'll skip table creation code, which should be obvious, and go straight to the select:
SELECT *
FROM A
LEFT JOIN B ON B.a_id = A.id
LEFT JOIN C ON C.b_id = B.id
WHERE whatever
Database server returns the result set combined of all columns from all tables, properly joined into the sort-of tree:
A.f1 .... A.f9 B.f1 .... B.f6 C.f1 C.f2
---------------------------------------------------
1 1 1 1 1 1 1 1
1 1 1 1 1 1 2 2
1 1 1 1 1 1 3 3
... more rows...
1 1 1 1 1 1 999 999
↓
1 1 1 2 2 2 1 1
1 1 1 2 2 2 2 2
... more rows...
1 1 1 2 2 2 999 999
... lots of rows ...
1 1 1 99 99 99 999 999
↓
2 2 2 -- oh there it is, A[2]
...
5 5 5 NULL NULL NULL NULL NULL -- A[5] has no b_items
...
9 9 9 ...
The problem is, if A has many columns, especially with text, json, other heavy data, it is duplicated thousands of times to match each product of +B+C join. Why don't SQL servers at least simply not send me the same {A,B}-rows after the first one in join group? Ideally, I would like to see something like that as a result:
[
{
<A-fields>,
B = [
{
<B-fields>,
C = [
{
<C-fields>
},
... more C rows
]
},
... more B rows
]
},
... more A rows
]
which pretty much resembles what I actually need to get in memory on the client-side. I know I can make more queries to fetch less data, e.g. via A.id IN (ids...) or stored proc returning nulls on parasite rows, but isn't relational model intended for one-shot access? Roundtrips are heavy, and so are planner guesses. And real data graphs are rarely of only 3 steps height (consider 5-10). Then why not make it all via single pass, but without excessive traffic?
I'm fine with duplicate cells in A and B columns, because usually there is not too much, but maybe I'm missing something mainstream, SQL and non-hacky that google hides from me for so many years.
Thanks!
The only way to avoid duplicated data transfer is to use aggregate functions like string_agg () or array_agg (). You can also aggregate the data using jsonb functions. You can even get a single json object instead of tabular data, example:
select jsonb_agg(taba)
from (
select to_jsonb(taba) || jsonb_build_object('tabb', jsonb_agg(tabb)) taba
from taba
left join (
select to_jsonb(tabb) || jsonb_build_object('tabc', jsonb_agg(to_jsonb(tabc))) tabb
from tabb
join tabc on tabc.bid = tabb.id
group by tabb.id
) tabb
on (tabb->>'aid')::int = taba.id
group by taba.id
) taba
Complete working example.
json_agg() may not be the fastest thing.Also, I wonder if your ORM will digest it properly and instantiate the right objects.
The usual way is to simply do:
SELECT ... FROM a WHERE ...
Then you recover the ids, and do:
SELECT ... FROM b WHERE a_id IN (the list you just got)
SELECT ... FROM c WHERE a_id IN (the list you just got)
These are utually autogenerated by an ORM. If the ORM is smart, you get one query per table. If it is dumb you get one query per object... However, this forces three queries, with network roundrips, plus some processing. Fortunately, postgres will let you have your cake and eat it, although that takes a little bit of extra work.
Thus, you can create a function in plpgsql which returns "SETOF refcursor". Since a refcursor is a cursor, a function can return several result sets.
Example.
Back in the day when I was doing sql for websites, I used that a few times. Mostly when you just want to fetch one object and a few dependencies, so the actual query parsing and planning takes longer than the queries themselves which return one line or a few. There it uses a function, so everything is already compiled. It's very efficient.

Difference in NA/NULL treatment using dplyr::left_join (R lang) vs. SQL LEFT JOIN

I want to left join two dataframes, where there might be NAs in the join column on both side (i.e. both code columns)
a <- data.frame(code=c(1,2,NA))
b <- data.frame(code=c(1,2,NA, NA), name=LETTERS[1:4])
Using dplyr, we get:
left_join(a, b, by="code")
code name
1 1 A
2 2 B
3 NA C
4 NA D
Using SQL, we get:
CREATE TABLE a (code INT);
INSERT INTO a VALUES (1),(2),(NULL);
CREATE TABLE b (code INT, name VARCHAR);
INSERT INTO b VALUES (1, 'A'),(2, 'B'),(NULL, 'C'), (NULL, 'D');
SELECT * FROM a LEFT JOIN b USING (code);
It seems that dplyr joins do not treat NAs like SQL NULL values.
Is there a way to get dplyr to behave in the same way as SQL?
What is rationale behind this type of NA treatment?
PS. Of course, I could remove NAs first to get there left_join(a, na.omit(b), by="code"), but that is not my question.
In SQL, "null" matches nothing, because SQL has no information on what it should join to -- hence the resulting "null"s in your joined data set, just as it would appear if performing left outer joins without a match in the right data set.
In R however, the default behaviour for "NA" when it comes to joins is almost to treat it like a data point (e.g. a null operator), so "NA" would match "NA". For example,
> match(NA, NA)
[1] 1
One way you can circumvent this would be to use the base merge method,
> merge(a, b, by="code", all.x=TRUE, incomparables=NA)
code name
1 1 A
2 2 B
3 NA <NA>
The "incomparables" parameter here allows you to define values that cannot be matched, and essentially forces R to treat "NA" the way SQL treats "null". It doesn't look like the incomparables feature is implemented in left_join, but it may simply be named differently.
By default column code have primary key,therefore not accept NULL value

INFORMATICA Using transformation to get desired target from a single flat file (see pictures)

I just started out using Informatica and currently I am figuring out how to get this to a target output (flat file to Microsoft SSIS):
ID Letter Parent_ID
---- ------ ---------
1 A NULL
2 B 1
3 C 1
4 D 2
5 E 2
6 F 3
7 G 3
8 H 4
9 I 4
From (assuming that this is a comma-delimited flat file):
c1,c2,c3,c4
A,B,D,H
A,B,D,I
A,B,E
A,C,F
A,C,G
EDIT: Where c1 c2 c3 and c4 being a header.
EDIT: A more descriptive representation of what I want to acheive:
EDIT: Here is what I have so far (Normalizer for achieving the letter column and Sequence Generator for ID)
Thanks in advance.
I'd go with a two-phased approach. Here's the general idea (not a full, step-by-step solution).
Perform pivot to get all values in separate rows (eg. from "A,B,D,H" do a substring and union the data to get four rows)
Perform sort with distinct and insert into target to get IDs assigned. End of mapping one.
In mapping two add a Sequence to add row numbers
Do the pivot again
Use expression variable to refer previous row and previous RowID (How do I get previous row?)
If current RowID doesn't match previous RowID, this is a top node and has no parent.
If previous row exists and the RowID is matching, previous row is a parent. Perform a lookup to get it's ID from DB and use as Parent_ID. Send update to DB.

Delete duplicates when the duplicates are not in the same column

Here is a sample of my data (n>3000) that ties two numbers together:
id a b
1 7028344 7181310
2 7030342 7030344
3 7030354 7030353
4 7030343 7030345
5 7030344 7030342
6 7030364 7008059
7 7030659 7066051
8 7030345 7030343
9 7031815 7045692
10 7032644 7102337
Now, the problem is that id=2 is a duplicate of id=5 and id=4 is a duplicate of id=8. So, when I tried to write if-then statements to map column a to column b, basically the numbers just get swapped. There are many cases like this in my full data.
So, my question is to identify the duplicate(s) and somehow delete one of the duplicates (either id=2 or id=5). And I preferably want to do this in Excel but I could work with SQL Server or SAS, too.
Thank you in advance. Please comment if my question is not clear.
What I want:
id a b
1 7028344 7181310
2 7030342 7030344
3 7030354 7030353
4 7030343 7030345
6 7030364 7008059
7 7030659 7066051
9 7031815 7045692
10 7032644 7102337
All sorts of ways to do this.
In SAS or SQL, this is simple (for SQL Server, the SQL portion should be identical or nearly so):
data have;
input id a b;
datalines;
1 7028344 7181310
2 7030342 7030344
3 7030354 7030353
4 7030343 7030345
5 7030344 7030342
6 7030364 7008059
7 7030659 7066051
8 7030345 7030343
9 7031815 7045692
10 7032644 7102337
;;;;
run;
proc sql undopolicy=none;
delete from have H where exists (
select 1 from have V where V.id < H.id
and (V.a=H.a and V.b=H.b) or (V.a=H.b and V.b=H.a)
);
quit;
The excel solution would require creating an additional column I believe with the concatenation of the two strings, in order (any order will do) and then a lookup to see if that is the first row with that value or not. I don't think you can do it without creating an additional column (or using VBA, which if you can use that will have a fairly simple solution as well).
Edit:
Actually, the excel solution IS possible without creating a new column (well, you need to put this formula somewhere, but without ANOTHER additional column).
=IF(OR(AND(COUNTIF(B$1:B1,B2),COUNTIF(C$1:C1,C2)),AND(COUNTIF(B$1:B1,C2),COUNTIF(C$1:C1,B2))),"DUPLICATE","")
Assuming ID is in A, B and C contain the values (and there is no header row). That formula goes in the second row (ie, B2/C2 values) and then is extended to further rows (so row 36 will have the arrays be B1:B35 and C1:C35 etc.). That puts DUPLICATE in the rows which are duplicates of something above and blank in rows that are unique.
I haven't tested this out but here is some food for thought, you could join the table against itself and get the ID's that have duplicates
SELECT
id, a, b
FROM
[myTable]
INNER JOIN ( SELECT id, a, b FROM [myTable] ) tbl2
ON [myTable].a = [tbl2].b
OR [myTable].b = tbl2.a

SQL Recursive Tables

I have the following tables, the groups table which contains hierarchically ordered groups and group_member which stores which groups a user belongs to.
groups
---------
id
parent_id
name
group_member
---------
id
group_id
user_id
ID PARENT_ID NAME
---------------------------
1 NULL Cerebra
2 1 CATS
3 2 CATS 2.0
4 1 Cerepedia
5 4 Cerepedia 2.0
6 1 CMS
ID GROUP_ID USER_ID
---------------------------
1 1 3
2 1 4
3 1 5
4 2 7
5 2 6
6 4 6
7 5 12
8 4 9
9 1 10
I want to retrieve the visible groups for a given user. That it is to say groups a user belongs to and children of these groups. For example, with the above data:
USER VISIBLE_GROUPS
9 4, 5
3 1,2,4,5,6
12 5
I am getting these values using recursion and several database queries. But I would like to know if it is possible to do this with a single SQL query to improve my app performance. I am using MySQL.
Two things come to mind:
1 - You can repeatedly outer-join the table to itself to recursively walk up your tree, as in:
SELECT *
FROM
MY_GROUPS MG1
,MY_GROUPS MG2
,MY_GROUPS MG3
,MY_GROUPS MG4
,MY_GROUPS MG5
,MY_GROUP_MEMBERS MGM
WHERE MG1.PARENT_ID = MG2.UNIQID (+)
AND MG1.UNIQID = MGM.GROUP_ID (+)
AND MG2.PARENT_ID = MG3.UNIQID (+)
AND MG3.PARENT_ID = MG4.UNIQID (+)
AND MG4.PARENT_ID = MG5.UNIQID (+)
AND MGM.USER_ID = 9
That's gonna give you results like this:
UNIQID PARENT_ID NAME UNIQID_1 PARENT_ID_1 NAME_1 UNIQID_2 PARENT_ID_2 NAME_2 UNIQID_3 PARENT_ID_3 NAME_3 UNIQID_4 PARENT_ID_4 NAME_4 UNIQID_5 GROUP_ID USER_ID
4 2 Cerepedia 2 1 CATS 1 null Cerebra null null null null null null 8 4 9
The limit here is that you must add a new join for each "level" you want to walk up the tree. If your tree has less than, say, 20 levels, then you could probably get away with it by creating a view that showed 20 levels from every user.
2 - The only other approach that I know of is to create a recursive database function, and call that from code. You'll still have some lookup overhead that way (i.e., your # of queries will still be equal to the # of levels you are walking on the tree), but overall it should be faster since it's all taking place within the database.
I'm not sure about MySql, but in Oracle, such a function would be similar to this one (you'll have to change the table and field names; I'm just copying something I did in the past):
CREATE OR REPLACE FUNCTION GoUpLevel(WO_ID INTEGER, UPLEVEL INTEGER) RETURN INTEGER
IS
BEGIN
DECLARE
iResult INTEGER;
iParent INTEGER;
BEGIN
IF UPLEVEL <= 0 THEN
iResult := WO_ID;
ELSE
SELECT PARENT_ID
INTO iParent
FROM WOTREE
WHERE ID = WO_ID;
iResult := GoUpLevel(iParent,UPLEVEL-1); --recursive
END;
RETURN iResult;
EXCEPTION WHEN NO_DATA_FOUND THEN
RETURN NULL;
END;
END GoUpLevel;
/
Joe Cleko's books "SQL for Smarties" and "Trees and Hierarchies in SQL for Smarties" describe methods that avoid recursion entirely, by using nested sets. That complicates the updating, but makes other queries (that would normally need recursion) comparatively straightforward. There are some examples in this article written by Joe back in 1996.
I don't think that this can be accomplished without using recursion. You can accomplish it with with a single stored procedure using mySQL, but recursion is not allowed in stored procedures by default. This article has information about how to enable recursion. I'm not certain about how much impact this would have on performance verses the multiple query approach. mySQL may do some optimization of stored procedures, but otherwise I would expect the performance to be similar.
Didn't know if you had a Users table, so I get the list via the User_ID's stored in the Group_Member table...
SELECT GroupUsers.User_ID,
(
SELECT
STUFF((SELECT ',' +
Cast(Group_ID As Varchar(10))
FROM Group_Member Member (nolock)
WHERE Member.User_ID=GroupUsers.User_ID
FOR XML PATH('')),1,1,'')
) As Groups
FROM (SELECT User_ID FROM Group_Member GROUP BY User_ID) GroupUsers
That returns:
User_ID Groups
3 1
4 1
5 1
6 2,4
7 2
9 4
10 1
12 5
Which seems right according to the data in your table. But doesn't match up with your expected value list (e.g. User 9 is only in one group in your table data but you show it in the results as belonging to two)
EDIT: Dang. Just noticed that you're using MySQL. My solution was for SQL Server. Sorry.
-- Kevin Fairchild
There was already similar question raised.
Here is my answer (a bit edited):
I am not sure I understand correctly your question, but this could work My take on trees in SQL.
Linked post described method of storing tree in database -- PostgreSQL in that case -- but the method is clear enough, so it can be adopted easily for any database.
With this method you can easy update all the nodes depend on modified node K with about N simple SELECTs queries where N is distance of K from root node.
Good Luck!
I don't remember which SO question I found the link under, but this article on sitepoint.com (second page) shows another way of storing hierarchical trees in a table that makes it easy to find all child nodes, or the path to the top, things like that. Good explanation with example code.
PS. Newish to StackOverflow, is the above ok as an answer, or should it really have been a comment on the question since it's just a pointer to a different solution (not exactly answering the question itself)?
There's no way to do this in the SQL standard, but you can usually find vendor-specific extensions, e.g., CONNECT BY in Oracle.
UPDATE: As the comments point out, this was added in SQL 99.