Difference in NA/NULL treatment using dplyr::left_join (R lang) vs. SQL LEFT JOIN - sql

I want to left join two dataframes, where there might be NAs in the join column on both side (i.e. both code columns)
a <- data.frame(code=c(1,2,NA))
b <- data.frame(code=c(1,2,NA, NA), name=LETTERS[1:4])
Using dplyr, we get:
left_join(a, b, by="code")
code name
1 1 A
2 2 B
3 NA C
4 NA D
Using SQL, we get:
CREATE TABLE a (code INT);
INSERT INTO a VALUES (1),(2),(NULL);
CREATE TABLE b (code INT, name VARCHAR);
INSERT INTO b VALUES (1, 'A'),(2, 'B'),(NULL, 'C'), (NULL, 'D');
SELECT * FROM a LEFT JOIN b USING (code);
It seems that dplyr joins do not treat NAs like SQL NULL values.
Is there a way to get dplyr to behave in the same way as SQL?
What is rationale behind this type of NA treatment?
PS. Of course, I could remove NAs first to get there left_join(a, na.omit(b), by="code"), but that is not my question.

In SQL, "null" matches nothing, because SQL has no information on what it should join to -- hence the resulting "null"s in your joined data set, just as it would appear if performing left outer joins without a match in the right data set.
In R however, the default behaviour for "NA" when it comes to joins is almost to treat it like a data point (e.g. a null operator), so "NA" would match "NA". For example,
> match(NA, NA)
[1] 1
One way you can circumvent this would be to use the base merge method,
> merge(a, b, by="code", all.x=TRUE, incomparables=NA)
code name
1 1 A
2 2 B
3 NA <NA>
The "incomparables" parameter here allows you to define values that cannot be matched, and essentially forces R to treat "NA" the way SQL treats "null". It doesn't look like the incomparables feature is implemented in left_join, but it may simply be named differently.

By default column code have primary key,therefore not accept NULL value

Related

Pattern Matching or Fuzzy Matching of two tables based on one column

Assuming I have the right naming, what O am trying to write is a function or stored procedure to compare names and find out if they are the same value.
I think its called fuzzy matching
For example, a table has 2 columns and table b has 3 columns:
Name
Number
Hello
24
Evening
56
Name
Num
F
Heello
23
some value
GoodEvening
15
some value
I want table like
A
D
Hello
Heello
Morning
GoodMorning
Currently, I'm using
Select A.Name, B.Name
from table A
left table B
on A.Name like B.Name
or (LTRIM(RTRIM(REPLACE(REPLACE(REPLACE( A.Name,' ',''),'-',''),'''',''))) = LTRIM(RTRIM(REPLACE(REPLACE(REPLACE(B.Name,' ',''),'-',''),'''',''))))
OR (A.Name LIKE '%'+B.Name+'%')
OR (B.Name LIKE '%'+A.Name+'%')
It is giving me a result, but not too accurate and is very slow, any other way I could try to compare these values?

SQLite: Matching a column containig a single string to another column containing comma-separated values [duplicate]

I have a table with a column that has concatenated values like this
Table CHILD:
ChildId Values
2 x123,j455
3 f456,z789
4 m333,y567
5 x123,h888
And I have a master table MASTER that has
Table MASTER:
MainValues
x123
f456
y567
I need to get a query that'll select the following data
ChildId MainValues
2 x123
3 f456
4 y567
5 x123
Basically match value from MASTER in child values and return only the master value. How can I do this ? I have tried IN and LIKE clause matching with second table but that doesnt help much since the values are csv. Is there a way to split and match in sqlite ?
EDIT: Table and column names are fictional and intended just to explain this question better
Use a regular expression:
SELECT ChildId,MainValues FROM CHILD INNER JOIN MASTER WHERE ','||[Values]||',' like '%,'||MainValues||',%'
Also, please refrain from using keywords like values for column names...
Unfortunately SQLite doesn't have a function to find the index of a character from a string. So you have to rely on something else. Idan's method is good too but can be slower. You may try this:
SELECT c.childID, m.mainvalues
FROM CHILD c
JOIN MASTER m
WHERE m.mainvalues = substr(c.ivalues, -length(c.ivalues), 4)
OR m.mainvalues = substr(c.ivalues, 6);
I have used 4 and 6 assuming your number of characters before and after the ,. If that's not fixed you can try:
SELECT c.childID, m.mainvalues
FROM CHILD c
JOIN MASTER m
WHERE m.mainvalues = substr(c.ivalues, -length(c.ivalues), length(m.mainvalues))
OR m.mainvalues = substr(c.ivalues, length(m.mainvalues) + 2);

Pulling "All but X" in a select statement, testing multiple fields at once

So after doing a very large select statement, I wanted to check if there was a slick way to pull all the fields paired on multiple tables into a report while testing a large set of null fields to have empty records removed. Say for example I have table a paired to table b paired to table c. I want almost all the records except for a.something, b.somthing, and a couple c.somethings.
I also want to make sure if all the fields in c are empty, exclude the record. (Well... all but the index)
Is there a good way to do this? I ended up building a largish report field by field but it was A: Mostly tedious and B: Would not scale if I ever ran into a bigger project.
SELECT * <except for c.4, c.5. c.6, a.3, a.4, b.2>
FROM a,b,c
LEFT JOIN b ON a.indexA = b.indexA
LEFT JOIN c ON b.indexB = c.indexB
WHERE a.1 is not null
AND b.1 is not null
and c.1 is not null
and c.2 is not null
and c.3 is not null
and a.2 is > 0
and b.2 = 'Test'
Feel free not to use my example.
You can actually do multiple join conditions:
SELECT *
FROM a
LEFT JOIN b ON a.indexA = b.indexA
and b.1 is not null
and b.2 = 'Test'
LEFT JOIN c ON b.indexB = c.indexB
and c.1 is not null
and c.2 is not null
and c.3 is not null
WHERE a.1 is not null
and a.2 is > 0
Also, I'm pretty sure when specifying the left join syntax as you have, listing all of the tables after the FROM is not necessary.
I'm not sure if this will change the performance at all however.

Find rows that contain all words in any order

My application is built in vb.net with SQL Server Compact as the database so I'm unable to use a full-text index.
Here's my data...
MainTable field1
A B C
B G C
X Y Z
C P B
Search term = B C
Expected Results = any combination of the search term = Rows 1, 2, 4
Here's what I'm currently doing...
I'm permuting the search term B C into an array containing %B%C% and %C%B% and inserting those values into field1 of tempTable.
So my SQL looks like this:
SELECT * FROM MainTable INNER JOIN tempTable ON MainTable.field1 LIKE tempTable.field1
In this simple example, it does return the expected results correctly. However, my search term can contain more values. For example 6 search terms B C D E F G when permuted has 720 different values and as more search terms are used, the permutations grow exponentially...which is not good.
Is there a better way to do this?
The following will work for your example above:
Select * from table where field1 like '%[BC]%'
But it will also return strings that contain ONLY "B" or "C". Do you need both characters in any order or one or more?
EDIT: Then the following would work:
Select * from test_data where col1 LIKE '%Apple%' and col1 like '%Dog%'
See the demo here: http://rextester.com/edit/LNDQ49764

SQL: Populating Column B where Column A has a match elsewhere in Column B

I’m somewhat of a newbie to SQL queries, especially anything containing logic, and although I've searched for hours finding the exact terms to search for is not easy in this case! I have a relatively simple one, I’m sure:
A table has 2 columns, and each row contains data about a function in a program. Some functions have a parent function associated (for grouping). Column A is the unique function ID. Column B indicates, when applicable, the parent function’s ID. All parent function IDs are independent and valid function IDs that exist elsewhere in column A.
For reporting purposes I need to list the functions grouped by their parent ID, listing the parent function with the child functions. I can easily report by parent function ID, but the problem is that a parent function does not know that it is a parent function because its column B is empty!
What I need to do is complete the value in Column B if it is empty and the function is referenced elsewhere as a parent function.
Otherwise stated, for each row that has a null value in Column B:
Take the value from column A
Check for the existence of that value in ANY row on column B
If there is a match, inject the value into column B (so that Column A and B have the same value)
What I have: (Query: SELECT function_id, parent_function FROM functions)
FUNCTION_ID PARENT_FUNCTION
4
13 4
79
138 4
195
314 345
345
What I need to have:
FUNCTION_ID PARENT_FUNCTION
4 4
13 4
79
138 4
195
314 345
345 345
Any Ideas? I can't wait to get more familiar with SQL! Thanks ahead of time.
This should work for you:
UPDATE functions
SET parent_function = function_id
WHERE parent_function IS NULL
AND function_id IN (SELECT parent_function FROM functions)
This will set parent_function equal to function_id where it has not yet been set, and where it appears somewhere in the parent_function column.
If you don't actually want to modify the table data but still return values that you need, you can use similar logic like this:
SELECT f.function_id, COALESCE(f.parent_function, f2.function_id) as parent_function
FROM functions f
LEFT JOIN functions f2
ON f.function_id = f2.function_id
AND f2.function_id IN (SELECT parent_function FROM functions)
maybe you can compare the two table using EXCEPT or INTERSECT
http://msdn.microsoft.com/en-us/library/ms188055.aspx
more tutorials>:
http://www.mssqltips.com/sqlservertip/1327/compare-sql-server-datasets-with-intersect-and-except/
How's this look?
select distinct
t1.funx, t1.parent,
case when t2.parent is null then t1.parent
else t2.parent end as newparent
from
tbl t1 left outer join
tbl t2 on
t1.funx = t2.parent
sqlFiddle