Why does SQL Server not behave consistently when dealing with half-emojis? - sql

When an NVARCHAR attribute in a SQL Server database contains an emoji, string functions and operators behave in different ways (default database collation is SQL_Latin1_General_CP1_CI_AS).
Behavior of string functions
Functions like LEFT() and LEN() see the emoji as multiple, separate characters (LEFT will cut the emoji in half and return a partial value).
DECLARE #YourString NVARCHAR(12) = N'Thank you 😃'
SELECT #YourString, LEFT(#YourString, 11), LEN(#YourString)
The return values are "Thank you 😃", "Thank you �", and 12.
Behavior of operators
Operators such as UNION, INTERSECT, and EXISTS treat both the emoji and the half-emoji as a single, identical value, though they return different results depending on the order the values arrive in.
Behavior of the UNION operator
UNION treats these as identical records (since it only returns one value), but will nevertheless yield a different result depending on the order, with the bottom record being returned.
SELECT #YourString UNION SELECT LEFT(#YourString, 11) returns "Thank you �" (the value from the bottom half)
SELECT LEFT(#YourString, 11) UNION SELECT #YourString returns "Thank you 😃" (the value from the top half)
Behavior of the INTERSECT and EXISTS operators
INTERSECT and EXISTS also treat these as identical values, but will return the value from the top record (which makes sense given the purpose of those operators, but nonetheless feels weird after seeing UNION do the opposite).
SELECT #YourString INTERSECT SELECT LEFT(#YourString, 11) returns "Thank you 😃".
SELECT LEFT(#YourString, 11) INTERSECT SELECT #YourString returns "Thank you �".
SELECT #YourString EXCEPT SELECT LEFT(#YourString, 11) returns no result.
Summary
String functions such as LEFT() and LEN() treat these characters as separate values.
The UNION operator treats these as identical values, but will nevertheless return varying results (with preference given to the value from the bottom half of the operator)
The EXISTS and INTERSECT operator treats these as identical values, but will return the value from the top half
Question
Why would two different values be treated as a single, identical character by some operators (UNION, INTERSECT, EXCEPT), but be interpreted as two different values by string functions (LEFT, LEN)?
Bonus Question
Why would the UNION operator be returning the second value it sees?

I'm afraid I don't see any inconsistence here: Using SQL_Latin1_General_CP1_CI_AS, all set and comparison operations treat the two values as equal (fiddle):
CREATE TABLE t1 (s1 NVARCHAR(12) COLLATE SQL_Latin1_General_CP1_CI_AS);
CREATE TABLE t2 (s2 NVARCHAR(12) COLLATE SQL_Latin1_General_CP1_CI_AS);
INSERT INTO t1 VALUES (N'Thank you 😃');
INSERT INTO t2 SELECT LEFT(s1, 11) FROM t1;
SELECT * FROM t1;
SELECT * FROM t2;
SELECT s1 FROM t1 UNION SELECT s2 FROM t2; -- only s1
SELECT s1 FROM t1 EXCEPT SELECT s2 FROM t2; -- none
SELECT s2 FROM t2 EXCEPT SELECT s1 FROM t1; -- none
SELECT s1 FROM t1 INTERSECT SELECT s2 FROM t2; -- only s1
SELECT * FROM t1 INNER JOIN t2 ON s1 = s2; -- one row with s1 and s2
As soon as we change the collation to Latin1_General_100_CI_AS, all operations treat the two values as not equal (fiddle):
CREATE TABLE t1 (s1 NVARCHAR(12) COLLATE Latin1_General_100_CI_AS);
CREATE TABLE t2 (s2 NVARCHAR(12) COLLATE Latin1_General_100_CI_AS);
INSERT INTO t1 VALUES (N'Thank you 😃');
INSERT INTO t2 SELECT LEFT(s1, 11) FROM t1;
SELECT * FROM t1;
SELECT * FROM t2;
SELECT s1 FROM t1 UNION SELECT s2 FROM t2; -- both
SELECT s1 FROM t1 EXCEPT SELECT s2 FROM t2; -- only s1
SELECT s2 FROM t2 EXCEPT SELECT s1 FROM t1; -- only s2
SELECT s1 FROM t1 INTERSECT SELECT s2 FROM t2; -- none
SELECT * FROM t1 INNER JOIN t2 ON s1 = s2; -- no result
The reason why the legacy encoding considers those two values equal can be found in this question:
SQL Query Where Column = '' returning Emoji characters 🎃 and 🍰
You probably already know this, but I'd like to explicitly point out that the fact that a = b evaluates to true for two values with different byte contents and different lengths is in itself not exceptional: All case-insensitive collations treat 'A' and 'a' as equal, and even case-sensitive collations treat 'A' (length 1) and 'A ' (length 2) as equal, since SQL Server ignores trailing spaces.

Related

Select where record does not exists

I am trying out my hands on oracle 11g. I have a requirement such that I want to fetch those id from list which does not exists in table.
For example:
SELECT * FROM STOCK
where item_id in ('1','2'); // Return those records where result is null
I mean if item_id '1' is not present in db then the query should return me 1.
How can I achieve this?
You need to store the values in some sort of "table". Then you can use left join or not exists or something similar:
with ids as (
select 1 as id from dual union all
select 2 from dual
)
select ids.id
from ids
where not exists (select 1 from stock s where s.item_id = ids.id);
You can use a LEFT JOIN to an in-line table that contains the values to be searched:
SELECT t1.val
FROM (
SELECT '1' val UNION ALL SELECT '2'
) t1
LEFT JOIN STOCK t2 ON t1.val = t2.item_id
WHERE t2.item_id IS NULL
First create the list of possible IDs (e.g. 0 to 99 in below query). You can use a recursive cte for this. Then select these IDs and remove the IDs already present in the table from the result:
with possible_ids(id) as
(
select 0 as id from dual
union all
select id + 1 as id from possible_ids where id < 99
)
select id from possible_ids
minus
select item_id from stock;
A primary concern of the OP seems to be a terse notation of the query, notably the set of values to test for. The straightforwwrd recommendation would be to retrieve these values by another query or to generate them as a union of queries from the dual table (see the other answers for this).
The following alternative solution allows for a verbatim specification of the test values under the following conditions:
There is a character that does not occur in any of the test values provided ( in the example that will be - )
The number of values to test stays well below 2000 (to be precise, the list of values plus separators must be written as a varchar2 literal, which imposes the length limit ). However, this should not be an actual concern - If the test involves lists of hundreds of ids, these lists should definitely be retrieved froma table/view.
Caveat
Whether this method is worth the hassle ( not to mention potential performance impacts ) is questionable, imho.
Solution
The test values will be provided as a single varchar2 literal with - separating the values which is as terse as the specification as a list argument to the IN operator. The string starts and ends with -.
'-1-2-3-156-489-4654648-'
The number of items is computed as follows:
select cond, regexp_count ( cond, '[-]' ) - 1 cnt_items from (select '-1-2-3-156-489-4654648-' cond from dual)
A list of integers up to the number of items starting with 1 can be generated using the LEVEL pseudocolumn from hierarchical queries:
select level from dual connect by level < 42;
The n-th integer from that list will serve to extract the n-th value from the string (exemplified for the 4th value) :
select substr ( cond, instr(cond,'-', 1, 4 )+1, instr(cond,'-', 1, 4+1 ) - instr(cond,'-', 1, 4 ) - 1 ) si from (select cond, regexp_count ( cond, '[-]' ) - 1 cnt_items from (select '-1-2-3-156-489-4654648-' cond from dual) );
The non-existent stock ids are generated by subtracting the set of stock ids from the set of values. Putting it all together:
select substr ( cond, instr(cond,'-',1,level )+1, instr(cond,'-',1,level+1 ) - instr(cond,'-',1,level ) - 1 ) si
from (
select cond
, regexp_count ( cond, '[-]' ) - 1 cnt_items
from (
select '-1-2-3-156-489-4654648-' cond from dual
)
)
connect by level <= cnt_items + 1
minus
select item_id from stock
;

SQL Customized search with special characters

I am creating a key-wording module where I want to search data using the comma separated words.And the search is categorized into comma , and minus -.
I know a relational database engine is designed from the principle that a cell holds a single value and obeying to this rule can help for performance.But in this case table is already running and have millions of data and can't change the table structure.
Take a look on the example what I exactly want to do is
I have a main table name tbl_main in SQL
AS_ID KWD
1 Man,Businessman,Business,Office,confidence,arms crossed
2 Man,Businessman,Business,Office,laptop,corridor,waiting
3 man,business,mobile phone,mobile,phone
4 Welcome,girl,Greeting,beautiful,bride,celebration,wedding,woman,happiness
5 beautiful,bride,wedding,woman,girl,happiness,mobile phone,talking
6 woman,girl,Digital Tablet,working,sitting,online
7 woman,girl,Digital Tablet,working,smiling,happiness,hand on chin
If search text is = Man,Businessman then result AS_ID is =1,2
If search text is = Man,-Businessman then result AS_ID is =3
If search text is = woman,girl,-Working then result AS_ID is =4,5
If search text is = woman,girl then result AS_ID is =4,5,6,7
What is the best why to do this, Help is much appreciated.Thanks in advance
I think you can easily solve this by creating a FULL TEXT INDEX on your KWD column. Then you can use the CONTAINS query to search for phrases. The FULL TEXT index takes care of the punctuation and ignores the commas automatically.
-- If search text is = Man,Businessman then the query will be
SELECT AS_ID FROM tbl_main
WHERE CONTAINS(KWD, '"Man" AND "Businessman"')
-- If search text is = Man,-Businessman then the query will be
SELECT AS_ID FROM tbl_main
WHERE CONTAINS(KWD, '"Man" AND NOT "Businessman"')
-- If search text is = woman,girl,-Working the query will be
SELECT AS_ID FROM tbl_main
WHERE CONTAINS(KWD, '"woman" AND "girl" AND NOT "working"')
To search the multiple words (like the mobile phone in your case) use the quoted phrases:
SELECT AS_ID FROM tbl_main
WHERE CONTAINS(KWD, '"woman" AND "mobile phone"')
As commented below the quoted phrases are important in all searches to avoid bad searches in the case of e.g. when a search term is "tablet working" and the KWD value is woman,girl,Digital Tablet,working,sitting,online
There is a special case for a single - search term. The NOT cannot be used as the first term in the CONTAINS. Therefore, the query like this should be used:
-- If search text is = -Working the query will be
SELECT AS_ID FROM tbl_main
WHERE NOT CONTAINS(KWD, '"working"')
Here is my attempt using Jeff Moden's DelimitedSplit8k to split the comma-separated values.
First, here is the splitter function (check the article for updates of the script):
CREATE FUNCTION [dbo].[DelimitedSplit8K](
#pString VARCHAR(8000), #pDelimiter CHAR(1)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)
,E2(N) AS (SELECT 1 FROM E1 a, E1 b)
,E4(N) AS (SELECT 1 FROM E2 a, E2 b)
,cteTally(N) AS(
SELECT TOP (ISNULL(DATALENGTH(#pString), 0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
,cteStart(N1) AS(
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString, t.N, 1) = #pDelimiter
),
cteLen(N1, L1) AS(
SELECT
s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter, #pString, s.N1),0) - s.N1, 8000)
FROM cteStart s
)
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
Here is the complete solution:
-- search parameter
DECLARE #search_text VARCHAR(8000) = 'woman,girl,-working'
-- split comma-separated search parameters
-- items starting in '-' will have a value of 1 for exclude
DECLARE #search_values TABLE(ItemNumber INT, Item VARCHAR(8000), Exclude BIT)
INSERT INTO #search_values
SELECT
ItemNumber,
CASE WHEN LTRIM(RTRIM(Item)) LIKE '-%' THEN LTRIM(RTRIM(STUFF(Item, 1, 1 ,''))) ELSE LTRIM(RTRIM(Item)) END,
CASE WHEN LTRIM(RTRIM(Item)) LIKE '-%' THEN 1 ELSE 0 END
FROM dbo.DelimitedSplit8K(#search_text, ',') s
;WITH CteSplitted AS( -- split each KWD to separate rows
SELECT *
FROM tbl_main t
CROSS APPLY(
SELECT
ItemNumber, Item = LTRIM(RTRIM(Item))
FROM dbo.DelimitedSplit8K(t.KWD, ',')
)x
)
SELECT
cs.AS_ID
FROM CteSplitted cs
INNER JOIN #search_values sv
ON sv.Item = cs.Item
GROUP BY cs.AS_ID
HAVING
-- all parameters should be included (Relational Division with no Remainder)
COUNT(DISTINCT cs.Item) = (SELECT COUNT(DISTINCT Item) FROM #search_values WHERE Exclude = 0)
-- no exclude parameters
AND SUM(CASE WHEN sv.Exclude = 1 THEN 1 ELSE 0 END) = 0
SQL Fiddle
This one uses a solution from the Relational Division with no Remainder problem discussed in this article by Dwain Camps.
From what you've described, you want the keywords that are included in the search text to be a match in the KWD column, and those that are prefixed with a - to be excluded.
Despite the data existing in this format, it still makes most sense to normalize the data, and then query based on the existence or non existence of the keywords.
To do this, in very rough terms:-
Create two additional tables - Keyword and tbl_Main_Keyword. Keyword contains a distinct list of each of the possible keywords and tbl_Main_Keyword contains a link between each record in tbl_Main to each Keyword record where there's a match. Ensure to create an index on the text field for the keyword (e.g. the Keyword.KeywordText column, or whatever you call it), as well as the KeywordID field in the tbl_Main_Keyword table. Create Foreign Keys between tables.
Write some DML (or use a separate program, such as a C# program) to iterate through each record, parsing the text, and inserting each distinct keyword encountered into the Keyword table. Create a relationship to the row for each keyword in the tbl_main record.
Now, for searching, parse out the search text into keywords, and compose a query against the tbl_Main_Keyword table containing both a WHERE KeywordID IN and WHERE KeywordID NOT IN clause, depending on whether there is a match.
Take note to consider whether the case of each keyword is important to your business case, and consider the collation (case sensitive or insensitive) accordingly.
I would prefer cha's solution, but here's another solution:
declare #QueryParts table (q varchar(1000))
insert into #QueryParts values
('woman'),
('girl'),
('-Working')
select AS_ID
from tbl_main
inner join #QueryParts on
(q not like '-%' and ',' + KWD + ',' like '%,' + q + ',%') or
(q like '-%' and ',' + KWD + ',' not like '%,' + substring(q, 2, 1000) + ',%')
group by AS_ID
having COUNT(*) = (select COUNT(*) from #QueryParts)
With such a design, you would have two tables. One that defines the IDs and a subtable that holds the set of keywords per search string.
Likewise, you would transform the search strings into two tables, one for strings that should match and one for negated strings. Assuming that you put this in a stored procedure, these tables would be table-value parameters.
Once you have this set up, the query is simple to write:
SELECT M.AS_ID
FROM tbl_main M
WHERE (SELECT COUNT(*)
FROM tbl_keywords K
WHERE K.AS_ID = M.AS_ID
AND K.KWD IN (SELECT word FROM #searchwords)) =
(SELECT COUNT(*) FROM #searchwords)
AND NOT EXISTS (SELECT *
FROM tbl_keywords K
WHERE K.AS_ID = M.AS_ID
AND K.KWD IN (SELECT word FROM #minuswords))

Yield Return equivalent in SQL Server

I am writing down a view in SQL server (DWH) and the use case pseudo code is:
-- Do some calculation and generate #Temp1
-- ... contains other selects
-- Select statement 1
SELECT * FROM Foo
JOIN #Temp1 tmp on tmp.ID = Foo.ID
WHERE Foo.Deleted = 1
-- Do some calculation and generate #Temp2
-- ... contains other selects
-- Select statement 2
SELECT * FROM Foo
JOIN #Temp2 tmp on tmp.ID = Foo.ID
WHERE Foo.Deleted = 1
The result of the view should be:
Select Statement 1
UNION
Select Statement 2
The intended behavior is the same as the yield returnin C#. Is there a way to tell the view which SELECT statements are actually part of the result and which are not? since the small calculations preceding what I need also contain selects.
Thank you!
Yield return in C# returns rows one at a time as they appear in some underlying function. This concept does not exist in SQL statements. SQl is set-based, returning the entire result set, conceptually as a unit. (That said, sometimes queries run slowly and you will see rows returned slowly or in batches.)
You can control the number of rows being returns using TOP (in SQL Server). You can select particular rows to be returned using WHERE statements. However, you cannot specify a UNION statement that conditionally returns rows from some components but not others.
The closest you may be able to come is something like:
if UseTable1Only = 'Y'
select *
from Table1
else if UseTable2Only = 'Y'
select *
from Table2
else
select *
from table1
union
select *
from table2
You can do something similar using dynamic SQL, by constructing the statement as a string and then executing it.
I found a better work around. It might be helpful for someone else. It is actually to include all the calculation inside WITH statements instead of doing them in the view core:
WITH Temp1 (ID)
AS
(
-- Do some calculation and generate #Temp1
-- ... contains other selects
)
, Temp2 (ID)
AS
(
-- Do some calculation and generate #Temp2
-- ... contains other selects
)
-- Select statement 1
SELECT * FROM Foo
JOIN Temp1 tmp on tmp.ID = Foo.ID
WHERE Foo.Deleted = 1
UNION
-- Select statement 2
SELECT * FROM Foo
JOIN Temp2 tmp on tmp.ID = Foo.ID
WHERE Foo.Deleted = 1
The result will be of course the UNION of all the outiside SELECT statements.

Detect (find) string in another string ( nvarchar (MAX) )

I've got nvarchar(max) column with different values alike 'A2'
And another column from another table with values alike '(A2 AND A3) OR A4'
I need to detect does string from second column contains string from first column.
So then I need to select all columns of second table which contains an string from first column of first table.
something alike ... but that is wrong
SELECT * Cols FROM T2
WHERE (SELECT T1.StringCol FROM T1) IN T2.StringCol
but I more understand it like it (in f# syntax)
for t1.date, t1.StringCol from t1
for t2.StringCol from t2
if t2.StringCol.Contains( t1.StringCol )
yield t2.StringCol, t1.date
This should get what you want...
select t2.*
from t1 cross join t2
where patindex('%' + t1.StringCol + '%', t2.StringCol) > 0

Stored Procedure return multiple result sets

I need a SP to return multiple sets of results. The second set of results would be based on a column of the first set of results.
So:
declare #myTable1 table(field0 int,field1 varchar(255))
insert into #myTable1 select top 1 field0, field1 from table1
declare #myTable2 table(field0 int,field3 varchar(255))
insert into #myTable2
select field0, field3 from table2
where #myTable1.field0 = #myTable2.field0
How do return #myTable1 and #myTable2 with my SP? Is this syntax even right at all?
My apologies, I'm still a newbie at SQL...
EDIT:
So, I'm getting an error on the last line of the code below that says: "Must declare the scalar variable "#myTable1""
declare #myTable1 table(field0 int,field1 dateTime)
insert into #myTable1
select top 1 field0, field1
from someTable1 m
where m.field4 > 6/29/2009
select * from #myTable1
select *
from someTable2 m2
where m2.field0 = #myTable1.field0
If I highlight and run the code up until the second select * it works fine...
when I highlight the rest it acts like the first variable doesn't exist...
EDIT2:
Figured that problem out. Thanks guys.
declare #myTable1 table(field0 int,field1 dateTime)
insert into #myTable1
select top 1 field0, field1
from someTable1 m
where m.field4 > 6/29/2009
select * from #myTable1
select *
from someTable2 m2
where m2.field0 = (select field0 from #myTable1)
You pretty much just select two result sets
SELECT * FROM #myTable1
SELECT * FROM #myTable2
However, some tools will hide some results (e.g. pgAdmin will only show the last) and some tools have some sort of requirement to get to the next result set (e.g. .NET's IDataReader's will not allow you to Read() from the second resultset until you call NextResult()).
Edit:
An alternative in this case, since the types of the two results match, is to combine them into a single resultset:
SELECT field0, field1 from #myTable1
UNION
SELECT field0, field3 from #myTable2
You can also choose between UNION ALL or UNION DISTINCT (the default) where the latter will only send rows that aren't repeats.
At the end of the Stored Proc, put:
SELECT * FROM #myTable1
SELECT * FROM #myTable2
This will return 2 result sets.