Sql trying to change case letter and group similar nvarchar values - sql

I am using sql server 2008 and I'm trying to build a query for displaying some overall results from a single sql table.
I want to display count(fieldname) for each date, for example I want to know how often the name "izla" is repeated in the table for each date but it could be also "IZLA" or "Izla", so i must find a way to group this data together as one and find count for the three of them.
The problem is that if i try using uppercase or lowercase so that they are considered automatically the same I have the problem: when izla is converted to upper it becomes İZLA or on the other hand when IZLA is converted to lowercase it is displayed ızla.
The big question is how can i group this data together? Maybe the problem comes from using nvarchar but i need the column type to be like that (can't change it).

When you group, you should use an Accent Insensitive collation. You can add this directly to your group by clause. The following is an example:
Declare #Temp Table(Data nvarchar(100))
Insert Into #Temp Values(N'izla')
Insert Into #Temp Values(N'İZLA')
Insert Into #Temp Values(N'IZLA')
Insert Into #Temp Values(N'Izla')
Select Data,
Count(*)
From #Temp
Group By Data
Select Data Collate Latin1_General_CI_AI,
Count(*)
From #Temp
Group By Data Collate Latin1_General_CI_AI
When you run this example, you will see that the first query creates two rows (with count 3 and count 1). The second example uses an accent insensitve collation for the grouping, so all 4 items are grouped together.
I used Latin1_General_CI_AI in my example. I suggest you examine the collation of the column you are using and then use a collation that most closely matches by changing the AS on the end to AI.

Try replacing ı and such with english equivalent after lowercasing

This all comes down to collation, which is the way that the system sorts string data.
You could say something like:
SELECT *, COUNT(*) OVER (PARTITION BY fieldname COLLATE Latin1_General_CI_AI), COUNT(*) OVER (PARTITION BY fieldname COLLATE Latin1_General_CI_AS)
FROM yourtable
This will provide some nice figures for you around how many times each name appeared in the various formats. There are many collations, and you can search in Books Online for a complete list. You may also be interested in Latin1_General_BIN for example.
Rob

Related

Reading Unicode strings from SQL Server

I know strings need to be prefixed with N' in SQL Server (2012) INSERT statements to store them as UNICODE but do they have to be retrieved (SELECT statement) in a certain way as well so they are in UNICODE?
I am able to store international strings correctly with N notation but when I run SELECT query to fetch the records back, it comes as question marks. My query is very simple.
SELECT COLUMN1, COLUMN2 FROM TABLE1
I am looking at other possible reasons that may have caused this issue but at least I want to eliminate the SQL statement above. Should it read COLUMN1 and COLUMN2 columns correctly when they both store UNICODE strings using N notation? Do I have to do anything to the statement to tell it they are UNICODE?
Within management studio you should not need to do anything special to display the correct values. Make sure that the columns in your table is defined as Unicode strings NVARCHAR instead of ANSI strings VARCHAR.
The following example demonstrates the concept:
CREATE TABLE UnicodeExample
(
MyUnicodeColumn NVARCHAR(100)
,MYANSIColumn VARCHAR(100)
)
INSERT INTO UnicodeExample
(
MyUnicodeColumn
,MYANSIColumn
)
VALUES
(
N'איש'
,N'איש'
)
SELECT *
FROM UnicodeExample
DROP TABLE UnicodeExample
In the above example the column MyUnicodeColumn is defined as an NVARCHAR(100) and MYANSIColumn is defined as a VARCHAR(100). The query will correctly return the result for MyUnicodeColumn but will return ??? for MYANSIColum.

Need to UPPER SQL statement with INNER JOIN SELECT

I'm using Pervasive SQL 10.3 (let's just call it MS SQL since almost everything is the same regarding syntax) and I have a query to find duplicate customers using their email address as the duplicate key:
SELECT arcus.idcust, arcus.email2
FROM arcus
INNER JOIN (
SELECT arcus.email2, COUNT(*)
FROM arcus WHERE RTRIM(arcus.email2) != ''
GROUP BY arcus.email2 HAVING COUNT(*)>1
) dt
ON arcus.email2=dt.email2
ORDER BY arcus.email2";
My problem is that I need to do a case insensitive search on the email2 field. I'm required to have UPPER() for the conversion of those fields.
I'm a little stuck on how to do an UPPER() in this query. I've tried all sorts of combinations including one that I thought for sure would work:
... ON UPPER(arcus.email2)=UPPER(dt.email2) ...
... but that didn't work. It took it as a valid query, but it ran for so long I eventually gave up and stopped it.
Any idea of how to do the UPPER conversion on the email2 field?
Thanks!
If your database is set up to be case sensitive, then your inner query will have to take account of this to perform the grouping as you intended. If it is not case sensitive, then you won't require UPPER functions.
Assuming your database IS case sensitive, you could try the query below. Maybe this will run faster...
SELECT arcus.idcust, arcus.email2
FROM arcus
INNER JOIN (
SELECT UPPER(arcus.email2) as upperEmail2, COUNT(*)
FROM arcus WHERE RTRIM(arcus.email2) != ''
GROUP BY UPPER(arcus.email2) HAVING COUNT(*)>1
) dt
ON UPPER(arcus.email2) = dt.upperEmail2
Check out this blog post which discusses case insensitive searches in SQL. In essence, the reason why it was so slow was that most likely none of the current table indexes could be used in the query, so the database engine had to perform a full table scan, likely multiple times.
An index on arcus.email2 is completely useless when wanting to compare between the uppercased versions (UPPER(arcus.email2)), because the database engine cannot look up the values in the index (because they're different values!).
To improve the performance, you can create an index specifically on the result of applying UPPER to the field.
CREATE INDEX IX_arcus_UPPER_email2
ON arcus (UPPER(email2));
The collation of a character string will determine how SQL Server compares character strings. If you store your data using a case-insensitive format then when comparing the character string “AAAA” and “aaaa” they will be equal. You can place a collate Latin1_General_CI_AS for your email column in the where clause.
Check the link below for how to implement collation in a sql query.
How to do a case sensitive search in WHERE clause

Matching sub string in a column

First I apologize for the poor formatting here.
Second I should say up front that changing the table schema is not an option.
So I have a table defined as follows:
Pin varchar
OfferCode varchar
Pin will contain data such as:
abc,
abc123
OfferCode will contain data such as:
123
123~124~125
I need a query to check for a count of a Pin/OfferCode combination and when I say OfferCode, I mean an individual item delimited by the tilde.
For example if there is one row that looks like abc, 123 and another that looks like abc,123~124, and I search for a count of Pin=abc,OfferCode=123 I wand to get a count = 2.
Obviously I can do a similar query to this:
SELECT count(1) from MyTable (nolock) where OfferCode like '%' + #OfferCode + '%' and Pin = #Pin
using like here is very expensive and I'm hoping there may be a more efficient way.
I'm also looking into using a split string solution. I have a Table-valued function SplitString(string,delim) that will return table OutParam, but I'm not quite sure how to apply this to a table column vs a string. Would this even be worth wile pursuing? It seems like it would be much more expensive, but I'm unable to get a working solution to compare to the like solution.
Your like/% solution is open to a bug if you had offer codes other than 3 digits (if there was offer code 123 and 1234, searching for like '%123%' would return both, which is wrong). You can use your string function this way:
SELECT Pin, count(1)
FROM MyTable (nolock)
CROSS APPLY SplitString(OfferCode,'~') OutParam
WHERE OutParam.Value = #OfferCode and Pin = #Pin
GROUP BY Pin
If you have a relatively small table you can probably get away with this. If you are working with a large number of rows or encountering performance problems, it would be more effective to normalize it as RedFilter suggested.
using like here is very expensive and I'm hoping there may be a more efficient way
The efficient way is to normalize the schema and put each OfferCode in its own row.
Then your query is more like (although you may need to use an intersection table depending on your schema):
select count(*)
from MyTable
where OfferCode = #OfferCode
and Pin = #Pin
Here is one way to use like for this problem, which is standard for getting exact matches when searching delimited strings while avoiding the '%123%' matches '123' and '1234' problem:
-- Create some test data
declare #table table (
Pin varchar(10) not null
, OfferCode varchar(100) not null
)
insert into #table select 'abc', '123'
insert into #table select 'abc', '123~124'
-- Mock some proc params
declare #Pin varchar(10) = 'abc'
declare #OfferCode varchar(10) = '123'
-- Run the actual query
select count(*) as Matches
from #table
where Pin = #Pin
-- Append delimiters to find exact matches
and '~' + OfferCode + '~' like '%~' + #OfferCode + '~%'
As you can see, we're adding the delimiters to the searched string, and also the search string in order to find matches, thus avoiding the bugs mentioned by other answers.
I highly doubt that a string splitting function will yield better performance over like, but it may be worth a test or two using some of the more recently suggested methods. If you still have unacceptable performance, you have a few options:
Updated:
Try an index on OfferCode (or on a computed persisted column of '~' + OfferCode + '~'). Contrary to the myth that SQL Server won't use an index with like and wildcards, this might actually help.
Check out full text search.
Create a normalized version of this table using a string splitter. Use this table to run your counts. Update this table according to some schedule or event (trigger, etc.).
If you have some standard search terms, pre-calculate the counts for these and store them on some regular basis.
Actually, the LIKE condition is going to have much less cost than doing any sort of string manipulation and comparison.
http://www.simple-talk.com/sql/performance/the-seven-sins-against-tsql-performance/

Sqlite query optimisation needed

I'm using sqlite for a small validation application. I have a simple one table database with 4 varhchar columns and one integer primary key. There are close to 1 million rows in the table. I have optimised it and done a vacuum on it.
I am using the following query to retrieve a presence count from the table. I have changed the fields and names for privacy.
SELECT
count(*) as 'test'
FROM
my_table
WHERE
LOWER(surname) = LOWER('Oliver')
AND
UPPER(address_line2) = UPPER('Somewhere over the rainbow')
AND
house_number IN ('3','4','5');
This query takes about 1.5-1.9 seconds to run. I have tried indexes and they make no difference really. This time may not sound bad but I have to run this test about 40,000 times on a read in csv file so as you may imagine it adds up pretty quickly. Any ideas on how to reduce the execution time. I normally develop in mssql or mysql so if there are some tricks I am missing in sqlite I would be happy to hear them.
All the best.
When you use a function over an indexed column, SQLite cannot use the index, because the function may not preserve the ordering -- i.e. there can be functions such as 1>2, but F(1)<F(2). There are some ways to solve this situation, though:
If you want to use indexes to make your query faster, you must save
the value in a fixed case (upper or lower) and then convert only the
query parameter to the same case:
SELECT count(*) as 'test'
FROM my_table
WHERE surname = LOWER('Oliver')
You can use the case-insensitive LIKE operator (I don't know how indexes are affected!):
SELECT count(*) as 'test'
FROM my_table
WHERE surname LIKE 'Oliver';
Or you can create each column as text collate nocase and don't worry about case differences regarding this column anymore:
CREATE TABLE my_table (surname text collate nocase, <... other fields here ...>);
SELECT count(*) as 'test'
FROM my_table
WHERE surname ='Oliver';
You can find more information about the = and LIKE operators here.
SELECT
count(1) as 'test'
FROM
my_table
WHERE
surname = 'Oliver'
AND
address_line2 = 'Somewhere over the rainbow'
AND
house_number IN ('3','4','5')
COLLATE NOCASE;

SQL Server 2008 - different sort orders on VARCHAR vs NVARCHAR values

In SQL Server 2008, I am seeing some strange behavior when ordering NVARCHAR columns; here are a few quick use cases to demonstrate:
Case 1: ORDER on VARCHAR values:
SELECT t.Name
FROM
(
SELECT CAST('A' AS VARCHAR(500)) As Name
UNION SELECT CAST('-A' AS VARCHAR(500)) AS NAME
) As t
ORDER BY t.Name ASC
Which produces (my desired) output of:
-A
A
(The one with the leading dash is displayed first)
Contrast this with the ORDER on NVARCHAR values:
SELECT t.Name
FROM
(
SELECT CAST('A' AS NVARCHAR(500)) As Name
UNION SELECT CAST('-A' AS NVARCHAR(500)) AS NAME
) As t
ORDER BY t.Name ASC
Which produces this output:
A
-A
Assuming I want to sort on NVARCHAR fields (I can't change the db design) using a standard ORDER BY clause (I'm using linq2nhib, which prevents me from doing any casting here) - how do I get the sorting to work in the desired fashion (item with the leading non-alphanumeric value displays first)?
I'm hoping there is some sort of database/server-level collation setting for this...any ideas?
You need to use binary collation to achieve consistent ordering.
ORDER BY t.Name ASC COLLATE Latin1_General_BIN
Edit:
Since you can't do the collate in the query,
you will need to do it at the database level.
You will need to set it on the column(s) that you are comparing and it needs to be binary.
Here's an example of that.
Either change the collation in the database or change the collation of individual columns in the tables you need consistent ordering on.
The collations Latin1_General_BIN or Latin1_General_BIN2 work fine with your example.
You can also order the set with CAST(VARCHAR) on a CTE that returns the primary key and do a join on the table to get the NVARCHAR value.