SQL Server - Like/Pattern Matching - sql

I'm trying to implement a check constraint on a key field. The key field is composed of a 3 character prefix, and then appended with numeric characters (which can be provided manually, but the default is to get an integer value from a sequence, which is then cast as nvarchar). The key field is defined as nvarhcar(9).
I'm doing this for multiple tables, but here is a specific example below to demonstrate:
Table name: Company
Key field: IDCompany
Key field prefix: CMP
Examples of valid keys -
CMP1
CMP01
CMP10000
CMP999999
Examples of invalid keys -
CMPdog1
steve
1CMP1
1
999999999
The check constraint I came up with was:
IDCompany LIKE 'CMP%[0-9]'
However, this is beaten by CMPdog1 etc.
What should I be using as a check constraint to enforce an unknown number of numeric characters?
I could do the following:
IDCompany LIKE 'CMP[0-9]' OR IDCompany LIKE 'CMP[0-9][0-9]' OR .... through to 6 characters
But, this seems like a clunky way of doing it, is there something smarter?
EDIT 2: This actually doesn't work, it does not exclude negative numbers:
EDIT 1:
This solution ended up working for me:
IDCompany nvarchar(9) NOT NULL CONSTRAINT DEF_Company_IDCompany DEFAULT 'CMP' + CAST((NEXT VALUE FOR dbo.sq_Company) AS nvarchar) CONSTRAINT CHK_Company_IDCompany CHECK (IDCompany LIKE 'CMP%[0-9]' AND ISNUMERIC(SUBSTRING(IDCompany,4,LEN(IDCompany)-3))=1)
EDIT 3: Solution -
As proposed in Szymon's post below.
Thanks all!

You could do something like that:
where LEFT(IDCompany, 3) = 'CMP'
and isnumeric(RIGHT(IDCompany, len(IDCompany) - 3)) = 1
and IDCompany not like '%[.,-]%'
The first part checks that it starts with CMP
The next part is to make sure that the rest is numeric but excluding negative and decimal numbers.

Well, I would reconsider the design of your table and create 3 columns:
prefix, CHAR(3), with a default as 'CMP' and a constraint to allow only 'CMP' combination
id, INTEGER
companyid, NVARCHAR(9), a computed, persisted column as sum of the first 2 columns. Most probably with an index on.

Unfortunately, SQL Server doesn't suppport regular expressions.
So there is only 2 ways to solve your problem:
Use CLR function for using regular expression. You may find more information here
Or whrite long WHERE clause like you suggested:
IDCompany LIKE 'CMP[0-9]' OR IDCompany LIKE 'CMP[0-9][0-9]' OR ....

Try this:
isnumeric(substring(IDCompany,4,len(IDCompany)))=1 and IDCompany not like '%[.,-]%'
How this works: The first three characters are fixed, so we only need to check from the 4th character onwards. So we get the required substring. Then, we use isNumeric to check if the substring is entirely numeric. Example here
EDIT: As pointed out in comments by Allan, we need an extra check to ensure that characters used in numeric strings like commas or dots are not part of the input string.

Related

Query to ignore rows which have non hex values within field

Initial situation
I have a relatively large table (ca. 0.7 Mio records) where an nvarchar field "MediaID" contains largely media IDs in proper hexadecimal notation (as they should).
Within my "sequential" query (each query depends on the output of the query before, this is all in pure T-SQL) I have to convert these hexadecimal values into decimal bigint values in order to do further calculations and filtering on these calculated values for the subsequent queries.
--> So far, no problem. The "sequential" query works fine.
Problem
Unfortunately, some of these Media IDs do contain non-hex characters - most probably because there was some typing errors by the people which have added them or through import errors from the previous business system.
Because of these non-hex chars, the whole query fails (of course) because the conversion hits an error.
For my current purpose, such rows must be skipped/ignored as they are clearly wrong and cannot be used (there are no medias / data carriers in use with the current business system which can have non-hex character IDs).
Manual editing of the data is not an option as there are too many errors and it is not clear with what the data must be replaced.
Challenge
To create a query which only returns records which have valid hex values within the media ID field.
(Unfortunately, my SQL skills are not enough to create the above query. Your help is highly appreciated.)
The relevant section of the larger query looks like this (xxxx is where your help comes in :-))
select
pureMediaID
, mediaID
, CUSTOMERID
,CONTRACT_CUSTOMERID
from
(
select concat('0x', Replace(Ltrim(Replace(mediaID, '0', ' ')), ' ', '0')) AS pureMediaID
--, CUSTOMERID
, *
from M_T_CONTRACT_CUSTOMERS
where mediaID is not null
and mediaID like '0%'
and xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
) as inner1
EDIT: As per request I have added here some good and some bad data:
Good:
4335463357
4335459809
1426427996
4335463509
4335515039
4335465134
4427370396
4335415661
4427369036
4335419089
004BB03433
004e7cf9c6
00BD23133
00EE13D8C1
00CCB5522C
00C46522C
00dbbe3433
Bad:
4564589+
AB6B8BFC.8
7B498DFCnm
DB218DFChb
d<tgfh8CFC
CB9E8AFCzj
B458DFCjhl
rytzju8DFC
BFCtdsjshj
DB9888FCgf
9BC08CFCyx
EB198DFCzj
4B628CFChj
7B2B8DFCgg
After I did upgrade the compatibility level of the SQL instance to SQL2016 (it was below 2012 before) I could use try_convert with same syntax as the original convert function as donPablo has pointed out. With that the query could run fully through and every MediaID which is not a correct hex value gets nicely converted into a null value - really, really nice.
Exactly what I needed.
Unfortunately, the solution of ALICE... didn't work out for me as this was also (strangely) returning records which had the "+" character within them.
Edit: The added comment of Alice... where you create a calculated field like this:
CASE WHEN "KEY" LIKE '%[^0-9A-F]%' THEN 0 ELSE 1 end as xyz
and then filter in the next query like this:
where xyz = 1
works also with SQL Instances with compatibility level < SQL 2012.
Great addition for people which still have to work with older SQL instances.
An option (although not ideal in terms of performance) is to check the characters in the MediaID through a case statement and regular expression
Hexadecimals cannot contain characters other than A-F and numbers between 0 and 9
CASE WHEN MediaID LIKE '%[0-9A-F]%' THEN 1 ELSE 0 END
I would recommend writing a function that can be used to evaluate MediaID first and checks if it is hexadecimal and then running the query for conversion

Exclude rows when column contains a 1 in position 2 without using function

I have a column that will always be 5 digits long, and each digit will always be a 1 or a 0. I need to put in my where clause to exclude when the second position is equal to 1. For example 01000 is to be excluded but 10010 is to be kept. I currently have:
WHERE (SUBSTRING(field, 2, 1) <> '1') or field IS NULL
How do do this without using the Substring function?
Edit:Also, the column is a varchar(10) in the database. Does this matter?
You could use the like operator to check that character directly:
WHERE field LIKE '_1%' OR field IS NULL
Use LEFT and RIGHT and then check that is 1 or not as below-
WHERE RIGHT(LEFT(field,2),1) <> '1' OR field IS NULL
No.
If 'field' is of a string type, you need to use string functions to manipulate it. SUBSTRING or some other flavor of it.
You can also convert it to binary and use bitwise AND operator but that won't solve the root issue here.
You are facing the consequences of someone ignoring 1NF.
There is a reason why Codd insisted that every "cell" must be atomic. Your's is not.
Can you separate this bitmap into atomic attribute columns?

PostgreSQL ORDER BY issue - natural sort

I've got a Postgres ORDER BY issue with the following table:
em_code name
EM001 AAA
EM999 BBB
EM1000 CCC
To insert a new record to the table,
I select the last record with SELECT * FROM employees ORDER BY em_code DESC
Strip alphabets from em_code usiging reg exp and store in ec_alpha
Cast the remating part to integer ec_num
Increment by one ec_num++
Pad with sufficient zeors and prefix ec_alpha again
When em_code reaches EM1000, the above algorithm fails.
First step will return EM999 instead EM1000 and it will again generate EM1000 as new em_code, breaking the unique key constraint.
Any idea how to select EM1000?
Since Postgres 9.6, it is possible to specify a collation which will sort columns with numbers naturally.
https://www.postgresql.org/docs/10/collation.html
-- First create a collation with numeric sorting
CREATE COLLATION numeric (provider = icu, locale = 'en#colNumeric=yes');
-- Alter table to use the collation
ALTER TABLE "employees" ALTER COLUMN "em_code" type TEXT COLLATE numeric;
Now just query as you would otherwise.
SELECT * FROM employees ORDER BY em_code
On my data, I get results in this order (note that it also sorts foreign numerals):
Value
0
0001
001
1
06
6
13
۱۳
14
One approach you can take is to create a naturalsort function for this. Here's an example, written by Postgres legend RhodiumToad.
create or replace function naturalsort(text)
returns bytea language sql immutable strict as $f$
select string_agg(convert_to(coalesce(r[2], length(length(r[1])::text) || length(r[1])::text || r[1]), 'SQL_ASCII'),'\x00')
from regexp_matches($1, '0*([0-9]+)|([^0-9]+)', 'g') r;
$f$;
Source: http://www.rhodiumtoad.org.uk/junk/naturalsort.sql
To use it simply call the function in your order by:
SELECT * FROM employees ORDER BY naturalsort(em_code) DESC
The reason is that the string sorts alphabetically (instead of numerically like you would want it) and 1 sorts before 9.
You could solve it like this:
SELECT * FROM employees
ORDER BY substring(em_code, 3)::int DESC;
It would be more efficient to drop the redundant 'EM' from your em_code - if you can - and save an integer number to begin with.
Answer to question in comment
To strip any and all non-digits from a string:
SELECT regexp_replace(em_code, E'\\D','','g')
FROM employees;
\D is the regular expression class-shorthand for "non-digits".
'g' as 4th parameter is the "globally" switch to apply the replacement to every occurrence in the string, not just the first.
After replacing every non-digit with the empty string, only digits remain.
This always comes up in questions and in my own development and I finally tired of tricky ways of doing this. I finally broke down and implemented it as a PostgreSQL extension:
https://github.com/Bjond/pg_natural_sort_order
It's free to use, MIT license.
Basically it just normalizes the numerics (zero pre-pending numerics) within strings such that you can create an index column for full-speed sorting au naturel. The readme explains.
The advantage is you can have a trigger do the work and not your application code. It will be calculated at machine-speed on the PostgreSQL server and migrations adding columns become simple and fast.
you can use just this line
"ORDER BY length(substring(em_code FROM '[0-9]+')), em_code"
I wrote about this in detail in this related question:
Humanized or natural number sorting of mixed word-and-number strings
(I'm posting this answer as a useful cross-reference only, so it's community wiki).
I came up with something slightly different.
The basic idea is to create an array of tuples (integer, string) and then order by these. The magic number 2147483647 is int32_max, used so that strings are sorted after numbers.
ORDER BY ARRAY(
SELECT ROW(
CAST(COALESCE(NULLIF(match[1], ''), '2147483647') AS INTEGER),
match[2]
)
FROM REGEXP_MATCHES(col_to_sort_by, '(\d*)|(\D*)', 'g')
AS match
)
I thought about another way of doing this that uses less db storage than padding and saves time than calculating on the fly.
https://stackoverflow.com/a/47522040/935122
I've also put it on GitHub
https://github.com/ccsalway/dbNaturalSort
The following solution is a combination of various ideas presented in another question, as well as some ideas from the classic solution:
create function natsort(s text) returns text immutable language sql as $$
select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;
The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.
Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.

How to find MAX() value of character column?

We have legacy table where one of the columns part of composite key was manually filled with values:
code
------
'001'
'002'
'099'
etc.
Now, we have feature request in which we must know MAX(code) in order to give user next possible value, in example case form above next value is '100'.
We tried to experiment with this but we still can't find any reasonable explanation how DB2 engine calculates that
MAX('001', '099', '576') is '576'
MAX('099', '99', 'www') is '99' and so on.
Any help or suggestion would be much appreciated!
You already have the answer to getting the maximum numeric value, but to answer the other part with regard to 'www','099','99'.
The AS/400 uses EBCDIC to store values, this is different to ASCII in several ways, the most important for your purposes is that Alpha characters come before numbers, which is the opposite of Ascii.
So on your Max() your 3 strings will be sorted and the highest EBCDIC value used so
'www'
'099'
'99 '
As you can see your '99' string is really '99 ' so it is higher that the one with the leading zero.
Cast it to int before applying max()
For the numeric maximum -- filter out the non-numeric values and cast to a numeric for aggregation:
SELECT MAX(INT(FLD1))
WHERE FLD1 <> ' '
AND TRANSLATE(FLD1, '0123456789', '0123456789') = FLD1
SQL Reference: TRANSLATE
And the reasonable explanation:
SQL Reference: MAX
This max working well in your type definition, when you want do max on integer values then convert values to integer before calling MAX, but i see you mixing max with string 'www' how you imagine this works?
Filter integer only values, cast it to int and call max. This is not good designed solution but looking at your problem i think is enough.
Sharing the solution for postgresql
which worked for me.
Suppose here temporary_id is of type character in database. Then above query will directly convert char type to int type when it gives response.
SELECT MAX(CAST (temporary_id AS Integer)) FROM temporary
WHERE temporary_id IS NOT NULL
As per my requirement I've applied MAX() aggregate function. One can remove that also and it will work the same way.

which datatype to use to store a mobile number

Which datatype shall I use to store mobile numbers of 10 digits (Ex.:9932234242). Shall I go for varchar(10) or for the big one- the "bigint".
If the number is of type- '0021-23141231' , then which datatype to use?
varchar/char long enough for all expected (eg UK numbers are 11 long)
check constraint to allow only digits (expression = NOT LIKE '%[^0-9]%')
format in the client per locale (UK = 07123 456 789 , Switzerland = 071 234 56 78)
As others have answered, use varchar for data that happens to be composed of numeric digits, but for which mathematical operations make no sense.
In addition, in your example number, did you consider what would happen if you stored 002123141231 into a bigint column? Upon retrieval, it would be 2123141231, i.e. there's no way for a numeric column to store leading 0 digits...
Use varchar with check constraint to make sure that only digits are allowed.
Something like this:
create table MyTable
(
PhoneNumber varchar(10)
constraint CK_MyTable_PhoneNumber check (PhoneNumber like '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]')
)
if it is always the same length you might want to use char instead.
varchar(50) is good for mobile number data type . because it may sometimes contain country code for example +91 or spaces also. For comparison purpose we can remove all special characters from both side in the expresion.