I have a table column numbers containing strings like:
1, 2, 2A, 14, 14A, 20
Listed in the desired ascending sort order.
How can I formulate an ORDER BY clause to achieve this order?
Per default, postgres has to resort to alphabetical order which would be:
1, 2, 14, 20, 2A, 14A
Can this be done using only the string-manipulation features that come with Postgres? (replace(), regex_replace() etc?)
My first idea was:
cut the letter, if present
number * 100
add ascii of letter, if present
This would yield the desired result as the mapped values would be:
100, 200, 265, 1400, 1465, 2000
I could also index this manipulated value to speed up sorting.
Additional restrictions:
I cannot use casts to hex numbers, because eg.: 14Z is valid too.
Ideally, the result is a single expression. I'd need to use this transformation for filtering and sorting like:
SELECT * FROM table WHERE transform(numbers) < 15 ORDER BY transform(numbers)
RESULT:
1, 2, 2A, 14, 14A
I tried to implement my idea, using what I learned from #klin's answer:
Cut the letter and multiply number by 100:
substring('12A' from '(\d+).*')::int*100
Cut the numbers and get ASCII of letter:
ascii(substring('12A' from '\d+([A-Z])'))
Add the two.
This works fine with 12A, but does not work with 12, as the second expression returns NULL and not 0 (numeric zero). Any ideas?
Based on these assumptions:
Numbers consist of digits and optionally one pending letter and nothing else.
There is always at least one leading digit.
All letters are either upper case [A-Z] or lower case [a-z], but not mixed.
I would enforce that with a CHECK constraint on the table column to be absolutely reliable.
Create a tiny IMMUTABLE SQL function:
CREATE OR REPLACE FUNCTION f_nr2sort(text)
RETURNS int AS
$func$
SELECT CASE WHEN right($1, 1) > '9' COLLATE "C" -- no collation
THEN left($1, -1)::int * 100 + ascii(right($1, 1))
ELSE $1::int * 100 END -- only digits
$func$ LANGUAGE SQL IMMUTABLE;
Optimized for performance based on above assumptions. I replaced all regular expressions with the much cheaper left() and right().
I disabled collation rules with COLLATE "C" for the CASE expression (it's cheaper, too) to assure default byte order of ASCII letters. Letters in [a-zA-Z] sort above '9' and if that's the case for the last letter, we proceed accordingly.
This way we avoid adding NULL values and don't need to fix with COALESCE.
Then your query can be:
SELECT *
FROM tbl
WHERE f_nr2sort(numbers) < f_nr2sort('15C')
ORDER BY f_nr2sort(numbers);
Since the function is IMMUTABLE, you can even create a simple functional index to support this class of queries:
CREATE INDEX tbl_foo_id ON tbl (f_nr2sort(numbers));
I am new at PostgreSQL, but I found this very useful post:
Alphanumeric sorting with PostgreSQL
So what about something like this:
select val
from test
order by (substring(val, '^[0-9]+'))::int, substring(val, '[^0-9_].*$') desc
Hope it helps
Related
I have a requirement where I have to find number of records in a special pattern in the field ref_id in a table. It's a varchar column. I need to find all the records where 8th, 9th and 10th character are numeric+XX. That is it should be like 2XX or 8XX. I tried using regexp :digit: but no luck. Essentially I am looking for all records where 8th-10th characters are 1XX, 2XX, 3XX… etc
Using REGEXP_LIKE, replace table with Yours:
SELECT COUNT(*)
FROM table
WHERE REGEXP_LIKE(ref_id,'^.{7}[0-9]XX');
.{7} whatever seven characters
[0-9] 8th character digit
XX 9th and 10th characters X
Or with [:digit:] class as You are mentioning, You may use:
SELECT COUNT(*)
FROM table
WHERE REGEXP_LIKE(ref_id,'^.{7}[[:digit:]]XX');
This can also be achieved using standard non-regex SQL functions
select * from t where s like '________XX%' -- any 8 characters and then XX
AND translate( substr(s,8,1),'?0123456789','?') is null; --8th one is numeric
DEMO
No need for a regexp:
select * from mytable where substr(ref_id, 8, 3) in ('0XX','1XX','2XX','3XX','4XX','5XX','6XX','7XX','8XX','9XX')
or
select * from mytable where substr(ref_id, 8, 3) in ('1XX','2XX','3XX','4XX','5XX','6XX','7XX','8XX','9XX')
I don't know if '0XX' is a valid match or not.
Regexp's tend to be slow.
I have a table, and one of the columns contains a string with items separated by semicolons(;)
I want to selectively transfer the data to a new table based on the pattern of the String.
For example, it may look like
16;;14;30;24;11;13;14;14;10;13;18;15;18;24;13/18;11;;23;12;;19;10;;11;26;;;42;26;38/39;12;;;;;;;11;;;;;;;;;;;;;;;
or
11;;11;11;11;11;11;11;11;11;11;11;11;11;11;11;11;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
I don't care about what's between the semicolons, but I care about which positions contain items. For example, if I only want the 1st, 3rd, 4th position to contain items, I would allow the following...
32;;14;18/12;;;;;;;;; or 32;;14;18/12;;;;55;;;;11;;;;;;;
This one down below is not okay because the 3rd position does not hold any value.
32;;;18/12;;;;;;;;;
If regexp works for this, then I can use merge into to move the desired records to the target table. If this cannot be done, I'll have to process each record in Java, and selectively insert the records to the new table.
source table:
id | StringValue | count
target table:
id | StringValue | count
The sql that I have in mind:
merge into you_target_table tt
using ( select StringValue, count
from source_table where REGEXP_LIKE ( StringValue, 'some pattern')
) st
on ( st.StringValue = tt.StringValue and st.count=tt.count )
when not matched then
insert (id, StringValue , count)
values (someseq.nextval, st.value1, st.count)
when matched then
update
set tt.count = tt.count + st.count;
Also I'm certain that all StringValue in source table is unique, so what's after when matched then is not important, but due to the syntax, I think I must have something.
For each position you want a value put [^;]+;, that matches any character, that is not ; and occurs at least one time followed by a ;. If you don't care for a position put [^;]*;. That's almost similar to the first one but the characters, that are before the ; may also be none. Anchor the whole thing to the beginning with ^.
So for your 1st, 3rd and 4th position example you'd get:
^[^;]+;[^;]*;[^;]+;[^;]+;
In a query that'd look like:
SELECT *
FROM elbat
WHERE regexp_like(nmuloc, '^[^;]+;[^;]*;[^;]+;[^;]+;');
db<>fiddle
It may be further improved by putting the sub expressions in a group, that is, put parenthesis around them, and use quantors -- a number in curly braces after the group. For example ([^;]+;){2} would match two positions that are not empty. Your example would get shorten to:
^[^;]+;[^;]*;([^;]+;){2}
While #stiky bit answer is totally correct there is another similar but perhaps more readable solution:
SELECT *
FROM elbat
WHERE regexp_substr(nmuloc, '(.*?)(;|$)', 1, 1, '', 1) is not null
AND regexp_substr(nmuloc, '(.*?)(;|$)', 1, 3, '', 1) is not null
AND regexp_substr(nmuloc, '(.*?)(;|$)', 1, 4, '', 1) is not null;
db<>fiddle
Pros:
clearly states position number that should not be null
has universal pattern for any condition, so no need in changing regex
can use any regex as delimiter, not only single character
actually extracts item, so you can further test it with any function
Cons:
rather verbose
n times slower, where n is condition count
even more slower (up to 2 times) cause of backtracking on each non-delimiter symbol
However in my experience this efficiency difference is minor if query is not run against billions of rows. And even then disk reading would consume most of the time.
How it's made:
(.*?)(;|$) - lazily searches for any character sequence (possibly zero-length) ended with delimiter or end of string
1 - position to start search. 1 is default. Needed only to get to the next parameter
1, 3 or 4 - occurrence or pattern
'' - match_parameter. Can be used for setting up matching mode, but here also only to get to the last parameter
1 - sub-expression number makes regexp_substr return only first capturing group. That is (.*?) i.e. item itself without delimiter.
Assuming I have table that looks like this:
Id | Name | Age
=====================
1 | Jose | 19
2 | Yolly | 26
20 | Abby | 3
29 | Tara | 4
And my query statement is:
1) Select * from thisTable where Name <= '*Abby';
it returns 0 row
2) Select * from thisTable where Name <= 'Abby';
returns row with Abby
3) Select * from thisTable where Name >= 'Abby';
returns all rows // row 1-4
4) Select * from thisTable where Name >= '*Abby';
returns all rows; // row 1-4
5) Select * from thisTable where Name >= '*Abby' and Name <= "*Abby";
returns 0 row.
6) Select * from thisTable where Name >= 'Abby' and Name <= 'Abby';
returns row with Abby;
My question: why I got these results? How does the wildcard affect the result of query? Why don't I get any result if the condition is this Name <= '*Abby' ?
Wildcards are only interpreted when you use LIKE opterator.
So when you are trying to compare against the string, it will be treated literally. So in your comparisons lexicographical order is used.
1) There are no letters before *, so you don't have any rows returned.
2) A is first letter in alphabet, so rest of names are bigger then Abby, only Abby is equal to itself.
3) Opposite of 2)
4) See 1)
5) See 1)
6) This condition is equivalent to Name = 'Abby'.
When working with strings in SQL Server, ordering is done at each letter, and the order those letters are sorted in depends on the collation. For some characters, the sorting method is much easier to understand, It's alphabetical or numerical order: For example 'a' < 'b' and '4' > '2'. Depending on the collation this might be done by letter and then case ('AaBbCc....') or might be Case then letter ('ABC...Zabc').
Let's take a string like 'Abby', this would be sorted in the order of the letters A, b, b, y (the order they would appear would be according to your collation, and i don't know what it is, but I'm going to assume a 'AaBbCc....' collation, as they are more common). Any string starting with something like 'Aba' would have a value sell than 'Abby', as the third character (the first that differs) has a "lower value". As would a value like 'Abbie' ('i' has a lower value than 'y'). Similarly, a string like 'Abc' would have a greater value, as 'c' has a higher value than 'b' (which is the first character that differs).
If we throw numbers into the mix, then you might be surpised. For example the string (important, I didn't state number) '123456789' has a lower value than the string '9'. This is because the first character than differs if the first character. '9' is greater than '1' and so '9' has the "higher" value. This is one reason why it's so important to ensure you store numbers as numerical datatypes, as the behaviour is unlikely to be what you expect/want otherwise.
To what you are asking, however, the wildcard for SQL Server is '%' and '_' (there is also '^',m but I won't cover that here). A '%' represents multiple characters, while '_' a single character. If you want to specifically look for one of those character you have to quote them in brackets ([]).
Using the equals (=) operator won't parse wildcards. you need to use a function that does, like LIKE. Thus, if you want a word that started with 'A' you would use the expression WHERE ColumnName LIKE 'A%'. If you wanted to search for one that consisted of 6 characters and ended with 'ed' you would use WHERE ColumnName LIKE '____ed'.
Like I said before, if you want to search for one of those specific character, you quote then. So, if you wanted to search for a string that contained an underscore, the syntax would be WHERE ColumnName LIKE '%[_]%'
Edit: it's also worth noting that, when using things like LIKE that they are effected by the collations sensitivity; for example, Case and Accent. If you're using a case sensitive collation, for example, then the statement WHERE 'Abby' LIKE 'abb%' is not true, and 'A' and 'a' are not the same case. Like wise, the statement WHERE 'Covea' = 'Covéa' would be false in an accent sensitive collation ('e' and 'é' are not treated as the same character).
A wildcard character is used to substitute any other characters in a string. They are used in conjunction with the SQL LIKE operator in the WHERE clause. For example.
Select * from thisTable WHERE name LIKE '%Abby%'
This will return any values with Abby anywhere within the string.
Have a look at this link for an explanation of all wildcards https://www.w3schools.com/sql/sql_wildcards.asp
It is because, >= and <= are comparison operators. They compare string on the basis of their ASCII values.
Since ASCII value of * is 42 and ASCII values of capital letters start from 65, that is why when you tried name<='*Abby', sql-server picked the ASCII value of first character in your string (that is 42), since no value in your data has first character with ASCII value less than 42, no data got selected.
You can refer ASCII table for more understanding:
http://www.asciitable.com/
There are a few answers, and a few comments - I'll try to summarize.
Firstly, the wildcard in SQL is %, not * (for multiple matches). So your queries including an * ask for a comparison with that literal string.
Secondly, comparing strings with greater/less than operators probably does not do what you want - it uses the collation order to see which other strings are "earlier" or "later" in the ordering sequence. Collation order is a moderately complex concept, and varies between machine installations.
The SQL operator for string pattern matching is LIKE.
I'm not sure I understand your intent with the >= or <= stateements - do you mean that you want to return rows where the name's first letter is after 'A' in the alphabet?
Column xy of type 'nvarchar2(40)' in table ABC.
Column consists mainly of numerical Strings
how can I make a
select to_number(trim(xy)) from ABC
query, that ignores non-numerical strings?
In general in relational databases, the order of evaluation is not defined, so it is possible that the select functions are called before the where clause filters the data. I know this is the case in SQL Server. Here is a post that suggests that the same can happen in Oracle.
The case statement, however, does cascade, so it is evaluated in order. For that reason, I prefer:
select (case when NOT regexp_like(xy,'[^[:digit:]]') then to_number(xy)
end)
from ABC;
This will return NULL for values that are not numbers.
You could use regexp_like to find out if it is a number (with/without plus/minus sign, decimal separator followed by at least one digit, thousand separators in the correct places if any) and use it like this:
SELECT TO_NUMBER( CASE WHEN regexp_like(xy,'.....') THEN xy ELSE NULL END )
FROM ABC;
However, as the built-in function TO_NUMBER is not able to deal with all numbers (it fails at least when a number contains thousand separators), I would suggest to write a PL/SQL function TO_NUMBER_OR_DEFAULT(numberstring, defaultnumber) to do what you want.
EDIT: You may want to read my answer on using regexp_like to determine if a string contains a number here: https://stackoverflow.com/a/21235443/2270762.
You can add WHERE
SELECT TO_NUMBER(TRIM(xy)) FROM ABC WHERE REGEXP_INSTR(email, '[A-Za-z]') = 0
The WHERE is ignoring columns with letters. See the documentation
I have the query below;
Select count(*) as poor
from records where deviceId='00019' and type='Poor' and timestamp between #14-Sep-2012 01:01:01# and #24-Sep-2012 01:01:01#
table is like;
id. deviceId, type, timestamp
data is like;
data is like;
1, '00019', 'Poor', '19-Sep-2012 01:01:01'
2, '00019', 'Poor', '19-Sep-2012 01:01:01'
3, '00019', 'Poor', '19-Sep-2012 01:01:01'
4, '00019', 'Poor', '19-Sep-2012 01:01:01'
i am trying to count the devices with a specific specific type.
Please help.. access always returns wrong data. it is returning 1 while 00019 has 4 entries for poor
Type and timestamp are both reserved words, so enclose them in square brackets in your query like this: [type] and [timestamp]. I doubt those reserved words are the cause of your problem, but it's hard to predict exactly when reserved words will cause query problems, so just rule out this possibility by using the square brackets.
Beyond that, stored text values sometimes contained extra non-visible characters. Check the lengths of the stored text values to see whether any are longer than expected.
SELECT
Len(deviceId) AS LenOfDeviceId,
Len([type]) AS LenOfType,
Len([timestamp]) AS LenOfTimestamp
FROM records;
In comments you mentioned spaces (ASCII value 32) in your stored values. I had been thinking we were dealing with other non-printable/invisible characters. If you have one or more actual space characters at the beginning and/or end of a stored deviceId value, the Trim() function will discard them. So this query will give you different length numbers in the two columns:
SELECT
Len(deviceId) AS LenOfDeviceId,
Len(Trim(deviceId)) AS LenOfDeviceId_NoSpaces
FROM records;
If the stored values can also include spaces within the string (not just at the beginning and/or end), Trim() will not remove those. In that case, you could use the Replace() function to discard all the spaces. Note however a query which uses Replace() must be run from inside an Access application session --- you can't use it from Java code.
SELECT
Len(deviceId) AS LenOfDeviceId,
Len(Replace(deviceId, ' ', '')) AS LenOfDeviceId_NoSpaces
FROM records;
If that query returns the same length numbers in both columns, then we are not dealing with actual space characters (ASCII value 32) ... but some other type of character(s) which look "space-like".
If you want to count devices with specific type irrespective of deviceids then use this:
Select count(*) as excellent
from records where type='Poor'
If you want to count devices with specific deviceid irrespective of types then use this:
Select count(*) as excellent
from records where deviceId='00019'