Is it possible to find dependency between PostgreSQL functions? - sql

I'm using PostgreSQL 9.2.10
Suppose you have 2 PostgreSQL functions, 'called_function' and 'caller_function', second one is calling the first one. Exactly:
CREATE FUNCTION called_function () RETURNS varchar AS
$BODY$
BEGIN
RETURN 'something';
END;
CREATE FUNCTION caller_function () RETURNS varchar AS
$BODY$
BEGIN
RETURN called_function ();
END;
Now, using SQL and knowing only function name, I would like to find out if 'caller_function' calls some other function. Or if 'called_function' is called by some other function. Is it possible?
I tried to get function's OID (let's say it is '123') and then I looked into pg_depend table:
SELECT * FROM pg_catalog.pg_depend dep WHERE dep.objid = 123 OR dep.objsubid = 123 OR dep.refobjid = 123 OR dep.refobjsubid = 123 OR dep.refobjsubid = 123;
But it finds only pg_language and pg_namespace dependency. Nothing more.

I had same problem to define a function and because of dependency It didn't work. Then I solved my problem with adding this command before the commands
SET check_function_bodies = false;
hope to help someone else

Look at the table pg_proc for example :
select nspname,proname,prosrc from pg_proc join pg_namespace nsp on (pronamespace=nsp.oid) where prosrc like '%called_function%'

Impossible in the general case; but a limited (restricted-domain) solution is perfectly doable --- and might prove adequate for your needs.
(The Most Obvious of the Many) Limitations
Fails (false negative) if name of callee (a function to be invoked) is specified as a quoted identifier.
Fails (false negative) if name of callee is passed as argument.
Fails (false negative) if name of callee is read from a relation at runtime.
Fails (false negative) if name of callee is assembled from tokens.
Fails (false positive) if name of callee is present just as literal.
Fails (false positive) if name of callee is present in a multi-line comment.
Does not account for function overloading.
Does not account for functions invoked via triggers.
Does not account for functions invoked per query-rewrite rules.
Does not account for effects of query rewriting rules.
Knows nothing about functions written in non-interpreted PLs like C.
Sample Output
Your routine... | ...calls these routines:
---------------------------------+-------------------------------------------------
create_silo_indexes | {get_config__f_l__ea_silo,subst_silo_id}
demux__id_creat_thread | {}
grow__sensor_thhourly | {containing_hhour_t_begin}
SQL
WITH routine_names AS (
SELECT DISTINCT(Lower(proname)) AS name --#0
FROM pg_proc
WHERE proowner = To_Regrole(current_role)
)
SELECT
name AS "Your routine...",
Array_Remove( --#8
Array( --#7
SELECT Unnest( --#5
String_To_Array( --#4
Regexp_Replace( --#3
Regexp_Replace( --#2
Lower(PG_Get_Functiondef(To_Regproc(name))) --#1
, '--.*?\n', '', 'g'
)
, '\W+', ' ', 'g'
)
, ' '
)
)
INTERSECT --#6
SELECT name FROM routine_names
ORDER BY 1
)
, name
) AS "...calls these routines:"
FROM
routine_names;
How It Works
#0 Collect names of all the routines which could be callers and/or callees. We cannot handle overloaded functions correctly anyway, so just DISTINCT to save trouble later on; SQL is case-insensitive apart from quoted identifiers which we are not bothering with anyway, so we just Lower() to simplify comparison later.
#1 PG_Get_Functiondef() fetches complete text of the CREATE FUNCTION or CREATE PROCEDURE command. Again, Lower().
#2 Strip single-line comments. Note the lazy (non-greedy) *? quantifier: the usual * quantifier, if used here, would remove the first single-line comment plus all subsequent lines!
#3 Replace all characters other than letters and digits and _, with a space. Note the + quantifier: it ensures that 2+ contiguous removed characters are replaced by just 1 space.
#4 Split by spaces into an array; this array contains bits of SQL syntax, literals, numbers, and identifiers including routine names.
#5 Unnest the array into a rowset.
#6 INTERSECT with routine names; result will consist of routine names only.
#7 Convert rowset into an array.
#8 Since input was complete text of a CREATE FUNCTION f ... command, extracted routine names will obviously contain f itself; so we remove it with Array_Remove().
(SQL tested with PostgreSQL 12.1)

Related

Unordered pattern/rule matching(?) with 'OR' capability on a PostgreSQL array field

Abstract
I'm writing code for a data analysis tool that interfaces with a PostgreSQL database and constructs an SQL query to filter to a set of rows based on user input. In broad terms, each row is a record containing a set of input data and an associated output/result. The utility I'm developing allows users to see different views of this data by applying filters to the input and output values.
There's a field in this table which contains an array of integers which represent the 'classes' of a set of entities, which is part of the 'input'. These classes have the most direct impact upon the output, so the particular assortment of values in this field is of particular importance to users of the system. There are twenty unique 'class' values, and the array typically has no more than six elements. There can, in certain circumstances, be two such arrays in a single record, and they may be queried either separately, or combined together into a single set of up to 12 values.
My system provides a freeform input where users can write filter criteria specifically to filter results based on the contents of this field. It allows the user to specify a list of class designations they wish to include in the filter clause, as well as any they wish to explicitly exclude. The grammar of this freeform input is based upon a preexisting community-defined syntax used outside this system to represent the data in question, and adapted here for the purpose of filtering.
Multiple entities in a given record may have the same 'class', so the same values can appear multiple times in the array, and the user can specify a constraint on the number of instances of each class value. The length of this array can also vary, but the user may only be interested in specific items, so the user may specify wildcards and place constraints upon the length of the array.
The arrays are unsorted, as the particular order (most notably, the value in the first position) can occasionally be of importance.
Examples
The data as stored in the database column is an array of integers, but for demonstration purposes, I will use textual class designations in the following example. Users input these textual designations in their queries, which are then translated by the system to numeric IDs.
Example field data: [A, B, B, E, B, D]
Example user inputs which would successfully match the above:
A B B B D E // Explicitly written, filters to rows matching this exact list of items. Order doesn't matter unless the user also selects an option to match the first entry explicitly.
6* // Array wildcard with length constraint; filters to any rows with an array length of 6.
2-3B * // Filters to any rows containing between two and three (inclusive) instances of B, and zero or more other non-B items (unconstrained array wildcard *).
A 2B 3XX // Filters to any rows containing at least one A, two B, and exactly three other items (class wildcard XX) of any class (which may also be A and/or B)
All of this currently works. My current method is to determine the potential upper/lower bounds of the instance counts (or lack thereof) of all specified classes, as well as that of the array length itself, and construct a query that checks those instance counts and array lengths and returns rows which successfully meet those criteria.
The problem...
All of the current syntax works great at the moment. It is purely of "AND" fashion, however -- and the #1 requested feature for this system is the introduction of an "OR" syntax, which is commonly used within the community to denote when certain sets of classes are considered interchangeable.
For example:
A B|C would match both [A,B] and [A,C].
3(B|C) would match [B,B,B], [C,C,C], [B,C,B], etc..
These kinds of queries are often more complex, with things like 2(A|B) 2(B|C|D) 2E not being uncommon. This potential for increasing complexity is where my brain starts to break down when trying to find a solution.
I believe that my current solution of tracking expected instance counts for each value is not inherently compatible with this (unless I'm simply overcomplicating things or overlooking something), but I have been at a loss for how better to approach it, made worse by the fact that I don't know what this type of issue is even called. I believe it would be considered a form of unordered pattern/rule matching, but that's quite a broad umbrella and my searches thus far have been fruitless.
I'm not really looking to be spoonfed a solution, but if there's anyone who recognizes the sort of problem I'm dealing with and has an idea of what topics I could research to figure it out on my own (particularly in the context of SQL queries), it would be immensely helpful.
Database notes
The data pool that a typical query is performed upon is a 30-day period with a subset of data spanning, on average, about 300,000 rows. This window can be increased, and it's not especially uncommon for users to perform long-term queries spanning many millions of rows. Performance is pretty important.
The SQL database in question is a replica of an external partner's database. It is replicated periodically via a binary copy operation, and thus the original format of the tables is largely maintained. Additional fields may be added to optimize access to certain types of data, but this must be done in a separate step during the replication process, and I'd prefer to avoid that if possible.
The problem as stated is very similar to regular expressions even if the unordered nature of the queries makes regular expressions not fully suitable. But this can be solved by defining an AGGREGATE function which relies on regular expressions.
Considering that :
Your arrays of integers to be evaluated may be converted as a text starting with '{', ending with '}' and with ',' as separator
Your queries may be converted as a text and which is a set of elements with a space as separator. Each element is a regular expression of any kind, and especially : an element may be a simple numeric string which represents an integer, an element may be like '(A|B|C)' where A, B, C are numeric stings so that to implement the 'OR' operator between these integers, etc
Your queries may be either ordered or non-ordered : ordered means that the array of integers is evaluated according to the order of the elements in the query, non-ordered means that the array of elements is evaluating againts every element of the query without any order consideration between these elements
Your queries may be strict or non-strict : strict means that the array of integers exactly match the set of elements in the query, ie no additional integer exists in the array which doesn't match with the query elements, non-strict means that the array of integers may include some integers which do not match with any element of the query
The ordered and strict parameters of the query are independent one from the other, ie the users may need ordered and non-strict queries, or non-ordered and strict queries, etc
the function check_function as defined here below should cover most of your use cases including the 'OR' syntax :
CREATE OR REPLACE FUNCTION deduct
( str1 text
, str2 text
, reg text
) RETURNS text LANGUAGE sql IMMUTABLE AS
$$
SELECT CASE
WHEN res = COALESCE(str1,str2)
THEN NULL
ELSE res
END
FROM regexp_replace( COALESCE(str1,str2)
, reg
, ','
) AS res
$$ ;
DROP AGGREGATE IF EXISTS deduct_agg(text, text);
CREATE AGGREGATE deduct_agg
( str text
, reg text
)
( sfunc = deduct
, stype = text
) ;
-- this function returns true when init_string matches with the reg_string expression according to the parameters ordered_match and strict_match
CREATE OR REPLACE FUNCTION check_function
( init_string text -- string to be checked against the reg_string; in case of an array of integer, it must be converted into text before being passed to the function
, reg_string text -- set of elements separated by a space and individually used for checking the init_string iteratively
, ordered_match boolean -- true = the order of the elements in reg_string must be respected in init_string, false = every element in reg_string is individually checked in init_string without any matching order in init_string
, strict_match boolean -- true = the init_string mut exactly match the reg_string, false = the init_string must match all the elements of the reg_string but with some extra substrings which don't match
) RETURNS boolean LANGUAGE plpgsql IMMUTABLE AS
$$
DECLARE res boolean ;
BEGIN
CASE
WHEN ordered_match AND strict_match
THEN SELECT deduct_agg(init_string, '(,|{)' || r.reg || '(,|})$' ORDER BY r.id DESC) IS NOT DISTINCT FROM ','
INTO res
FROM regexp_split_to_table(reg_string,' ') WITH ORDINALITY AS r(reg,id) ;
WHEN NOT ordered_match AND strict_match
THEN SELECT deduct_agg(init_string, '(,|{)' || r.reg || '(,|})') IS NOT DISTINCT FROM ','
INTO res
FROM regexp_split_to_table(reg_string,' ') AS r(reg) ;
WHEN ordered_match AND NOT strict_match
THEN SELECT deduct_agg(init_string, '(,|{)' || r.reg || '(,|})') IS DISTINCT FROM NULL
INTO res
FROM regexp_replace(reg_string,' ', '.*','g') AS r(reg) ;
ELSE SELECT deduct_agg(init_string, '(,|{)' || r.reg || '(,|})') IS DISTINCT FROM NULL
INTO res
FROM regexp_split_to_table(reg_string,' ') AS r(reg) ;
END CASE ;
RETURN res ;
END ;
$$ ;
The following use cases should be supported :
"A B B B D E // Explicitly written, filters to rows matching this
exact list of items. Order doesn't matter" ==> implemented as SELECT check_function(your_array_of_integers :: text, 'A B B B D E', true, true)
"6* // Array wildcard with length constraint; filters to any rows with an array length of 6." ==> implemented as SELECT check_function(your_array_of_integers :: text,'([0-9]+,){5}([0-9]+)',true,true). This use case can be generalized by replacing "6*" by "n*" and '{5}' by '{' || n-1 || '}' in the reg_string, where n is any integer > 1
"A 3B" with any order and strict ==> implemented as SELECT check_function(your_array_of_integers :: text, 'A B B B', false, true)
"A (B|C)" with no order and not strict ==> implemented as SELECT check_function(your_array_of_integers :: text, 'A (B|C)', false, false)
"3(B|C)" with no order and strict ==> implemented as SELECT check_function(your_array_of_integers :: text, '(B|C) (B|C) (B|C)', false, true)
"2(A|B) 2(B|C|D) 2E" with no order and not strict ==> implemented as SELECT check_function(your_array_of_integers :: text, '(A|B) (A|B) (B|C|D) (B|C|D) E E', false, false)
etc
The use cases which are not yet implemented :
"2-3B" but some additional home work could make it happen, I don't see any blocking point. One idea would be to call the function check_function twice : SELECT check_function (..., 'B B', ..., ...) AND NOT check_function (..., 'B B B B', ..., ...)
"2-3B *" and "A 2B 3XX" because the wildcards * and XX are not clear to me in that cases.
PS : I'm a basic user of regular expressions as I don't use all the capabilities as presented in the manual. Having the advices of an experienced user in regular expression could bring a lot of value in your context.

Can 2 character length variables cause SQL injection vulnerability?

I am taking a text input from the user, then converting it into 2 character length strings (2-Grams)
For example
RX480 becomes
"rx","x4","48","80"
Now if I directly query server like below can they somehow make SQL injection?
select *
from myTable
where myVariable in ('rx', 'x4', '48', '80')
SQL injection is not a matter of length of anything.
It happens when someone adds code to your existing query. They do this by sending in the malicious extra code as a form submission (or something). When your SQL code executes, it doesn't realize that there are more than one thing to do. It just executes what it's told.
You could start with a simple query like:
select *
from thisTable
where something=$something
So you could end up with a query that looks like:
select *
from thisTable
where something=; DROP TABLE employees;
This is an odd example. But it does more or less show why it's dangerous. The first query will fail, but who cares? The second one will actually work. And if you have a table named "employees", well, you don't anymore.
Two characters in this case are sufficient to make an error in query and possibly reveal some information about it. For example try to use string ')480 and watch how your application will behave.
Although not much of an answer, this really doesn't fit in a comment.
Your code scans a table checking to see if a column value matches any pair of consecutive characters from a user supplied string. Expressed in another way:
declare #SearchString as VarChar(10) = 'Voot';
select Buffer, case
when DataLength( Buffer ) != 2 then 0 -- NB: Len() right trims.
when PatIndex( '%' + Buffer + '%', #SearchString ) != 0 then 1
else 0 end as Match
from ( values
( 'vo' ), ( 'go' ), ( 'n ' ), ( 'po' ), ( 'et' ), ( 'ry' ),
( 'oo' ) ) as Samples( Buffer );
In this case you could simply pass the value of #SearchString as a parameter and avoid the issue of the IN clause.
Alternatively, the character pairs could be passed as a table parameter and used with IN: where Buffer in ( select CharacterPair from #CharacterPairs ).
As far as SQL injection goes, limiting the text to character pairs does preclude adding complete statements. It does, as others have noted, allow for corrupting the query and causing it to fail. That, in my mind, constitutes a problem.
I'm still trying to imagine a use-case for this rather odd pattern matching. It won't match a column value longer (or shorter) than two characters against a search string.
There definitely should be a canonical answer to all these innumerable "if I have [some special kind of data treatment] will be my query still vulnerable?" questions.
First of all you should ask yourself - why you are looking to buy yourself such an indulgence? What is the reason? Why do you want add an exception to your data processing? Why separate your data into the sheep and the goats, telling yourself "this data is "safe", I won't process it properly and that data is unsafe, I'll have to do something?
The only reason why such a question could even appear is your application architecture. Or, rather, lack of architecture. Because only in spaghetti code, where user input is added directly to the query, such a question can be ever occur. Otherwise, your database layer should be able to process any kind of data, being totally ignorant of its nature, origin or alleged "safety".

SQL not finding results

This query currently is returning no results, and it should. Can you see anything wrong with this query
field title are NEED_2_TARGET, ID, and CARD
NEED_2_TARGET = integer
CARD = string
ID = integer
value of name is 'Ash Imp'
{this will check if a second target is needed}
//**************************************************************************
function TFGame.checkIf2ndTargetIsNeeded(name: string):integer;
//**************************************************************************
var
targetType : integer; //1 is TCard , 2 is TMana , 0 is no second target needed.
begin
TargetType := 0;
Result := targetType;
with adoquery2 do
begin
close;
sql.Clear;
sql.Add('SELECT * FROM Spells WHERE CARD = '''+name+''' and NEED_2_TARGET = 1');
open;
end;
if adoquery2.RecordCount < 1 then
Result := 0
else
begin
Adoquery2.First;
TargetType := adoquery2.FieldByName(FIELD_TARGET_TYPE).AsInteger;
result := TargetType;
end;
end;
sql db looks like below
ID CARD TRIGGER_NUMBER CATEGORY_NUMBER QUANTITY TARGET_NUMBER TYPE_NUMBER PLUS_NUMBER PERCENT STAT_TARGET_NUMBER REPLACEMENT_CARD_NUMBER MAX_RANDOM LIFE_TO_ADD REPLACED_DAMAGE NEED_2_TARGET TYPE_OF_TARGET
27 Ash Imp 2 2 15 14 1 1
There are a number of things that could be going wrong.
First and most important in your trouble-shooting is to take your query and run it directly against your database. I.e. first confirm your query is correct by eliminating possibilities of other things going wrong. More things confirmed working, the less "noise" to distract you from solving the problem.
As others having pointed out if you're not clearing your SQL statement, you could be returning zero rows in your first result set.
Yes I know, you've since commented that you are clearing your previous query. The point is: if you're having trouble solving your problem, how can you be sure where the problem lies? So, don't leave out potentially relevant information!
Which bring us neatly to the second possibility. I can't see the rest of your code, so I have to ask: are you refreshing your data after changing your query? If you don't Close and Open your query, you may be looking at a previous execution's result set.
I'm unsure whether you're even allowed to change your query text while the component is Active, or even whether that depends on exactly which data access component you're using. The point is, it's worth checking.
Is your application connecting to the correct database? Since you're using Access, it's very easy to be connected to a different database file without realising it.
You can check this by changing your query to return all rows (i.e. delete the WHERE clause).
You my want to change the quotes used in your SQL query. Instead of: ...CARD = "'+name+'" ORDER... rather use ...CARD = '''+name+''' ORDER...
As far as I'm aware single quotes is the ANSI standard. Even if some databases permit double quotes, using them limits portability, and may produce unexpected results when passed through certain data access drivers.
Check the datatype of your CARD column. If it's a fixed length string, then the data values will be padded. E.g. if CARD is char(10), then you might actually need to look for 'Ash Imp '.
Similarly, the actual value may contain spaces before / after the words. Use select without WHERE and check the actual value of the column. You could also check whether SELECT * FROM Spells WHERE CARD LIKE '%Ash Imp%' works.
Finally, as others have suggested, you're better off using a parameterised query rather dynamically building the query up yourself.
Your code will be more readable and flexible.
You can make your code strongly typed; and so avoid converting things like numbers and dates into strings.
You won't need to worry about the peculiarities of date formatting.
You eliminate some security concerns.
#GordonLinoff all fields in db are all caps
If that is true then that is your problem. SQL usually performs case sensitive comparisons of character/string values unless you tell it not to do so, such as with STRCMP() (MySQL 4+), LOWER() or UPPER() (SQLServer, Firebird), etc. I would also go as far as wrapping the conditions in parenthesis as well:
sql.Text := 'SELECT * FROM Spells WHERE (NEED_2_TARGET = 1) AND (STRCMP(CARD, "'+name+'") = 0) ORDER by ID';
sql.Text := 'SELECT * FROM Spells WHERE (NEED_2_TARGET = 1) AND (LOWER(CARD) = "'+LowerCase(name)+'") ORDER by ID';
sql.Text := 'SELECT * FROM Spells WHERE (NEED_2_TARGET = 1) AND (UPPER(CARD) = "'+UpperCase(name)+'") ORDER by ID';
This is or was an issue with the
With Adoquery2 do
begin
...
end
when using name in the sql, it was really getting adoquery2.name not the var name. I fixed this by changing name to Cname had no more issues after that.

Convert numeric to string inside a user-defined function

I am trying to call/convert a numeric variable into string inside a user-defined function. I was thinking about using to_char, but it didn't pass.
My function is like this:
create or replace function ntile_loop(x numeric)
returns setof numeric as
$$
select
max("billed") as _____(to_char($1,'99')||"%"???) from
(select "billed", "id","cm",ntile(100)
over (partition by "id","cm" order by "billed")
as "percentile" from "table_all") where "percentile"=$1
group by "id","cm","percentile";
$$
language sql;
My purpose is to define a new variable "x%" as its name, with x varying as the function input. In context, x is numeric and will be called again later in the function as a numeric (this part of code wasn't included in the sample above).
What I want to return:
I simply want to return a block of code so that every time I change the percentile number, I don't have to run this block of code again and again. I'd like to calculate 5, 10, 20, 30, ....90th percentile and display all of them in the same table for each id+cm group.
That's why I was thinking about macro or function, but didn't find any solutions I like.
Thank you for your answers. Yes, I will definitely read basics while I am learning. Today's my second day to use SQL, but have to generate some results immediately.
Converting numeric to text is the least of your problems.
My purpose is to define a new variable "x%" as its name, with x
varying as the function input.
First of all: there are no variables in an SQL function. SQL functions are just wrappers for valid SQL statements. Input and output parameters can be named, but names are static, not dynamic.
You may be thinking of a PL/pgSQL function, where you have procedural elements including variables. Parameter names are still static, though. There are no dynamic variable names in plpgsql. You can execute dynamic SQL with EXECUTE but that's something different entirely.
While it is possible to declare a static variable with a name like "123%" it is really exceptionally uncommon to do so. Maybe for deliberately obfuscating code? Other than that: Don't. Use proper, simple, legal, lower case variable names without the need to double-quote and without the potential to do something unexpected after a typo.
Since the window function ntile() returns integer and you run an equality check on the result, the input parameter should be integer, not numeric.
To assign a variable in plpgsql you can use the assignment operator := for a single variable or SELECT INTO for any number of variables. Either way, you want the query to return a single row or you have to loop.
If you want the maximum billed from the chosen percentile, you don't GROUP BY x, y. That might return multiple rows and does not do what you seem to want. Use plain max(billed) without GROUP BY to get a single row.
You don't need to double quote perfectly legal column names.
A valid function might look like this. It's not exactly what you were trying to do, which cannot be done. But it may get you closer to what you actually need.
CREATE OR REPLACE FUNCTION ntile_loop(x integer)
RETURNS SETOF numeric as
$func$
DECLARE
myvar text;
BEGIN
SELECT INTO myvar max(billed)
FROM (
SELECT billed, id, cm
,ntile(100) OVER (PARTITION BY id, cm ORDER BY billed) AS tile
FROM table_all
) sub
WHERE sub.tile = $1;
-- do something with myvar, depending on the value of $1 ...
END
$func$ LANGUAGE plpgsql;
Long story short, you need to study the basics before you try to create sophisticated functions.
Plain SQL
After Q update:
I'd like to calculate 5, 10, 20, 30, ....90th percentile and display
all of them in the same table for each id+cm group.
This simple query should do it all:
SELECT id, cm, tile, max(billed) AS max_billed
FROM (
SELECT billed, id, cm
,ntile(100) OVER (PARTITION BY id, cm ORDER BY billed) AS tile
FROM table_all
) sub
WHERE (tile%10 = 0 OR tile = 5)
AND tile <= 90
GROUP BY 1,2,3
ORDER BY 1,2,3;
% .. modulo operator
GROUP BY 1,2,3 .. positional parameter
It looks like you're looking for return query execute, returning the result from a dynamic SQL statement:
http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html
http://www.postgresql.org/docs/current/static/plpgsql-statements.html

PostgreSQL ORDER BY issue - natural sort

I've got a Postgres ORDER BY issue with the following table:
em_code name
EM001 AAA
EM999 BBB
EM1000 CCC
To insert a new record to the table,
I select the last record with SELECT * FROM employees ORDER BY em_code DESC
Strip alphabets from em_code usiging reg exp and store in ec_alpha
Cast the remating part to integer ec_num
Increment by one ec_num++
Pad with sufficient zeors and prefix ec_alpha again
When em_code reaches EM1000, the above algorithm fails.
First step will return EM999 instead EM1000 and it will again generate EM1000 as new em_code, breaking the unique key constraint.
Any idea how to select EM1000?
Since Postgres 9.6, it is possible to specify a collation which will sort columns with numbers naturally.
https://www.postgresql.org/docs/10/collation.html
-- First create a collation with numeric sorting
CREATE COLLATION numeric (provider = icu, locale = 'en#colNumeric=yes');
-- Alter table to use the collation
ALTER TABLE "employees" ALTER COLUMN "em_code" type TEXT COLLATE numeric;
Now just query as you would otherwise.
SELECT * FROM employees ORDER BY em_code
On my data, I get results in this order (note that it also sorts foreign numerals):
Value
0
0001
001
1
06
6
13
۱۳
14
One approach you can take is to create a naturalsort function for this. Here's an example, written by Postgres legend RhodiumToad.
create or replace function naturalsort(text)
returns bytea language sql immutable strict as $f$
select string_agg(convert_to(coalesce(r[2], length(length(r[1])::text) || length(r[1])::text || r[1]), 'SQL_ASCII'),'\x00')
from regexp_matches($1, '0*([0-9]+)|([^0-9]+)', 'g') r;
$f$;
Source: http://www.rhodiumtoad.org.uk/junk/naturalsort.sql
To use it simply call the function in your order by:
SELECT * FROM employees ORDER BY naturalsort(em_code) DESC
The reason is that the string sorts alphabetically (instead of numerically like you would want it) and 1 sorts before 9.
You could solve it like this:
SELECT * FROM employees
ORDER BY substring(em_code, 3)::int DESC;
It would be more efficient to drop the redundant 'EM' from your em_code - if you can - and save an integer number to begin with.
Answer to question in comment
To strip any and all non-digits from a string:
SELECT regexp_replace(em_code, E'\\D','','g')
FROM employees;
\D is the regular expression class-shorthand for "non-digits".
'g' as 4th parameter is the "globally" switch to apply the replacement to every occurrence in the string, not just the first.
After replacing every non-digit with the empty string, only digits remain.
This always comes up in questions and in my own development and I finally tired of tricky ways of doing this. I finally broke down and implemented it as a PostgreSQL extension:
https://github.com/Bjond/pg_natural_sort_order
It's free to use, MIT license.
Basically it just normalizes the numerics (zero pre-pending numerics) within strings such that you can create an index column for full-speed sorting au naturel. The readme explains.
The advantage is you can have a trigger do the work and not your application code. It will be calculated at machine-speed on the PostgreSQL server and migrations adding columns become simple and fast.
you can use just this line
"ORDER BY length(substring(em_code FROM '[0-9]+')), em_code"
I wrote about this in detail in this related question:
Humanized or natural number sorting of mixed word-and-number strings
(I'm posting this answer as a useful cross-reference only, so it's community wiki).
I came up with something slightly different.
The basic idea is to create an array of tuples (integer, string) and then order by these. The magic number 2147483647 is int32_max, used so that strings are sorted after numbers.
ORDER BY ARRAY(
SELECT ROW(
CAST(COALESCE(NULLIF(match[1], ''), '2147483647') AS INTEGER),
match[2]
)
FROM REGEXP_MATCHES(col_to_sort_by, '(\d*)|(\D*)', 'g')
AS match
)
I thought about another way of doing this that uses less db storage than padding and saves time than calculating on the fly.
https://stackoverflow.com/a/47522040/935122
I've also put it on GitHub
https://github.com/ccsalway/dbNaturalSort
The following solution is a combination of various ideas presented in another question, as well as some ideas from the classic solution:
create function natsort(s text) returns text immutable language sql as $$
select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;
The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.
Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.