How does the SQL length function handle unicode graphemes?

How does the SQL length function handle unicode graphemes? - sql

Consider the following scenario where I have the string É defined by \U00000045\U00000301.
1) https://www.fileformat.info/info/unicode/char/0045/index.htm
2) https://www.fileformat.info/info/unicode/char/0301/index.htm
Would a table constrained by varchar(1) treat it as a valid 1 character input. Or would it be rejected because it is considered a 2 character input?
How does SQL treat the length of strings with graphemes in them generally?

I probably look silly with this query, but still:
t=# with c(u) as (values( e'\U00000045\U00000301'))
select u, u::varchar(1), u::varchar(2),char_length(u), octet_length(u) from c;
u | u | u | char_length | octet_length
---+---+---+-------------+--------------
É | E | É | 2 | 3
(1 row)
edit
t=# show server_encoding ;
server_encoding
-----------------
UTF8
(1 row)
t=# \l+ t
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description
------+-------+----------+---------+-------+-------------------+-------+------------+-------------
t | vao | UTF8 | C | UTF-8 | | 51 MB | pg_default |
(1 row)

Related

Cast VARCHAR columns to int, bigint, time, etc (PL/pgSQL)

Problem
(This is for an open source, analytics library.)
Here's our query results from events_view:
id | visit_id | name | prop0 | prop1 | url
------+----------+--------+----------------------------+-------+------------
2004 | 4 | Magnus | 2021-10-26 02:25:55.790999 | 142 | cnn.com
2007 | 4 | Hartis | 2021-10-26 02:26:37.773999 | 25 | fox.com
Currently all columns are VARCHAR.
Column | Type | Collation | Nullable | Default
----------+-------------------+-----------+----------+---------
id | bigint | | |
visit_id | character varying | | |
name | character varying | | |
prop0 | character varying | | |
prop1 | character varying | | |
url | character varying | | |
They should be something like
Column | Type | Collation | Nullable | Default
----------+------------------------+-----------+----------+---------
id | bigint | | |
visit_id | bigint | | |
name | character varying | | |
prop0 | time without time zone | | |
prop1 | bigint | | |
url | character varying | | |
Desired result
Hardcoding these castings as in SELECT visit::bigint, name::varchar, prop0::time, prop1::integer, url::varchar FROM tbl won't do, column names are known in run time only.
To simplify things we could cast each column into only three types: boolean, numeric, or varchar. Use regexps below for matching types:
boolean: ^(true|false|t|f)$
numeric: ^(,-)[0-9]+(,\.[0-9]+)$
varchar: every result that does not match boolean and numeric above
What should be the SQL that discover what type each column is and dynamically cast them?

These are a few ideas rather than a true solution for this tricky job. A slow but very reliable function can be used instead of regular expressions.
create or replace function can_cast(s text, vtype text)
returns boolean language plpgsql immutable as
$body$
begin
execute format('select %L::%s', s, vtype);
return true;
exception when others then
return false;
end;
$body$;
Data may be presented like this (partial list of columns from your example)
create or replace temporary view tv(id, visit_id, prop0, prop1) as
values
(
2004::bigint,
4::bigint,
case when can_cast('2021-10-26 02:25:55.790999', 'time') then '2021-10-26 02:25:55.790999'::time end,
case when can_cast('142', 'bigint') then '142'::bigint end
), -- determine the types
(2007, 4, '2021-10-26 02:26:37.773999', 25)
-- the rest of the data here;
I believe that it is possible to generate the temporary view DDL dynamically as a select from events_view too.

Length of SQL CHAR column is always at maximum regardless of content [duplicate]

This question already has an answer here:
Getting the length of a string in SQL
(1 answer)
Closed 2 years ago.
I'm looking for the equivalent of Select LEN(1234) from x Return 4 for FlameRobin for my Char field.
All I can find is char_length which returns the fields max length not the contents of the field.

Because that is the difference between SQL datatypes CHAR (fixed length, always right-padded with spaces, like in DBF and other tabular formats of that age) and VARCHAR (variable-length, may be shorter than max length).
And your query is NOT a query you are really using!
The query you suggest DOES return exactly 4 in Firebird.
db<>fiddle here
select rdb$get_context('SYSTEM', 'ENGINE_VERSION') as version
, rdb$character_set_name
from rdb$database;
VERSION | RDB$CHARACTER_SET_NAME
:------ | :---------------------------------------------------------------------------------------------------------------------------
3.0.5 | UTF8
Select char_LENgth(1234) from rdb$database
| CHAR_LENGTH |
| ----------: |
| 4 |
create table T (
i integer,
c char(20),
v varchar(20)
)
✓
insert into T values (1234, 1234, 1234)
1 rows affected
select * from T
I | C | V
---: | :------------------------------------------------------------------------------- | :---
1234 | 1234 | 1234
Select
char_length(1234) as const
, char_length(i) as int_to_char
, char_length(c) as fixed_char
, char_length(v) as var_char
, char_length(trim(c)) as char_t
, char_length(cast(trim(c) as varchar(20))) as char_t_v
, char_length(trim(cast(c as varchar(20)))) as char_v_t
from T
CONST | INT_TO_CHAR | FIXED_CHAR | VAR_CHAR | CHAR_T | CHAR_T_V | CHAR_V_T
----: | ----------: | ---------: | -------: | -----: | -------: | -------:
4 | 4 | 20 | 4 | 4 | 4 | 4

This is exactly what should happen. If you store "HELLO" in a CHAR(20) field, you will get a 20 character string on output (it might be trimmed somewhere along the path, so you don't realize that the initial size is always padded to, or truncated to, 20).
Either use VARCHAR type, or you'll have to do something like CHAR_LENGTH(TRIM(FieldName)) to get the "perceived length" of the string.

Filtering records not containing numbers

I have a table that has numbers in string format. Ideally the table should contain 10 digit number in string format, but it has many junk values. I wanted to filter out the records that are not ideal in nature.
Below is the sample table that I have:
+---------------+--------+----------------------------------+
| ID_UID | Length | ##Comment |
+---------------+--------+----------------------------------+
| +112323456705 | 13 | Contains special character |
| 4323456432 | 11 | Contains blank |
| 3423122334 | 10 | As expected, 10 character number |
| 6758439239 | 10 | As expected, 10 character number |
| 58_4323129 | 10 | Contains special character |
| 4567$%6790 | 10 | Contains special character |
| 45684938901 | 11 | Is 11 characters |
| 4568 38901 | 10 | Contains blank |
+---------------+--------+----------------------------------+
Expected Output:
+---------------+--------+----------------------------+
| ID_UID | Length | ##Comment |
+---------------+--------+----------------------------+
| +112323456705 | 13 | Contains special character |
| 4323456432 | 11 | Contains blank |
| 58_4323129 | 10 | Contains special character |
| 4567$%6790 | 10 | Contains special character |
| 45684938901 | 11 | Is 11 characters |
| 4568 38901 | 10 | Contains blank |
+---------------+--------+----------------------------+
Basically I want all the records that dont have 10 digit numbers in them.
I have tried out below query:
SELECT *
FROM t1
WHERE ID_UID LIKE '%[^0-9]%'
But this does not returns any records.
Have created a fiddle for the same.
P.S. The columns length and ##Comment are illustrative in nature.

You want RLIKE not LIKE:
SELECT *
FROM t1
WHERE ID_UID RLIKE '[^0-9]'
Note that % is a LIKE wildcard, not a regular expression wildcard. Also, regular expressions match the pattern anywhere it occurs, so no wildcards are needed for the beginning and end of the string.
If you want to find values that are not ten digits, then be explicit:
SELECT *
FROM t1
WHERE ID_UID NOT RLIKE '^[0-9]{10}$'

PostgreSQL: restoring table from sql using psql return 'ERROR: invalid input syntax for integer'

I trying to restore sql dump that looks like this:
COPY table_name (id, oauth_id, foo, bar) FROM stdin;
1 142 \N xxxxxxx
2 142 \N yyyyyyy
<dozen similar lines>
last line in this dump: \.
command to restore:
psql < table.sql
or
psql --file=dump.sql
\d+ table_name:
Table "public.table_name"
Column | Type | Modifiers | Storage | Stats target | Description
---------------------+-----------------------+-------------------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('connected_table_name_id_seq'::regclass) | plain | |
oauth_id | integer | not null | plain | |
foo | character varying | | extended | |
bar | character varying | | extended | |
Looks sadly that standard method for backup and rollback does not work :(
Version of the psql: 9.5.4, version of the server: 9.5.2

Calculating the size of a column type in Postgresql

I am trying to figure out how to determine the size of a specific column in database for instance I have two columns called sourceip, destinationip that are both 16 byte fields.
I thought this would be somewhere in the information_schema or \d+ but I cannot find a specific command to isolate the size of each column type.
Can you calculate column type size in database or do you just have to reference the byte size for each type in the Postgresql documentation?

only few types in pg has fixed length - almost all types are varlena type - it has dynamic length. You can check queries like
postgres=# select typlen from pg_type where oid = 'int'::regtype::oid;
typlen
--------
4
(1 row)
postgres=# select attlen from pg_attribute where attrelid = 'x'::regclass and attname = 'a';
attlen
--------
4
(1 row)
When result is not -1, then type has not fixed length
for varlena types use pg_column_size function:
postgres=# \df *size*
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+------------------------+------------------+---------------------+--------
pg_catalog | pg_column_size | integer | "any" | normal
pg_catalog | pg_database_size | bigint | name | normal
pg_catalog | pg_database_size | bigint | oid | normal
pg_catalog | pg_indexes_size | bigint | regclass | normal
pg_catalog | pg_relation_size | bigint | regclass | normal
pg_catalog | pg_relation_size | bigint | regclass, text | normal
pg_catalog | pg_size_pretty | text | bigint | normal
pg_catalog | pg_size_pretty | text | numeric | normal
pg_catalog | pg_table_size | bigint | regclass | normal
pg_catalog | pg_tablespace_size | bigint | name | normal
pg_catalog | pg_tablespace_size | bigint | oid | normal
pg_catalog | pg_total_relation_size | bigint | regclass | normal
(12 rows)
postgres=# select pg_column_size('Hello');
pg_column_size
----------------
6
(1 row)
postgres=# select pg_column_size(10);
pg_column_size
----------------
4
(1 row)
postgres=# select pg_column_size(now());
pg_column_size
----------------
8

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How does the SQL length function handle unicode graphemes? - sql

Related

Cast VARCHAR columns to int, bigint, time, etc (PL/pgSQL)

Length of SQL CHAR column is always at maximum regardless of content [duplicate]

Filtering records not containing numbers

PostgreSQL: restoring table from sql using psql return 'ERROR: invalid input syntax for integer'

Calculating the size of a column type in Postgresql

Categories

Resources