I've a column which has data in following format
\d my_table
Column | Type | Modifiers
-------------+--------------------------+---------------
message | character varying |
message has strings such as
select message from my_table;
message
-----------------------
\u6771
How can I print its human readable counterpart 東? I don't know how to use E while selecting columns. I don't have database write access so I can't create functions.
Related
I've got a bunch of fields which are double quoted with delimiters but for the life of me, I'm unable to get any regex to pull out what I need.
In short - the delimiters can be in any order and I just need the value that's between the double quotes after each delimiter. Some sample data is below, can anyone help with what regex might extract each value? I've tried
'delimiter_1=\\W+\\w+'
but I only seem to get the first word after the delimiter (unfortunately - they do have spaces in the value)
some content delimiter_1="some value" delimiter_2="some other value" delimiter_4="another value" delimiter_3="the last value"
The problem is returning a varying numbers of values from the regex function. For example, if you know that there will 4 delimiters, then you can use REGEXP_SUBSTR for each match, but if the text will have varying delimiters, this approach doesn't work.
I think the best solution is to write a function to parse the text:
create or replace function superparser( SRC varchar )
returns array
language javascript
as
$$
const regexp = /([^ =]*)="([^"]*)"/gm;
const array = [...SRC.matchAll(regexp)]
return array;
$$;
Then you can use LATERAL FLATTEN to process the returning values from the function:
select f.VALUE[1]::STRING key, f.VALUE[2]::STRING value
from values ('some content delimiter_1="some value" delimiter_2="some other value" delimiter_4="another value" delimiter_3="the last value"') tmp(x),
lateral flatten( superparser(x) ) f;
+-------------+------------------+
| KEY | VALUE |
+-------------+------------------+
| delimiter_1 | some value |
| delimiter_2 | some other value |
| delimiter_4 | another value |
| delimiter_3 | the last value |
+-------------+------------------+
It seemed that Athena was including CSV column headers in my query results. I recreated the tables with the DDL included below using TBLPROPERTIES ("skip.header.line.count"="1") to remove the headers.
I'm running the following queries to validate that the CREATE TABLE DDL worked. The only difference between the queries below is the use of single vs double quotes in the WHERE clause. The issue is that I'm getting different result when running them.
Query 1:
SELECT
file_name
FROM table
WHERE file_name = "file_name"
The query above returns the actual data (see sample table below), rather than only rows where the file_name field is "file_name".
+-------+--------------------+
| Row # | file_name |
+-------+--------------------+
| 1 | |
| 2 | 1586786323.8194735 |
| 3 | |
| 4 | 1586858857.3117666 |
| 5 | 1586858857.3117666 |
| 6 | 1586858857.3117666 |
| ... | |
+-------+--------------------+
Query 2:
SELECT
file_name
FROM table
WHERE file_name = 'file_name'
The query above returns no results, as expected if the CSV column headers are not being included in the results.
I'm quite confused by the first query returning any results at all. I've scoured the AWS documentation at this point and doesn't seem I did anything wrong with the DDL and SQL should not care whether I use single vs. double quotes. What am I missing here?
DDL:
CREATE EXTERNAL TABLE `table` (
`file_name` string,
`ticker` string,
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'separatorChar'=',')
LOCATION
's3://{bucket_name}/{folder}/'
TBLPROPERTIES (
"skip.header.line.count"="1")
Single quotes are the SQL standard for delimiting strings.
Double quotes are used for escaping delimiters. So "file_name" refers to the column of that name. Some databases also accept double quotes for strings. That is just confusing. Don't do that.
In your original tags, for instance, Hive uses backticks to escape identifiers and double quotes for strings. Presto uses double quotes (which is the standard) to delimit identifiers.
Just to expand on Gordon's answer a little. Your first query:
SELECT
file_name
FROM table
WHERE file_name = "file_name"
In this case, the double quotes are causing the query engine to treat "file_name" as a column identifier, not a value, so that query is functionally the same as:
SELECT
file_name
FROM table
WHERE file_name = file_name
Obviously (when written that way) the condition is always true, so the full table is returned.
In Bigquery: How do we format a number that will be part of the result set that should be not having commas: like 1,000,000 to 1000000 ?
I am assuming that your data type is string here.
You can use the REGEXP_REPLACE function to remove certain symbols from strings.
SELECT REGEXP_REPLACE("1,000,000", r',', '') AS Output
Returns:
+-----+---------+
| Row | Output |
+-----+---------+
| 1 | 1000000 |
+-----+---------+
If your data contains strings with and without commas, this function will return the ones without as they are so you don't need to worry about filtering the input.
Documentation for this function can be found here.
I would like to extract particular character from strings using StandardSQL.
I would like to extract the character after limit=.
For instance, from below strings I would like to extract 10, 3 and null. For everything that has null I also would like to make all null = 1.
partner=&limit=10
partner=aex&limit=3&filters%5Bpartner%5D
partner=aex&limit=&filters%5Bpartner%5D
I only know how to use substring function but the problem here is the positions of limit= are not always the same.
You can use REGEXP_EXTRACT. For example:
SELECT REGEXP_EXTRACT('partner=aex&limit=3&filters%5Bpartner%5D', 'limit=(\\d+)');
+-------+
| $col1 |
+-------+
| 3 |
+-------+
From Database System Concepts
When comparing two values of type char, if they are of different lengths extra spaces are automatically added to the shorter one to
make them the same size, before comparison.
When comparing a char type with a varchar type, one may expect extra spaces to be added to the varchar type to make the lengths
equal, before comparison; however, this may or may not be done,
depending on the database system. As a result, even if the same value
“Avi” is stored in the attributes A and B above, a comparison A=B may
return false.
We recommend you always use the varchar type instead of the char
type to avoid these problems.
Could you give some examples to explain comparing two values of type char, and comparing two valus of a varchar type? What operator(s) is used in the comparison, =?
What problems can using the varchar type instead of the char type avoid? Why?
This is about SQL in general, and I guess it may also apply to PostgreSQL, since it is compatible with SQL standard well.
Thanks.
The basic issue is that char will pad the value with spaces and this can lead to some surprising and inconsistent results.
Here we see Postgres retains the trailing space.
test=> create table foo ( c char(10), v varchar(10) );
CREATE TABLE
test=> insert into foo values ('foo', 'foo');
INSERT 0 1
test=> select * from foo;
c | v
------------+-----
foo | foo
test=> select concat(c, '>'), concat(v, '>') from foo where c = 'foo ';
concat | concat
-------------+--------
foo > | foo>
But MySQL does not unless PAD_CHAR_TO_FULL_LENGTH is set.
mysql> create table foo ( c char(10), v varchar(10) );
mysql> insert into foo values ('foo', 'foo');
mysql> select * from foo;
+------+------+
| c | v |
+------+------+
| foo | foo |
+------+------+
mysql> select concat(c, '>'), concat(v, '>') from foo where c = 'foo ';
+----------------+----------------+
| concat(c, '>') | concat(v, '>') |
+----------------+----------------+
| foo> | foo> |
+----------------+----------------+
mysql> set sql_mode = 'PAD_CHAR_TO_FULL_LENGTH';
mysql> select concat(c, '>'), concat(v, '>') from foo where c = 'foo ';
+----------------+----------------+
| concat(c, '>') | concat(v, '>') |
+----------------+----------------+
| foo > | foo> |
+----------------+----------------+
The PostgreSQL documentation outlines several issues.
Values of type character are physically padded with spaces to the specified width n, and are stored and displayed that way.
...trailing spaces are treated as semantically insignificant and disregarded when comparing two values of type character. In collations where whitespace is significant, this behavior can produce unexpected results; for example SELECT 'a '::CHAR(2) collate "C" < E'a\n'::CHAR(2) returns true, even though C locale would consider a space to be greater than a newline
Trailing spaces are removed when converting a character value to one of the other string types.
Storage engines have improved such that there's little reason to use char anymore.
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
One case where char might be justified is to store very small, fixed size strings. For example, ISO 2 character country codes might be stored as char(2). But the performance difference is unlikely to be noticed on such small strings.
char is a headache best avoided.