single vs double quotes in WHERE clause returning different results - sql

It seemed that Athena was including CSV column headers in my query results. I recreated the tables with the DDL included below using TBLPROPERTIES ("skip.header.line.count"="1") to remove the headers.
I'm running the following queries to validate that the CREATE TABLE DDL worked. The only difference between the queries below is the use of single vs double quotes in the WHERE clause. The issue is that I'm getting different result when running them.
Query 1:
SELECT
file_name
FROM table
WHERE file_name = "file_name"
The query above returns the actual data (see sample table below), rather than only rows where the file_name field is "file_name".
+-------+--------------------+
| Row # | file_name |
+-------+--------------------+
| 1 | |
| 2 | 1586786323.8194735 |
| 3 | |
| 4 | 1586858857.3117666 |
| 5 | 1586858857.3117666 |
| 6 | 1586858857.3117666 |
| ... | |
+-------+--------------------+
Query 2:
SELECT
file_name
FROM table
WHERE file_name = 'file_name'
The query above returns no results, as expected if the CSV column headers are not being included in the results.
I'm quite confused by the first query returning any results at all. I've scoured the AWS documentation at this point and doesn't seem I did anything wrong with the DDL and SQL should not care whether I use single vs. double quotes. What am I missing here?
DDL:
CREATE EXTERNAL TABLE `table` (
`file_name` string,
`ticker` string,
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'separatorChar'=',')
LOCATION
's3://{bucket_name}/{folder}/'
TBLPROPERTIES (
"skip.header.line.count"="1")

Single quotes are the SQL standard for delimiting strings.
Double quotes are used for escaping delimiters. So "file_name" refers to the column of that name. Some databases also accept double quotes for strings. That is just confusing. Don't do that.
In your original tags, for instance, Hive uses backticks to escape identifiers and double quotes for strings. Presto uses double quotes (which is the standard) to delimit identifiers.

Just to expand on Gordon's answer a little. Your first query:
SELECT
file_name
FROM table
WHERE file_name = "file_name"
In this case, the double quotes are causing the query engine to treat "file_name" as a column identifier, not a value, so that query is functionally the same as:
SELECT
file_name
FROM table
WHERE file_name = file_name
Obviously (when written that way) the condition is always true, so the full table is returned.

Related

How to extract string between quotes, with a delimiter in Snowflake

I've got a bunch of fields which are double quoted with delimiters but for the life of me, I'm unable to get any regex to pull out what I need.
In short - the delimiters can be in any order and I just need the value that's between the double quotes after each delimiter. Some sample data is below, can anyone help with what regex might extract each value? I've tried
'delimiter_1=\\W+\\w+'
but I only seem to get the first word after the delimiter (unfortunately - they do have spaces in the value)
some content delimiter_1="some value" delimiter_2="some other value" delimiter_4="another value" delimiter_3="the last value"
The problem is returning a varying numbers of values from the regex function. For example, if you know that there will 4 delimiters, then you can use REGEXP_SUBSTR for each match, but if the text will have varying delimiters, this approach doesn't work.
I think the best solution is to write a function to parse the text:
create or replace function superparser( SRC varchar )
returns array
language javascript
as
$$
const regexp = /([^ =]*)="([^"]*)"/gm;
const array = [...SRC.matchAll(regexp)]
return array;
$$;
Then you can use LATERAL FLATTEN to process the returning values from the function:
select f.VALUE[1]::STRING key, f.VALUE[2]::STRING value
from values ('some content delimiter_1="some value" delimiter_2="some other value" delimiter_4="another value" delimiter_3="the last value"') tmp(x),
lateral flatten( superparser(x) ) f;
+-------------+------------------+
| KEY | VALUE |
+-------------+------------------+
| delimiter_1 | some value |
| delimiter_2 | some other value |
| delimiter_4 | another value |
| delimiter_3 | the last value |
+-------------+------------------+

How to add delimiter to String after every n character using hive functions?

I have the hive table column value as below.
"112312452343"
I want to add a delimiter such as ":" (i.e., a colon) after every 2 characters.
I would like the output to be:
11:23:12:45:23:43
Is there any hive string manipulation function support available to achieve the above output?
For fixed length this will work fine:
select regexp_replace(str, "(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})","$1:$2:$3:$4:$5:$6")
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
Another solution which will work for dynamic length string. Split string by the empty string that has the last match (\\G) followed by two digits (\\d{2}) before it ((?<= )), concatenate array and remove delimiter at the end (:$):
select regexp_replace(concat_ws(':',split(str,'(?<=\\G\\d{2})')),':$','')
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
If it can contain not only digits, use dot (.) instead of \\d:
regexp_replace(concat_ws(':',split(str,'(?<=\\G..)')),':$','')
This is actually quite simple if you're familiar with regex & lookahead.
Replace every 2 characters that are followed by another character, with themselves + ':'
select regexp_replace('112312452343','..(?=.)','$0:')
+-------------------+
| _c0 |
+-------------------+
| 11:23:12:45:23:43 |
+-------------------+

Format a number to NOT have commas (1,000,000 -> 1000000) in Google BigQuery

In Bigquery: How do we format a number that will be part of the result set that should be not having commas: like 1,000,000 to 1000000 ?
I am assuming that your data type is string here.
You can use the REGEXP_REPLACE function to remove certain symbols from strings.
SELECT REGEXP_REPLACE("1,000,000", r',', '') AS Output
Returns:
+-----+---------+
| Row | Output |
+-----+---------+
| 1 | 1000000 |
+-----+---------+
If your data contains strings with and without commas, this function will return the ones without as they are so you don't need to worry about filtering the input.
Documentation for this function can be found here.

PostgreSQL COPY FROM csv - csv formating issues

I have a csv file that I'm trying to import into my PostgreSQL database (v.10). I'm using the following basic SQL syntax:
COPY table (col_1, col_2, col_3)
FROM '/filename.csv'
DELIMITER ',' CSV HEADER
QUOTE '"'
ESCAPE '\';
First 30,000 lines or so are imported without any problem. But then I start bumping into formatting issues in the csv file that break the import:
Double quotes in double quotes: "value_1",""value_2"","value_3" or "value_1","val"ue_2","value_3"
The typical error I get is
ERROR: extra data after last expected column
So I started editing the csv file manually using Vim (the csv file has close to 7 million lines so can't really think of another desktop tool to use)
Is there anything I can do with my SQL syntax to handle those malformed strings? Using alternative ESCAPE clauses? Using regex?
Can you think of a way to handle those formatting issues in Vim or using another tool or function?
Thanks a lot!
Note that the file does not meet the CSV specification:
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote.
You should specify a quote sign other than double-quote, for example '|':
create table test(a text, b text, c text);
copy test from '/data/example.csv' (format csv, quote '|');
select * from test;
a | b | c
-----------+-------------+-----------
"value_1" | ""value_2"" | "value_3"
"value_1" | "val"ue_2" | "value_3"
(2 rows)
You can get rid of the unwanted double-quotes using the trim() or replace() functions, e.g.:
update test
set a = trim(a, '"'), b = trim(b, '"'), c = trim(c, '"');
select * from test;
a | b | c
---------+----------+---------
value_1 | value_2 | value_3
value_1 | val"ue_2 | value_3
(2 rows)

replacing multiple units of specific character with one unit in hive

I have a dataset in which values are same except the number of semicolons in it resulting to different records.
For example if in a column one records has a;b;c and another record has a;;b;c, this is disabling the use of distinct function in my code. I want this to be treated as duplicate record for which ;; needs to be replaced with ;
How can we replace multiple ; with single ; in strings in a dataset in hive?
You can use regexp_replace as defined in Hive UDFs
The first argument is the string that needs to get changed. So you can call it on your table like :
with t as
(SELECT "a\;\;\;b\;\;c\;d" as col )
SELECT regexp_replace(t.col, "\;+", "\;") as col from t
This should give you the output
+-------+
| col|
+-------+
|a;b;c;d|
+-------+