PostgreSQL COPY FROM csv - csv formating issues - sql

I have a csv file that I'm trying to import into my PostgreSQL database (v.10). I'm using the following basic SQL syntax:
COPY table (col_1, col_2, col_3)
FROM '/filename.csv'
DELIMITER ',' CSV HEADER
QUOTE '"'
ESCAPE '\';
First 30,000 lines or so are imported without any problem. But then I start bumping into formatting issues in the csv file that break the import:
Double quotes in double quotes: "value_1",""value_2"","value_3" or "value_1","val"ue_2","value_3"
The typical error I get is
ERROR: extra data after last expected column
So I started editing the csv file manually using Vim (the csv file has close to 7 million lines so can't really think of another desktop tool to use)
Is there anything I can do with my SQL syntax to handle those malformed strings? Using alternative ESCAPE clauses? Using regex?
Can you think of a way to handle those formatting issues in Vim or using another tool or function?
Thanks a lot!

Note that the file does not meet the CSV specification:
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote.
You should specify a quote sign other than double-quote, for example '|':
create table test(a text, b text, c text);
copy test from '/data/example.csv' (format csv, quote '|');
select * from test;
a | b | c
-----------+-------------+-----------
"value_1" | ""value_2"" | "value_3"
"value_1" | "val"ue_2" | "value_3"
(2 rows)
You can get rid of the unwanted double-quotes using the trim() or replace() functions, e.g.:
update test
set a = trim(a, '"'), b = trim(b, '"'), c = trim(c, '"');
select * from test;
a | b | c
---------+----------+---------
value_1 | value_2 | value_3
value_1 | val"ue_2 | value_3
(2 rows)

Related

How to add delimiter to String after every n character using hive functions?

I have the hive table column value as below.
"112312452343"
I want to add a delimiter such as ":" (i.e., a colon) after every 2 characters.
I would like the output to be:
11:23:12:45:23:43
Is there any hive string manipulation function support available to achieve the above output?
For fixed length this will work fine:
select regexp_replace(str, "(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})","$1:$2:$3:$4:$5:$6")
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
Another solution which will work for dynamic length string. Split string by the empty string that has the last match (\\G) followed by two digits (\\d{2}) before it ((?<= )), concatenate array and remove delimiter at the end (:$):
select regexp_replace(concat_ws(':',split(str,'(?<=\\G\\d{2})')),':$','')
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
If it can contain not only digits, use dot (.) instead of \\d:
regexp_replace(concat_ws(':',split(str,'(?<=\\G..)')),':$','')
This is actually quite simple if you're familiar with regex & lookahead.
Replace every 2 characters that are followed by another character, with themselves + ':'
select regexp_replace('112312452343','..(?=.)','$0:')
+-------------------+
| _c0 |
+-------------------+
| 11:23:12:45:23:43 |
+-------------------+

SQL: Extract from messy JSON nested field with backslashes

I have a table that has some rows with normal JSON and some with escaped values in the JSON field (backslashes)
id
obj
1
{"is_from_shopping_bag":true,"products":[{"price":{"amount":"18.00","currency":"USD","offset":100,"amount_with_offset":"1800"},"product_id":"1234","quantity":1}],"source":"cart"}
2
{"is_from_shopping_bag":"","products":"[{\ "product_id\ ":\ "2345\ ",\ "price\ ":{\ "currency\ ":\ "USD\ ",\ "amount\ ":\ "140.00\ ",\ "offset\ ":100},\ "quantity\ ":1}]"}
(Note: I needed to include a space after the backslashes in the above table so that they would show up in the github generated markdown table -- my actual table does not include those spaces between the backslash and the quote character)
I am doing a sql query in Hive to get the 'currency' field.
Currently I can run
SELECT
id,
JSON_EXTRACT(obj, '$.products[0].price.currency')
FROM my_table
Which will give me the correct output for the first row, but gives me a NULL in the second row
id
obj
1
"USD"
2
NULL
What is the best way to get currency field from the second row? Is there a way to clean up the field and remove the backslashes before trying to JSON_EXTRACT the relevant data?
I could use REPLACE to swap the '\ ' for '', but is that the most efficient method?
Replace \" with " using regexp_replace like this:
regexp_replace(obj,'\\\\"','"')

Parsing Escape character with delimiter in csv into same field in bigquery

I have a following text in csv file with delimiter as ','
Vliesbehang_0\,52&1\,04,103
I want the below output
Vliesbehang_0\,52&1\,04 | 103
but when I am doing the bq load it is ignoring the escape character and output I am getting is
Vliesbehang_0\ | 52&1\ | 04 | 103
I think you should replace the last delimiter with another symbol, like a semicolon (;) or a tab (\t). After that, you can use the option --field_delimiter to specify the new delimiter.

single vs double quotes in WHERE clause returning different results

It seemed that Athena was including CSV column headers in my query results. I recreated the tables with the DDL included below using TBLPROPERTIES ("skip.header.line.count"="1") to remove the headers.
I'm running the following queries to validate that the CREATE TABLE DDL worked. The only difference between the queries below is the use of single vs double quotes in the WHERE clause. The issue is that I'm getting different result when running them.
Query 1:
SELECT
file_name
FROM table
WHERE file_name = "file_name"
The query above returns the actual data (see sample table below), rather than only rows where the file_name field is "file_name".
+-------+--------------------+
| Row # | file_name |
+-------+--------------------+
| 1 | |
| 2 | 1586786323.8194735 |
| 3 | |
| 4 | 1586858857.3117666 |
| 5 | 1586858857.3117666 |
| 6 | 1586858857.3117666 |
| ... | |
+-------+--------------------+
Query 2:
SELECT
file_name
FROM table
WHERE file_name = 'file_name'
The query above returns no results, as expected if the CSV column headers are not being included in the results.
I'm quite confused by the first query returning any results at all. I've scoured the AWS documentation at this point and doesn't seem I did anything wrong with the DDL and SQL should not care whether I use single vs. double quotes. What am I missing here?
DDL:
CREATE EXTERNAL TABLE `table` (
`file_name` string,
`ticker` string,
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'separatorChar'=',')
LOCATION
's3://{bucket_name}/{folder}/'
TBLPROPERTIES (
"skip.header.line.count"="1")
Single quotes are the SQL standard for delimiting strings.
Double quotes are used for escaping delimiters. So "file_name" refers to the column of that name. Some databases also accept double quotes for strings. That is just confusing. Don't do that.
In your original tags, for instance, Hive uses backticks to escape identifiers and double quotes for strings. Presto uses double quotes (which is the standard) to delimit identifiers.
Just to expand on Gordon's answer a little. Your first query:
SELECT
file_name
FROM table
WHERE file_name = "file_name"
In this case, the double quotes are causing the query engine to treat "file_name" as a column identifier, not a value, so that query is functionally the same as:
SELECT
file_name
FROM table
WHERE file_name = file_name
Obviously (when written that way) the condition is always true, so the full table is returned.

How can I specify the record delimiter to be used in SQLite's output?

I am using the following command to output the result of an SQL query to a text file:
$sqlite3 my_db.sqlite "SELECT text FROM message;" > out.txt
This gives me output like this:
text for entry 1
text for entry 2
Unfortunately, this breaks down when the text contains a newline:
text for entry 1
text for
entry 2
How can I specify an output delimiter (which I know doesn't exist in the text) for SQLite to use when outputting the data so I can more easily parse the result? E.g.:
text for entry 1
=%=
text for
entry 2
=%=
text for entry 3
Try -separator option for this.
$sqlite3 -separator '=%=' my_db.sqlite "SELECT text FROM message;" > out.txt
Update 1
I quess this is because of '-list' default option. In order to turn this option off you need to change current mode.
This is a list of modes
.mode MODE ?TABLE? Set output mode where MODE is one of:
csv Comma-separated values
column Left-aligned columns. (See .width)
html HTML <table> code
insert SQL insert statements for TABLE
line One value per line
list Values delimited by .separator string
tabs Tab-separated values
tcl TCL list elements
-list Query results will be displayed with the separator (|, by
default) character between each field value. The default.
-separator separator
Set output field separator. Default is '|'.
Found this info here
I had the same question and there is a simpler solution. I found this at https://sqlite.org/cli.html :
.separator COL ?ROW? Change the column and row separators
For example:
sqlite> .separator | ,
sqlite> select * from example_table;
1|3,1|4,1|15,1|21,1|33,2|13,2|16,2|32,
Or with no column separator:
sqlite> .separator '' ,
sqlite> select * from example_table;
13,14,115,121,133,213,216,232,
Or, to answer the specific question posed above, this is all that is needed:
sqlite> .separator '' \r\n=%=\r\n
sqlite> select * from message;
text for entry 1
=%=
text for
entry 2
=%=
text for entry 3
=%=
In order to seperate columns, you would have to work with group_concat and a seperator.
Query evolution:
SELECT text FROM messages;
SELECT GROUP_CONCAT(text, "=%=") FROM messages;
SELECT GROUP_CONCAT(text, "\r\n=%=\r\n") FROM messages;
// to get rid of the concat comma, use replace OR change seperator
SELECT REPLACE(GROUP_CONCAT(text, "\r\n=%="), ',', '\r\n') FROM messages;
SQLFiddle
Alternative: Sqlite to CSV export (with custom seperator), then work with that.