How to make Postgres Copy ignore first line of large txt file - sql

I have a fairly large .txt file ~9gb and I will like to load this txt file into postgres. The first row is the header, followed by all the data. If I postgres COPY the data directly, the header will cause an error that data type do not match with my postgres table, so I will need to remove it somehow.
Sample data:
ProjectId,MailId,MailCodeId,prospectid,listid,datemailed,amount,donated,zip,zip4,VectorMajor,VectorMinor,packageid,phase,databaseid,amount2
15,53568419,89734,219906,15,2011-05-11 00:00:00,0,0,90720,2915,NonProfit,POLICY,230,3,1,0
16,84141863,87936,164657,243,2011-03-10 00:00:00,0,0,48362,2523,NonProfit,POLICY,1507,5,1,0
16,81442028,86632,15181625,243,2011-01-19 00:00:00,0,0,11501,2115,NonProfit,POLICY,1508,2,1,0
While the COPY function for postgres has the "header" setting that can ignore the first row, it only works for csv files:
copy training from 'C:/testCSV.csv' DELIMITER ',' csv header;
when I try to run the code above on my txt file, it gets an error:
copy training from 'C:/testTXTFile.txt' DELIMITER ',' csv header
ERROR: unquoted newline found in data
HINT: Use quoted CSV field to represent newline.
I have tried adding "quote" and "escape" attributes but the command just won't seem to work for txt file:
copy training from 'C:/testTXTFile.txt' DELIMITER ',' csv header quote as E'"' escape as E'\\N';
ERROR: COPY escape must be a single one-byte character
Alternatively, I thought about running java or create a seperate stagging table to remove the first row...but these solutions are expansive and time consuming. I will need to load 9gb of data just to remove the first row of headers... are there other solutions out there to remove the first row of a txt file easily so that I can load the data into my postgres database?

Use HEADER option with CSV option:
\copy <table_name> from '/source_file.csv' delimiter ',' CSV HEADER ;
HEADER
Specifies that the file contains a header line with the names of each column in the file. On output, the first line contains the column names from the table, and on input, the first line is ignored. This option is allowed only when using CSV format.

I've looked up docs at https://www.postgresql.org/docs/10/sql-copy.html
written about HEADER is not only true for CSV, but TSV also!
My solution was this in psql
\COPY mytable FROM 'mydata.tsv' DELIMITER E'\t' CSV HEADER;
(in addition mydata.tsv contaned header row which I excluded from copying to database table)

Related

" replaced by ""

redshift unload command is replacing " by "".
example :
UNLOAD($$ select '"Jane"' as name $$)
TO 's3://s3-bucket/test_'
iam_role arn:aws:iam::xxxxxx:role/xxxxxx'
HEADER
CSV
DELIMITER ','
ALLOWOVERWRITE
The output looks like : ""Jane""
If I run the same command with select 'Jane' as name , the output shows without quote at all like Jane. But I need the output to be "Jane"
You are asking for the unloaded file to be in CSV format and CSV format says that if you want a double quote in your data you need to escape it with another double quote. See https://datatracker.ietf.org/doc/html/rfc4180
So Redshift is doing exactly as you requested. Now if you just want a comma delimited file then you don't want to use "CSV" as this will add all the necessary characters to make the file fully compliant with the CSV specification.
This choice will come down to what tool or tools are reading the file and if they expect an rfc compliant CSV or just a simple file where fields are separated by commas.
This is a gripe of mine - tools that say they read CSV but don't follow the spec. If you say CSV then follow the format. Or call what you read something different, like CDV - comma delimited values.

skipLeadingRows=1 in external table definition

In the below example, how can I set the skip leading row option?
bq --location=US query --external_table_definition=sales::Region:STRING,Quarter:STRING,Total_sales:INTEGER#CSV=gs://mybucket/sales.csv 'SELECT Region,Total_sales FROM sales;'
Regards,
Sreekanth
Flags options can be found under the installation home folder (I marked in bold below the flag you are looking for)
/google-cloud-sdk/platform/bq/bq.py:
--[no]allow_jagged_rows: Whether to allow missing trailing optional columns in
CSV import data.
--[no]allow_quoted_newlines: Whether to allow quoted newlines in CSV import
data.
-E,--encoding: : The character encoding used by the input
file. Options include:
ISO-8859-1 (also known as Latin-1)
UTF-8
-F,--field_delimiter: The character that indicates the boundary between
columns in the input file. "\t" and "tab" are accepted names for tab.
--[no]ignore_unknown_values: Whether to allow and ignore extra, unrecognized
values in CSV or JSON import data.
--max_bad_records: Maximum number of bad records allowed before the entire job
fails.
(default: '0')
(an integer)
--quote: Quote character to use to enclose records. Default is ". To indicate
no quote character at all, use an empty string.
--[no]replace: If true erase existing contents before loading new data.
(default: 'false')
--schema: Either a filename or a comma-separated list of fields in the form
name[:type].
--skip_leading_rows: The number of rows at the beginning of the source file to
skip.
(an integer)
--source_format: : Format of
source data. Options include:
CSV
NEWLINE_DELIMITED_JSON
DATASTORE_BACKUP

Control Characters issues while Uploading Data

Trying to upload the data in PostgreSQL by using the query:
COPY "nad_data" FROM 'D:\NAD_Files\NAD.tsv' DELIMITER E'\t' CSV header
encoding 'utf8';
COPY "nad_data" FROM 'D:\NAD_Files\NAD.tsv' DELIMITER E'\t' quote e'\b' CSV
header encoding 'utf8';
COPY "nad_data" FROM 'D:\NAD_Files\NAD.tsv' DELIMITER E'\t' quote e'\b' CSV
header encoding 'win-1252';
Every time I came up with this error:
ERROR: extra data after last expected column
CONTEXT: COPY nad_data, line 10314533: "14879314 Tennessee McMinn
Athens 37303 County Rd 3051
579
This line basically contains some control characters. Like this
These special or control characters are causing a problem.
Any suggestions how to handle them or remove them from data ?

Load raw data from a file without dropping backslash characters

I have a file that contains the following content (simplified version that demonstrates the problem):
"abc\"def"
I would like to load the literal content of the file into a table without any mangling of the data. Here is what I am currently doing:
CREATE TABLE file_content (content text);
COPY file_content FROM '/path/to/test.txt';
The resulting line in the table is:
"abc"def"
In other words, the backslash was silently dropped/ignored. I've tried the copy with different encodings (UTF8, LATIN1, SQL_ASCII) without any change in behavior.
Also, the ESCAPE and QUOTE options seemed promising at first, but they are only for COPY ... TO.
Is there a way to load raw data from a file without the mangling? I'm using version PostgreSQL version 9.4.6.
You need to change \ to \\. You can use sed for that:
sed -i -- 's/\\/\\\\/g' import.file
Please make sure you have reviewed your data and backuped it before performing operation above.

PIG LOAD filename

I am just trying to load an unstructured input file and add the filename. So what I want to get is two fields :
filename:chararray, inputrow:chararray.
I can load the filename if I have a field delimiter using pigstorage(';','-tagfile') but I do not want to delimit fields at this point I just want the string and the filename. How can I do this ?
B
The way to load in files without applying a delimiter, is to choose a delimiter that does not (cannot) occur in the file.
For example, if your file is separated by ; and cannot contain tabs \t you could do:
pigstorage('\t','-tagfile')