skip bad record in redshift data load - sql

I am trying to load data into AWS redshift using following command
copy venue from 's3://mybucket/venue'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
delimiter '\t';
but data load is failing, when I checked Query section for that specific load I noticed it failed because of "Bad UTF8 hex sequence: a4 (error 3)"
Is there a way to skip bad records in data load into redshift?

Yes, you can use the maxerror parameter. This example will allow up to 250 bad records to be skipped (the errors are written to stl_load_errors):
copy venue
from 's3://mybucket/venue'
credentials 'aws_access_key_id=;aws_secret_access_key='
delimiter '\t'
maxerror as 250;

Related

Is it any way to ignore the record that isn't correct and go ahead with the next record while using COPY command to upload data from s3 to redshift?

I have a '.csv' file in the s3 that has a lot of text data. I am trying to upload the data from s3 to redshift table but my data is not consistent , it has a lot of special character. Some records may be denied by the redshift. I want to ignore that record and move ahead with the next record. Is it possible using COPY command to ignore that record ?
I am expecting exception handling feature while using COPY command to upload data from s3 to Redshift.
Redshift has several ways to attack this kind of situation. First there is the MAXERROR option which you can set for how many unreadable rows will be allowed before the COPY will fail. There is also the IGNOREALLERRORS option to COPY which will read every row it can.
If you want to accept the rows with the odd characters you can use the ACCEPTINVCHARS option to COPY where you can specify a replacement character for every character Redshift cannot parse. It is typical to use '?' but you can make it any character.

BigQuery load - NULL is treating as string instead of empty

My requirement is to pull the data from Different sources(Facebook,youtube, double click search etc) and load into BigQuery. When I try to pull the data, in some of the sources I was getting "NULL" when the column is empty.
I tried to load the same data to BigQuery and BigQuery is treating as a string instead of NULL(empty).
Right now replacing ""(empty string) where NULL is there before loading into BigQuery. Instead of doing this is there any way to load the file directly without any manipulations(replacing).
Thanks,
What is the file format of source file e.g. CSV, New Line Delimited JSON, Avro etc?
The reason is CSV treats an empty string as a null and the NULL is a string value. So, if you don't want to manipulate the data before loading you should save the files in NLD Json format.
As you mentioned that you are pulling data from Social Media platforms, I assume you are using their REST API and as a result it will be possible for you to save that data in NLD Json instead of CSV.
Answer to your question is there a way we can load this from web console?:
Yes, Go to your bigquery project console https://bigquery.cloud.google.com/ and create table in a dataset where you can specify the source file and table schema details.
From Comment section (for the convenience of other viewers):
Is there any option in bq commands for this?
Try this:
bq load --format=csv --skip_leading_rows=1 --null_marker="NULL" yourProject:yourDataset.yourTable ~/path/to/file/x.csv Col1:string,Col2:string,Col2:integer,Col3:string
You may consider running a command similar to: bq load --field_delimiter="\t" --null_marker="\N" --quote="" \
PROJECT:DATASET.tableName gs://bucket/data.csv.gz table_schema.json
More details can be gathered from the replies to the "Best Practice to migrate data from MySQL to BigQuery" question.

Specify multiple delimiters for Redshift copy command

Is there a way to specify multiple delimiters to Redshift copy command while loading data.
I have a data file having the following format:-
1 | ab | cd | ef
2 | gh | ij | kl
I am using a command like this:-
COPY MY_TBL
FROM 's3://s3-file-path'
iam_role 'arn:aws:iam::ddfjhgkjdfk'
manifest
IGNOREHEADER 1
gzip delimiter '|';
Fields are separated by | and records are separated using newline. How do I copy this data into Redshift. Because my query above gives me a delimiter not found error
No, delimiters are single characters.
From Data Format Parameters:
Specifies the single ASCII character that is used to separate fields in the input file, such as a pipe character ( | ), a comma ( , ), or a tab ( \t ).
You could import it with a pipe delimiter, then perform an UPDATE command to STRIP() off the spaces.
Your error above suggests that something in your data is causing the COPY command to fail. This could be a number of things, from file encoding, to some funky data in there. I've struggled with the "delimiter not found" error recently, which turned out to be the ESCAPE parameter combined with trailing backslashes in my data which prevented my delimiter (\t) from being picked up.
Fortunately, there are a few steps you can take to help you narrow down the issue:
stl_load_errors - This system table contains details on any error logged by Redshift during the COPY operation. This should be able to identify the row number in your data file that is causing the problem.
NOLOAD - will allow you to run your copy command without actually loading any data to Redshift. This performs the COPY ANALYZE operation and will highlight any errors in the stl_load_errors table.
FILLRECORD - This allows Redshift to "fill" any columns that it sees as missing in the input data. This is essentially to deal with any ragged-right data files, but can be useful in helping to diagnose issues that can lead to the "delimiter not found" error. This will let you load your data to Redshift and then query in database to see where your columns start being out of place.
From the sample you've posted, your setup looks good, but obviously this isn't the entire picture. The options above should help you narrow down the offending row(s) to help resolve the issue.

Google Bigquery - Bulk Load

We have a csv file with 300 columns. Size is approx 250 MB. Trying to upload it to BQ through Web UI but the schema specification is hard work. I was anticipating BQ will identify file headers but it doesn't seems to be recognising unless I am missing something.Is there a way forward ?
Yes, you have to write the schema by your own. Bigquery is not able to auto-infert it. If you have 300 columns, I sugget writing a script to automatically create the schema.
With the command-line tool (cf here) If you have some lines with the wrong/different schema, you can use the following option to continue for other records :
--max_bad_records : The maximum number of bad rows to skip before the load job
In your case if you want to skip the first line of headers, that can be the following :
bq load --skip_leading_rows=1 --max_bad_records=10000 <destination_table> <data_source_uri> [<table_schema>]

Redshift copy command from S3 works, but no data uploaded

I am using the copy command to copy a file (.csv.gz) from AWS S3 to Redshift
copy sales_inventory from
's3://[redacted].csv.gz'
CREDENTIALS '[redacted]'
COMPUPDATE ON
DELIMITER ','
GZIP
IGNOREHEADER 1
REMOVEQUOTES
MAXERROR 30
NULL 'NULL'
TIMEFORMAT 'YYYY-MM-DD HH:MI:SS'
;
I don't receive any errors, just '0 rows loaded successfully'. I checked the easy things: double checked the file's content, made sure I was targeting the right file with the copy command. Then I created a simple one row example file to try and it didn't work. I've been using a copy command template I made a long time ago and that has worked very recently.
Any common mistakes I might have overlooked? Any way other than the example file that I could try?
Thanks.
With IGNOREHEADER 1 option, Redshift will regard the first line as a header and skip it. If there is just one line in the file, you should take this option off.
If your file contains multiple records, you might have a data load error. Since you're specifying MAXERROR 30, Redshift will skip invalid records up to 30 records and return success result. The load error information during the copy would be stored in STL_LOAD_ERRORS table. Try SELECT * FROM STL_LOAD_ERRORS order by starttime desc limit 10; to check if you had load errors.