getting an error 1206 and 1205 when injesting data from fireshose to redshift using a copy command
Below is the raw data on firehose
{
"Name": "yoyo"
"a_timestamp": "2021-05-11T15:02:02.426729Z",
"a_date": "2021-05-11T00:00:00Z"
}
below is the copy command
COPY pqr_table FROM 's3://xyz/<manifest>' CREDENTIALS 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' MANIFEST json 's3://xyz/abc.json' DATEFORMAT 'YYYY-MM-DD' ;
below is the DDL command
create table events (
Name varchar(8),
a_timestamp timestamp,
a_date date)
It would be great if anyone can please help me with this
Those are errors for bad timestamp and date formats. You need to have "timeformat" specified with that string as it is not Redshift's default format. I'd first try 'auto' for both of these and see if Redshift can work things out.
dateformat as 'auto'
timeformat as 'auto'
Also, having time specified in your date may create some confusion and may need you to manually specify the format or ingest as timestamp and then cast to date. I'd fist see if 'auto' does the trick.
Related
I try to change a columns' data type from string to DATETIME (for example '04/12/2016 02:47:30') with the format 'YY/MM/DD HH24:MI:SS' but it shoes an error like :
Failed to parse input timestamp string at 8 with format element ' '
The initial file was a csv which i uploaded from my drive. I tried to convert the column's data type from google sheets and then re'upload it, but the column type still remains as string.
I think when you load your CSV file to the BigQuery table, you use autodetect mode.
Unfortunately with this mode, BigQuery will consider your date as String even if you changed it from Google Sheet.
Instead of using autodetect, I propose you using a Json schema for your BigQuery table.
In your schema you will indicate that the column type for your date field is timestamp.
The format you indicated 04/12/2016 02:47:30 is compatible with a timestamp and BigQuery will convert it for you.
For the loading file to BigQuery, you can directly use the console or gcloud cli with bq command :
bq load \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata.csv \
./myschema.json
For the BigQuery Json schema, the timestamp type is :
{
{
"name": "yourDate",
"type": "TIMESTAMP",
"mode": "NULLABLE",
"description": "Your date"
}
}
Is there an option for bq load to specify datetime format to parse? I'm getting an error when using bq load due to a datetime with milliseconds in it.
Sample file below:
ID|Card|Status|ExpiryDate|IssuedDate
1105|9902|Expired|2015-12-31 00:00:00|2014-07-04 14:43:41.963000000
Command used below:
bq load --source_format=CSV --skip_leading_rows 1 --field_delimiter "|" --replace mytable $GSPATH
It is not possible to control/change date or datetime formatting when loading data into BigQuery.
As a solution, I would try to load the datetime field as a STRING and then try to use the PARSE_DATETIME function or something else to postprocess and convert the string to datetime.
An example of the code to parse the string to datetime:
select PARSE_DATETIME('%Y-%m-%d %H:%M:%E*S','2014-07-04 14:43:41.963000000');
I was trying to use s3 copy command in my project.
In my scenario, there are chances for getting a lot of errors.
Here is sample snippet which I found in doc.
Doc link: https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html
copy event from 's3://mybucket/data/allevents_pipe.txt' iam_role
'arn:aws:iam::0123456789012:role/MyRedshiftRole' removequotes
emptyasnull blanksasnull maxerror 5 delimiter '|' timeformat
'YYYY-MM-DD HH:MI:SS' manifest;
I was trying to figure out the maximum and default values for maxerror. I read s3 docs but I couldn't find anything about this.
can anyone please let me know the same?
Any help would be appreciated.
You can always set parameters to dataload operations in copy commands. In your case you can set MAXERROR paraameter to maximum limit it supports(i.e., 100000). Here is the redshift documentation for the same https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html#copy-maxerror .
Tried to load csv files into bigquery table. There are columns where the types are INTEGER, but some missing values are NULL. So when I use the command bq load to load, got the following error:
Could not parse 'null' as int for field
So I am wondering what are the best solutions to deal with this, have to reprocess the data first for bq to load?
You'll need to transform the data in order to end up with the expected schema and data. Instead of INTEGER, specify the column as having type STRING. Load the CSV file into a table that you don't plan to use long-term, e.g. YourTempTable. In the BigQuery UI, click "Show Options", then select a destination table with the table name that you want. Now run the query:
#standardSQL
SELECT * REPLACE(SAFE_CAST(x AS INT64) AS x)
FROM YourTempTable;
This will convert the string values to integers where 'null' is treated as null.
Please try with job config setting.
job_config.null_marker = 'NULL'
configuration.load.nullMarker
string
[Optional] Specifies a string that represents a null value in a CSV file. For example, if you specify "\N", BigQuery interprets "\N" as a null value when loading a CSV file. The default value is the empty string. If you set this property to a custom value, BigQuery throws an error if an empty string is present for all data types except for STRING and BYTE. For STRING and BYTE columns, BigQuery interprets the empty string as an empty value.
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
BigQuery Console has it's limitations and doesn't allow you to specify a null marker while loading data from a CSV. However, it can easily be done by using the BigQuery command-line tool's bq load command. We can use the --null_marker flag to specify the marker which is simply null in this case.
bq load --source_format=CSV \
--null_marker=null \
--skip_leading_rows=1 \
dataset.table_name \
./data.csv \
./schema.json
Setting the null_marker as null does the trick here. You can omit the schema.json part if the table is already present with a valid schema. --skip_leading_rows=1 is used because my first row was a header.
You can learn more about the bg load command in the BigQuery Documentation.
The load command however lets you create and load a table in a single go. The schema needs to be specified in a JSON file in the below format:
[
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
},
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
}
]
Can someone give a full example of date time functions including the 'register' jar ? I have been trying to get CurrentTime() and ToDate() running without much success. I have the piggybank jar in classpath and registered the same. But it always says the function has to be defined before usage.
I read this question comparing datetime in pig before this.
Datetime functions can be easily implemented using native pig, you no need to go for piggybank jar.
Example:
In this example i will read set of dates from the input file, get the current datetime and calculate the total no of days between previous and current date
input.txt
2014-10-12T10:20:47
2014-08-12T10:20:47
2014-07-12T10:20:47
PigScript:
A = LOAD 'input.txt' AS (mydate:chararray);
B = FOREACH A GENERATE ToDate(mydate) AS prevDate,CurrentTime() AS currentDate,DaysBetween(CurrentTime(),ToDate(mydate)) AS diffDays;
DUMP B;
Output:
(2014-10-12T10:20:47.000+05:30, 2014-12-12T10:39:15.455+05:30, 61)
(2014-08-12T10:20:47.000+05:30, 2014-12-12T10:39:15.455+05:30, 122)
(2014-07-12T10:20:47.000+05:30, 2014-12-12T10:39:15.455+05:30, 153)
You can refer few examples from my old post
Human readable String date converted to date using Pig?
Storing Date and Time In PIG
how to convert UTC time to IST using pig