In PostgreSQL, what's data type you pass to a create table call when dealing with timestamp values? - sql

When creating a table how do you deal with a timestamp in csv file that has the following syntax - MM/DD/YY HH:MI? Here's an example: 1/1/16 19:00
I have tried the following script in PostgreSQL:
create table timetable (
time timestamp
);
copy table from '<path>' delimiter ',' CSV;
But, I receive an error message saying:
ERROR: ERROR: invalid input syntax for type timestamp:
"visit_datetime" Where: COPY air_reserve, line 16, column
visit_datetime: "visit_datetime"
One solution I have considered is first creating the timestamp column in char then run a separate query that converts it to the appropriate timestamp datatype using the function call 'to_char(time, MM/DD/YY HH:MI). But, I'm looking for a solution that would enable to load the data in the correct datatype in a single query.

You may find a datestyle that enables you to load the data you have, but sooner or later someone will deliver to you something that doesn't fit.
The solution you have considered is probably the best.
We use this as a standard pattern for loading data warehouses. We take today's data, load it into a staging table using varchar columns for any data that will not load directly into its target data type. We then run whatever scripts we need to to get the data into a good state, raising warnings for anything that is broken in a way we haven't seen before. Then we add the cleaned version of today's data into the table containing cleaned data for all previous days.
We don't mind if this takes several steps; we put them all in a script and run it as an automated job.
I'm working on documenting the techniques we use. You can see the beginnings of this at http://www.thedatastudio.net.

Related

Importing CSV file but getting timestamp error

I'm trying to import CSV files into BigQuery and on any of the hourly reports I attempt to upload it gives the code
Error while reading data, error message: Could not parse 4/12/2016 12:00:00 AM as TIMESTAMP for field SleepDay (position 1) starting at location 65 with message Invalid time zone: AM
I get that the format is trying to use AM as a timezone and causing an error but I'm not sure how best to work around it. All of the hourly entries will have AM or PM after the date-time and that will be thousands of entries.
I'm using the autodetect for my schema and I believe that's where the issue is coming up, but I'm not sure what to put in the edit as text schema option to fix it
To successfully parse an imported string to timestamp in Bigquery, the string must be in the ISO 8601 format.
YYYY-MM-DDThh:mm:ss.sss
If your source data is not available in this format, then try the below approach.
Import the CSV into a temporary table by providing explicit schema, where timestamp fields are strings.
2. Select the data from the created temporary table, use the BigQuery PARSE_TIMESTAMP function as specified below and write to the permanent table.
INSERT INTO `example_project.example_dataset.permanent_table`
SELECT
PARSE_TIMESTAMP('%m/%d/%Y %H:%M:%S %p',time_stamp) as time_stamp,
value
FROM `example_project.example_dataset.temporary_table`;

What does this error mean: Required column value for column index: 8 is missing in row starting at position: 0

I'm attempting to upload a CSV file (which is an output from a BCP command) to BigQuery using the gcloud CLI BQ Load command. I have already uploaded a custom schema file. (was having major issues with Autodetect).
One resource suggested this could be a datatype mismatch. However, the table from the SQL DB lists the column as a decimal, so in my schema file I have listed it as FLOAT since decimal is not a supported data type.
I couldn't find any documentation for what the error means and what I can do to resolve it.
What does this error mean? It means, in this context, a value is REQUIRED for a given column index and one was not found. (By the way, columns are usually 0 indexed, meaning a fault at column index 8 is most likely referring to column number 9)
This can be caused by myriad of different issues, of which I experienced two.
Incorrectly categorizing NULL columns as NOT NULL. After exporting the schema, in JSON, from SSMS, I needed to clean it
up for BQ and in doing so I assigned IS_NULLABLE:NO to
MODE:NULLABLE and IS_NULLABLE:YES to MODE:REQUIRED. These
values should've been reversed. This caused the error because there
were NULL columns where BQ expected a REQUIRED value.
Using the wrong delimiter The file I was outputting was not only comma-delimited but also tab-delimited. I was only able to validate this by using the Get Data tool in Excel and importing the data that way, after which I saw the error for tabs inside the cells.
After outputting with a pipe ( | ) delimiter, I was finally able to successfully load the file into BigQuery without any errors.

what are the data types allowed with the sqoop option "--map-column-java"?

I want to use sqoop import to import data from SQL Server, however I am facing some data type conversion issues, and I want to use "--map-column-java" to solve that.
Just in case anybody wants to suggest "--map-column-hive". I can't because I am importing to "--as-parquetfile"; therefore I have to cast the columns data types before inserted in the file.
So, what are the data types allowed with the sqoop option "--map-column-java"?
P.S.
Especially I want to know the "datetime" data type that works with "--map-column-java"
It's pretty taught to load from database into parquet through sqoop, keeping the source datatypes, from the datatypes point of view. For example, you can't load timestamp because it's not supported.
I'm suggesting you the next workaround:
Load with sqoop with all the datatypes string;
Insert from table 1 (with all the datatypes string) into table 2, using cast (as timestamp, as decimal ... etc);
Example:
--map-column-java "ID=String,NR_CARD=String,TIP_CARD_ID=String,CONT_CURENT_ID=String,AUTORIZ_CONTURI_ID=String,TIP_STARE_ID=String,DATA_STARE=String,COMIS=String,BUGETARI_ID=String,DATA_SOLICITARII=String,DATA_EMITERII=String,DATA_VALABILITATII=String,TIP_DESCOPERIT_ID=String,BRANCH_CODE_EMIT=String,ORG_ID=String,DATA_REGEN=String,FIRMA_ID=String,VOUCHER_BLOC=String,CANAL_CERERE=String,CODE_BUG_OPER=String,CREATED_BY=String,CREATION_DATE=String,LAST_UPDATED_BY=String,LAST_UPDATE_DATE=String,LAST_UPDATE_LOGIN=String,IDPAN=String,MOTIV_STARE_ID=String,DATA_ACTIVARII=String" \
In this way you will have all the datatypes, correctly loaded from source.

Issues loading CSV into BigQuery table

Im trying to create a BigQuery table using a pretty simple csv file I have stored in GCS.
I keep getting the same error over and over again:
Could not parse '1/1/2008' as datetime for field XXX
I've checked that the csv file isn't corrupted, and I've managed to upload everything into one column so the file is readable by BigQuery.
I've added the word NULL to any empty fields thinking consecutive delimiters may be causing the issues but I am still facing the same issue.
I know data, I understand data and CSV files.
BigQuery cannot cast '1/1/2008' as DATETIME and rather would expecting something like '2008-1-1'
So, you can either modify your CSV file or just use STRING for that XXX field and than translate it into DATETIME in your queries - like below
#standardSQL
SELECT PARSE_DATETIME('%d/%m/%Y', '1/1/2008')

Talend ETL tool

I am developing a migration tool and using Talend ETL tool (Free edition).
Challenges faced:-
is it possible to create a Talend job that uses dynamic schema every time it runs i.e. no hard-coded mappings in tMap component.
I want user to give a input CSV/Excel file and the job should create mappings on the basis of that input file. Is it possible in talend?
Any other free source ETL tool can also be helpful, or any sample job.
Yes, this can be done in Talend but if you do not wish to use a tMap then your table and file must match exactly. The way we have implemented it is for stage tables which are all datatype of varchar. This works when you are loading raw data into a stage table, and your validation is done after the load, prior to loading the stage data into a data warehouse.
Here is a summary of our method:
the filenames contain the table name so the process starts with a tFileList and parsing out the table name from the file name.
using tMSSQLColumnList obtain each column name, type, and length for the table (one way is to store it as an inline table in tFixedFlowInput)
run this thru a tSetDynamicSchema to produce your dynamic for that table
use a file input reference the dynamic schema.
load that into a MSSQLOutput again referencing the dynamic schema.
One more note on data types. It may work with data types than varchar, but our stage tables only have varchar and datetime. We had issues with datetime, so we filtered out those column types with a tMap.
Keep in mind, this is a summary to point you in the right direction, not a precise tutorial. But with this info in your hands, it can save you many hours of work while building your solution.