Trying to upload a CSV into Snowflake and the date column isn't recognized

Trying to upload a CSV into Snowflake and the date column isn't recognized - sql

The column contains dates in the following format:
dd/mm/yyyy
I've tried DATE, TIMESTAMP etc but whatever I do, I can't seem to upload the file.

In the classic UI you can click on the Load Tablebutton and follow the dialogs to upload your file. It is a bit hard to find. Click on Databases right to the big snowflake icon. Then select a database and a table. Then you should the the button. In the wizard there will be a step for defining the 'File Format'. There, you have to scroll down to define the date format. (see Classic snowflake UI)
Without the classic UI you have to install SnowSQL on your device first (https://docs.snowflake.com/en/user-guide/snowsql-install-config.html).
Start SnowSQL and apply the following steps:
Use the database where to upload the file to. You need various privileges for creating a stage, a fileformat, and a table. E.g. USE TEST_DB
Create the fileformat you want to use for uploading your csv file. E.g.
CREATE FILE FORMAT "TEST_DB"."PUBLIC".MY_FILE_FORMAT TYPE = 'CSV' DATE_FORMAT = 'dd/mm/yyyy';
Create a stage using this fileformat
CREATE STAGE MY_STAGE file_format = "TEST_DB"."PUBLIC".MY_FILE_FORMAT;
Now you can put your file to this stage
PUT file://<file_path>/file.csv #MY_STAGE;
You can check the upload with
SELECT d.$1, ..., d.$N FROM #MY_STAGE/file.csv d;
Then, create your table.
Copy the content from your stage to your table. If you want to transform your data at this point, you have to use an inner select. If not then the following command is enough.
COPY INTO mycsvtable from #MY_STAGE/file.csv;
You can find documentation for configuring the fileupload at https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
You can find documentation for configuring the stage at https://docs.snowflake.com/en/sql-reference/sql/create-stage.html
You can find documentation for copying the staged file into a table https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
I recommend that you upload your file with disabled automatically date detection OR that your initial table has a string column instead of a date column. IMHO, it is easier to transform your upload afterwards using the try_to_date function. Using this function it is much easier to handle possible parsing errors.
e.g. SELECT try_to_date(date_column, 'dd/mm/yyyy') as MY_DATE, IFNULL(MY_DATE, date_column) as MY_DATE_NOT_PARSABLE FROM upload;
You see that it is pretty much to do for loading a simple CSV file to Snowflake. It becomes even more complicated when you take into account that every step can cause some specific failures and that your file might contain erroneous lines. This is why my team and I are working at Datameer to make these types of tasks easier. We aim for a simple drag and drop solution that does most of the work for you. We would be happy if you would try it out here: https://www.datameer.com/upload-csv-to-snowflake/

Related

Trouble loading data into Snowflake using Azure Data Factory

I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?

Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.

My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.

Trying to create a table and load data into same table using Databricks and SQL

I Googled for a solution to create a table, using Databticks and Azure SQL Server, and load data into this same table. I found some sample code online, which seems pretty straightforward, but apparently there is an issue somewhere. Here is my code.
CREATE TABLE MyTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://server_name_here.database.windows.net:1433;database = db_name_here",
user "u_name",
password "p_wd",
dbtable "MyTable"
);
Now, here is my error.
Error in SQL statement: SQLServerException: Invalid object name 'MyTable'.
My password, unfortunately, has spaces in it. That could be the problem, perhaps, but I don't think so.
Basically, I would like to get this to recursively loop through files in a folder and sub-folders, and load data from files with a string pattern, like 'ABC*', and load recursively all these files into a table. The blocker, here, is that I need the file name loaded into a field as well. So, I want to load data from MANY files, into 4 fields of actual data, and 1 field that captures the file name. The only way I can distinguish the different data sets is with the file name. Is this possible? Or, is this an exercise in futility?

my suggestion is to use the Azure SQL Spark library, as also mentioned in documentation:
https://docs.databricks.com/spark/latest/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
The 'Bulk Copy' is what you want to use to have good performances. Just load your file into a DataFrame and bulk copy it to Azure SQL
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#bulk-copy-to-azure-sql-database-or-sql-server
To read files from subfolders, answer is here:
How to import multiple csv files in a single load?

I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()
Thanks for a point in the right direction mauridb!!

Getting path of csv file from where data is being loaded in SQL loader.

I am trying to import data into a table in Oracle from a CSV file using SQL Loader. However, I want to add two additional attributes namely date of upload and the file path from which the data is being imported. I can add the date using SYSDATE, Is there a similar method of obtaining the file path?

The trouble with using SYSDATE is that it will not be the same for all rows. This can make it difficult if you do more than one load in a day and need to back out a particular load. Consider using a batch_id also using the method in this post: Insert rows with batch id using sqlldr
I suspect it could be adapted to use the SYSDATE as well so it would be the same for all rows. Give it a try and let us know. At any rate, using a batch_id from a sequence would make working with problems much easier should you need to delete based on a batch_id.

Dynamically populate external tables location

I'm trying to use oracle external tables to load flat files into a database but I'm having a bit of an issue with the location clause. The files we receive are appended with several pieces of information including the date so I was hoping to use wildcards in the location clause but it doesn't look like I'm able to.
I think I'm right in assuming I'm unable to use wildcards, does anyone have a suggestion on how I can accomplish this without writing large amounts of code per external table?
Current thoughts:
The only way I can think of doing it at the moment is to have a shell watcher script and parameter table. User can specify: input directory, file mask, external table etc. Then when a file is found in the directory, the shell script generates a list of files found with the file mask. For each file found issue a alter table command to change the location on the given external table to that file and launch the rest of the pl/sql associated with that file. This can be repeated for each file found with the file mask. I guess the benefit to this is I could also add the date to the end of the log and bad files after each run.

I'll post the solution I went with in the end which appears to be the only way.
I have a file watcher than looks for files in a given input dir with a certain file mask. The lookup table also includes the name of the external table. I then simply issue an alter table on the external table with the list of new file names.
For me this wasn't much of an issue as I'm already using shell for most of the file watching and file manipulation. Hopefully this saves someone searching for ages for a solution.

How to truncate then append to a RAW file in SSIS

I'm attempting to pull data from several spreadsheets that reside in a single folder, then put all the data into a single csv file along with column headings.
I have a foreach loop container setup to iterate through each of the filenames in the folder, which then appends this data to a RAW file, however as many have seemed to run into, there does not appear to be a built in option that will allow one to simply truncate the RAW file before entering the loop container.
Jamie Thompson described a similar situation in his blog here, but the links to the examples do not seem to work. Does anyone have an easy way to truncate the RAW file in a stand alone step before entering the foreach loop?

The approach I always use is to create a data flow with the appropriate metadata format but no actual rows and route that to a RAW file set to Create new.
In my existing data flow, I look at the metadata that populates the RAW file and then craft a select statement that mimics it.
e.g.
SELECT
CAST(NULL AS varchar(70)) AS AddressLine1
, CAST(NULL AS bigint) AS SomeBigInt
, CAST(NULL AS nvarchar(max)) AS PerformanceLOL

Here's what I would did:
Make your initial raw file
Make a copy of that raw file
Use a file task to replace the staging file at the beginning of your package/job every time.
In my use case I have 20 foreach threads writing to their own files all at the same time. No thread can create and then append, so you just "recreate" by copying over an 'empty' raw file that already has the meta data assigned, before calling the threads:

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas