How to truncate then append to a RAW file in SSIS - sql

I'm attempting to pull data from several spreadsheets that reside in a single folder, then put all the data into a single csv file along with column headings.
I have a foreach loop container setup to iterate through each of the filenames in the folder, which then appends this data to a RAW file, however as many have seemed to run into, there does not appear to be a built in option that will allow one to simply truncate the RAW file before entering the loop container.
Jamie Thompson described a similar situation in his blog here, but the links to the examples do not seem to work. Does anyone have an easy way to truncate the RAW file in a stand alone step before entering the foreach loop?

The approach I always use is to create a data flow with the appropriate metadata format but no actual rows and route that to a RAW file set to Create new.
In my existing data flow, I look at the metadata that populates the RAW file and then craft a select statement that mimics it.
e.g.
SELECT
CAST(NULL AS varchar(70)) AS AddressLine1
, CAST(NULL AS bigint) AS SomeBigInt
, CAST(NULL AS nvarchar(max)) AS PerformanceLOL

Here's what I would did:
Make your initial raw file
Make a copy of that raw file
Use a file task to replace the staging file at the beginning of your package/job every time.
In my use case I have 20 foreach threads writing to their own files all at the same time. No thread can create and then append, so you just "recreate" by copying over an 'empty' raw file that already has the meta data assigned, before calling the threads:

Related

Trying to upload a CSV into Snowflake and the date column isn't recognized

The column contains dates in the following format:
dd/mm/yyyy
I've tried DATE, TIMESTAMP etc but whatever I do, I can't seem to upload the file.
In the classic UI you can click on the Load Tablebutton and follow the dialogs to upload your file. It is a bit hard to find. Click on Databases right to the big snowflake icon. Then select a database and a table. Then you should the the button. In the wizard there will be a step for defining the 'File Format'. There, you have to scroll down to define the date format. (see Classic snowflake UI)
Without the classic UI you have to install SnowSQL on your device first (https://docs.snowflake.com/en/user-guide/snowsql-install-config.html).
Start SnowSQL and apply the following steps:
Use the database where to upload the file to. You need various privileges for creating a stage, a fileformat, and a table. E.g. USE TEST_DB
Create the fileformat you want to use for uploading your csv file. E.g.
CREATE FILE FORMAT "TEST_DB"."PUBLIC".MY_FILE_FORMAT TYPE = 'CSV' DATE_FORMAT = 'dd/mm/yyyy';
Create a stage using this fileformat
CREATE STAGE MY_STAGE file_format = "TEST_DB"."PUBLIC".MY_FILE_FORMAT;
Now you can put your file to this stage
PUT file://<file_path>/file.csv #MY_STAGE;
You can check the upload with
SELECT d.$1, ..., d.$N FROM #MY_STAGE/file.csv d;
Then, create your table.
Copy the content from your stage to your table. If you want to transform your data at this point, you have to use an inner select. If not then the following command is enough.
COPY INTO mycsvtable from #MY_STAGE/file.csv;
You can find documentation for configuring the fileupload at https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
You can find documentation for configuring the stage at https://docs.snowflake.com/en/sql-reference/sql/create-stage.html
You can find documentation for copying the staged file into a table https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
I recommend that you upload your file with disabled automatically date detection OR that your initial table has a string column instead of a date column. IMHO, it is easier to transform your upload afterwards using the try_to_date function. Using this function it is much easier to handle possible parsing errors.
e.g. SELECT try_to_date(date_column, 'dd/mm/yyyy') as MY_DATE, IFNULL(MY_DATE, date_column) as MY_DATE_NOT_PARSABLE FROM upload;
You see that it is pretty much to do for loading a simple CSV file to Snowflake. It becomes even more complicated when you take into account that every step can cause some specific failures and that your file might contain erroneous lines. This is why my team and I are working at Datameer to make these types of tasks easier. We aim for a simple drag and drop solution that does most of the work for you. We would be happy if you would try it out here: https://www.datameer.com/upload-csv-to-snowflake/

Azure Data Factory 2 : How to split a file into multiple output files

I'm using Azure Data Factory and am looking for the complement to the "Lookup" activity. Basically I want to be able to write a single line to a file.
Here's the setup:
Read from a CSV file in blob store using a Lookup activity
Connect the output of that to a For Each
within the For Each, take each record (a line from the file read by the Lookup activity) and write it to a distinct file, named dynamically.
Any clues on how to accomplish that?
Use Data flow, use the derived column activity to create a filename column. Use the filename column in sink. Details on how to implement dynamic filenames in ADF is describe here: https://kromerbigdata.com/2019/04/05/dynamic-file-names-in-adf-with-mapping-data-flows/
Data Flow would probably be better for this, but as a quick hack, you can do the following to read the text file line by line in a pipeline:
Define your source dataset to output a line as a single column. Normally I would use "NoDelimiter" for this, but that isn't supported by Lookup. As a workaround, define it with an incorrect Column Delimiter (like | or \t for a CSV file). You should also go to the Schema tab, and CLEAR the schema. This will generate a column in the output named "Prop_0".
In the foreach activity, set the Items to the Lookup's "output.value" and check "Sequential".
Inside the foreach, you can use item().Prop_0 to grab the text of the line:
To the best of my understanding, creating a blob isn't directly supported by pipelines [hence my suggestion above to look into Data Flow]. It is, however, very simple to do in Logic Apps. If I was tackling this problem, I would create a logic app with an HTTP Request Received trigger, then call it from ADF with a Web activity and send the text line and dynamic file name in the payload.

Trying to create a table and load data into same table using Databricks and SQL

I Googled for a solution to create a table, using Databticks and Azure SQL Server, and load data into this same table. I found some sample code online, which seems pretty straightforward, but apparently there is an issue somewhere. Here is my code.
CREATE TABLE MyTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://server_name_here.database.windows.net:1433;database = db_name_here",
user "u_name",
password "p_wd",
dbtable "MyTable"
);
Now, here is my error.
Error in SQL statement: SQLServerException: Invalid object name 'MyTable'.
My password, unfortunately, has spaces in it. That could be the problem, perhaps, but I don't think so.
Basically, I would like to get this to recursively loop through files in a folder and sub-folders, and load data from files with a string pattern, like 'ABC*', and load recursively all these files into a table. The blocker, here, is that I need the file name loaded into a field as well. So, I want to load data from MANY files, into 4 fields of actual data, and 1 field that captures the file name. The only way I can distinguish the different data sets is with the file name. Is this possible? Or, is this an exercise in futility?
my suggestion is to use the Azure SQL Spark library, as also mentioned in documentation:
https://docs.databricks.com/spark/latest/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
The 'Bulk Copy' is what you want to use to have good performances. Just load your file into a DataFrame and bulk copy it to Azure SQL
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#bulk-copy-to-azure-sql-database-or-sql-server
To read files from subfolders, answer is here:
How to import multiple csv files in a single load?
I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()
Thanks for a point in the right direction mauridb!!

how to load multiple CSV files into Multiple Tables

I have Multiple CSV files in Folder
Example :
Member.CSv
Leader.CSv
I need to load them in to Data base tables .
I have worked on it using ForEachLoop Container ,Data FlowTask, Excel Source and OLEDB Destination
we can do if by using Expressions and Precedence Constraints but how can I do using Script task if I have more than 10 files ..I got Stuck with this one
We have a similar issue, our solution is a mixture of the suggestions above.
We have a number of files types sent from our client on a daily basis.
These have a specific filename pattern (e.g. SalesTransaction20160218.csv, Product20160218.csv)
Each of these file types have a staging "landing" table of the structure you expect
We then have a .net script task that takes the filename pattern and loads that data into a landing table.
There are also various checks that are done within the csv parser - matching number of columns, some basic data validation, before loading into the landing table
We are not good enough .net programmers to be able to dynamically parse an unknown file structure, create SQL table and then load the data in. I expect it is feasible, after all, that is what the SSIS Import/Export Wizard does (with some manual intervention)
As an alternative to this (the process is quite delicate), we are experimenting with a HDFS data landing area, then it allows us to use analytic tools like R to parse the data within HDFS. After that utilising PIG to load the data into SQL.

Dynamically populate external tables location

I'm trying to use oracle external tables to load flat files into a database but I'm having a bit of an issue with the location clause. The files we receive are appended with several pieces of information including the date so I was hoping to use wildcards in the location clause but it doesn't look like I'm able to.
I think I'm right in assuming I'm unable to use wildcards, does anyone have a suggestion on how I can accomplish this without writing large amounts of code per external table?
Current thoughts:
The only way I can think of doing it at the moment is to have a shell watcher script and parameter table. User can specify: input directory, file mask, external table etc. Then when a file is found in the directory, the shell script generates a list of files found with the file mask. For each file found issue a alter table command to change the location on the given external table to that file and launch the rest of the pl/sql associated with that file. This can be repeated for each file found with the file mask. I guess the benefit to this is I could also add the date to the end of the log and bad files after each run.
I'll post the solution I went with in the end which appears to be the only way.
I have a file watcher than looks for files in a given input dir with a certain file mask. The lookup table also includes the name of the external table. I then simply issue an alter table on the external table with the list of new file names.
For me this wasn't much of an issue as I'm already using shell for most of the file watching and file manipulation. Hopefully this saves someone searching for ages for a solution.