My Azure SQL Table contains 6 months' data. I need to split data monthly wise stored into blob storage (Like, Sales_01-01-2021.csv) that file need only Jan month data only.
You can use Conditional split actives in Data Flow:
The conditional split transformation routes data rows to different
streams based on matching conditions. The conditional split
transformation is similar to a CASE decision structure in a
programming language. The transformation evaluates expressions, and
based on the results, directs the data row to the specified stream.
Build the condition can achieve your request. For example:
This is my source dataset, date format is "yyyy-mm-dd":
I use the condition to split the source to get the data in Mar 2021:
Then just create the sink for the output.
Related
My issue is similar to this one Multiple data types in a Power BI matrix but I've got a bit of a different setup that's throwing everything off.
What I'm trying to do is create a matrix table with several metrics that are categorized as Current (raw data values) and Prior (year over year percent growth/decline). I've created some dummy data in Excel to get the format the way I want it in PowerBI (see below):
Desired Format
As you can see the Current values are coming in as integers and the Prior % numbers as percentages which is exactly what I want; however, I was able to accomplish this through a custom column with the following formula:
Revenue2 = IF(Scorecard2[Current_Prior] = "Current", FORMAT(FIXED(Scorecard2[Revenue],0), "$#,###"), FORMAT(Scorecard2[Revenue], "Percent"))
The problem is that the data comes from a SQL query and you can't use the FORMAT() function in DirectQuery. Is there a way I can have two different datatypes in the same column of data? See below for how the SQL data comes into PowerBI (I can change this if need be):
SQL
Create 2 separate measures, one for the Current second for Prior, and format these measures.
Probably you can also use a case in SQL query to format your data to bring it as STRING.
What I wound up doing was reformatting the SQL code to look like this:
Solution
That way Current/Prior are have two separate values and the "metric" is categorical.
I got the idea from this post:
Simple way to transpose columns and rows in SQL?
I'm creating a pipeline to ingest a series of csvs into an Azure SQL Database
The CSVs come from a piece of medical software named SystmOne, the CSVs are either a full dataset or a delta dataset. The only difference in their schemas is the presence of an additional column: RemovedData.
The presence of this column will require an additional step in the pipeline (deleting any row from the database with RemovedData == true).
Is there a way in ADF or (ADF with Data Flow Preview) to query the file for the presence of a column and split the pipeline based on the result?
I have no control over the intial output of the file.
You can check the number of columns in your source dataset, with getMetadataActivty columnCount property, and then with If Activity do what you want.
Expression in IF activity: #equals(activity('YourGet Metadata').output.columnCount,numberOfColumns)
And then based on true or false you choose your dataset with propper schema.
I work in a sales based environment and our data consists of 'leads'.
Let's say we record CompanyName, PhoneNumber, Address1 & PostCode(ZIP). These rows a seeded with a unique ID in the schema.
The leads come in from various sources and are compiled onto a spread sheet and then imported into SQL 2012 using SSIS.
After a validation check to see if a file exists we then use a simple data flow which consists of an Excel source, Derived Column, Data Conversion and finally an OLE DB Destination.
My requirement I'm sure has a relatively simple solution. I understand what I need to achieve is the first step. I need to take a sample of data from the last rolling two months, if 2 or more fields in the source excel file match the corresponding field in the destination sql table then I want to redirect to another table.
I am unsure of which combination of components I could use to achieve this. I believe that Fuzzy lookup may not be what I am looking for as I am looking to find exact field matches, I have looked at the lookup component but I am unsure if this is the way to go.
Could anyone please provide some advice on how I can best achieve this as simply as possible.
You can use the Lookup to check for matches in your existing table. However, it will be fairly complicated to implement the requirement of checking for any two or more fields matching. Your expression would be long and complex basically consisting of:
(using pseudo code for readability)
IIF((a=a AND b=b) OR (a=a AND c=c) OR (b=b AND c=c) OR ...and so on
for every combination of two columns you want to test
I would do this by importing the entire spreadsheet to a staging table, and doing the existing rows check in a SQL stored proc that moves the data to the desired destination table.
I am currently entering data into a SQL Server database using SSIS. The plan is for it to do this each week but the day that it happens may differ depending on when the data will be pushed through.
I use SSIS to grab data from an Excel worksheet and enter each row into the database (about 150 rows per week). The only common denominator is the date between all the rows. I want to add a date to each of the rows on the day that it gets pushed through. Because the push date may differ I can't use the current date I want to use a week from the previous date entered for that row.
But because there are about 150 rows I don't know how to achieve this. It would be nice if I could set this up in SQL Server where every time a new set of rows are entered it adds 7 days from the previous set of rows. But I would also be happy to do this in SSIS.
Does anyone have any clue how to achieve this? Alternatively, I don't mind doing this in C# either.
Here's one way to do what you want:
Create a column for tracking the data entry date in your target table.
Add an Execute SQL Task before the Data Flow Task. This task will retrieve the latest data entry date + 7 days. The query should be something like:
select dateadd(day,7,max(trackdate)) from targettable
Assign the SQL result to a package variable.
Add a Derived Column Transformation between your Source and Destination components in the Data Flow Task. Create a dummy column to hold the tracking date and assign the variable to it.
When you map the Excel to table in a Data Flow task, map the dummy column created earlier to the tracking date column. Now when you write the data to DB, your tracking column will have the desired date.
Derived Column Transformation
I'm trying to split a table into multiple tables based on the value of a given column using Talend Open Studio. Let's say this column can contain any of the integer values of 1, 2, 3, etc. then according to this value, these rows should go to table_1, table_2, table_3 etc.
It would be best if I could solve this when the number of different values in that column is not known in advance, but for now we can assume that all these output tables exists already. The bottom line is that the number of different values and therefore the number of different tables are high enough that setting up the individual filters manually is not an option.
Is this possible to solve this using Talend Open Studio or any similiary open source ETL tools like Pentaho Keetle?
Of course, I could just write a simple script myself, but I would prefer to use a proper ETL tool since the complete ETL process is quite complex.
In PDI or Pentaho Kettle you could do this with partitioning. (A right click option on the step IIRC) Partitioning in PDI is designed for exactly this sort of problem.
Yes that's Possible to do and split the data on the basis of single column to different table, but for that you need to create table dynamically :-
tFileInputDelimited->tFlowtoIterate ->tFixedFlowInput->and the can use
globalMap() to get the column values and use the same to seperate the
data to different tables. -> And the can use globalMap(Columnused to
seperate data) in table name.
The first solution that came to my mind was using the replicator to transport the current row to three filters which act as guard and only let rows through with either 1 2 or 3 in the given column. pic: http://i.imgur.com/FmvwU.png
But you could also build the table name dynamically, if that is what you want, pic: http://i.imgur.com/8LR7Q.png