How to load grouped data with SSIS

How to load grouped data with SSIS - sql

I have a tricky flat file data source. The data is grouped, like this:
Country City
U.S. New York
Washington
Baltimore
Canada Toronto
Vancouver
But I want it to be this format when it's loaded in to the database:
Country City
U.S. New York
U.S. Washington
U.S. Baltimore
Canada Toronto
Canada Vancouver
Anyone has met such a problem before? Got a idea to deal with it?
The only idea I got now is to use the cursor, but the it is just too slow.
Thank you!

The answer by cha will work, but here is another in case you need to do it in SSIS without temporary/staging tables:
You can run your dataflow through a Script Transformation that uses a DataFlow-level variable. As each row comes in the script checks the value of the Country column.
If it has a non-blank value, then populate the variable with that value, and pass it along in the dataflow.
If Country has a blank value, then overwrite it with the value of the variable, which will be last non-blank Country value you got.
EDIT: I looked up your error message and learned something new about Script Components (the Data Flow tool, as opposed to Script Tasks, the Control Flow tool):
The collection of ReadWriteVariables is only available in the
PostExecute method to maximize performance and minimize the risk of
locking conflicts. Therefore you cannot directly increment the value
of a package variable as you process each row of data. Increment the
value of a local variable instead, and set the value of the package
variable to the value of the local variable in the PostExecute method
after all data has been processed. You can also use the
VariableDispenser property to work around this limitation, as
described later in this topic. However, writing directly to a package
variable as each row is processed will negatively impact performance
and increase the risk of locking conflicts.
That comes from this MSDN article, which also has more information about the Variable Dispenser work-around, if you want to go that route, but apparently I mislead you above when I said you can set the value of the package variable in the script. You have to use a variable that is local to the script, and then change it in the Post-Execute event handler. I can't tell from the article whether that means that you will not be able to read the variable in the script, and if that's the case, then the Variable Dispenser would be the only option. Or I suppose you could create another variable that the script will have read-only access to, and set its value to an expression so that it always has the value of the read-write variable. That might work.

Yes, it is possible. First you need to load the data to a table with an IDENTITY column:
-- drop table #t
CREATE TABLE #t (id INTEGER IDENTITY PRIMARY KEY,
Country VARCHAR(20),
City VARCHAR(20))
INSERT INTO #t(Country, City)
SELECT a.Country, a.City
FROM OPENROWSET( BULK 'c:\import.txt',
FORMATFILE = 'c:\format.fmt',
FIRSTROW = 2) AS a;
select * from #t
The result will be:
id Country City
----------- -------------------- --------------------
1 U.S. New York
2 Washington
3 Baltimore
4 Canada Toronto
5 Vancouver
And now with a bit of recursive CTE magic you can populate the missing details:
;WITH a as(
SELECT Country
,City
,ID
FROM #t WHERE ID = 1
UNION ALL
SELECT COALESCE(NULLIF(LTrim(#t.Country), ''),a.Country)
,#t.City
,#t.ID
FROM a INNER JOIN #t ON a.ID+1 = #t.ID
)
SELECT * FROM a
OPTION (MAXRECURSION 0)
Result:
Country City ID
-------------------- -------------------- -----------
U.S. New York 1
U.S. Washington 2
U.S. Baltimore 3
Canada Toronto 4
Canada Vancouver 5
Update:
As Tab Alleman suggested below the same result can be achieved without the recursive query:
SELECT ID
, COALESCE(NULLIF(LTrim(a.Country), ''), (SELECT TOP 1 Country FROM #t t WHERE t.ID < a.ID AND LTrim(t.Country) <> '' ORDER BY t.ID DESC))
, City
FROM #t a
BTW, the format file for your input data is this (if you want to try the scripts save the input data as c:\import.txt and the format file below as c:\format.fmt):
9.0
2
1 SQLCHAR 0 11 "" 1 Country SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 100 "\r\n" 2 City SQL_Latin1_General_CP1_CI_AS

Related

Set based approach to SQL Server insert where one column is calculated as Max from same column

I'm wondering if this is one of those situations where I'm forced to use a cursor or if I can use a set based approach. I've searched for several hours and also tried to come up with a solution myself to no avail.
I've got a table, SuperSupplierCodes, that contains two columns: SuperSupplierCode INT, and SupplierName NVARCHAR(50).
SuperSupplierID SupplierName
1 21ST CENTURY GRAPHIC TECHNOLOGIES LLC
2 3D SYSTEMS
3 3G
4 A A ABRASIVOS ARGENTINOS SAIC
5 A AND F DRUCKLUFTTECHNIK GMBH
6 A BAY STATIONERS
7 A C T TOOL AND ENGINEERING LLC
8 A HERZOG AG
9 A LI T DI MONTANARI MARCO AND CO SAS
11 A RAYMOND GMBH AND CO KG
I've got a second table with millions of rows in it containing financial data as well as the SupplierName column.
LocalSupplierName
23 JAN HOFMEYER ROAD
303 TAXICAB, LLC
3D MECA SARL
3D SYSTEMS
3E CO ENVIRONMENTAL, ECO. & EN
3E COMPANY
What I need to do is insert into the SuperSupplierCodes table such that each row gets the MAX(SuperSupplierCode) from the previous row, increments it by one, and inserts that into the SuperSupplierCode column along with the SupplierName from the second table.
I've tried the following, just as a test, that I might be able to use for the insert, but of course it will only do the increment once and try to use that same value for SuperSupplierCode for every row:
SELECT s.SuperSupplierID,
s.SupplierName,
s.SupplierAddress,
s.DateCreated,
s.DateModified,
s.SupplierCode,
s.PlantName,
s.id,
x.MaxSSC
FROM SuperSupplierCodes AS s
CROSS APPLY (SELECT MAX(SuperSupplierID)+1 AS MaxSSC FROM dbo.SuperSupplierCodes) x;
I don't like using cursors unless I absolutely have to. Is there a way to do this with T-SQL in a set based manner versus using a cursor?

Create the column as an identity and insert the existing records once using SET IDENTITY_INSERT ON option. Then switch it off for adding new Ids and they will be incremented.
https://learn.microsoft.com/en-us/sql/t-sql/statements/set-identity-insert-transact-sql?view=sql-server-2017

Why not something like this?
SELECT (SELECT MAX(SuperSupplierID) FROM dbo.SuperSupplierCodes) + ROW_NUMBER() OVER (ORDER BY s.DateCreated) AS SuperSupplierID,
s.SupplierName,
s.SupplierAddress,
s.DateCreated,
s.DateModified,
s.SupplierCode,
s.PlantName,
s.id
FROM SuperSupplierCodes AS s;
We use the above technique at my work all the time when inserting rows. If some have existing values, you can insert them all into the table and then change the above to only update values that are currently null.

columns manipulation in fast load

Hello i am new to teradata. I am loading flat file into my TD DB using fast load.
My data set(CSV FILE) contains some issues like some of the rows in city column contains proper data but some of the rows contains NULL. The values of the city columns which contains NULL are stored into the next column which is zip code and so on. At the end some of the rows contains extra columns due to the extra NULL in rows. Examples is given below. How to resolve these kind of issues in fastload? Can someone answer this with SQL example?
City Zipcode country
xyz 12 Esp
abc 11 Ger
Null def(city's data) 12(zipcode's data) Por(country's data)

What about different approach. Instead of solving this in fast load, load your data to temporary table like DATABASENAME.CITIES_TMP with structure like below
City | zip_code | country | column4
xyz | 12 | Esp |
NULL | abc | 12 | Por
In next step create target table DATABASENAME.CITY with the structure
City | zip_code | country |
As a final step you need to run 2 INSERT queries:
INSERT INTO DATABASENAME.CITY (City, zip_code, country)
SELECT City, zip_code, country FROM DATABASENAME.CITIES_TMP
WHERE CITY not like 'NULL'/* WHERE CITY is not null - depends if null is a text value or just empty cell*/;
INSERT INTO DATABASENAME.CITY (City, zip_code, country)
SELECT Zip_code, country, column4 FROM DATABASENAME.CITIES_TMP
WHERE CITY like 'NULL' /* WHERE CITY is null - depends if null is a text value or just empty cell*/
Of course this will work if all your data looks exacly like in sample you provide.
This also will work only when you need to do this once in a while. If you need to load data few times a day it will be a litte cumbersome (not sure if I used proper word in this context) and then you should build some kind of ETL process with for example Talend tool.

SQL Server : set all column aliases in a dynamic query

It's a bit of a long and convoluted story why I need to do this, but I will be getting a query string which I will then be executing with this code
EXECUTE sp_ExecuteSQL
I need to set the aliases of all the columns to "value". There could be a variable number of columns in the queries that are being passed in, and they could be all sorts of data types, for example
SELECT
Company, AddressNo, Address1, Town, County, Postcode
FROM Customers
SELECT
OrderNo, OrderType, CustomerNo, DeliveryNo, OrderDate
FROM Orders
Is this possible and relatively simple to do, or will I need to get the aliases included in the SQL queries (it would be easier not to do this, if it can be avoided and done when we process the query)
---Edit---
As an example, the output from the first query would be
Company AddressNo Address1 Town County Postcode
--------- --------- ------------ ------ -------- --------
Dave Inc 12345 1 Main Road Harlow Essex HA1 1AA
AA Tyres 12234 5 Main Road Epping Essex EP1 1PP
I want it to be
value value value value value value
--------- --------- ------------ ------ -------- --------
Dave Inc 12345 1 Main Road Harlow Essex HA1 1AA
AA Tyres 12234 5 Main Road Epping Essex EP1 1PP
So each of the column has an alias of "value"
I could do this with
SELECT
Company AS 'value', AddressNo AS 'value', Address1 AS 'value', Town AS 'value', County AS 'value', Postcode AS 'value'
FROM Customers
but it would be better (it would save additional complexity in other steps in the process chain) if we didn't have to manually alias each column in the SQL we're feeding in to this section of the process.
Regarding the XY problem, this is a tiny section in a very large process chain, it would take pages to explain the whole process in detail - in essence, we're taking code out of our database triggers and putting it into a dynamic procedure; then we will have frontends that users will access to "edit" the SQL statements that are called by the triggers and these will then dynamically feed the results out into other systems. It works if we manually alias the SQL going in, but it would be neater if there was a way we could feed clean SQL into the process and then apply the aliases when the SQL is processed - it would keep us DRY, to start with.

I do not understand at all what you are trying to accomplish, but I believe the answer is no: there is no built-in way how to globally predefine or override column aliases for ad hoc queries. You will need to code it yourself.

SQL Server - copy data across tables , but copy the data only when it match with a specific column name

For example I got this 2 table
dbo.fc_states
StateId Name
6316 Alberta
6317 British Columbia
and dbo.fc_Query
Name StatesName StateId
Abbotsford Quebec NULL
Abee Alberta NULL
100 Mile House British Columbia NULL
Ok pretty straightforward , how do I copy the stateId over from fc_states to fc_Query, but match it with the StatesName, let say the result would be
Name StatesName StateId
Abee Alberta 6316
100 Mile House British Columbia 6317
Thanks, and both stateName column type is text

How about:
update fc_Query set StateId =
(select StateId from fc_states where fc_states.Name = fc_Query.StatesName)
That should give you the result you're looking for.

This is a different way than what Eddie did, I like MERGE for updates if they're not dead simple (like I wouldn't consider yours dead simple). So if you're bored/curious also try
WITH stateIds as
(SELECT name, MAX(stateID) as stID
FROM fc_states
GROUP BY name)
MERGE fc_Query
on stateids.name = fc_query.statesname
WHEN MATCHED THEN UPDATE
SET fc_query.stateid = convert(int, stid)
;
The first part, from "WITH" to the GROUP BY NAME), is a CTE, that creates a table-like thing - a name 'stateIds' that is good as a table for the immediately following part of the query - where there's guaranteed to be only one row per state name. Then the MERGE looks for anything in the fc_query with a matching name. And if there's a match, it sets it as you want. YOu can make a small edit if you don't want to overwrite existing stateids in fc_query:
WITH stateIds as
(SELECT name, MAX(stateID) as stID
FROM fc_states
GROUP BY name)
MERGE fc_Query
ON stateids.name = fc_query.statesname
AND fc_query.statid IS NOT NULL
WHEN MATCHED THEN UPDATE
SET fc_query.stateid = convert(int, stid)
;
And you can have it do something different to rows that don't match. So I think MERGE is good for a lot of applications. You need a semicolon at the end of MERGE statements, and you have to guarantee that there will only be one match or zero matches in the source (that is "stateids", my CTE) for each row in the target; if there's more than one match some horrible thing happens, Satan wins or the US economy falters, I'm not sure what, just never let it happen.

Efficient SQL to merge results or leave to client browser javascript?

I was wondering, what is the most efficient way of combining results into a single result.
I want to turn
Num Ani Country
---- ----- -------
22 cows Canada
20 pigs Canada
40 cows USA
34 pigs USA
into:
cows pigs Country
----- ----- -------
22 20 Canada
40 34 USA
I want to know if it would be better to use SQL only or if I should feed the whole query result set to the user. Once given to the user, I could use JavaScript to parse it into the desired format.
Also, I do not know exactly how I would change this into the right format for a SQL query. The only way I can think of approaching this SQL statement is very roundabout with dynamically creating a temporary table.

The operation you're after is called "pivoting" - the PIVOT info page has a little more detail:
SELECT MAX(CASE WHEN t.ani = 'cows' THEN t.num ELSE NULL END) AS cows,
MAX(CASE WHEN t.ani = 'pigs' THEN t.num ELSE NULL END) AS pigs,
t.country
FROM YOUR_TABLE t
GROUP BY t.country
ORDER BY t.country

There should be an efficient way using a 2-D array on the client-side (php) to achieve the pivoting. To address Ken Downs' concerns about byte pushing, a ragged raw pivot data consumes less bytes than a fully materialized 2-D pivot table, the simple case is
cows | pigs | sheep | goats | country
1 null null null Canada
null 2 null null USA
null null 3 null Egypt
null null null 4 England
which is only 4 rows of raw data (each being 3 columns).
Doing it in the front end also solves the issue of dynamic-pivoting. If your number of pivot columns is unknown, then you would require a MySQL procedure to build up a dynamic sql statement of the pattern "MAX(CASE....)" for each column.
There are advantages to doing this on the client side
can be done (at least considered as an alternative)
can be rendered earlier, if the savings in network traffic is significant despite requiring either (1) php pivottable construction or (2) client side javascript
does not require a MySQL procedure for dynamic pivoting

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas