Rank/Rownumber function in SSIS Dataflow - sql

In my dataflow, after some lookups I would get duplicate customer records(They are not exact duplicates only the customer ID is the same), based on some attributes of the customer like city, location. I need to choose one record among them.
How I can achieve this in SSIS dataflow
Here is the sample data:
;with cust (CustomerID,Cutomer_Name,score)
as
(Select 1 as CustomerID, 'abd' as Cutomer_Name, 100 as Score
union
select 1,'abd',null
union select 1,'abd',20
)
select * from cust
From here I need to choose the record the with lowest score and send only that row to the final table.
It's easy to achieve with Rownum function in SQL, but this case occurs during the dataflow in SSIS

Do the source's data access mode on an SQL command.

Use a MultiCast to split it into two Outputs - say Output1 and Output2. One of the outputs connect to a Aggregate transformation and Group by CustomerId and do a Minimum of Score. Now connect back the output of the Aggregate transform to Output2 use a Merge Join in the mapping map Output2.CustomerId = Aggregate Transform.Score and Output2.CustomerId = Aggregate Transform.Score. This would do the trick, but if you have multiple customerIds with the same score then you might need a Sort after this step to remove duplicates. Hope this helps.

This is the solution which helped me to solve my issue
http://paultebraak.wordpress.com/2013/02/25/rank-partitioning-in-etl-using-ssis/

Related

SQL/HANA query assigning incremental number per X unique values without using stored procedure

Currently I'm struggling trying to write a select query to achieve the following result:
I have documents, each document has an address.
Now I want a third column that will assign a unique number for every X unique adresses found in the result list. In the example I have used X = 3.
There are a total of 7 unique adresses. Which means we need 3 unique numbers.
1 = (adres A,B,C) , 2 = (adres D,E,F), 3= (adres G).
PS. I have already worked out this logic in a stored procedure, but because of technical limitations that I can't go into detail about this has to be done using a SELECT query if possible. If this is not possible we will have to find another workaround.
I was hoping you could point me in the right direction for which HANA SQL syntax to use to achieve this.. I've been looking into ROW_NUMBER and DENSE_RANK but without success.
In order for your question to make sense, you need an ordering column. This answer assumes that document orders the rows.
You want to assign the group based on the minimum document of each address (and then do arithmetic). You can get this with two levels of window functions:
select t.*,
ceil(dense_rank() over (order by min_document) / 3.0) as incremental_number
from (select t.*,
min(document) over (partition by address) as min_document
from t
) t;

Any easier way to group by individual columns in Hive/Impala?

I need to output report of users by their age, gender, education, income, etc from our database. However, there are about 40 variables. It seems just silly to group by each variable one bye one but I'm not aware of other ways and I don't know how to write UDF to solve it yet. I'd appreciate your help.
It's not that complicated but it does come up a lot in daily work. My work environment is Hive/Impala.
We cannot implement 'Group By' task on input rows in UDF , UDAF or UDTF.
UDF takes in a single input row and output a single output row.
UDAF just does Aggregations on one column, but not by Grouping rows.
UDTF transforms a single input row to multiple output rows.
Only possible solution is to write multiple Queries and Combine them using UNION ALL and display/insert into table
Sample Query:
SELECT *
FROM
(
SELECT COUNT(column1),column1 FROM table GROUP BY column1
UNION ALL
SELECT COUNT(column2),column2 FROM table GROUP BY column2
UNION ALL
SELECT COUNT(column3),column3 FROM table GROUP BY column3
) s

Summarizing a table result in SQL

Given the below table as a SQL Result:
I want to use the above generated table and produce a table which clubs the given information into:
I have multiple areaName and multiple functionNames and multiple users. Please let me know if this is possible and how?
I have tried couple of things but I am just drained out now and need a direction. Any help is appreciated.
Even if you can provide a pseudo code, I can try and make use of it. Start from the SQL result as a given table.
Use correlated sub-queries to achieve the desired result. I've provided an example below. Test it for the first summary column, and then add in your other summary columns if it does. Hopefully this makes sense, and helps. Alternatively you could use a CTE (common table expression) to achieve similar results.
SELECT a.areaName, a.functionName
, (SELECT count(DISTINCT b.UserKey)
from AREAS b
where a.areaName = b.areaName
and a.functionName = b.functionName
and b.[1-add] = 1) as UsersinAdd
-- Lather/rinse/repeat for other summary columns
FROM AREAS a
group by a.areaName, a.functionName
Your problem stems from the de-normalised structure of your table. Columns [1-add],...,[8-correction] should be values in a column, not columns. This leads to more complex queries, as you have discovered.
The unpivot command allows you to correct this mistake.
select areaname, functionname, rights, count(distinct userkey)
from
(
select * from yourtable
unpivot (permission for rights in ([1-add], [2-update/display],[4-update/display all] , [8-correction] )) u
) v
group by areaname, functionname, rights

SQL select first records of rows for specific column

I realize my title probably doesnt explain my situation very well, but I honestly have no idea how to word this.
I am using SQL to access a DB2 database.
Using my screenshot image 1 below as a reference:
column 1 has three instances of "U11124", with three different descriptions (column 2)
I would like this query to return the first instance of "U11124" and its description, but then also unique records for the other rows. image 2 shows my desired result.
image 1
image 2
----- EDIT ----
to answer some of the questions / posts:
technically, it does not need to be the first , just any single one of those records. the problem is that we have three descriptions, and only one needs to be shown, i am now told it does not matter which one.
SELECT STVNST, MAX(STDESC) FROM MY_TABLE GROUP BY STVNST;
In SQL Server:
select stvnst, stdesc
from (
select
stvnst, stdesc
row_number() over (order by stdesc partition by stvnst) row
from table
) a
where row = 1
This method has an advantage over a simple group by, in that it will also work when there's more than two columns in the table.
SELECT STVNST,FIRST(STDESC) from table group by STVNST ORDER BY what_you_want_first
All you need to do is use GROUP BY.
You say you want the first instance of the STDESC column? Well you can't guarntee the order of the rows without another column, however if you want to order by the highest ordered value the following will suffice:
SELECT STVNST, MAX(STDESC) FROM MY_TABLE GROUP BY STVNST;

How Do I Combine Multiple SQL Queries?

I'm having some trouble figuring out any way to combine two SQL queries into a single one that expresses some greater idea.
For example, let's say that I have query A, and query B. Query A returns the total number of hours worked. Query B returns the total number of hours that were available for workers to work. Each one of these queries returns a single column with a single row.
What I really want, though, is essentially query A over query B. I want to know the percentage of capacity that was worked.
I know how to write query A and B independently, but my problem comes when I try to figure out how to use those prewritten queries to come up with a new SQL query that uses them together. I know that, on a higher level, like say in a report, I could just call both queries and then divide them, but I'd rather encompass it all into a single SQL query.
What I'm looking for is a general idea on how to combine these queries using SQL.
Thanks!
Unconstrained JOIN, Cartesian Product of 1 row by 1 row
SELECT worked/available AS PercentageCapacity
FROM ( SELECT worked FROM A ),
( SELECT available FROM B )
You can declare variables to store the results of each query and return the difference:
DECLARE #first INT
DECLARE #second INT
SET #first = SELECT val FROM Table...
SET #second = SELECT val FROM Table...
SELECT #first - #second
The answer depends on where the data is coming from.
If it's coming from a single table, it could be something as easy as:
select totalHours, availableHours, (totalHours - availableHours) as difference
from hoursTable
But if the data is coming from separate tables, you need to add some identifying column so that the rows can be joined together to provide some useful view of the data.
You may want to post examples of your queries so we know better how to answer your question.
You can query the queries:
SELECT
a.ID
a.HoursWorked/b.HoursAvailable AS UsedWork
FROM
( SELECT ID, HoursWorked FROM Somewhere ) a
INNER JOIN
( SELECT ID, HoursAvailable FROM SomewhereElse ) b
ON
a.ID = b.ID