How to deal with ports which are not being Grouped by neither being aggregated in Informatica Powercenter - sql

We are working on converting Informatica mappings to Google Bigquery SQL. In one of the mappings, there are a couple ports/columns, say A and B which are not getting grouped by in the Aggregator transformation and neither have been applied any aggregation function like sum, avg etc.
According to senior devs in my org, in Informatica, we will get last values of these ports/columns as a result after the aggregator. My question is, how do we convert this behaviour in BigQuery SQL? Because we cannot use that columns in select statement, which are not present in the Group by clause and we don't want to group by these columns.
For getting last value of the column, we have LAST_VALUE() analytic function in bigquery, but even then we cannot use the group by and analytic function in same select statement.
I would really appreciate some help!

Use some aggregation function.
In Informatica you will get LAST value. This is not deterministic. It basically means that either
you have same values across all the column,
you don't care which one you get, or
you have specific order, on which the last value is taken.
First two cases mean you can use MIN / MAX / whatsoever. The result will be same or you don't care.
If the last one is your case, ARRAY_AGG should help you, as per this answer.

to convert Infa mapping with aggregator to big SQL, I would use row_number over (partitioned by id order by id) as rn and then in outside put a filter rn=1.
Informatica aggregator - id is group by column.
Equivalent SQL should look like this -
select a,b,id
from
(select a,b,row_number over (partitioned by id order by id desc) as rn --this will mimic informatica aggregator. id column is the group by port. if you have any sorter before aggregator add all ports as per order in order by column on same sequence but reverse order(asc/desc)
from mytable) rs
where rs.rn=1 -- this will ensure to pick latest row.

Related

How does the window SUM function work with OVER internally?

I'm trying to understand, how window function works internally.
ID,Amt
A,1
B,2
C,3
D,4
E,5
If I run this, will give sum of all amount in total column against every record.
Select ID, SUM (AMT) OVER () total from table
but when I run this, it will give me cumulative sum
Select ID, SUM (AMT) OVER (order by ID) total from table
Trying to understand what is happening when its OVER() and OVER(order by ID)
What I've understood is when no partition is defined in OVER, it considers everything as single partition. But not able to understand when we add order by Id within over(), how come it starts doing cumulative sum ?
Can anyone share what's happening behind the scenes for this ?
That is an interesting case, based on the documentation here is the explanation and example.
If PARTITION BY is not specified, the function treats all rows of the
query result set as a single partition. Function will be applied on
all rows in the partition if you don't specify ORDER BY clause.
So if you specifiey ORDER BY then
If it is specified, and a ROWS/RANGE is not specified, then default
RANGE UNBOUNDED PRECEDING AND CURRENT ROW is used as default for
window frame by the functions that can accept optional ROWS/RANGE
specification (for example min or max).
So technically these two commands are the same:
SELECT ID, SUM(AMT) OVER (ORDER BY ID) total FROM table
SELECT ID, SUM(AMT) OVER (ORDER BY ID RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) total FROM table
More about you can read in the documentation:https://learn.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql?view=sql-server-ver15
This is not related to Oracle itself, but it's part of the SQL Standard and behaves the same way in many databases including Oracle, DB2, PostgreSQL, SQL Server, MySQL, MariaDB, H2, etc.
By definition, when you include the ORDER BY clause the engine will produce "running values" (cumulative aggregation) inside each partition; without the ORDER BY clause it produces the same, single value that aggregates the whole partition.
Now, the partition itself is mainly defined by the PARTITION BY clause. In its absence, the whole result set is considered as a single partition.
Finally, as a more advanced topic, the partition can be further tweaked using a "frame" clause (ROWS and RANGE) and by a "frame exclusion" clause (EXCLUDE).

Is there possible to use SQL Lag function itself?

I need to find solution for my SQL report issue.
My issue is related to find appropriate way about using lag function for the column that was came from lag function.
Here it is picture to clarify my problem
That looks like a simple running total. You don't need lag() for that. When using an aggregate like sum() together with an order by in the window definition, it will give you just that.
But you you have to specify an order by, because rows in a table don't have any implied ordering.
select column1,
sum(column1) over (order by ???) as column1_calculation
from the_table
order by ???
You need to replace the ??? in the above statement with the column that defines the sort order of your rows. Very often a date/time column is used for that to get the running total over time.
Online example

behavior of select when MAX() is present

Get other columns that correspond with MAX value of one column?
I know that when SELECT, sql usually gives all rows from the table(without WHERE condition).
but why this time sql gives data of only the first row for other columns(video_category, url, ...)?
does MAX() changes the behavior of SELECT?
if so, why the rest of columns are not derived from row of MAX(id)?
If you want all columns from the record having the max id, then you will have to use a subquery:
SELECT *
FROM yourTable
WHERE id = (SELECT MAX(id) FROM yourTable);
Your current output is only one record because when say take the max over the entire table, you are no longer speaking of individual records. I am guessing that you are using MySQL, in which case the values you see for the other columns were chosen by the database. But there is no guarantee about which record was chosen.
Some versions of SQL (e.g. MySQL, SQL Server), support LIMIT/TOP functionality which might simplify things. For example, on SQL Server we can just write:
SELECT TOP 1 *
FROM yourTable
ORDER BY id DESC;

Postgres - Group by without having to aggregate?

This is hard to explain, but say I have this query:
SELECT *
FROM "late_fee_tiers"
And it returns this:
I have a validation in code set up to prevent duplicate days from being saved (notice there are 2 rows of days = 2).
I want my query to double-check there are only unique rows of day, and if there are multiple, select the first one (so it should return 3 rows with 2,3,5).
My first thought is to use GROUP BY day, while selecting a MIN("id").
The problem is, I don't understand SQL enough, because it forces me to add different aggregator functions to every single column... but what if I don't want to do that? I want THAT row to be "chosen" according to the single aggregator function I define, I don't need multiple aggregators creating some weird hybrid row. I just want the MIN() function to choose that 1 row and fill in all the rest of the values for that row.
What function do I use to do this, or how would I do it?
Thanks
You want to use DISTINCT ON:
select distinct on (day) *
from "late_fee_tiers"
order by day, id;
Why day is also required in the order by:
From the official documentation:
The DISTINCT ON expression(s) must match the leftmost ORDER BY
expression(s). The ORDER BY clause will normally contain additional
expression(s) that determine the desired precedence of rows within
each DISTINCT ON group.

Any reason for GROUP BY clause without aggregation function?

I'm (thoroughly) learning SQL at the moment and came across the GROUP BYclause.
GROUP BY aggregates or groups the resultset according to the argument(s) you give it. If you use this clause in a query you can then perform aggregate functions on the resultset to find statistical information on the resultset like finding averages (AVG()) or frequency (COUNT()).
My question is: is the GROUP BY statement in any way useful without an accompanying aggregate function?
Update
Using GROUP BY as a synonym for DISTINCT is (probably) a bad idea because I suspect it is slower.
is the GROUP BY statement in any way useful without an accompanying aggregate function?
Using DISTINCT would be a synonym in such a situation, but the reason you'd want/have to define a GROUP BY clause would be in order to be able to define HAVING clause details.
If you need to define a HAVING clause, you have to define a GROUP BY - you can't do it in conjunction with DISTINCT.
You can perform a DISTINCT select by using a GROUP BY without any AGGREGATES.
Group by can used in Two way Majorly
1)in conjunction with SQL aggregation functions
2)to eliminate duplicate rows from a result set
SO answer to your question lies in second part of USEs above described.
Note: everything below only applies to MySQL
GROUP BY is guaranteed to return results in order, DISTINCT is not.
GROUP BY along with ORDER BY NULL is of same efficiency as DISTINCT (and implemented in the say way). If there is an index on the field being aggregated (or distinctified), both clauses use loose index scan over this field.
In GROUP BY, you can return non-grouped and non-aggregated expressions. MySQL will pick any random values from from the corresponding group to calculate the expression.
With GROUP BY, you can omit the GROUP BY expressions from the SELECT clause. With DISTINCT, you can't. Every row returned by a DISTINCT is guaranteed to be unique.
It is used for more then just aggregating functions.
For example, consider the following code:
SELECT product_name, MAX('last_purchased') FROM products GROUP BY product_name
This will return only 1 result per product, but with the latest updated value of that records.