I have some timestamps in a table and I want to create a dummy variable (0 or 1) that tests if the row above is equal to the current row, after the timestamps are sorted. I need to do this in different partitions. Is there a window function that can do this in SQL Server?
So I know my partition by column and order by column. From my knowledge of window functions I need to perhaps use a rank function, but is there a way to write this with nested functions using IIF and LEAD or LAG to check for some condition between the rows in a partition?
SQL table represent unordered sets. If you have an ordering column separate from the timestamps you can use:
select t.*,
(case when lag(timestamp) over (partition by <partition col> order by <order col>) = timestamp
then 1 else 0
end) as flag
from t;
SQL is a set language, so there is no above or below, but they created something like that, limited portability, by popular demand. You can write a procedure that processes the order'd rows remembering prior values, priming that memory with something for the first row. Still no access to the next row! Back to spreadsheets? Or save the info for the current row and prior row until the next row, and output the row a row late, but then what about the last row? Even in spreadsheets, you have a top and bottom row!
Related
We are working on converting Informatica mappings to Google Bigquery SQL. In one of the mappings, there are a couple ports/columns, say A and B which are not getting grouped by in the Aggregator transformation and neither have been applied any aggregation function like sum, avg etc.
According to senior devs in my org, in Informatica, we will get last values of these ports/columns as a result after the aggregator. My question is, how do we convert this behaviour in BigQuery SQL? Because we cannot use that columns in select statement, which are not present in the Group by clause and we don't want to group by these columns.
For getting last value of the column, we have LAST_VALUE() analytic function in bigquery, but even then we cannot use the group by and analytic function in same select statement.
I would really appreciate some help!
Use some aggregation function.
In Informatica you will get LAST value. This is not deterministic. It basically means that either
you have same values across all the column,
you don't care which one you get, or
you have specific order, on which the last value is taken.
First two cases mean you can use MIN / MAX / whatsoever. The result will be same or you don't care.
If the last one is your case, ARRAY_AGG should help you, as per this answer.
to convert Infa mapping with aggregator to big SQL, I would use row_number over (partitioned by id order by id) as rn and then in outside put a filter rn=1.
Informatica aggregator - id is group by column.
Equivalent SQL should look like this -
select a,b,id
from
(select a,b,row_number over (partitioned by id order by id desc) as rn --this will mimic informatica aggregator. id column is the group by port. if you have any sorter before aggregator add all ports as per order in order by column on same sequence but reverse order(asc/desc)
from mytable) rs
where rs.rn=1 -- this will ensure to pick latest row.
I'm trying to understand, how window function works internally.
ID,Amt
A,1
B,2
C,3
D,4
E,5
If I run this, will give sum of all amount in total column against every record.
Select ID, SUM (AMT) OVER () total from table
but when I run this, it will give me cumulative sum
Select ID, SUM (AMT) OVER (order by ID) total from table
Trying to understand what is happening when its OVER() and OVER(order by ID)
What I've understood is when no partition is defined in OVER, it considers everything as single partition. But not able to understand when we add order by Id within over(), how come it starts doing cumulative sum ?
Can anyone share what's happening behind the scenes for this ?
That is an interesting case, based on the documentation here is the explanation and example.
If PARTITION BY is not specified, the function treats all rows of the
query result set as a single partition. Function will be applied on
all rows in the partition if you don't specify ORDER BY clause.
So if you specifiey ORDER BY then
If it is specified, and a ROWS/RANGE is not specified, then default
RANGE UNBOUNDED PRECEDING AND CURRENT ROW is used as default for
window frame by the functions that can accept optional ROWS/RANGE
specification (for example min or max).
So technically these two commands are the same:
SELECT ID, SUM(AMT) OVER (ORDER BY ID) total FROM table
SELECT ID, SUM(AMT) OVER (ORDER BY ID RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) total FROM table
More about you can read in the documentation:https://learn.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql?view=sql-server-ver15
This is not related to Oracle itself, but it's part of the SQL Standard and behaves the same way in many databases including Oracle, DB2, PostgreSQL, SQL Server, MySQL, MariaDB, H2, etc.
By definition, when you include the ORDER BY clause the engine will produce "running values" (cumulative aggregation) inside each partition; without the ORDER BY clause it produces the same, single value that aggregates the whole partition.
Now, the partition itself is mainly defined by the PARTITION BY clause. In its absence, the whole result set is considered as a single partition.
Finally, as a more advanced topic, the partition can be further tweaked using a "frame" clause (ROWS and RANGE) and by a "frame exclusion" clause (EXCLUDE).
I would like to ask you how to return the difference between the value of the current row and the value of the following row.
I know that I could use lead, but I would like to try to use lag function to do the same.
What I thought was lag(value) over (partition by id order by time desc), but I am not completely sure that this works.
Any suggestions?
Your query should work with the desc on the sort.
You should understand that SQL tables represent unordered sets. There is no ordering, unless a column specifies the ordering. For a column lead() and lag() do the same thing -- just the ordering of the sort is different. What is the previous row for one is the next row for the other.
There are some small caveats. NULL values might be treated differently. And identical values result in an unstable ordering, so any matching value might be returned.
In general, lag() and lead() are used without desc, but that is just a convention to choose between the two functions.
I have seen a few posts detailing fast ways to "estimate" the number of rows in a given SQL table without using COUNT(*). However, none of them seem to really solve the problem if you need to estimate the number of rows which satisfy a given criteria. I am trying to get a way of estimating the number of rows which satisfy a given criteria, but the information for these criteria is scattered around two or three tables. Of course a SELECT COUNT(*) with the NOLOCK hint and a few joins will do, and I can afford under- or over-estimating the total records. The probem is that this kind of query will be running every 5-10 minutes or so, and since I don't need the actual number-only an estimate-I would like to trade-off accuracy for speed.
The solution, if any, may be "SQL Server"-specific. In fact, it must be compatible with SQL Server 2005. Any hints?
There is no easy way to do this. You can get an estimate for the total number of rows in a table, e.g. from system catalog views.
But there's no way to do this for a given set of criteria in a WHERE clause - either you would have to keep counts for each set of criteria and the values, or you'd have to use black magic to find that out. The only place that SQL Server keeps something that would go into that direction is the statistics it keeps on the indices. Those will have certain information about what kind of values occur how frequently in an index - but I quite honestly don't have any idea if (and how) you could leverage the information in the statistics in your own queries......
If you really must know the number of rows matching a certain criteria, you need to do a count of some sort - either a SELECT COUNT(*) FROM dbo.YourTable WHERE (yourcriteria) or something else.
Something else could be something like this:
wrap your SELECT statement into a CTE (Common Table Expression)
define a ROW_NUMBER() in that CTE ordering your data by some column (or set of columns)
add a second ROW_NUMBER() to that CTE that orders your data by the same column (or columns) - but in the opposite direction (DESC vs. ASC)
Something like this:
;WITH YourDataCTE AS
(
SELECT (list of columns you need),
ROW_NUMBER() OVER(ORDER BY <your column>) AS 'RowNum',
ROW_NUMBER() OVER(ORDER BY <your column> DESC) AS 'RowNum2'
FROM
dbo.YourTable
WHERE
<your conditions here>
)
SELECT *
FROM YourDataCTE
Doing this, you would get the following effect:
your first row in your result set will contain your usual data columns
the first ROW_NUMBER() will contain the value 1
the second ROW_NUMBER() will contain the total number of row that match that criteria set
It's surprisingly good at dealing with small to mid-size result sets - I haven't tried yet how it'll hold up with really large result sets - but it might be something to investigate and see if it works.
Possible solutions:
If the count number is big in comparison to the total number of rows in the table, then adding indexes that cover where condition will help and the query will be very fast.
If the result number is close to the total number of rows in the table, indexes will not help much. You could implement a trigger that would maintain a 'conditional count table'. So whenever row matching condition added you would increment the value in the table, and when row is deleted you would decrement the value. So you will query this small 'summary count table'.
Quicky question on SQLite3 (may as well be general SQLite)
How can one retrieve the n-th row of a query result?
row_id (or whichever index) won't work on my case, given that the tables contain a column with a number. Based on some data, the query needs the data unsorted or sorted by asc/desc criteria.
But I may need to quickly retrieve, say, rows 2 & 5 of the results.
So other than implementing a sqlite3_step()==SQLITE_ROW with a counter, right now I have no idea on how to proceed with this.
And I don't like this solution very much because of performance issues.
So, if anyone can drop a hint that'd be highly appreciated.
Regards
david
add LIMIT 1 and OFFSET <n> to the query
example SELECT * FROM users LIMIT 1 OFFSET 5132;
The general approach is that, if you want only the nth row of m rows, use an appropriate where condition to only get that row.
If you need to get to a row and can't because no where criteria can get you there, your database has a serious design issue. It fails the first normal form, which states that "There's no top-to-bottom ordering to the rows."
But I may need to quickly retrieve, say, rows 2 & 5 of the results.
In scenario when you need non-continuous rows you could use ROW_NUMBER():
WITH cte AS (
SELECT *, ROW_NUMBER() OVER() AS rn --OVER(ORDER BY ...) --if specific order is required
FROM t
)
SELECT c
FROM cte
WHERE rn IN (2,5); -- row nums
db<>fiddle demo