Having issue susing the MAX command in Databricks

Having issue susing the MAX command in Databricks - sql

Just running a simple query
select COL1
from (
select *
, monotonically_increasing_id() as row_id
from db00sparkmigration_landingzone_template.tbl_Ingestion_note_load_errors_pipes
) aa
where row_id > 1 and row_id < max(row_id)
but getting the following error, any ideas?
Error in SQL statement: UnsupportedOperationException: Cannot evaluate expression: max(input[1, bigint, false])

I would recommend row_number() here - however you need a column that defines the ordering of the rows:
select col1
from (
select
t.*,
row_number() over(order by <ordering_col> asc) rn_asc,
row_number() over(order by <ordering_col> desc) rn_desc
from db00sparkmigration_landingzone_template.tbl_Ingestion_note_load_errors_pipes t
) aa
where 1 not in (rn_asc, rn_desc)

Related

Creating column and filtering it in one select statement

Wondering if it is possible to creating a new column and filter on that column. The following is an example:
SELECT row_number() over (partition by ID order by date asc) row# FROM table1 where row# = 1
Thanks!

Some databases support a QUALIFY clause which you might be able to use:
SELECT *
FROM table1
QUALIFY ROW_NUMBER() OVER (PARTITION BY ID ORDER BY date) = 1;
On SQL Server, you may use a TOP 1 WITH TIES trick:
SELECT TOP 1 WITH TIES *
FROM table1
ORDER BY ROW_NUMBER() OVER (PARTITION BY ID ORDER BY date);
More generally, you would have to use a subquery:
WITH cte AS (
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY date) rn
FROM table1 t
)
SELECT *
FROM cte
WHERE rn = 1;

The WHERE clause is evaluated before the SELECT so your column has to exist before you can use a WHERE clause. You could achieve this by making a subquery of the original query.
SELECT *
FROM
(
SELECT row_number() over (partition by ID order by date asc) row#
FROM table1
) a
WHERE a.row# = 1

Selecting records that appear several times in a row

My problem is that I would like to select some records which appears in a row.
For example we have table like this:
x
x
x
y
y
x
x
y
Query should give answer like this:
x 3
y 2
x 2
y 1

SQL tables represent unordered sets. Your question only makes sense if there is a column that specifies the ordering. If so, you can use the difference-of-row-numbers to determine the groups and then aggregate:
select col1, count(*)
from (select t.*,
row_number() over (order by <ordering col>) as seqnum,
row_number() over (partition by col1 order by <ordering col>) as seqnum_2
from t
) t
group by col1, (seqnum - seqnum_2)

I made a SQL Fiddle
http://sqlfiddle.com/#!18/f8900/5
CREATE TABLE [dbo].[SomeTable](
[data] [nchar](1) NULL,
[id] [int] IDENTITY(1,1) NOT NULL
);
INSERT INTO SomeTable
([data])
VALUES
('x'),
('x'),
('x'),
('y'),
('y'),
('x'),
('x'),
('y')
;
select * from SomeTable;
WITH SomeTable_CTE (Data, total, BaseId, NextId)
AS
(
SELECT
Data,
1 as total,
Id as BaseId,
Id+1 as NextId
FROM SomeTable
where not exists(
Select * from SomeTable Previous
where Previous.Id+1 = SomeTable.Id
and Previous.Data = SomeTable.Data)
UNION ALL
select SomeTable_CTE.Data, SomeTable_CTE.total+1, SomeTable_CTE.BaseId as BaseId, SomeTable.Id+1 as NextId
from SomeTable_CTE inner join SomeTable on
SomeTable.Data = SomeTable_CTE.Data
and
SomeTable.Id = SomeTable_CTE.NextId
)
SELECT Data, max(total) as total
FROM SomeTable_CTE
group by Data, BaseId
order by BaseId

The elephant in the room is the missing column(s) to establish the order of rows.
SELECT col1, count(*)
FROM (
SELECT col1, order_column
, row_number() OVER (ORDER BY order_column)
- row_number() OVER (PARTITION BY col1 ORDER BY order_column) AS grp
FROM tbl
) t
GROUP BY col1, grp
ORDER BY min(order_column);
To exclude partitions with only a single row, add a HAVING clause:
SELECT col1, count(*)
FROM (
SELECT col1, order_column
, row_number() OVER (ORDER BY order_column)
- row_number() OVER (PARTITION BY col1 ORDER BY order_column) AS grp
FROM tbl
) t
GROUP BY col1, grp
HAVING count(*) > 1
ORDER BY min(order_column);
db<>fiddle here
Add a final ORDER BY to maintain original order (and a meaningful result). You may want to add a column like min(order_column) as well.
Related:
Find the longest streak of perfect scores per player
Select longest continuous sequence
Group by repeating attribute

Random records in Oracle table based on conditions

I have a Oracle table with the following columns
Table Structure
In a query I need to return all the records with CPER>=40 which is trivial. However, apart from CPER>=40 I need to list 5 random records for each CPID.
I have attached a sample list of records. However, in my table I have around 50,000 records.
Appreciate if you can help.

Oracle solution:
with CTE as
(
select t1.*,
row_number() over(order by DBMS_RANDOM.VALUE) as rn -- random order assigned
from MyTable t1
where CPID <40
)
select *
from CTE
where rn <=5 -- pick 5 at random
union all
select t2.*, null
from my_table t2
where CPID >= 40
SQL Server:
with CTE as
(
select t1.*,
row_number() over(order by newid()) as rn -- random order assigned
from MyTable t1
where CPID <40
)
select *
from CTE
where rn <=5 -- pick 5 at random
union all
select t2.*, null
from my_table t2
where CPID >= 40

How about something like this...
SELECT *
FROM (SELECT CID,
CVAL,
CPID,
CPER,
Row_number() OVER (partition BY CPID ORDER BY CPID ASC ) AS RN
FROM Table) tmp
WHERE CPER>=40 OR pids <= 5
However, this is not random.

Assuming that you want five additional random records, you can do:
select t.*
from (select t.*,
row_number() over (partition by cpid,
(case when cper >= 40 then 1 else 2 end)
order by dbms_random.value
) as seqnum
from t
) t
where seqnum <= 5 or cper >= 40;
The row_number() is enumerating the rows for each cpid in two groups -- based on the cper value. The outer where is taking all cper values in the range you want as well as five from the other group.

SQL Server ROW_NUMBER() Issue

I'm using row_number() expression but I don't get result as I expected. I have a sql table and some rows are duplicate. They have same 'BATCHID' and I want to get second row number for these, for others I use first row number. How can I do it?
SELECT * FROM (SELECT * , ROW_NUMBER() OVER (PARTITION BY BATCHID ORDER BY SCAQTY) Rn FROM SAYIMDCPC ) t
WHERE Rn=1
This code returns to me only first rows, but I want to get second rows for duplicated items.

ROW_NUMBER() gives every row a unique counter. You'd want to use RANK(), which is similar, but gives rows with identical values the same score:
SELECT *
FROM (SELECT * , RANK() OVER (PARTITION BY batchid ORDER BY scaqry) rk
FROM sayimdcpc) t
WHERE rk = 1

If some values are only shown once, but some twice (and perhaps more than twice), you don't want the "first" row, you want the "max" row. Try reversing your order condition:
SELECT *
FROM (SELECT * ,
ROW_NUMBER() OVER (PARTITION BY BATCHID ORDER BY SCAQTY DESC) Rn
FROM SAYIMDCPC ) t
WHERE Rn=1
As a side note, it's still better to explicitly list out all columns; for instance, you probably don't need Rn outside of this query...

Simply try this,
SELECT * FROM (SELECT * , ROW_NUMBER() OVER (ORDER BY SCAQTY) Rn FROM SAYIMDCPC ) t
WHERE Rn=1

To rephrase it another way, it sounds like you're saying, "When there's a single row for the Batch ID, return the single row. When there are 2 or more rows, return the second row." That's going to require inspecting your Rn value to see what its max is. I don't think you can do it in a single query.
So I'd try something like this:
WITH NumberedRows AS (
SELECT * ,
ROW_NUMBER() OVER (PARTITION BY BATCHID ORDER BY SCAQTY) Rn
FROM SAYIMDCPC
)
, MaxNumber AS (
SELECT Max(RN) as MaxRn,
BATCHID
FROM NumberedRows
GROUP BY BATCHID
)
, NonDupes AS (
SELECT *
FROM NumberedRows
WHERE BATCHID NOT IN (SELECT BATCHID FROM MaxNumber WHERE MaxNumber = 1)
)
, SecondRows AS (
SELECT (
FROM NumberedRows
WHERE BATCHID NOT IN (SELECT BATCHID FROM MaxNumber WHERE MaxNumber > 1)
AND Rn = 2
)
SELECT
FROM NonDupes
UNION ALL
SELECT *
FROM SecondRows

Please try this. It will select max row num, if no duplicate then it should be first one otherwise second
select * from (SELECT * , ROW_NUMBER() OVER (PARTITION BY BATCHID ORDER BY SCAQTY) Rn FROM SAYIMDCPC ) d,
(SELECT batchid,max(Rn) maxRn FROM (SELECT * , ROW_NUMBER() OVER (PARTITION BY BATCHID ORDER BY SCAQTY) Rn FROM SAYIMDCPC ) t
group by batchid) q
where d.batchid = q.batchid and d.rn = q.maxrn

correct me if i am wrong
e.g sample data
BatchID, SCAQTY
1 , 10
2 , 10
2 , 20
2 , 30
is your expectation result like below?
**Expectation Result 1**
BatchID , SCAQty
1 , 10
2 , 30
or
**Expectation Result 2**
BatchID , SCAQty
1 , 10
2 , 20
2 , 30
based on my understanding what you want to perform is Expectation Result 1, so i guess Query below should able to help u, you just need to add desc for SCAQTY in your query
SELECT * FROM (SELECT * ,
ROW_NUMBER() OVER (PARTITION BY BATCHID ORDER BY SCAQTY DESC) Rn FROM SAYIMDCPC ) t
WHERE Rn=1

Total result set with duplicates and non duplicates. The first column "IsDuplicate" indicates if the column is a duplicate or not.
;WITH d1 AS (
SELECT Seq = ROW_NUMBER() OVER (PARTITION BY BATCHID ORDER BY SCAQTY)
,*
FROM SAYIMDCPC
)
SELECT IsDuplicate = CONVERT(BIT, Seq)
,*
FROM d1
This will give you only the duplicates:
;WITH d1 AS (
SELECT Seq = ROW_NUMBER() OVER (PARTITION BY BATCHID ORDER BY SCAQTY)
,*
FROM SAYIMDCPC
)
SELECT IsDuplicate = CONVERT(BIT, Seq)
,*
FROM d1
WHERE Seq > 1
This will give you only the non duplicates (as in your first query)
;WITH d1 AS (
SELECT Seq = ROW_NUMBER() OVER (PARTITION BY BATCHID ORDER BY SCAQTY)
,*
FROM SAYIMDCPC
)
SELECT IsDuplicate = CONVERT(BIT, Seq)
,*
FROM d1
WHERE Seq = 1

Get entries which are mostly available

For example I have the following database entries:
timestamp | value1 | value 2
----------
1452|5|7
1452|1|6
1452|2|7
1623|1|2
1623|5|6
1623|4|5
1623|4|7
1855|1|2
Now I want to have a sql query which returns me value1 only for the timestamp which is availble the most. Therefore it should return only the timestamp 1623 and it's values.
I was first thinking of count, but that will return only the number of the availability and not the entries.

select *
from T
inner join (select timestamp
from T
group by timestamp
order by count(*) desc
limit 1) t2
on T.timestamp = t2.timestamp
see it's working live in a sqlfiddle

WITH CTE AS (
SELECT *, COUNT(timestamps) OVER (PARTITION BY value1, timestamps) AS cnt
FROM mytable
), cte2 as (select *, row_number() over (partition by value1 order by cnt DESC, timestamps) as Rn FROM cte)
SELECT value1, timestamps , cnt FROM CTE2 WHERE Rn = 1;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Having issue susing the MAX command in Databricks - sql

Related

Creating column and filtering it in one select statement

Selecting records that appear several times in a row

Random records in Oracle table based on conditions

SQL Server ROW_NUMBER() Issue

Get entries which are mostly available

Categories

Resources