Cloudera / Impala / SQL: finding all rows with unique value in specific column

Cloudera / Impala / SQL: finding all rows with unique value in specific column - sql

Hopefully a simple question for some of you: I have a table adsb_table as as follows (apologiesstrong text for the formatting of the table):
callsign | time | speed|
A | 23421 | 431 |
A | 23422 | 426 |
A | 23423 | 459 |
B | 23424 | 521 |
B | 23425 | 601 |
B | 23426 | 401 |
C | 23427 | 454 |
C | 23428 | 499 |
C | 23429 | 621 |
I want the resulting output to be the first row for each unique value of callsign:
A 23421 431
B 23424 521
C 23427 454
I have tried the following without success:
SELECT callsign, time, speed FROM adsb_table WHERE speed>400 ORDER BY callsign GROUP by callsign
I don't know if the fact that I am using Impala makes the difference in the query. No output is generated - if I remove the "GROUP BY" clause all ordered records are listed....so I am using the GROUP BY incorrectly I guess. Help.

If you always want the first row per callsign, you can use ROW_NUMBER()
WITH cte AS (
SELECT
callsign,
time,
speed,
ROW_NUMBER() OVER (PARTITION BY callsign) AS row_no
FROM adsb_table
WHERE speed > 400
)
SELECT *
FROM cte
WHERE row_no = 1
ORDER BY callsign

Related

What's the best way to optimize a query of search 1000 rows in 50 million by date? Oracle

I have table
CREATE TABLE ard_signals
(id, val, str, date_val, dt);
This table records the values of all ID's signals. Unique ID's around 950. At the moment, the table contains about 50 million rows with different values of these signals. Each ID can have only a numeric values, string values or date values.
I get the last value of each ID, which, by condition, is less than input date:
select ID,
max(val) keep (dense_rank last order by dt desc) as val,
max(str) keep (dense_rank last order by dt desc) as str,
max(date_val) keep (dense_rank lastt order by dt desc) as date_val,
max(dt)
where dt <= to_date(any_date)
group by id;
I have index on ID. At the moment, the request takes about 30 seconds. Help, please, what ways of optimization it is possible to make for the given request?
EXPLAIN PLAN: with dt index
Example Data(This kind of rows are about 950-1000):
ID
VAL
STR
DATE_VAL
DT
920
0
20.07.2022 9:59:11
490
yes
20.07.2022 9:40:01
565
233
20.07.2022 9:32:03
32
1
20.07.2022 9:50:01

TL;DR You need your application to maintain a table of distinct id values.
So, you want the last record for each group (distinct id value) in your table, without doing a full table scan. It seems like it should be easy to tell Oracle: iterate through the distinct values for id and then do an index lookup to get the last dt value for each id and then give me that row. But looks are deceiving here -- it's not easy at all, if it is even possible.
Think about what an index on (id) (or (id, dt)) has. It has a bunch of leaf blocks and a structure to find the highest value of id in each block. But Oracle can only use the index to get all the distinct id values by reading every leaf block in the index. So, we might be find a way to trade our TABLE FULL SCAN for an INDEX_FFS for a marginal benefit, but it's not really what we're hoping for.
What about partitioning? Can't we create ARD_SIGNALS with PARTITION BY LIST (id) AUTOMATIC and use that? Now the data dictionary is guaranteed to have a separate partition for each distinct id value.
But again - think about what Oracle knows (from DBA_TAB_PARTITIONS) -- it knows what the highest partition key value is in each partition. It is true: for a list partitioned table, that highest value is guaranteed to be the only distinct value in the partition. But I think using that guarantee requires special optimizations that Oracle's CBO does not seem to make (yet).
So, unfortunately, you are going to need to modify your application to keep a parent table for ARDS_SIGNALS that has a (single) row for each distinct id.
Even then, it's kind of difficult to get what we want. Because, again, want Oracle to iterate through the distinct id values, then use the index to find the one with the highest dt for that id .. and then stop. So, we're looking for an execution plan that makes use of the INDEX RANGE SCAN (MIN/MAX) operation.
I find that tricky with joins, but not so hard with scalar subqueries. So, assuming we named our parent table ARD_IDS, we can start with this:
SELECT /*+ NO_UNNEST(#ssq) */ i.id,
( SELECT /*+ QB_NAME(ssq) */ max(dt)
FROM ard_signals s
WHERE s.id = i.id
AND s.dt <= to_date(trunc(SYSDATE) + 2+ 10/86400) -- replace with your date variable
) max_dt
FROM ard_ids i;
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1000 |00:00:00.01 | 22 |
| 1 | SORT AGGREGATE | | 1000 | 1 | 1000 |00:00:00.02 | 3021 |
| 2 | FIRST ROW | | 1000 | 1 | 1000 |00:00:00.01 | 3021 |
|* 3 | INDEX RANGE SCAN (MIN/MAX)| ARD_SIGNALS_N1 | 1000 | 1 | 1000 |00:00:00.01 | 3021 |
| 4 | TABLE ACCESS FULL | ARD_IDS | 1 | 1000 | 1000 |00:00:00.01 | 22 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("S"."ID"=:B1 AND "S"."DT"<=TO_DATE(TO_CHAR(TRUNC(SYSDATE#!)+2+.00011574074074074074
0740740740740740740741)))
Note the use of hints to keep Oracle from merging the scalar subquery into the rest of the query and losing our desired access path.
Next, it is a matter of using those (id, max(dt)) combinations to look up the rows from the table to get the other column values. I came up with this; improvements may be possible (especially if (id, dt) is not as selective as I am assuming it is):
with k AS (
select /*+ NO_UNNEST(#ssq) */ i.id, ( SELECT /*+ QB_NAME(ssq) */ max(dt) FROM ard_signals s WHERE s.id = i.id AND s.dt <= to_date(trunc(SYSDATE) + 2+ 10/86400) ) max_dt
from ard_ids i
)
SELECT k.id,
max(val) keep ( dense_rank first order by dt desc, s.rowid ) val,
max(str) keep ( dense_rank first order by dt desc, s.rowid ) str,
max(date_val) keep ( dense_rank first order by dt desc, s.rowid ) date_val,
max(dt) keep ( dense_rank first order by dt desc, s.rowid ) dt
FROM k
INNER JOIN ard_signals s ON s.id = k.id AND s.dt = k.max_dt
GROUP BY k.id;
--------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1000 |00:00:00.04 | 7009 | | | |
| 1 | SORT GROUP BY | | 1 | 1000 | 1000 |00:00:00.04 | 7009 | 302K| 302K| 268K (0)|
| 2 | NESTED LOOPS | | 1 | 1005 | 1000 |00:00:00.04 | 7009 | | | |
| 3 | NESTED LOOPS | | 1 | 1005 | 1000 |00:00:00.03 | 6009 | | | |
| 4 | TABLE ACCESS FULL | ARD_IDS | 1 | 1000 | 1000 |00:00:00.01 | 3 | | | |
|* 5 | INDEX RANGE SCAN | ARD_SIGNALS_N1 | 1000 | 1 | 1000 |00:00:00.02 | 6006 | | | |
| 6 | SORT AGGREGATE | | 1000 | 1 | 1000 |00:00:00.02 | 3002 | | | |
| 7 | FIRST ROW | | 1000 | 1 | 1000 |00:00:00.01 | 3002 | | | |
|* 8 | INDEX RANGE SCAN (MIN/MAX) | ARD_SIGNALS_N1 | 1000 | 1 | 1000 |00:00:00.01 | 3002 | | | |
| 9 | TABLE ACCESS BY GLOBAL INDEX ROWID| ARD_SIGNALS | 1000 | 1 | 1000 |00:00:00.01 | 1000 | | | |
--------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("S"."ID"="I"."ID" AND "S"."DT"=)
8 - access("S"."ID"=:B1 AND "S"."DT"<=TO_DATE(TO_CHAR(TRUNC(SYSDATE#!)+2+.000115740740740740740740740740740740740741)))
... 7000 gets and 4/100ths of a second.

You want to see that latest entry per ID at a given time.
If you would usually be interested in times at the beginning of the recordings, an index on the time could help to limit the rows to have to work with. But I consider this highly unlikely.
It is much more likely that you are interested in the situation at more recent times. This means for instance that when looking for the latest entries until today, for one ID the latest entry may be found yesterday, while for another ID the latest entry my be from two years ago.
In my opinion, Oracle already chooses the best approach to deal with this: read the whole table sequentially.
If you have several CPUs at hand, parallel execution might speed up things:
select /*+ parallel(4) */ ...
If you really need this to be much faster, then you may want to consider an additional table with one row per ID, which gets a copy of the lastest date and value by a trigger. I.e. you'd introduce redundancy for the gain of speed, as it is done in data warehouses.

How to select oldest date row from each product using SQL

I would like to get the oldest only one from every type product and sum of the prices listed in listofproduct table. Another thing is to search only between prodacts that has at least one peace on lager.
With the SQL I managed to get all the products has at least one on the stock. But the rest I am stack...
So the sum cold be done later, that was my plan, but if you have better idea feel free to write
Here is my data:
+-------------+----------------+---------------+----------+
| IDProizvoda | NazivProizvoda | DatumKupovine | NaLageru |
+-------------+----------------+---------------+----------+
| 77 | Cokolada | 25-Feb-20 | 2 |
| 44 | fgyhufrthr | 06-Aug-20 | 5 |
| 55 | Auto | 06-Aug-23 | 0 |
| 55 | Auto | 11-Aug-20 | 200 |
| 77 | Cokolada | 06-Aug-27 | 0 |
| 77 | Cokolada | 25-Feb-20 | 10 |
| 77 | Cokolada | 25-Jan-20 | 555 |
| 77 | Cokolada | 25-Mar-20 | 40 |
+-------------+----------------+---------------+----------+
Access.ExeQuery("SELECT * FROM Products " &
"WHERE IDProizvoda IN (SELECT value FROM STRING_SPLIT(#listofproduct, ',')) " &
"AND NaLageru > 0 ")
I tried to add GROUP BY and HAVING but it does not worked because i choose the whole table. But I need Product ID and Stock field for edit it later, to subtract one from the stock for those products.
I would like to get the result:
+-------------+----------------+---------------+----------+
| IDProizvoda | NazivProizvoda | DatumKupovine | NaLageru |
+-------------+----------------+---------------+----------+
| 44 | fgyhufrthr | 06-Aug-20 | 5 |
| 55 | Auto | 11-Aug-20 | 200 |
| 77 | Cokolada | 25-Jan-20 | 555 |
+-------------+----------------+---------------+----------+
Thank you for all the help.

You can do it with a Cross Apply, this would be your SQL query:
Select P.IDProizvoda,
P.NazivProizvoda,
N.DatumKupovine,
N.NaLageru,
N.IDKupovine,
N.CenaPoKomadu
From
products P
Cross Apply
(
Select top 1 DatumKupovine,
NaLageru,
IDKupovine,
CenaPoKomadu
From products P2
where P2.IDProizvoda = P.IDProizvoda
and P2.NaLageru > 0
order by DatumKupovine
) N
group by P.IDProizvoda, P.NazivProizvoda, N.DatumKupovine, N.NaLageru, N.IDKupovine, N.CenaPoKomadu
And this your ExeQuery:
Access.ExeQuery("Select P.IDProizvoda, P.NazivProizvoda, N.DatumKupovine, N.NaLageru, N.IDKupovine, N.CenaPoKomadu From products P " &
" Cross Apply( Select top 1 DatumKupovine, NaLageru, IDKupovine, CenaPoKomadu From products P2 where P2.IDProizvoda = P.IDProizvoda and P2.NaLageru > 0 order by DatumKupovine) N " &
" where P.IDProizvoda in (Select value From STRING_SPLIT(#listofproduct, ',')) " &
" group by P.IDProizvoda, P.NazivProizvoda, N.DatumKupovine, N.NaLageru, N.IDKupovine, N.CenaPoKomadu " )

I think this is just aggregation with a filter:
SELECT IDProizvoda, NazivProizvoda, MAX(DatumKupovine),
SUM(NaLegaru)
FROM Products p
WHERE NaLegaru > 0
GROUP BY IDProizvoda, NazivProizvoda;

This should do it:
with cte as (
SELECT *, row_number() over (
partition by NazivProizvoda
order by DatumKupovine
) as rn
FROM Products
WHERE IDProizvoda IN (
SELECT value
FROM STRING_SPLIT(#listofproduct, ',')
)
AND NaLageru > 0
)
select *
from cte
where rn = 1;
By way of explanation, I'm using a common table expression to select the superset of the data you want by criteria and adding a column that enumerates each row within a group (a group being defined here as having NazivProizvoda be the same) in order of the DatumKupovine). With that done, anything that admits the value of 1 for that enumeration will be the oldest in the group. If you data is such that more than one row can be the oldest, use rank() instead of row_number().

SQL GROUPING with conditional

I am sure this is easy to accomplish but after spending the whole day trying I had to give up and ask for your help.
I have a table that looks like this
| PatientID | VisitId | DateOfVisit | FollowUp(Y/N) | FollowUpWks |
----------------------------------------------------------------------
| 123456789 | 2222222 | 20180802 | Y | 2 |
| 123456789 | 3333333 | 20180902 | Y | 4 |
| 234453656 | 4443232 | 20180506 | N | NULL |
| 455344243 | 2446364 | 20180618 | Y | 12 |
----------------------------------------------------------------------
Basically I have a list of PatientIDs, each patient can have multiple visits (VisitID and DateOfVisit). FollowUp(Y/N) specifies whether the patients has to be seen again and in how many weeks (FollowUpWks).
Now, what I need is a query that extracts PatientsID, DateOfVisit (the most recent one and only if FollowUp is YES) and the FollowUpWks field.
Final result should look like this
| PatientID | VisitId | DateOfVisit | FollowUp(Y/N) | FollowUpWks |
----------------------------------------------------------------------
| 123456789 | 3333333 | 20180902 | Y | 4 |
| 455344243 | 2446364 | 20180618 | Y | 12 |
----------------------------------------------------------------------
The closest I could get was with this code
SELECT PatientID,
Max(DateOfVisit) AS LastVisit
FROM mytable
WHERE FollowUp = True
GROUP BY PatientID;
The problem is that when I try adding the FollowUpWks field to the SELECT I get the following error: "The query does not include the specified expression as part of an aggregate function." However, if I add FollowUpWks to the GROUP BY statement than I get all visits, not just the most recent ones.

You need to match back to the most recent visit. One method uses a correlated subquery:
SELECT t.*
FROM mytable as t
WHERE t.FollowUp = True AND
t.DateOfVisit = (SELECT MAX(t2.DateOfVisit)
FROM mytable as t2
WHERE t2.PatientID = t.PatientID
);

RETURN comma list with NULLs if item not found IN list

I am running
Microsoft SQL Server 2012 - 11.0.5058.0 (X64)
Standard Edition (64-bit) on Windows NT 6.3 (Build 9600: ) (Hypervisor). We are inputing sensor data into this database at 1Hz. Each test has a variable number of sensors that could be of any type. The channel name (ChName) will describe to the user what they are.
The user will select the TestID and which ChName's they want from a web interface.
I have a table that is setup like this but contains approximately 15 million rows with ~50 TestIDs:
Timestamp datetime, TestID int, ChName varchar(100), Value real
Timestamp | TestID | ChName | Value
13:52:12 | 1000 | A | 23
13:52:12 | 1000 | B | 2
13:52:12 | 1000 | C | 150
13:52:13 | 1000 | A | 25
13:52:13 | 1000 | C | 147
13:52:13 | 1000 | B | 1
13:52:14 | 1000 | A | 24
13:52:14 | 1000 | B | 4
13:52:14 | 1000 | C | 151
13:52:15 | 1000 | B | 8
13:52:15 | 1000 | C | 153
13:52:16 | 1000 | B | 3
13:52:16 | 1000 | C | 149
13:52:17 | 1000 | C | 152
13:52:17 | 1000 | A | 27
I am looking for a query that when searching for a specific TestID and specific ChName's it will return a comma separated result in the order searched with NULLs for those not found.
For example searching for TestID 1000 and ChNames ('A','B','C') would return:
Timestamp | Data
13:52:12 | 23,2,150
13:52:13 | 25,1,147
13:52:14 | 24,4,151
13:52:15 | NULL,8,153
13:52:16 | NULL,3,149
13:52:17 | 27,NULL,152
Searching for TestID 1000 and ChNames ('B','C') would return:
Timestamp | Data
13:52:12 | 2,150
13:52:13 | 1,147
13:52:14 | 4,151
13:52:15 | 8,153
13:52:16 | 3,149
13:52:17 | NULL,152
I've implemented this in PHP returning all rows that contain the TestID and ChName's but it is slow (returning ~503,000 rows and doing the grouping in PHP takes approximately 2 minutes). I do believe the table could be structured better but unfortunately I inherited the design so trying to get a more efficient query.
Purpose of this data is to pull it and export to excel or the user can graph it via a webapp. User has the ability to select all data or a certain time period.
The query when requesting all data looks like this and then in PHP I group them and add the NULL if not found.
SELECT Timestamp,ChName,Value FROM data_table WHERE TestID=1000 AND ChName IN ('A','B',C') ORDER BY Timestamp,ChName

I would suggest to use PIVOT and build the query dynamically based on the list of values for ChName you want to produce the result for:
SELECT TimeStamp,
COALESCE([A], 'NULL') + ',' +
COALESCE([B], 'NULL') + ',' +
COALESCE([C], 'NULL') AS Data
FROM
( SELECT TimeStamp, ChName, Value
FROM data_table
WHERE TestID = 1000
AND ChName IN ('A',B',C')
) AS SourceTable
PIVOT
(
Max(Value)
FOR ChName IN ([A], [B], [C])
) AS PivotTable;
I have not tested this, so it may have some syntax issues.

Pick a record based on a given value in postgres

I have a table in postgres like below,
alg_campaignid | alg_score | cp | sum
----------------+-----------+---------+----------
9829 | 30.44056 | 12.4000 | 12.4000
9880 | 29.59280 | 12.0600 | 24.4600
9882 | 29.59280 | 12.0600 | 36.5200
9827 | 29.27504 | 11.9300 | 48.4500
9821 | 29.14840 | 11.8800 | 60.3300
9881 | 29.14840 | 11.8800 | 72.2100
9883 | 29.14840 | 11.8800 | 84.0900
10026 | 28.79280 | 11.7300 | 95.8200
10680 | 10.31504 | 4.1800 | 100.0000
From which i have to select a record based on randomly generated number from 0 to 100.i.e first record should be returned if random number picked is between 0 and 12.4000,second if rendom is between 12.4000 and 24.4600,and likewise last if random no is between 95.8200 and 100.0000.
For Example
if the random number picked is 8 then the first record should be returned
or
if the random number picked is 48 then the fourth record should be returned
Is it possible to do this postgres if so kindly recommend a solution for this..

Yes, you can do this in Postgres. If you want to generate the number in the database:
with r as (
select random() * 100 as r
)
select t.*
from table t cross join r
where t.sum <= r.r
order by t.sum desc
limit 1;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Cloudera / Impala / SQL: finding all rows with unique value in specific column - sql

If you always want the first row per callsign, you can use ROW_NUMBER() WITH cte AS ( SELECT callsign, time, speed, ROW_NUMBER() OVER (PARTITION BY callsign) AS row_no FROM adsb_table WHERE speed > 400 ) SELECT * FROM cte WHERE row_no = 1 ORDER BY callsign

Related

What's the best way to optimize a query of search 1000 rows in 50 million by date? Oracle

How to select oldest date row from each product using SQL

SQL GROUPING with conditional

RETURN comma list with NULLs if item not found IN list

Pick a record based on a given value in postgres

Categories

Resources