How could I remove duplicate rows in SparkSql？

How could I remove duplicate rows in SparkSql？ - sql

My data is like this:
Code
Time
Total Value
Model Type
First Status
Second Status
11111
07/06/2022 06:45:42
23456
MXJ
Turn On
Turn Off
11111
07/06/2022 06:45:42
23456
MXJ
Turn On
Turn Off
11111
03/02/2022 08:01:11
78231
MXJ
Turn On
Turn Off
22222
04/03/2022 13:23:54
20134
MXJ
Turn On
Turn Off
22222
04/03/2022 13:23:54
20134
MXJ
Turn On
Turn Off
The result I Want:
Code
Time
Total Value
Model Type
First Status
Second Status
11111
07/06/2022 06:45:42
23456
MXJ
Turn On
Turn Off
11111
03/02/2022 08:01:11
78231
MXJ
Turn On
Turn Off
22222
04/03/2022 13:23:54
20134
MXJ
Turn On
Turn Off
My code is like this:
select * from
(
select
code,
Time,
Model Type,
Total Value,
First Status,
lead(First Status, 1, null) over(partition by code order by Time asc) as Second Status
from file
where Model Type = 'MXJ'
) t
where First Status='Turn On' and Second='Turn Off'
limit 5

The data in your questions is not very clear. However, there are two methods that come to mind in de-duplicating data.
The first is to use DISTINCT. So, if you want to remove duplicates based on all of your columns, you can do,
SELECT DISTINCT *
FROM <your_table>
If you want it to be based on a few columns,
SELECT DISTINCT <column_1>, <column_2> ..
FROM <your_table>
The other option is to use GROUP BY with HAVING. You can group by the columns that you want to de-duplicate based on and then filter out rows with a count greater than 1,
SELECT <column_1>, <column_2> ..
FROM <your_table>
GROUP BY <column_1>, <column_2> ..
HAVING COUNT(*) > 1
So, for your situation, I would suggest creating a TEMP VIEW using the query you have already and then applying one of the methods given above,
CREATE OR REPLACE TEMP VIEW tmp
AS
select
code,
Time,
Model Type,
Total Value,
First Status,
lead(First Status, 1, null) over(partition by code order by Time asc) as Second Status
from file
where Model Type = 'MXJ'
) t
where First Status='Turn On' and Second='Turn Off'
SELECT DISTINCT *
FROM tmp

Related

How to get the set size, first and last record in a db2 ordered set with one call

I have a very big transaction table on DB2 v11, and I need to query a subset of it as efficiently as possible. All I need is the total count of the set (not known in advance, it's based on criteria, lets say 1 day) and the ID of the first record, and the ID of the last record.
The old code was fetching the entire table, then just using the 1st record ID, and the last record ID, and size, and not making use of the rest. Now this code is timing out. It's a complex query of several joins.
IS there a way to just fetch the size of the set, 1st record, last record all in one select query ?
I've read that reordering the list in order to fetch the 1st record(so fetch with Desc, then change to Asc) is not efficient.
sample table 1 TRANSACTION_RECORDS:
tdID TIMESTAMP name
-------------------------------
123 2020-03-31 john
234 2020-03-31 dan
456 2020-03-01 Eve
675 2020-04-01 joy
sample table 2 TRANSACTION_TYPE:
invoiceId tdID account
------------------------------
897 123 abc
898 123 def
877 234 mnc
899 456 opp
Sample query
select Min(tr.transaction_id), Max(tr.transaction_id)
from TRANSACTION_RECORDS TR
join TRANSACTION_TYPE TT
on TR.tdID=tt.tdID
WHERE Date(TR.TIMESTAMP) = '2020-03-31'
group by tr.tdID
order by TR.tdID ASC
This results in multiple columns, (but it requires the group by)
123,123
234,234
456,456
What I want is:
123,456

As I mentioned in the comments, for this query you don't need Group BY and neither Order by, just do:
select Min(tr.transaction_id), Max(tr.transaction_id)
from TRANSACTION_RECORDS TR
join TRANSACTION_TYPE TT
on TR.tdID=tt.tdID
WHERE Date(TR.TIMESTAMP) = '2020-03-31'
It should work as expected

SELECT from 50 columns

I have a table that has many columns around 50 columns that have datetime data that represent steps user takes when he/she do a procedure
SELECT UserID, Intro_Req_DateTime, Intro_Onset_DateTime, Intro_Comp_DateTime, Info_Req_DateTime, Info_Onset_DateTime, Info_Comp_DateTime,
Start_Req_DateTime, Start_Onset_DateTime, Start_Comp_DateTime,
Check_Req_DateTime, Check_Onset_DateTime, Check_Comp_DateTime,
Validate_Req_DateTime, Validate_Onset_DateTime, Validate_Comp_DateTime,
....
FROM MyTable
I want to find the Step the user did after certain datetime
example I want to find user ABC what the first step he did after 2 May 2019 17:25:36
I cannot use case to check this will take ages to code
is there an easier way to do that?
P.S. Thanks for everyone suggested redesigning the database.. not all databases can be redesigned, this database is for one of the big systems we have and it is been used for more than 20 years. redesigning is out of the equation.

You can use CROSS APPLY to unpivot the values. The syntax for UNPIVOT is rather cumbersome.
The actual query text should be rather manageable. No need for complicated CASE statements. Yes, you will have to explicitly list all 50 column names in the query text, you can't avoid that, but it will be only once.
SELECT TOP(1)
A.StepName
,A.dt
FROM
MyTable
CROSS APPLY
(
VALUES
('Intro_Req', Intro_Req_DateTime)
,('Intro_Onset', Intro_Onset_DateTime)
,('Intro_Comp', Intro_Comp_DateTime)
.........
) AS A (StepName, dt)
WHERE
MyTable.UserID = 'ABC'
AND A.dt > '2019-05-02T17:25:36'
ORDER BY dt DESC;
See also How to unpivot columns using CROSS APPLY in SQL Server 2012

The best way is to design your table with your action type and datetime that action was done. Then you can use a simple where clause to find what you want. The table should be like the table below:
ID ActionType ActionDatetime
----------- ----------- -------------------
1492 1 2019-05-13 10:10:10
1494 2 2019-05-13 11:10:10
1496 3 2019-05-13 12:10:10
1498 4 2019-05-13 13:10:10
1500 5 2019-05-13 14:10:10
But in your current solution, you should use UNPIVOT to get what you want. You can find more information in this LINK.

Display related rows in same row in MSaccess

I have a set of related rows which I need to display in a single line. For example, the data I have is in different rows.
"ID" RecordDate "ExpType" "OrigBudget" "ActualCost"
1001 1-5-2017 Hardware $ 5000
1001 2-6-2017 Hardware $ 5200
The Original budget is approved at an earlier time for the same record but the Actual cost often differs and is recorded at a later date. I want the output as
ProjectID YearofEntry ExpenseType OrgBudget ActualCost <BR>
1001 2017 Hardware $ 5000 $ 5200 <BR>
I have tried group query to aggregate it based on ExpenseType and ProjectId but not successful in getting it into a single row so far.

if you always just have two rows for each ExpType - one with the original budget and one with the actual costs - you could simply use a GROUP BY:
SELECT ID AS ProjectID
,YEAR(RecordDate) AS YearofEntry
,ExpType AS ExpenseType
,MAX(OrigBudget) AS OrgBudget
,MAX(ActualCost) AS ActualCost
FROM yourtable
GROUP BY ID
,YEAR(RecordDate)
,ExpType

Try This:
SELECT ID,
Year([RecordDate]) AS YEARofEntry,
ExpType,
Sum(OrigBudget) AS SumOfOrigBudget,
Sum(ActualCost) AS SumOfActualCost
FROM youtable
GROUP BY ID,
Year([RecordDate]),
ExpType;

Create column based on grouping other values

I have difficulties formulating my issue.
I have a view which brings these results. There's a need to add a column to the view, which will pair up round-trip flights with identical number.
Flt_No From_Airport To_Airport Dep_Date RequiredResult
124 |LCA |CDG |10/19/14 5:00 1
125 |CDG |LCA |10/19/14 10:00 1
197 |LCA |BCN |10/4/12 5:00 2
198 |BCN |LCA |10/4/12 11:00 2
501 |LCA |HER |15/8/12 12:05 3
502 |HER |LCA |15/8/12 15:15 3
I.e. flight 124 is going from Larnaca to CDG, and flight 125 is going back from CDG to Larnaca - they both have to have the same identifier.
Round-trip flights will always have following flight numbers.
I have a bunch of conditions which I won't write now.
Omitting hours is not an option, they're important.
I was thinking dense_rank() but I don't know how to create one identifier for 2 flights with different numbers, please help.

If your data is similar to the sample data posted, then the following query should give the required result:
SELECT *,
DENSE_RANK() OVER (ORDER BY CASE
WHEN From_Airport < To_Airport THEN From_Airport
ELSE To_Airport
END)
FROM mytable

Join conditions are not limited to simple equality. Assuming {Flight No, Departure, Destination} is unique on any one day, then a self join should do it:
select whatever
from flights outbound
inner join flights inbound on outbound.flt_no+1 = inbound.flt_no
and cast(outbound.dep_date, date)
= cast(inbound.dep_date, date)
and outbound.From_Airport = inbound.To_Airport
and outbound.To_Airpott = inbound.From_Ariport

The best way to keep count data in postgres

I need to create a statistic for some aggragete date splitted by days.
For example:
select
(select count(*) from bananas) as bananas_count,
(select count(*) from apples) as apples_count,
(select count(*) from bananas where color = 'yellow') as yellow_bananas_count;
obviously I will get:
bananas_count | apples_count | yellow_bananas_count
--------------+------------------+ ---------------------
123| 321 | 15
but I need to get that data grouped by day, we need to know how many banaras we had yesterday.
The first thought which I got is create aview, but in that case i will not be able split by dates ( or I don't know how to do it).
I need a performance-wise database sided implementation of this task.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How could I remove duplicate rows in SparkSql？ - sql

Related

How to get the set size, first and last record in a db2 ordered set with one call

SELECT from 50 columns

Display related rows in same row in MSaccess

Create column based on grouping other values

The best way to keep count data in postgres

Categories

Resources