Selecting batches of rows - sql

I have a table widget_events that records event_what events occurring to
widget widget_id on date event_when. It's possible for the same event to
occur multiple times to the same widget on the same day. For this reason,
column event_id is used as primary key to distinguish such rows. Here is
the table declaration:
CREATE TABLE widget_events
(
event_id int4 UNIQUE NOT NULL,
event_when date NOT NULL,
event_what text NOT NULL,
widget_id int4 REFERENCES widgets (widget_id) NOT NULL,
PRIMARY KEY (event_id)
);
The client application processes events in batches, where each batch consists
of all events for one widget on one date. However, the application has no
previous knowledge of which widgets and dates are stored in widget_events.
One possible solution is to start by selecting one random row from
widget_events (using SQL's LIMIT), and then do another query for all
rows with the same widget_id and widget_when. After this batch is
processed, those rows can be deleted from widget_events, and we go back
to the first step. The algorithm stops when the first step reports that
there is no more random row to return.
My question is whether there is a faster, more elegant way to do this.
Is it possible in SQL (in particular the SQL understood by PostgreSQL)
to return each distinct batch in a single query?

To select distinct batches:
select distinct event_when
, event_what
from widget_events
Or you could pick up a single batch in one query, like:
select batch.*
from widget_events batch
join (
select event_when
, event_what
from widget_events
limit 1
) filter
on filter.event_when = batch.event_when
and filter.event_what = batch.event_what

Why don't you just return the rows, ordered by event_when:
select *
from widget_events we
order by event_when, event_what, event_id
I threw in event_what as well, so all similar events will be on consecutive rows.
Your logic can then just look for when the date changes to determine whether something is the last event. You could even put this into the select, if you wanted:
select *,
(case when lag(event_when) over (partition by event_id) < event_when then 1
else 0
end) as isFirst,
(case when lead(event_when) over (partition by event_id) < event_when then 1
else 0
end) as isLast
from widget_events we
order by event_when, event_what, event_id

Related

Select 1 record from each of 2 duplicate records

I have a messaging application which regularly inserts duplicate messages in BigQuery. The table name is 'metrics' and it has the following fields:
The Row column is a bigquery ROW_NUMBER() which is not part of the metrics table. All the other columns except batch_id form 2 duplicate rows for each message_id. You can see that message_id is repeated twice, and for each insertion 1 different batch_id is created.
I want the output like this, only 3 rows should be in the select result with 3 different message_id instead of the 6 rows i get here. It would be better if the row which had been inserted first among the duplicates for each message id would be selected(as the start_time and end_time is same for the duplicates i am not sure how to find that). I am new to Bigquery seen some examples in sql but not in Bigquery so any help is appreciated
Thanks for your help.
This deduping process becomes part of your business logic, so pick one method and stay consistent. I would do something like this:
with data as (
select
*,
row_number() over(partition by message_id order by batch_id asc) as rn
from `project.dataset.table`
)
select * from data where rn = 1
This query selects the row that has the "minimum" batch_id for each message_id. Your batch_id seem random/hashed (and not necessarily in a specific order), so this might or might work for your purposes, but it should reproduce the same results everytime (unless a 3rd record shows up, then it could begin to vary).

if-then-else construction in complex stored procedure

I am relatively new to sql queries and I was wondering how to create a complex stored procedure. My database runs on SQL server.
I have a table customer (id, name) and a table customer_events (id, customer_id, timestamp, action_type). I want add a calculated field customer_status to table customer which is
0: (if there is no event for this customer in customer_events) or (the most recent event is > 5 minutes ago)
1: if the most recent event is < 5 minutes ago and action_type=0
2: if the most recent event is < 5 minutes ago and action_type=1
Can I use if-then-else constructions or should I solve this challenge differently?
As you mentioned in comments, you actually want to add a field to a select query, and in a general sense what you want is a CASE statement. They work like this:
SELECT field1,
field2,
CASE
WHEN some_condition THEN some_result
WHEN another_condition THEN another_result
END AS field_alias
FROM table
Applied to your specific scenario, well it's not totally straightforward. You're certainly going to need to left join your status table, you also want to aggregate to find the most recent event, along with that event's action type. Once you have that information, the case statement is straightforward.
Always hard to write sql without access to your data, but something like:
SELECT c.id,
c.name,
CASE
WHEN e.id IS NULL OR DATEDIFF(minute,e.timestamp,getDate())>=5 THEN 0
WHEN DATEDIFF(minute,e.timestamp,getDate())<5 AND s.action_type=1 THEN 1
WHEN DATEDIFF(minute,e.timestamp,getDate())<5 AND s.action_type=0 THEN 2
END as customer_status
FROM clients c
LEFT JOIN (
SELECT id, client_id, action_type,
rank() OVER(partition by client_id order by timestamp desc) AS r
FROM customer_events
) e
ON c.id=e.client_id AND e.r=1
The core of this is the subquery in the middle, it's using a rank funtion to give a number to each status by client_id ordered by the timestamp descending. Therefore every record with a rank of 1 will be the most recent (for that client). Thereafter, you simply join it on to the client table, and use it to determine the right value for customer_status
Presuming you get the event info into "Most_Recent_Event_Mins_Ago". If none it will be NULL.
SELECT Id, Name,
CASE
WHEN Most_Recent_Event_Mins_Ago IS NULL THEN 0
WHEN Most_Recent_Event_Mins_Ago <5 AND Action_type = 0 THEN 1
WHEN Most_Recent_Event_Mins_Ago <5 AND Action_type = 1 THEN 0
..other scenarions
ELSE yourDefaultValueForStatus
END as Status
FROM customer
WHERE
...
...

How to pull the most recent values from a T-SQL table

I have a database table that I need to process with either a view or a stored procedure or something else that gives me a result based on the live data.
The table holds records of people with data associated with each one. The thing is that people can be in the table more than once. Each record shows a time when one or more pieces of information was recorded for an individual.
The identifier field for the people is cardholder_index. I need to take a DISTINCT list of that field. There is also a date field called bio_complete_date. What I need to do is, for all the other fields in the table, take the most recent non-null (or possibly non-zero) value.
For instance, there is a bmi field. For each distinct cardholder index, I need to take the most recent (by the bio_complete_date field) non-null bmi for that cardholder_index. But there's also a body_fat field, and I need to take the most recent non-null value in that field, which might not necessarily be the same row as the most recent non-null bmi value.
For the record, the table itself does have its own unique identifier column, bio_id, if that helps.
I don't need to show when the most recent piece of information was taken. I just need to show the data itself.
I figure I need to do a distinct on the card_holder index, and then join to it the result sets of querys for each other field. It's writing the subqueries that is giving me problems.
From your description I guess your table looks something like this:
create table people (
bio_id int identity(1,1),
cardholder_index int,
bio_complete_date date,
bmi int,
body_fat int
)
If so, one way (of many) to do the query would be to use correlated queries to pull the latest non-null value for the cardholder_index, either using subqueries like this:
select
cardholder_index,
(
select top 1 bmi
from people
where cardholder_index = p.cardholder_index and bmi is not null
order by bio_complete_date desc
) as latest_bmi,
(
select top 1 body_fat
from people
where cardholder_index = p.cardholder_index and body_fat is not null
order by bio_complete_date desc
) as latest_body_fat
from people p
group by cardholder_index
or to use the apply operator like this:
select cardholder_index, latest_bmi.bmi, latest_body_fat.body_fat
from people p
outer apply (
select top 1 bmi
from people
where cardholder_index = p.cardholder_index and bmi is not null
order by bio_complete_date desc
) as latest_bmi
outer apply (
select top 1 body_fat
from people
where cardholder_index = p.cardholder_index and body_fat is not null
order by bio_complete_date desc
) as latest_body_fat
group by cardholder_index, latest_bmi.bmi, latest_body_fat.body_fat
Sample SQL Fiddle demo

Write Oracle SQL query to fetch from Tasks table top Approval Statuses that appear after some first null value

Write Oracle SQL query to fetch from Tasks table top Approval Statuses that appear after some first null value in the Approval_Status Column and then Approval Status sequence and then some null values
Facts
I only need the top Approval Statuses sequence
Serial Number for each task ID Sequence starts from 1 and then comes in Sequence like 1.2.3... and so on
There are thousands of tasks in the table like from T1 .... Tn
See the Query Result below i need to write a query that returns data in that format
I have heard analytic function i.e. "Partition By clause" for this can be used but i don't know how to use that
Tasks
Query Result
I really appreciate experts help in this regard
Thanks
You can do this with analytic functions, but there is a trick. The idea is to look only at rows where approval_status is not null. You want the first group of sequential serial numbers in this group.
The group is identified by the difference between a sequence that enumerates all the rows and the existing serial number. To get the first, use dense_rank(). Finally, choose the first by looking for the ones with a rank equal to 1:
select t.*
from (select t.*, dense_rank(diff) over (partition by taskid) as grpnum
from (select t.*,
(row_number() over (partition by taskid order by serial_number) -
serial_number
) as diff
from tasks
where approval_status is not null
) t
) t
where grpnum = 1;

Getting the last record in SQL in WHERE condition

i have loanTable that contain two field loan_id and status
loan_id status
==============
1 0
2 9
1 6
5 3
4 5
1 4 <-- How do I select this??
4 6
In this Situation i need to show the last Status of loan_id 1 i.e is status 4. Can please help me in this query.
Since the 'last' row for ID 1 is neither the minimum nor the maximum, you are living in a state of mild confusion. Rows in a table have no order. So, you should be providing another column, possibly the date/time when each row is inserted, to provide the sequencing of the data. Another option could be a separate, automatically incremented column which records the sequence in which the rows are inserted. Then the query can be written.
If the extra column is called status_id, then you could write:
SELECT L1.*
FROM LoanTable AS L1
WHERE L1.Status_ID = (SELECT MAX(Status_ID)
FROM LoanTable AS L2
WHERE L2.Loan_ID = 1);
(The table aliases L1 and L2 could be omitted without confusing the DBMS or experienced SQL programmers.)
As it stands, there is no reliable way of knowing which is the last row, so your query is unanswerable.
Does your table happen to have a primary id or a timestamp? If not then what you want is not really possible.
If yes then:
SELECT TOP 1 status
FROM loanTable
WHERE loan_id = 1
ORDER BY primaryId DESC
-- or
-- ORDER BY yourTimestamp DESC
I assume that with "last status" you mean the record that was inserted most recently? AFAIK there is no way to make such a query unless you add timestamp into your table where you store the date and time when the record was added. RDBMS don't keep any internal order of the records.
But if last = last inserted, that's not possible for current schema, until a PK addition:
select top 1 status, loan_id
from loanTable
where loan_id = 1
order by id desc -- PK
Use a data reader. When it exits the while loop it will be on the last row. As the other posters stated unless you put a sort on the query, the row order could change. Even if there is a clustered index on the table it might not return the rows in that order (without a sort on the clustered index).
SqlDataReader rdr = SQLcmd.ExecuteReader();
while (rdr.Read())
{
}
string lastVal = rdr[0].ToString()
rdr.Close();
You could also use a ROW_NUMBER() but that requires a sort and you cannot use ROW_NUMBER() directly in the Where. But you can fool it by creating a derived table. The rdr solution above is faster.
In oracle database this is very simple.
select * from (select * from loanTable order by rownum desc) where rownum=1
Hi if this has not been solved yet.
To get the last record for any field from a table the easiest way would be to add an ID to each record say pID. Also say that in your table you would like to hhet the last record for each 'Name', run the simple query
SELECT Name, MAX(pID) as LastID
INTO [TableName]
FROM [YourTableName]
GROUP BY [Name]/[Any other field you would like your last records to appear by]
You should now have a table containing the Names in one column and the last available ID for that Name.
Now you can use a join to get the other details from your primary table, say this is some price or date then run the following:
SELECT a.*,b.Price/b.date/b.[Whatever other field you want]
FROM [TableName] a LEFT JOIN [YourTableName]
ON a.Name = b.Name and a.LastID = b.pID
This should then give you the last records for each Name, for the first record run the same queries as above just replace the Max by Min above.
This should be easy to follow and should run quicker as well
If you don't have any identifying columns you could use to get the insert order. You can always do it like this. But it's hacky, and not very pretty.
select
t.row1,
t.row2,
ROW_NUMBER() OVER (ORDER BY t.[count]) AS rownum from (
select
tab.row1,
tab.row2,
1 as [count]
from table tab) t
So basically you get the 'natural order' if you can call it that, and add some column with all the same data. This can be used to sort by the 'natural order', giving you an opportunity to place a row number column on the next query.
Personally, if the system you are using hasn't got a time stamp/identity column, and the current users are using the 'natural order', I would quickly add a column and use this query to create some sort of time stamp/incremental key. Rather than risking having some automation mechanism change the 'natural order', breaking the data needed.
I think this code may help you:
WITH cte_Loans
AS
(
SELECT LoanID
,[Status]
,ROW_NUMBER() OVER(ORDER BY (SELECT 1)) AS RN
FROM LoanTable
)
SELECT LoanID
,[Status]
FROM LoanTable L1
WHERE RN = ( SELECT max(RN)
FROM LoanTable L2
WHERE L2.LoanID = L1.LoanID)