Firebird select distinct with count - sql

In Firebird 2.5 I have a table of hardware device events; each row contains a timestamp, a device ID and an integer status of the event. I need to retrieve a rowset of the subset of IDs with non-0 statuses and the number of instances of the non-0 events for each ID, within a specified date range. I can get the subset of IDs with non-0 statuses in the specified date range, but I can't figure out how to get the count of non-0-status rows associated with each ID in the same rowset. I'd prefer to do this in a query rather than a stored proc, if possible.
The table is:
RPR_HISTORY
TSTAMP timestamp
RPRID integer
PARID integer
LASTRES integer
LASTCUR float
The rowset I want is like
RPRID ERRORCOUNT
-------------------
18 4
19 2
66 7
The query
select distinct RPRID from RPR_HISTORY
where (LASTRES <> 0)
and (TSTAMP >= :STARTSTAMP);
gives me the IDs I'm looking for, but obviously not the count of non-0-status rows for each ID. I've tried a bunch of combinations of nested queries derived from the above; all generate errors, usually on grouping or aggregation errors. It seems like a straightforward thing to do but is just escaping me.

Got it! The query
select rh.RPRID, count(rh.RPRID) from RPR_HISTORY rh
where (rh.LASTRES <> 0)
and (rh.TSTAMP >= :STARTSTAMP)
and rh.RPRID in
(select distinct rd.RPRID from RPR_HISTORY rd where rd.LASTRES <> 0)
group by rh.RPRID;
returns the rowset I need.

Related

Calculating the mode/median/most frequent observation in categorical variables in SQL impala

I would like to calculate the mode/median or better, most frequent observation of a categorical variable within my query.
E.g, if the variable has the following string values:
dog, dog, dog, cat, cat and I want to get dog since its 3 vs 2.
Is there any function that does that? I tried APPX_MEDIAN() but it only returns the first 10 characters as median and I do not want that.
Also, I would like to get the most frequent observation with respect to date if there is a tie-break.
Thank you!
the most frequent observation is mode and you can calculate it like this.
Single value mode can be calculated like this on a value column. Get the count and pick up row with max count.
select count(*),value from mytable group by value order by 1 desc limit 1
now, in case you have multiple modes, you need to join back to the main table to find all matches.
select orig.value from
(select count(*) c, value v from mytable) orig
join (select count(*) cmode from mytable group by value order by 1 desc limit 1) cmode
ON orig.c= cmode.cmode
This will get all count of values and then match them based on count. Now, if one value of count matches to max count, you will get 1 row, if you have two value counts matches to max count, you will get 2 rows and so on.
Calculation of median is little tricky - and it will give you middle value. And its not most frequent one.

How can I retrieve data from a Hive table from two columns with non null values and top 500 records in one query?

I have a Hive table (my_table) which is in ORC format and has 30 columns. Two of the columns (col_us, col_ds) store numeric values which can be 0 or null or some integer. The table is partitioned on the bases of day and hourly.
The table has approx. 8 Million x 96 records in a days partition and I am referring to 15 daily partitions
Currently I am running separate queries to retrieve top 500 records with value greater than 0 using a rank function. One query to retrieve col_us and other for col_ds
It is possible that clo_US may have a numeric value while col_DS is 0 or null
Question:
I want to retrieve top 500 non null and non 0 records from each of these columns from one query.
My Query:
From(
SELECT D.COL_US, D.DATESTAMP,
ROW_NUMBER() OVER (PARTITION BY D.ID,D.SUB_ID ORDER BY CONCAT (D.DATESTAMP,D.HOURSTAMP,D.TIMESTAMP) DESC) AS RNK
FROM ${wf_table_name} D
WHERE DATESTAMP >= '${datestamp_15}' AND DATESTAMP < '${datestamp}'
AND COL_US > 0)T
INSERT OVERWRITE TABLE ${wf_us_table}
SELECT T.COL_US, T.DATESTAMP, T.RNK WHERE T.RNK < 500;
As per your query I can guess that you are trying to get top 500 rows from your table based on date/time that means latest 500 rows where col_us, col_ds both have a value which is >0 but not top 500 from each of these columns.
As per your question your table may have 2 type of value. for example.
col_us
0
NULL
10
5
col_ds
5
10
0
NULL
or both column may have >0 value.
So instead of 'AND COL_US > 0' under WHERE clause use 'AND (COL_US > 0 and col_ds > 0)'
But with this condition you will not get any value from above stated 4 rows.
So if you want to get 10,5 from col_us along with 5,10 col_ds then I should say it's not possible using a single query.
Again, as per your question stated "I want to retrieve top 500 non null and non 0 records from each of these columns from one query." ,
I can guess that you want to get top 500 records from col_us, col_ds depends on the value of col_us/col_ds then you must have to use these columns within rank clause instead of date/time.
What you want to retrieve you may get by UPDATE query depending on other available columns but before that I want to request you to share exactly what you want (top 500 based on col_us/col_ds or latest 500) along with your base and target table structure.

Efficiently identify all FK items with n>3 dates within any 8 week period from a SQL table?

I have a ~400,000 row table containing the dates at which a collection of ~30,000 people had appointments. Each row has the patient ID number and an appointment date. I want to efficiently select people who had at least 4 appointments in an 8 week span. Ideally, I would also flag the appointments that were within this 8 week span as I did so. I am working in a server environment that does not allow CLR aggregate functions. Is this possible to do in SQL server? If so, how?
What I've thought about:
If I could write my own aggregate function to do this via GROUP BY that would obviously be best - but I can't seem to find any way to do it with the built in aggregate functions.
I can add a column to my original table giving a date 8 weeks out from any given appointment, but can't come up with any way that doesn't involve a for loop to then ask the question row by row whether there are at least 3 other appointments within that window.
Finally, I've even though that perhaps I could just do GROUP BY but somehow create 100 new columns (as there are up to that many appointments for some patients) to create a table that contains every appointment indexed by patient, but even as a SQL newbie I'm pretty sure that as soon as I get to the point of imagining adding 100 new columns I'm going down the wrong road....
For clarity of discussion, here is some notation:
MyTable:
ApptID PatientID ApptDate (in smalldatetime)
--------------------------------------------------
Apt1 Pt1 Datetime1
Apt2 Pt1 Datetime2
Apt3 Pt2 Datetime3
... ... ...
Desired output (one option):
PatientID 4aptsIn8weeks? (Boolean) InitialApptDateForWin
Pt1 1 Datetime1
Pt2 0 NULL
Pt3 1 Datetime3
...
Desired output (another option):
ApptID PatientID ApptDate InAn8wkWindow? InitialApptDateForWin
Apt1 Pt1 Datetime1 1 Datetime1
Apt2 Pt1 Datetime2 1 Datetime1
Apt3 Pt2 Datetime3 0 NULL
... ... ...
But really, any output format that will in the end let me select patients and appointments that meet this criterion would be dandy....
Thanks for any ideas!
EDIT: Here's a slightly decompressed outline of my implementation of the selected answer below, just in case the details are helpful for anyone else (being new to SQL, it took me a couple stabs to get it working):
WITH MyTableAlias AS (
SELECT * FROM MyTable
)
SELECT MyTableAlias.PatientID, MyTable.Apptdate AS V1,
MyTableAlias.Apptdate AS V2
INTO temp1
FROM MyTable INNER JOIN MyTableAlias
ON (
MyTable.PatientID = MyTableAlia.PatientID
AND (DATEDIFF(Wk,MyTable.Apptdate,MyTableAlias.Apptdate) <=8 )
);
-- Since this gives for any given two visit dates 3 hits
-- (V1-V1, V1-V2, V2-V2), delete the ones where the second visit is being
-- selected as V1:
DELETE FROM temp1
WHERE V2<V1;
-- So far we have just selected pairs of visits within an 8 week
-- span of each other, including an entry for each visit being
-- within 8 weeks of itself, but for the rest only including the item
-- where the second visit is after the first. Now we want to look
-- for examples of first visits where there are at least 4 hits:
SELECT PatientID, V1, MAX(V2) AS lastvisitinspan, DATEDIFF(Wk,V1,MAX(V2))
AS nWeeksInSpan, COUNT(*) AS nWeeksInSpan
INTO MyOutputTable
FROM temp
GROUP BY PatientID, V1
HAVING COUNT(*)>3;
-- From here on it's just a matter of how I want to handle patients with two
-- separate V1 examples meeting criteria...
Rough outline of the query:
INNER JOIN the table ("table") with itself ("alias"), the ON clause would be:
table.patientid = alias.patientid
table.appointment_date < alias.appointment_date
datediff(table.appointment_date, alias.appointment_date) <= 8 week
Then GROUP BY table.patientid, table.appointment_date
Output table.patientid, table.appointment_date, MAX(alias.appointment_date), COUNT(*)
Add a HAVING COUNT(*) > n clause
There are some issues though:
With 400,000 rows the JOIN could produce a very large result set
It will count some date ranges twice. E.g. if there were 4 visits in 9 week period then it will return two rows (#1, #2, #3 and #2, #3, #4).

How to get three count values from same column using SQL in Access?

I have a table that has an integer column from which I am trying to get a few counts from. Basically I need four separate counts from the same column. The first value I need returned is the count of how many records have an integer value stored in this column between two values such as 213 and 9999, including the min and max values. The other three count values I need returned are just the count of records between different values of this column. I've tried doing queries like...
SELECT (SELECT Count(ID) FROM view1 WHERE ((MyIntColumn BETWEEN 213 AND 9999));)
AS Value1, (SELECT Count(ID) FROM FROM view1 WHERE ((MyIntColumn BETWEEN 500 AND 600));) AS Value2 FROM view1;
So there are for example, ten records with this column value between 213 and 9999. The result returned from this query gives me 10, but it gives me the same value of 10, 618 times which is the number of total records in the table. How would it be possible for me to only have it return one record of 10 instead?
Use the Iif() function instead of CASE WHEN
select Condition1: iif( ), condition2: iif( ), etc
P.S. : What I used to do when working with Access was have the iif() resolve to 1 or 0 and then do a SUM() to get the counts. Roundabout but it worked better with aggregation since it avoided nulls.
SELECT
COUNT(CASE
WHEN MyIntColumn >= 213 AND MyIntColumn <= 9999
THEN MyIntColumn
ELSE NULL
END) AS FirstValue
, ??? AS SecondValue
, ??? AS ThirdValue
, ??? AS FourthValue
FROM Table
This doesn't need nesting or CTE or anything. Just define via CASE your condition within COUNTs argument.
I dont really understand what You want in the second, third an fourth column. Sounds to me, its very similar to the first one.
Reformatted, your query looks like:
SELECT (
SELECT Count(ID)
FROM view1
WHERE MyIntColumn BETWEEN 213 AND 9999
) AS Value1
FROM view1;
So you are selecting a subquery expression that is not related to the outer query. For each row in view1, you calculate the number of rows in view1.
Instead, try to do the calculation once. You just have to remove your outer query:
SELECT Count(ID)
FROM view1
WHERE MyIntColumn BETWEEN 213 AND 9999;
OLEDB Connection in MS Access does not support key words CASE and WHEN .
You can only use iif() function to count two, three.. values in same columns
SELECT Attendance.StudentName, Count(IIf([Attendance]![Yes_No]='Yes',1,Null)) AS Yes, Count(IIf([Attendance]![Yes_No]='No',1,Null)) AS [No], Count(IIf([Attendance]![Yes_No]='Not',1,Null)) AS [Not], Count(IIf([Attendance]![Yes_No],1,Null)) AS Total
FROM Attendance
GROUP BY Attendance.StudentName;

SQL Server SQL Select: How do I select rows where sum of a column is within a specified multiple?

I have a process that needs to select rows from a Table (queued items) each row has a quantity column and I need to select rows where the quantities add to a specific multiple. The mulitple is the order of between around 4, 8, 10 (but could in theory be any multiple. (odd or even)
Any suggestions on how to select rows where the sum of a field is of a specified multiple?
My first thought would be to use some kind of MOD function which I believe in SQL server is the % sign. So the criteria would be something like this
WHERE MyField % 4 = 0 OR MyField % 8 = 0
It might not be that fast so another way might be to make a temp table containing say 100 values of the X times table (where X is the multiple you are looking for) and join on that