Find out irregular entries with SQL - sql

I am having some human error entries in my table. Some missing a zero, some has more material than it should be, and so on. So I am trying to scan throughout a table to find some error in an entry groups.
Table goes like this:
| Work Order | Product | Material Qty
---------------------------------
| 1 | Item A | 10
| 2 | Item A | 25
| 3 | Item A | 12
| 4 | Item A | 9
| 5 | Item X | 52
| 6 | Item X | 20
| 7 | Item X | 23
| 8 | Item X | 24
| 9 | Item X | 2
| 10 | Item Z | 20
| 11 | Item Z | 5
---------------------------------
Now, the WO and WO item are not that sequential, I write it as sequential here only for examples.
As you can see, those item A should have number around 10, give or take some. Item X should be around 22, give or take some, meanwhile the query should tag Item Z as all suspicious since there are not enough data to correlate. So I need to isolate WO number 2, 5 and 9, 10 and 11 for people to audit. Any idea how?
I have been trying to create an average of them, and using a percentage to eliminate them. But sometimes, percentage number are too varies. And in case of item Z, there are not enough data to choose which number are normal number, and which number are irregular numbers, and I need to tag both of them for verification, in which case, reducing down to percentage won't help.
Also, if I reduce them to variant percentage against average, its spread are still too wide to tag one of them.
Any ideas? Because I am really stuck this time.

From a statistical basis, you probably want to start with the STDEV standard deviation function.
select *
from
(
select *,
AVG(qty) OVER( Partition by product) av,
STDEV(qty) OVER( Partition by product) sd,
COUNT(*) over (Partition by product) c
from yourtable
) v
where ABS(qty-av)>sd or c<3

Related

sql index same column two directions for traversing window functions

I'm trying use windowing functions to group records close to each other (within the same partition) into sequential groups. There's probably a better way to solve the problem, but right now what I would like to try is running too slow to be useful. It involves an order by on the select:
order by person_id, rollup_class, rollup_concept_id, exp_num
and another order by in the window function:
lead(days_from_latest) over (partition by person_id, rollup_class, rollup_concept_id
order by exp_num DESC)
Because I have that last column (exp_num) ordered in opposite directions, the query takes forever. I even have two indexes on the table to handle the two directions:
create index deeIdx on results.drug_exposure_extra (person_id,rollup_class, rollup_concept_id,
exp_num);
create index deeIdx2 on results.drug_exposure_extra (person_id,rollup_class,rollup_concept_id,
exp_num desc);
But that doesn't help. So I'm trying one that orders exp_num in both directions:
create index deeIdx3 on results.drug_exposure_extra (person_id,rollup_class,rollup_concept_id,
exp_num, exp_num desc);
Does that even make sense? When the index finally finishes building, if it solves the problem, I'll answer my own question...
Nope.
Even with all three indexes, if the two order bys (in select and in over clause) go the same direction, the query runs super fast, if they go opposite directions the query runs super slow. So, at this point I guess I should explain my use case better and ask for ideas for a better approach.
I've got drug exposure records (this is for a cool open-source project http://www.ohdsi.org/, btw), and when a person has drug exposures that begin less than N days from the end of any previous exposure, it should be combined with the earlier ones into a single 'era'. Whenever there is a gap of more than N days, a new era begins.
Over the course of composing this question, it turns out I solved it. It raises some interesting issues, though, so I'll post it and answer it below.
Like asking a doctor, "It hurts when I move my arm like this, what should I do?" the answer is obviously, "Don't move your arm like that." So -- don't try to make windowing functions proceed in a different order from the main query (or probably from each other) -- there's probably a better solution.
Early in working on this I had somehow convinced myself that it would be easier to aggregate eras relative to their ending records rather than their starting records, but that was where I went wrong.
So the expression that gives me the era number I want looks like this:
sum(case when exp_num = 1 or days_from_latest > 30 then 1 else 0 end)
over (partition by person_id, rollup_class, rollup_concept_id
order by exp_num)
as era_num
Explanation: if it's the patient's first exposure to the drug (well, the combination of rollup_class and rollup_concept_id in this case), then that's the beginning of a drug era. It's also the beginning of a drug era if the exposure is more than N days from any earlier exposure. (This point is what makes it a little complicated: say exposure 1 starts at day 1 and is 60 days, exposure 2 starts at day 20 and is 10 days, exposure 3 starts at day 70: it's 40 days after the end of the most recent exposure, 2, which would put it in a new era, but it's only 10 days after exposure 1, which puts it in the same era with 1 and 2.) So, for each record that starts an era the case statement gives us a 1, the rest get 0s. Then we sum that, partitioning over the same partition we used in an earlier query to establish the exp_num, and order by exp_num. I could have specified the rows to sum explicitly by adding rows between unbounded preceding and current row, but that's the default behavior anyway. So the era number increments only at the beginning of new eras.
Here is a much simplified example in response to gordon-linoff's comment below.
create table junk_numbers (x int);
insert into junk_numbers values (1),(2),(3),(5),(7),(9),(10),(15),(20),(25),(26),(28),(30);
-- break into series with gaps of at least 1
select x, gap, 1+sum(case when gap > 1 then 1 else 0 end) over (order by x) as series_num
from (
select x, x - lag(x) over (order by x) as gap
from junk_numbers
) as x_and_gaps
order by x;
x | gap | series_num
----+-----+------------
1 | | 1
2 | 1 | 1
3 | 1 | 1
5 | 2 | 2
7 | 2 | 3
9 | 2 | 4
10 | 1 | 4
15 | 5 | 5
20 | 5 | 6
25 | 5 | 7
26 | 1 | 7
28 | 2 | 8
30 | 2 | 9
-- same query but bigger gaps:
select x, gap, 1+sum(case when gap > 4 then 1 else 0 end) over (order by x) as series_num
from (
select x, x - lag(x) over (order by x) as gap
from junk_numbers
) as x_and_gaps
order by x;
x | gap | series_num
----+-----+------------
1 | | 1
2 | 1 | 1
3 | 1 | 1
5 | 2 | 1
7 | 2 | 1
9 | 2 | 1
10 | 1 | 1
15 | 5 | 2
20 | 5 | 3
25 | 5 | 4
26 | 1 | 4
28 | 2 | 4
30 | 2 | 4

Select the difference of two consecutive columns

I have a table car that looks like this:
| mileage | carid |
------------------
| 30 | 1 |
| 50 | 1 |
| 100 | 1 |
| 0 | 2 |
| 70 | 2 |
I would like to get the average difference for each car. So for example for car 1 I would like to get ((50-30)+(100-50))/2 = 35. So I created the following query
SELECT AVG(diff),carid FROM (
SELECT (mileage-
(SELECT Max(mileage) FROM car Where mileage<mileage AND carid=carid GROUP BY carid))
AS diff,carid
FROM car GROUP BY carid)
But this doesn't work as I'm not able to use current row for the other column. And I'm quite clueless on how to actually solve this in a different way.
So how would I be able to obtain the value of the next row somehow?
The average difference is the maximum minus he minimum divided by one less than the count (you can do the arithmetic to convince yourself this is true).
Hence:
select carid,
( (max(mileage) - min(mileage)) / nullif(count(*) - 1, 0)) as avg_diff
from cars
group by carid;

PostgreSQL: Distribute rows evenly and according to frequency

I have trouble with a complex ordering problem. I have following example data:
table "categories"
id | frequency
1 | 0
2 | 4
3 | 0
table "entries"
id | category_id | type
1 | 1 | a
2 | 1 | a
3 | 1 | a
4 | 2 | b
5 | 2 | c
6 | 3 | d
I want to put entries rows in an order so that category_id,
and type are distributed evenly.
More precisely, I want to order entries in a way that:
category_ids that refer to a category that has frequency=0 are
distributed evenly - so that a row is followed by a different category_id
whenever possible. e.g. category_ids of rows: 1,2,1,3,1,2.
Rows with category_ids of categories with frequency<>0 should
be inserted from ca. the beginning with a minimum of frequency rows between them
(the gaps should vary). In my example these are rows with category_id=2.
So the result could start with row id #1, then #4, then a minimum of 4 rows of other
categories, then #5.
in the end result rows with same type should not be next to each other.
Example result:
id | category_id | type
1 | 1 | a
4 | 2 | b
2 | 1 | a
6 | 3 | d
.. some other row ..
.. some other row ..
.. some other row ..
5 | 2 | c
entries are like a stream of things the user gets (one at a time).
The whole ordering should give users some variation. It's just there to not
present them similar entries all the time, so it doesn't have to be perfect.
The query also does not have to give the same result on each call - using
random() is totally fine.
frequencies are there to give entries of certain categories a higher
priority so that they are not distributed across the whole range, but are placed more
at the beginning of the result list. Even if there are a lot of these entries, they
should not completely crowd out the frequency=0 entries at the beginning, through.
I'm no sure how to start this. I think I can use window functions and
ntile() to distribute rows by category_id and type.
But I have no idea how to insert the non-0-category-entries afterwards.

SQL Query X Days back excluding date ranges (Confusing!)

Ok, I have a tough SQL query, and I'm not sure how to go about writing it.
I am summing the number of "bananas collected" by an employee within the last X days, but what I could really use help on is determining X.
The "last X days" value is defined to be the last 100 days that the employee was NOT out due to Purple Fever, starting from some ChosenDate (we'll say today, 6/24/14). That is to say, if the person was sick with Purple Fever for 3 days, then I want to look back over the last 103 days from ChosenDate rather than the last 100 days. Any other reason the employee may have been out does not affect our calculation.
Table PersonOutIncident
+----------------------+----------+-------------+
| PersonOutIncidentID | PersonID | ReasonOut |
+----------------------+----------+-------------+
| 1 | Sarah | PurpleFever |
| 2 | Sarah | PaperCut |
| 3 | Jon | PurpleFever |
| 4 | Sarah | PurpleFever |
+----------------------+----------+-------------+
Table PersonOutDetail
+-------------------+----------------------+-----------+-----------+
| PersonOutDetailID | PersonOutIncidentID | BeginDate | EndDate |
+-------------------+----------------------+-----------+-----------+
| 1 | 1 | 1/1/2014 | 1/3/2014 |
| 2 | 1 | 1/7/2014 | 1/13/2014 |
| 3 | 2 | 2/1/2014 | 2/3/2014 |
| 4 | 3 | 1/15/2014 | 1/20/2014 |
| 5 | 4 | 5/1/2014 | 5/15/2014 |
+-------------------+----------------------+-----------+-----------+
The tables are established. Many PersonOutDetail records can be associated with one PersonOutIncident record and there may be multiple PersonOutIncident records for a single employee (That is to say, there could be two or three PersonOutIncident records with an identical ReasonOut column, because they represent a particular incident or event and the not-necessarily-continuous days lost due to that particular incident)
The nature of this requirement complicates things, even conceptually to me.
The best I can think of is to check for a BeginDate/EndDate pair within the 100 day base period, then determine the number of days from BeginDate to EndDate and add that to the base 100 days. But then I would have to check again that this new range doesn't overlap or contain additional BeginDate/EndDate pairs and add, if so, add those days as well. I can tell already that this isn't the method I want to use, but I can't wrap my mind quite around how exactly what I need to start/structure this query. Does anyone have an idea that might steer me in the correct direction? I realize this might not be clear and I apologize if I'm just confusing things.
One way to do this is to work with a table or WITH CLAUSE that contains a list of days. Let's say days is a table with one column that contains the last 200 days. (This means the query will break if the employee had more than 100 sick days in the last 200 days).
Now you can get a list of all working days of an employee like this (replace ? with the employee id):
WITH t1 AS
(
SELECT day,
ROW_NUMBER() OVER (ORDER BY day DESC) AS 'RowNumber'
FROM days d
WHERE NOT EXISTS (SELECT * FROM PersonOutDetail pd
INNER JOIN PersonOutIncidentID po ON po.PersonOutIncidentID = pd.PersonOutIncidentID
WHERE d.day BETWEEN pd.BeginDate AND pd.EndDate
AND po.ReasonOut = 'PurpleFever'
AND po.PersonID = ?)
)
SELECT * FROM t1
WHERE RowNumber <= 100;
Alternatively, you can obtain the '100th day' by replacing RowNumber <= 100 with RowNumber = 100.

How to print page-wise totals at the end of every page using table

I'm trying make a report which contains some prices on every row, and I want to print the sum of prices -which are printed on the page- at the bottom of every page. I don't find it smart to print the grand totals in every page, for my situation at least.
Example:
First Page
Name | Price1 | Price2 | Price3 | Price4 -table header
Record 1 | 10 | 15 | 15 | 20
Record 2 | 15 | 15 | 15 | 15
Total 2 records | 25 | 30 | 30 | 35 -table footer for page 1
Second Page
Name | Price1 | Price2 | Price3 | Price4 -starting 2. page, table header
Record 3 | 20 | 30 | 30 | 30
Total 1 records | 20 | 30 | 30 | 30 -end of the table
Grand T. (3 rec) | 45 | 60 | 60 | 65 -end of the table
I put 2 records for the first page and 1 for second page, its just to demonstrate what I want. I did my best to make it look clear.
I can see two possible ways this could be done.
Since you say that there's a predictable number of rows per page (let's say, for ease of constructing an example, that there are 25), you could do the following:
In your query, assign each row a consecutive sequence number. You could do this like this:
WITH Cte
AS (
SELECT Name,
Price1,
Price2,
Price3,
Price4,
RecordNumber = Row_Number() OVER (ORDER BY [whatever])
FROM [tables]
)
SELECT Name,
Price1,
[etc]
RowGroup = Floor((RecordNumber * 1.0) / 25 )
FROM Cte
ORDER BY RecordNumber
Then, in your report, group by RowGroup, and subtotal your rows.
If you have other, higher-level groupings in the report, you will probably have to start a new page after each higher-level group, because otherwise it will throw the placement off for the page subtotals.
(You might also be able to do this with the RowNumber() function in SSRS without modifying the underlying query, but there are a lot of rules about what functions you can use when, and I don't know if that would be allowed.)
The other option I can think of is more complicated and involves using the report Code section and setting some module-level functions. Let me know if you think that's something you need to see.