This questions is posted on a suggestion in this thread.
I'm using SQLite/Database browser and my data looks like this:
data.csv
company year value
A 2000 15
A 2001 12
A 2002 20
B 2000 25
B 2001 20
B 2002 10
C 2000 18
C 2001 14
C 2002 22
etc..............
What I want to do is get all companies which have a value of <= 20 for all years in the data set. Using above data this would mean I want the query to answer me:
result.csv
company year value
A 2000 15
A 2001 12
A 2002 20
Thus excluding company C due to value > 20 in 2002 and company B for value > 20 in 2000.
You want all companies whose maximum value is no larger than 20:
SELECT *
FROM Data
WHERE company IN (SELECT company
FROM Data
GROUP BY company
HAVING max(value) <= 20)
Not sure if there are better solutions, but I think this will work:
select company
, sum(case when value < 20 then 1 else 0 end) s
, count(*) c
from data
where year in (2000, 2001, 2002)
group
by company
having s = c
It will check whether the count equals the number of years where the value is less than 20.
Related
I have a postgres query that is supposed to calculate an average value based on a set of values. This set of values should be based on DISTINCT ID's.
The query is the following:
#{context.answers_base}
SELECT
stores.name as store_name,
answers_base.question_name as question_name,
answers_base.question_id as question_id,
(sum(answers_base.answer_value) / NULLIF(count(answers_base.answer_id),0)) as score, # <--- this line is calculating wrong
sum(answers_base.answer_value) as score_sum,
count(answers_base.answer_id) as question_answer_count,
count(DISTINCT answers_base.answer_id) as answer_count
FROM answers_base
INNER JOIN stores ON stores.id = answers_base.store_id
WHERE answers_base.answer_value IS NOT NULL AND answers_base.question_type_id = :question_type_id
AND answers_base.scale = TRUE
#{context.filter_answers}
GROUP BY stores.name, answers_base.question_name, answers_base.question_id, answers_base.sort_order
ORDER BY stores.name, answers_base.sort_order
The thing is, that on the indicated line (sum(answers_base.answer_value) / NULLIF(count(answers_base.answer_id),0)) some values are counted more than once.
Part of the solution is making it DISTINCT based on ID, like so:
(sum(answers_base.answer_value) / NULLIF(count(DISTINCT answers_base.answer_id),0))
This will result in an average that divided by the right number, but here the sum it's dividing is still wrong.
Doing the following (make sum() DISTINCT) does not work, for the reason that values are not unique. The values are either 0 / 25 / 50 / 75 / 100, so different IDs might contain 'same' values.
(sum(DISTINCT answers_base.answer_value) / NULLIF(count(DISTINCT answers_base.answer_id),0))
How would I go about making this work?
Here are simplified versions of the table structures.
Table Answer
ID
answer_date
1
Feb 01, 2022
2
Mar 02, 2022
3
Mar 13, 2022
4
Mar 21, 2022
Table AnswerRow
ID
answer_id
answer_value
1
1
25
2
1
50
3
1
50
4
2
75
5
2
100
6
2
0
7
3
25
8
4
25
9
4
100
10
4
50
Answer 1' answer_rows:
25 + 50 + 50 -> average = 125 / 3
Answer 2' answer_rows:
75 + 100 + 0 -> average = 175 / 3
Answer 3' answer_rows:
25 -> average = 25 / 1
Answer 4' answer_rows:
25 + 100 + 50 -> average = 175 / 3
For some reason, we get duplicate answer_rows in the calculation.
Example of the problem; for answer_id=1 we have the following answer_rows in the calculation, giving us a different average:
ID
answer_id
answer_value
1
1
25
2
1
50
3
1
50
3
1
50
3
1
50
3
1
50
Result: 25 + 50 + 50 + 50 + 50 + 50 -> 275 / 6
Desired result: 25 + 50 + 50 -> 125 / 3
Making answer_row_id distinct (see beginning of post) makes it possible for me to get:
25 + 50 + 50 + **50 + 50 + 50** -> 275 / **3**
But not
25 + 50 + 50 -> 275 / 3
What I would like to achieve is having a calculation that selects answer_row distinctly based on its ID, and those answer_rows will be used both for calculation x and y in calculation average -> x / y.
answers_base is the following (simplified):
WITH answers_base as (
SELECT
answers.id as answer_id,
answers.store_id as store_id,
answer_rows.id as answer_row_id,
question_options.answer_value as answer_value
FROM answers
INNER JOIN answer_rows ON answers.id = answer_rows.answer_id
INNER JOIN stores ON stores.id = answers.store_id
WHERE answers.status = 0
)
I think this would be best solved with a window function. Something along the lines of
SELECT
ROW_NUMBER() OVER (PARTITION BY answer_rows.id ORDER BY answer_rows.created_at DESC) AS duplicate_answers
...
WHERE
answer_rows.duplicate_answers = 1
This would filter out multiple rows with the same id, and only keep one entry. (I chose the "first by created_at", but you could change this to whatever logic suits you best.)
A benefit to this approach is that it makes the rationale behind the logic clear, contained and re-usable.
I have a table that looks like this:
firm id year profit
1 2000 10
1 2001 20
1 2002 15
2 1999 40
2 2000 55
2 2001 35
2 2002 65
3 2001 5
3 2002 20
3 2003 10
And I want to estimate the var of the past years' firm profits.
Assuming that your table is already sorted by "Firm ID" and by "Year", and that there's only one entry for each year, you can loop through each firm:
profitVar = [];
ids = unique(yourTable.firmId); % id of each firm
% Loop through the firms
for i = 1:length(ids)
id = ids(i);
subData = find(yourTable.firmId == id); % get the data from the given firm only
% Loop through the years
for j = 1:length(subData)
profitVar = [profitVar; var(yourTable.profit(subData(1:j-1)))];
end
end
yourTable = addvars(yourTable, profitVar);
Note that this returns NaN only for the first year, and not the second year due to the fact that it is calculating the variance of the previous one (which will thus always be zero). If this is a problem, you could just insert an exception in the inner loop, something like if (j == 2) profitVar = [profitVar; NaN];.
Using Sql Server Mgmt Studio. My data set is as below.
ID Days Value Threshold
A 1 10 30
A 2 20 30
A 3 34 30
A 4 25 30
A 5 20 30
B 1 5 15
B 2 10 15
B 3 12 15
B 4 17 15
B 5 20 15
I want to run a query so only rows after the threshold has been reached are selected for each ID. Also, I want to create a new days column starting at 1 from where the rows are selected. The expected output for the above dataset will look like
ID Days Value Threshold NewDayColumn
A 3 34 30 1
A 4 25 30 2
A 5 20 30 3
B 4 17 15 1
B 5 20 15 2
It doesn't matter if the data goes below the threshold for the latter rows, I want to take the first row when threshold is crossed as 1 and continue counting rows for the ID.
Thank you!
You can use window functions for this. Here is one method:
select t.*, row_number() over (partition by id order by days) as newDayColumn
from (select t.*,
min(case when value > threshold then days end) over (partition by id) as threshold_days
from t
) t
where days >= threshold_days;
When a column value does not equal, I would like to retrieve the closest lower pay value.
For instance: 10 yearsOfService should equal the value 650.00; 14 yearsOfService would equal the value 840.00 in the below incentive table,
ID Pay yearsOfService
1 125.00 0
2 156.00 2
3 188.00 3
4 206.00 4
5 650.00 6
6 840.00 14
7 585.00 22
8 495.00 23
9 385.00 24
10 250.00 25
I have tried several different approaches; including:
SELECT TOP 1 (pay) as incentivePay
FROM incentive
WHERE yearsOfService = '10'
This works but only for yearsOfService that match.
With 10 yearsOfService:
RESULTSET = [1 650.00]
Any ideas?
Please try:
SELECT TOP 1 (pay) as incentivePay
FROM incentive
WHERE yearsOfService <= '10'
ORDER BY yearsOfService desc
I have 2 sets.
First one is big (~1000k rows), it contains patient observation data grouped by observation year, from, lets say 2000 to 2005. In this set there are some patients that contain observations for all years (or should I say for each year in sequence), and there are some that has, for example, observations for year 2002-2003 only.
The second set contains only sequence of years from 2000 till 2005, 6 rows.
What I want to have is a table with the data from set 1 for each patient, but extended so that for each patient I would see observations for each year from set 2, and if there were not any observation for particular year in set 1, the empty rows should be added or emptyness (or better "-") in the data column only.
For example set 1 could be:
patient_id | obs_year | data
a 2000 10
a 2001 12
a 2002 13
a 2003 9
a 2004 1
a 2005 6
bb 2002 100
bb 2003 110
Set 2 is like:
year |
2000
2001
2002
2003
2004
2005
So what I want in result ideally would be like this:
patient_id | obs_year | data
a 2000 10
a 2001 12
a 2002 13
a 2003 9
a 2004 1
a 2005 6
bb 2000 -
bb 2001 -
bb 2002 100
bb 2003 110
bb 2004 -
bb 2005 -
I should also mention that I do this job in SAS, so SQL query or SAS script (or both )solutions are welcomed.
Dedup your patient_id from set 1 in a sort. Merge this onto set 2 to give every patient_id against the years, then merge this back onto set 1 by patient_id and year to give your output. Anywhere that patient_id and year do not match will be blank as in your desired output
Another option is PROC FREQ with sparse, which produces a line for every possible combination whether they appear or not. This works if you don't have any legitimate zeroes in the data; if you do and care that they're different from missing, this won't work.
proc freq data=have noprint;
weight data;
tables patient_id*obs_year/missing sparse out=want(rename=count=data keep=count patient_id obs_year);
run;
Then you need to convert 0 back to missing, if you care about the difference (presumably in the next step, if there is one).
A similar approach that is closer to the desired results is proc tabulate with printmiss, which works similarly to sparse:
proc tabulate data=have out=want(keep=patient_id obs_year data_sum rename=data_sum=data);
class patient_id obs_year;
var data;
tables patient_id,obs_year*data*sum='data'/printmiss misstext='.';
run;
That actually does get you missing values properly.