I work for a small company dealing with herbal ingredients. We count regularly the effectiveness of the ingredients, based on the "product mix" (how much of the ingredient A, B and C). I have a table with thousands of rows, like the following:
PRODUCT Ingredient A Ingredient B Ingredient C EFFECTIVENESS
1 A 28 94 550 4,1
2 B 50 105 400 4,3
3 C 30 104 312 3,5
.. Etc etc etc etc Etc
What I want as a result, is the table below. I am using excel during the last years however it is difficult to handle millions of data and therefore I would like now to have something similar in sql. I did several attempts with Pivot and subqueries but I did not manage to get the result I needed.
In particular, in the first three columns, I include various ranges / criteria. In the column ‘Average effectiveness’ it is counted the average effectiveness of the ‘total products’ which meet these criteria. Due to the fact that the ranges are hundreds e.g. for Ingredient A, I have more than 100 different ranges and similarly for Ingredient B and C, I would like a way to have all multiple combinations of A, B, C ingredients (ranges) automatically.
Ingr. A Ingr. B Ingr. C Total products Average Effectiveness
1-10 50-60 90-110 ??? ???
1-10 50-60 110-130 ??? ??
1-10 50-60 130-150 ???? ??
1-10 60-70 150-170 ??? ??
10-20 60-70 90-110 ??? ??
10-20 60-70 110-130 ??? ??
10-20 60-70 130-150 ?? ??
Etc etc
I'm unable to give a more specific answer, but I think what you need to do is;
Use the CUBE to get all of the combinations and to aggregate the SUM and AVG values
Summarizing Data Using CUBE
The CUBE query will take its data from a nested query that has your data stored by the range of a value rather than the actual value. You can refer to SQL's CASE expression for more information on transforming the data so that it stores the range of a value rather than the value.
So, in other words, first you transform your data so that you're storing which range a value occurs in. Then from that transformed data, you summarize it using the CUBE to get all the combinations. So #1 is the outer query and #2 is the inner query.
Here is a very rough idea of what the query might look like, just to give you an idea:
Select Ingr_A, Ingr_B, Ingr_C, COUNT(*), AVG(Effectiveness)
(SELECT
Product,
Effectiveness,
"Ingr_A" =
CASE
WHEN Ingredient_A >= 10 and Ingredient_A < 20 THEN '[10, 20)'
WHEN Ingredient_A >= 20 and Ingredient_A < 30 THEN '[20, 30)'
...
END,
"Ingr_B" =
CASE
(like above)
END,
"Ingr_C"
(etc.)
FROM ProductsTable)
GROUP BY Ingr_A, Ingr_B, Ingr_C WITH CUBE
Related
SELECT distinct
A.PROPOLN, C.LIFCLNTNO, A.PROSASORG, sum (A.PROSASORG) as sum
FROM [FPRODUCTPF] A
join [FNBREQCPF] B on (B.IQCPLN=A.PROPOLN)
join [FLIFERATPF] C on (C.LIFPOLN=A.PROPOLN and C.LIFPRDCNT=A.PROPRDCNT and C.LIFBNFCNT=A.PROBNFCNT)
where C.LIFCLNTNO='2012042830507' and A.PROSASORG>0 and A.PROPRDSTS='10' and
A.PRORECSTS='1' and A.PROBNFLVL='M' and B.IQCODE='B10000' and B.IQAPDAT>20180101
group by C.LIFCLNTNO, A.PROPOLN, A.PROSASORG
This does not sum correctly, it returns two lines instead of one:
PROPOLN LIFCLNTNO PROSASORG sum
1 209814572 2012042830507 3881236 147486968
2 209814572 2012042830507 15461074 463832220
You are seeing two rows because A.PROSASORG has two different values for the "C.LIFCLNTNO, A.PROPOLN" grouping.
i.e.
C.LIFCLNTNO, A.PROPOLN, A.PROSASORG together give you two unique rows.
If you want a single row for C.LIFCLNTNO, A.PROPOLN, then you may want to use an aggregate on A.PROSASORG as well.
Your entire query is being filtered on your "C" table by the one LifClntNo,
so you can leave that out of your group by and just have it as a MAX() value
in your select since it will always be the same value.
As for you summing the PROSASORG column via comment from other answer, just sum it. Hour column names are not evidently clear for purpose, so I dont know if its just a number, a quantity, or whatever. You might want to just pull that column out of your query completely if you want based on a single product id.
For performance, I would suggest the following indexes on
Table Index
FPRODUCTPF ( PROPRDSTS, PRORECSTS, PROBNFLVL, PROPOLN )
FNBREQCPF ( IQCODE, IQCPLN, IQAPDAT )
FLIFERATPF ( LIFPOLN, LIFPRDCNT, LIFBNFCNT, LIFCLNTNO )
I have rewritten your query to put the corresponding JOIN components to the same as the table they are based on vs all in the where clause.
SELECT
P.PROPOLN,
max( L.LIFCLNTNO ) LIFCLNTNO,
sum (P.PROSASORG) as sum
FROM
[FPRODUCTPF] P
join [FNBREQCPF] N
on N.IQCODE = 'B10000'
and P.PROPOLN = N.IQCPLN
and N.IQAPDAT > 20180101
join [FLIFERATPF] L
on L.LIFCLNTNO='2012042830507'
and P.PROPOLN = L.LIFPOLN
and P.PROPRDCNT = L.LIFPRDCNT
and P.PROBNFCNT = L.LIFBNFCNT
where
P.PROPRDSTS = '10'
and P.PRORECSTS = '1'
and P.PROBNFLVL = 'M'
and P.PROSASORG > 0
group by
P.PROPOLN
Now, one additional issue you will PROBABLY be running into. You are doing a query with multiple joins, and it appears that there will be multiple records in EACH of your FNBREQCPF and FLIFERATPF tables for the same FPRODUCTPF entry. If you, you will be getting a Cartesian result as the PROSASORG value will be counted for each instance combination in the two other tables.
Ex: FProductPF has ID = X with a Prosasorg value of 3
FNBreQCPF has matching records of Y1 and Y2
FLIFERATPF has matching records of Z1, Z2 and Z3.
So now your total will be equal to 3 times 6 = 18.
If you look at the combinations, Y1:Z1, Y1:Z2, Y1:Z3 AND Y2:Z1, Y2:Z2, Y2:Z3 giving your 6 entries that qualify, times the original value of 3, thus bloating your numbers -- IF such multiple records may exist in each respective table. Now, imagine if your tables have 30 and 40 matching instances respectively, you have just bloated your totals by 1200 times.
I am trying to come up with some arithmetic calculations for some survey data. I want to do these calculations for a number of segments and want to figure out how to do it without writing numerous SELECT statements.
This is what I have so far:
FACT table. This tables holds survey data at a respondent level - for example, if a survey had 10 questions, this table will have 11 columns: a column to identify the respondent_ID and 10 other columns to identify the responses to those questions.
DIMENSION table. This table segments we want to view the survey data by at a respondent level - for example, if we want to view survey responses by membership_status and age_bracket, this table will have 3 columns: a column to identify the respondent_ID, and two columns to identify membership_status and age_bracket.
OUTPUT.
I want to get aggregate calculations to summarizes the responses to the survey overall and to each question. I also want to be able to get this information for all possible segments that exist in the DIMENSIONS table.
I can do the query below, however I'll need to do this for every segment:
SELECT
COUNT(DISTINCT(CASE WHEN f.QUESTION_1 IN ('8', '9', '10') THEN f.RESPONDENT_ID END))*1.0 / COUNT(DISTINCT(CASE WHEN f.QUESTION_1 IS NOT NULL THEN f.RESPONDENT_ID END))*1.0 AS CSAT_1
FROM FACT f
JOIN DIMENSION d ON f.RESPONDENT_ID = d.RESPONDENT_ID
WHERE d.MEMBERSHIP_STATUS = 'ACTIVE'
The calculation above gives us something called a top 3 box. That is just one calculation, I will need to do many of them. Additionally, ever calculation will need to be done for each segment. In order to get a calculation for nonactive members, I would need to run another query and set d.MEMBERSHIP_STATUS = 'INACTIVE' and I would need to run another query with no filter, to get the overall calculation.
Is there a way I could store all my arithmetic calculations needed in my output as a function (maybe in a temp table or something) - my thought is that it'll be better to set the functions somewhere, and then when I need to calculate the output, I would some how call the function to do all the calculations I need, and give me all the calculations for every segment I have?
I can't fully envision how to get there, or if this is even a good solution, so guidance and detailed SQL code would be extremely helpful.Examples please!
I have data on five experiments, which vary slightly, and I would like to find a way of standardizing results between experiments to develop a “standard result.” This will be a value I can multiply each result by to make them comparable.
The way I have gone about this is there are individuals in more than one experimental group (assuming being in more than one group doesn’t affect results).
I am assuming the results of individuals in more than one experiment are only different because of the slight differences between experiments and so by calculating a conversion factor for each experiment that makes the average of the individuals results in more than one experiment the same that can translate results into a standard result.
The problem I am experiencing is at the end of the process most of my conversion factors are less than 1. However, I was expecting to get roughly an even number of values greater than 1 and less than 1 as I am sort of averaging out the results.
My procedure and MSQL code is below:
Get data by experiment and by individual and exclude all individual’s data that is only in 1 experiment.
SELECT ExperimentID,
IndividualID,
INTO Z1_IndivByExperiment
FROM Results
GROUP BY ExperimentID,
IndividualID
SELECT IndividualID,
COUNT(ExperimentID) AS ExperimentCount
INTO Z2_MultiExperIndiv
FROM Z1_IndivByExperiment
GROUP BY IndividualID
HAVING COUNT(ExperimentID) > 1
ORDER BY ExperimentCount DESC
Create two tables:
i) Summaries results by individual (an individual can have multiple results per experiment) and experiment.
ii) Summaries results by individual
SELECT ExperimentID,
IndividualID,
SUM(Results.Result) AS Result_sum,
SUM(Results.ResultsCount) AS ResultCount_sum
INTO Z3_MultiExperIndiv_Results
FROM Results INNER JOIN
Z2_MultiExperIndiv ON Results.IndividualID =
Z2_MultiExperIndiv.IndividualID
GROUP BY ExperimentID,
IndividualID
SELECT 'Standard' AS Experiment
IndividualID,
SUM(Result_sum) AS ResultIndiv_sum,
SUM(ResultCount_sum) AS ResultCountIndiv_sum
into Z4_MultiExperIndiv_Stand
FROM Z3_MultiExperIndiv_Results
GROUP BY IndividualID
Link the two tables created in step 2 by individual summing results from table 1 and table 2 by experiment. I am hoping this provides two sets of results for individuals on the experiment in question and results for individuals who were part of the experiment in question from other experiments were.
SELECT Z4_MultiExperIndiv_Stand.ExperimentID AS ExperimentID1,
Z3_MultiExperIndiv_Results.ExperimentID AS ExperimentID2,
SUM(Z4_MultiExperIndiv_Stand.ResultIndiv_sum -
Z3_MultiExperIndiv_Results.Result_sum) AS Results1,
SUM(Z4_MultiExperIndiv_Stand.ResultCountIndiv_sum -
Z3_MultiExperIndiv_Results.ResultCount_sum) AS ResultCount1,
SUM(Z3_MultiExperIndiv_Results.Result_sum) AS Results 2,
SUM(Z3_MultiExperIndiv_Results.ResultCount_sum) AS ResultCount2
into Z5_StandardConversion_data
FROM Z3_MultiExperIndiv_Results INNER JOIN
Z4_MultiExperIndiv_Stand ON Z3_MultiExperIndiv_Results.IndividualID =
Z4_MultiExperIndiv_Stand.IndividualID
GROUP BY Z4_MultiExperIndiv_Stand.IndividualId,
Z3_MultiExperIndiv_Results.IndividualId
Then I divide results for each set of results by number of results and divide one by the other to get my conversion to standard number for each experiment.
SELECT ExperimentID1,
ExperimentID2,
Results1,
ResultCount1,
Results2,
ResultCount2,
Results1 / ResultCount1 AS Result1_avg,
Results2 / ResultCount2 AS Result2_avg,
(ResultCount2 * Results1) / (ResultCount1 * Results2) AS Conversion
FROM Z5_StandardConversion_data
Sample input data:
Sample Output:
I am looking for a way to search for a certain number of rows as a quality check. For example, we have tables that have a certain set of results that are needed.
Here is a quick table for an example:
ID: Name: Result: Reportable:
ONE A 10 X
TWO B 12 X
THREE C 1
FOUR D 18 X
FOUR(redo) D 11 X
So we are looking to double check results as there are people who accidentally report results multiple times (as in the case with ID FOUR). We have used having counts but we need the numbers to be specific and need a query to verify that number is satisfied.
In the table above we only want IDs ONE, TWO, and FOUR, however we have 4 results (one extra). Currently we have our check showing the count needed (ie 3) and the current result count (4) to show the mismatch but want a query to easily only show the result needed. We would need the redo result most of the time so we have set it so we take the latest date, but it doesn't help filter how many rows or results. I apologize if anything is confusing and I am not able to share the SQL query that we have currently. It's my first time posting so if I need to clarify anything please let me know as this seems to be very complicated. Thank you for your time.
EDIT: The details
We have one table (Table A) letting us know which results are reportable. The ones that are reportable go into another table (Table B). We have had issues in which people have made too many results reportable which overpopulates the Table B. Our old query had a count in Table B, but due to mistakes in people placing multiple reportables, samples which had many redos seem to be finished as they were all placed and met the count in Table B.
So now by using the Table A that helps tell us how many are Reportable, we want this to double check that the samples are indeed ready.
As I understand the question, you want ids that have multiple reportables. Assuming you really mean name, then:
select name
from t
where reportable = 'X'
group by name
having count(*) >= 2;
I have an expense table designed using sqlite I would like to construct a query to filter out some random rows using the sum function on the amount column of the table.
Sample Expense table
Clients Amounts
A 1000
B 3000
C 5000
D 2000
E 6000
Assuming i would like total sum in the table above to be 10,000 i would like to construct a query which would return any number of randoms rows that would add up to 10,000
So far i tried
SELECT *
FROM Expense Table
GROUP BY (Clients)
HAVING SUM(AMOUNT)=10000
but i got nothing generated
I have also had a go with the random function but i'm assuming i need to specify a LIMIT
SQLLite does not support CTEs (specifically recursive ones), so I can't think of an easy way of doing this. Perhaps you would be better off doing this in your presentation logic.
One option via SQL would be to string to together a number of UNION statements. Using your above sample data, you would need to string 3 UNIONs to get your results:
select clients
from expense
where amounts = 10000
union
select e.clients || e2.clients
from expense e
inner join expense e2 on e2.rowid > e.rowid
where e.amounts + e2.amounts = 10000
union
select e.clients || e2.clients || e3.clients
from expense e
inner join expense e2 on e2.rowid > e.rowid
inner join expense e3 on e3.rowid > e2.rowid
where e.amounts + e2.amounts + e3.amounts = 10000
Resulting in ABE and BCD. This would work for any group of clients, 1 to 3, whose sum is 10000. You could string more unions to get more clients -- this is just an example.
SQL Fiddle Demo
(Here's a sample with up to 4 clients - http://sqlfiddle.com/#!7/b01cf/2).
You can probably use dynamic sql to construct your endless query if needed, however, I do think this is better suited on the presentation side.
What you are describing is the knapsack problem (in your case, the value is equal to the weight).
This can be solved in SQL (see sgeddes's answer), but due to SQL's set-oriented design, the computation is rather complex and very slow.
You would be better off by reading the amounts into your program and solving the problem there (see the pseudocode on the Wikipedia page).