I have data on five experiments, which vary slightly, and I would like to find a way of standardizing results between experiments to develop a “standard result.” This will be a value I can multiply each result by to make them comparable.
The way I have gone about this is there are individuals in more than one experimental group (assuming being in more than one group doesn’t affect results).
I am assuming the results of individuals in more than one experiment are only different because of the slight differences between experiments and so by calculating a conversion factor for each experiment that makes the average of the individuals results in more than one experiment the same that can translate results into a standard result.
The problem I am experiencing is at the end of the process most of my conversion factors are less than 1. However, I was expecting to get roughly an even number of values greater than 1 and less than 1 as I am sort of averaging out the results.
My procedure and MSQL code is below:
Get data by experiment and by individual and exclude all individual’s data that is only in 1 experiment.
SELECT ExperimentID,
IndividualID,
INTO Z1_IndivByExperiment
FROM Results
GROUP BY ExperimentID,
IndividualID
SELECT IndividualID,
COUNT(ExperimentID) AS ExperimentCount
INTO Z2_MultiExperIndiv
FROM Z1_IndivByExperiment
GROUP BY IndividualID
HAVING COUNT(ExperimentID) > 1
ORDER BY ExperimentCount DESC
Create two tables:
i) Summaries results by individual (an individual can have multiple results per experiment) and experiment.
ii) Summaries results by individual
SELECT ExperimentID,
IndividualID,
SUM(Results.Result) AS Result_sum,
SUM(Results.ResultsCount) AS ResultCount_sum
INTO Z3_MultiExperIndiv_Results
FROM Results INNER JOIN
Z2_MultiExperIndiv ON Results.IndividualID =
Z2_MultiExperIndiv.IndividualID
GROUP BY ExperimentID,
IndividualID
SELECT 'Standard' AS Experiment
IndividualID,
SUM(Result_sum) AS ResultIndiv_sum,
SUM(ResultCount_sum) AS ResultCountIndiv_sum
into Z4_MultiExperIndiv_Stand
FROM Z3_MultiExperIndiv_Results
GROUP BY IndividualID
Link the two tables created in step 2 by individual summing results from table 1 and table 2 by experiment. I am hoping this provides two sets of results for individuals on the experiment in question and results for individuals who were part of the experiment in question from other experiments were.
SELECT Z4_MultiExperIndiv_Stand.ExperimentID AS ExperimentID1,
Z3_MultiExperIndiv_Results.ExperimentID AS ExperimentID2,
SUM(Z4_MultiExperIndiv_Stand.ResultIndiv_sum -
Z3_MultiExperIndiv_Results.Result_sum) AS Results1,
SUM(Z4_MultiExperIndiv_Stand.ResultCountIndiv_sum -
Z3_MultiExperIndiv_Results.ResultCount_sum) AS ResultCount1,
SUM(Z3_MultiExperIndiv_Results.Result_sum) AS Results 2,
SUM(Z3_MultiExperIndiv_Results.ResultCount_sum) AS ResultCount2
into Z5_StandardConversion_data
FROM Z3_MultiExperIndiv_Results INNER JOIN
Z4_MultiExperIndiv_Stand ON Z3_MultiExperIndiv_Results.IndividualID =
Z4_MultiExperIndiv_Stand.IndividualID
GROUP BY Z4_MultiExperIndiv_Stand.IndividualId,
Z3_MultiExperIndiv_Results.IndividualId
Then I divide results for each set of results by number of results and divide one by the other to get my conversion to standard number for each experiment.
SELECT ExperimentID1,
ExperimentID2,
Results1,
ResultCount1,
Results2,
ResultCount2,
Results1 / ResultCount1 AS Result1_avg,
Results2 / ResultCount2 AS Result2_avg,
(ResultCount2 * Results1) / (ResultCount1 * Results2) AS Conversion
FROM Z5_StandardConversion_data
Sample input data:
Sample Output:
Related
The question asked is: Report the nations where the average yield of the shares exceeds the average yield of all shares
I have a solution but I am unsure if it answers the question. I only received one output and I'm not sure if i need multiple
SELECT Nations.nationName,
AVG(dividend/price*100) AS Yield
FROM Shares, Nations
WHERE Shares.nationID=Nations.ID
GROUP BY Nations.nationName
HAVING AVG(dividend/price*100)>
(SELECT AVG(dividend/price*100) FROM Shares);
My output only shows one result, i'm just wondering if this should be the outcome.
I'll provide the previous table and my output
output of the previous table that is required to get this
When Percentage formatting is applied to the column, the result will be automatically multiplied by 100 when displayed, i.e. a value of 0.25 will be displayed as 25%; as such, either remove the formatting or remove the 100 multiplier from your code.
I would also suggest using an inner join over a Cartesian Product, i.e.:
select n.nationname, avg(s.dividend/s.price) as yield
from shares s inner join nations n on s.nationid = n.id
group by n.nationname
As for the correctness of your query, you can easily check this by calculating the average yield of all shares either manually or using a separate query, and then identifying which nations should be output by your query.
I am trying to come up with some arithmetic calculations for some survey data. I want to do these calculations for a number of segments and want to figure out how to do it without writing numerous SELECT statements.
This is what I have so far:
FACT table. This tables holds survey data at a respondent level - for example, if a survey had 10 questions, this table will have 11 columns: a column to identify the respondent_ID and 10 other columns to identify the responses to those questions.
DIMENSION table. This table segments we want to view the survey data by at a respondent level - for example, if we want to view survey responses by membership_status and age_bracket, this table will have 3 columns: a column to identify the respondent_ID, and two columns to identify membership_status and age_bracket.
OUTPUT.
I want to get aggregate calculations to summarizes the responses to the survey overall and to each question. I also want to be able to get this information for all possible segments that exist in the DIMENSIONS table.
I can do the query below, however I'll need to do this for every segment:
SELECT
COUNT(DISTINCT(CASE WHEN f.QUESTION_1 IN ('8', '9', '10') THEN f.RESPONDENT_ID END))*1.0 / COUNT(DISTINCT(CASE WHEN f.QUESTION_1 IS NOT NULL THEN f.RESPONDENT_ID END))*1.0 AS CSAT_1
FROM FACT f
JOIN DIMENSION d ON f.RESPONDENT_ID = d.RESPONDENT_ID
WHERE d.MEMBERSHIP_STATUS = 'ACTIVE'
The calculation above gives us something called a top 3 box. That is just one calculation, I will need to do many of them. Additionally, ever calculation will need to be done for each segment. In order to get a calculation for nonactive members, I would need to run another query and set d.MEMBERSHIP_STATUS = 'INACTIVE' and I would need to run another query with no filter, to get the overall calculation.
Is there a way I could store all my arithmetic calculations needed in my output as a function (maybe in a temp table or something) - my thought is that it'll be better to set the functions somewhere, and then when I need to calculate the output, I would some how call the function to do all the calculations I need, and give me all the calculations for every segment I have?
I can't fully envision how to get there, or if this is even a good solution, so guidance and detailed SQL code would be extremely helpful.Examples please!
There are 2 tables, there is an expected result, the result is to have the total cost of each engagement calculated, there are multiple tests taken during each engagement, each test ranges in cost (all set values), the expected result must be in terms of EngagementId, EngagementCost
The 2 tables, with there respective fields
- EngagementTest (EngagementId, TestId)
- Test (TestId, TestCost)
How would one go calculating the cost of each engagement.
This is as far as i managed to get
SELECT EngagementId, COUNT(TESTId)
FROM EngagementTest
GROUP BY EngagementId;
Try a SUM of the TestCost column rather than a COUNT. COUNT just tells you the number of rows. SUM adds up the values within the rows and gives you a total. Also your existing query doesn't actually use the table that contains the cost data. You can INNER JOIN the two tables via TestId and then GROUP BY the EngagementId so you get the sum of each engagement.
Something like this:
SELECT
ET.EngagementId,
SUM(T.TestCost)
FROM
EngagementTest ET
INNER JOIN Test T
ON T.TestId = ET.TestId
GROUP BY
ET.EngagementId
It can be achieved using below query.
SELECT i.EngagementId, SUM(TestCost)
FROM EngagementTest i
INNER JOIN Test t
ON e.TestId = t.TestId
GROUP BY i.EngagementId
I work for a small company dealing with herbal ingredients. We count regularly the effectiveness of the ingredients, based on the "product mix" (how much of the ingredient A, B and C). I have a table with thousands of rows, like the following:
PRODUCT Ingredient A Ingredient B Ingredient C EFFECTIVENESS
1 A 28 94 550 4,1
2 B 50 105 400 4,3
3 C 30 104 312 3,5
.. Etc etc etc etc Etc
What I want as a result, is the table below. I am using excel during the last years however it is difficult to handle millions of data and therefore I would like now to have something similar in sql. I did several attempts with Pivot and subqueries but I did not manage to get the result I needed.
In particular, in the first three columns, I include various ranges / criteria. In the column ‘Average effectiveness’ it is counted the average effectiveness of the ‘total products’ which meet these criteria. Due to the fact that the ranges are hundreds e.g. for Ingredient A, I have more than 100 different ranges and similarly for Ingredient B and C, I would like a way to have all multiple combinations of A, B, C ingredients (ranges) automatically.
Ingr. A Ingr. B Ingr. C Total products Average Effectiveness
1-10 50-60 90-110 ??? ???
1-10 50-60 110-130 ??? ??
1-10 50-60 130-150 ???? ??
1-10 60-70 150-170 ??? ??
10-20 60-70 90-110 ??? ??
10-20 60-70 110-130 ??? ??
10-20 60-70 130-150 ?? ??
Etc etc
I'm unable to give a more specific answer, but I think what you need to do is;
Use the CUBE to get all of the combinations and to aggregate the SUM and AVG values
Summarizing Data Using CUBE
The CUBE query will take its data from a nested query that has your data stored by the range of a value rather than the actual value. You can refer to SQL's CASE expression for more information on transforming the data so that it stores the range of a value rather than the value.
So, in other words, first you transform your data so that you're storing which range a value occurs in. Then from that transformed data, you summarize it using the CUBE to get all the combinations. So #1 is the outer query and #2 is the inner query.
Here is a very rough idea of what the query might look like, just to give you an idea:
Select Ingr_A, Ingr_B, Ingr_C, COUNT(*), AVG(Effectiveness)
(SELECT
Product,
Effectiveness,
"Ingr_A" =
CASE
WHEN Ingredient_A >= 10 and Ingredient_A < 20 THEN '[10, 20)'
WHEN Ingredient_A >= 20 and Ingredient_A < 30 THEN '[20, 30)'
...
END,
"Ingr_B" =
CASE
(like above)
END,
"Ingr_C"
(etc.)
FROM ProductsTable)
GROUP BY Ingr_A, Ingr_B, Ingr_C WITH CUBE
I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...
You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost