Lets say me defect severity levels are 4 (Critical), 3 (Serious), 2 (Medium), 1 (Low). The total number of defects is 4.
We can use the following steps
1)We assign a number to each severity as : Blocker=9, Critical=8, Major=3, Minor=2 , Trivial=1
2)Then we multiply the number of issues in each category by the assigned number as:(Num of Blocker * 9) + (Number of Critical issue * 8)
3)Then we divide by the Total issue count
ex: ( (Num of Blocker * 9) + (Number of Critical issue * 8) + (Number of Major issue *3) + (Minor issue count * 2) + (Trivial issue count * 1) ) /Total issue count
#hirosht Defect Severity Index provides a measurement of the quality of a product under test. So in multiple test iterations if we can identify that the DSI drop, that may indicate that the quality of the product/feature is increasing. However, having said that, the numbers may mislead us and we should not take this as an indication of increasing quality as we need to also take into consideration the number of defects logged per iteration and the severity of the defects identified in each cycle and make our decision.
Related
The table tests contains data on power tests of a certain type of rocket engine. The values in the following columns mean:
ID - test number
SN - engine serial number
T - test duration in seconds
M1, M2, M3 - power values recorded at the beginning, half and end of the test duration.
Your task is to display for each engine the test in which it achieved the highest average value of three power measurements. However, you must take into account the following limitations: 1.we are only interested in those engines that have participated in at least 5 tests, 2.of the tests for engines that meet condition 1, we are only interested in those that lasted at least a minute and where the lowest value of the three measurements was not less than 90% of the highest value of the three measurements 3. We are interested only in those engines for which the tests selected after meeting the criteria in point 2 meet the following condition: the lowest average of the three measurement results of the tests for a given engine is not less than 85% of the highest average of measurements among these tests.
The resulting table should contain four columns:
ID
SN
MAX - containing the highest value of the three measurements in the given test
MAX_AVG - containing the highest average of measurements from all tests for a given engine (taking into account the conditions described above). Round this value to two decimal places.
Sort the results from the highest to the lowest average.
I am a student and the professor gave me such an assignment below I am sending my idea to solve this problem.
SELECT
ID, SN, AS MAX(M1, M2, M3) AS MAX, ROUND(MAX((M1 + M2 + M3)/3.0),2) AS MAX_AVG
FROM
tests
WHERE
T >= 60
AND MIN(M1, M2, M3) >= MAX(M1, M2, M3) * 0.9
GROUP BY
SN
HAVING
COUNT(SN) >= 5
AND MIN((M1 + M2 + M3)/3.0) >= MAX((M1 + M2 + M3)/3.0) * 0.85
ORDER BY
MAX_AVG ASC;
You do not say what your difficulty is. But you did include your attempt, and made an effort.
However, you have a condition backwards, so that might be it.
Say that you have an engine that participated in 5 tests, but the fifth only lasted 55 seconds. Condition 1 says that you must include that engine, because it did participate to five tests:
1.we are only interested in those engines that have participated in at least 5 tests
Of course, for that engine, you only want the first four tests:
2.of the tests for engines that meet condition 1, we are only interested in those that lasted at least a minute
But your WHERE excludes the fifth test from the start, so that leaves four, and the HAVING excludes that engine.
You probably want to try a subSELECT to handle this case.
WITH
test_level_summary AS
(
SELECT
ID,
SN,
MIN(M1, M2, M3) AS MIN_M,
MAX(M1, M2, M3) AS MAX_M,
ROUND((M1 + M2 + M3)/3.0), 2) AS AVG_M,
COUNT(*) OVER (PARTITION BY SN) AS SN_TEST_COUNT
FROM
tests
),
valid_engines AS
(
SELECT
*,
MIN(AVG_M) OVER (PARTITION BY SN) AS MIN_AVG_M,
MAX(AVG_M) OVER (PARTITION BY SN) AS MAX_AVG_M
FROM
test_level_summary
WHERE
SN_TEST_COUNT >= 5
AND T >= 60
AND MIN_M >= MAX_M * 0.9
)
SELECT
ID, SN, MAX_M, AVG_M
FROM
valid_engines
WHERE
MIN_AVG_M >= MAX_AVG_M * 0.85
AND AVG_M = MAX_AVG_M
for example
SELECT company_ID, totalRevenue
FROM `BigQuery.BQdataset.companyperformance`
ORDER BY totalRevenue LIMIT 10
The only difference I can see between using and not using LIMIT 10 is just the different amount of data used for displaying to user.
The system still orders all the data first before performing a LIMIT.
Below is applicable for BigQuery
Not necessarily 100% technically correct - but close enough so I hope below will give you an idea why LIMIT N is extremely important to consider in BigQuery
Assume you have 1,000,000 rows of data and 8 workers to process query like below
SELECT * FROM table_with_1000000_rows ORDER BY some_field
Round 1: To sort this data each worker gets 125,000 rows – so now you have 8 sorted sets of 125,000 rows each
Round 2: Worker #1 sends its sorted data (125,000 rows) to worker #2, #3 sends to #4 and so on. So now we have 4 workers and each produce ordered set of 250,000 rows
Round 3: Above logic repeated and now we have just 2 workers each producing ordered list of 500,000 rows
Round 4: And finally, just one worker producing final ordered set of 1,000,000 rows
Of course, based on number of rows and number of available workers – number of rounds can be different than in above example
In Summary: what we have here:
a. We have quite a huge amount of data being transferred between workers – this can be quite a factor for performance going down
b. And we have chance for one of the workers not being able to process amount of data distributed to respective worker. It can happen earlier or later and is usually manifested with “Resources exceeded …” type of error
So, now if you have LIMIT as a part of query as below
SELECT * FROM table_with_1000000_rows ORDER BY some_field LIMIT 10
So, now – Round 1 is going to be the same. But starting with Round 2 – ONLY top 10 rows will be sent to another worker – thus in each Round after first one - only 20 rows will processed and only top 10 will be sent for further processing
Hope you see how different these two processes in terms of volume of the data being sent between workers and how much work each worker needs to apply to sort respective data
To Summarize:
Without LIMIT 10:
• Initial rows moved (Round 1): 1,000,000;
• Initial rows ordered (Round 1): 1,000,000;
• Intermediate rows moved (Round 2 - 4): 1,500,000
• Overall merged ordered rows (Round 2 - 4): 1,500,000;
• Final result: 1,000,000 rows
With LIMIT 10:
• Initial rows moved (Round 1): 1,000,000;
• Initial rows ordered (Round 1): 1,000,000;
• Intermediate rows moved (Round 2 - 4): 70
• Overall merged ordered rows (Round 2 - 4): 140;
• Final result: 10 rows
Hope above numbers clearly show the difference in performance you gain using LIMIT N and in some cases even ability to successfully run the query without "Resource exceeded ..." error
This answer assumes you are asking about the difference between the following two variants:
ORDER BY totalRevenue
ORDER BY totalRevenue LIMIT 10
In many databases, if a suitable index existed involving totalRevenue, the LIMIT query could stop sorting after finding the top 10 records.
In the absence of any index, as you pointed out, both versions would have to do a full sort, and therefore should perform the same.
Also, there is a potentially major performance difference between the two, if the table be large. In the LIMIT version, BigQuery only has to send across 10 records, while in the non LIMIT version, potentially much more data has to be sent.
There is no performance gain. bigQuery still has go through all the records on the table.
You can partition your data in order to cut the amount of records that bigQuery has to read. That will increase performance. You can read more information here:
https://cloud.google.com/bigquery/docs/partitioned-tables
See the statistical difference in bigQuery UI between the below 2 queries
SELECT * FROM `bigquery-public-data.hacker_news.comments` LIMIT 1000
SELECT * FROM `bigquery-public-data.hacker_news.comments` LIMIT 10000
As you can see BQ will return immediately to UI after the limit criteria is reached this result in better performance and less traffic on the network
I am working with a SQLite database and I have three tables describing buildings,rooms and scheduled events.
The tables look like this:
Buildings(ID,Name)
Rooms(ID,BuildingID,Number)
Events(ID,BuildingID,RoomID,Days,s_time,e_time)
So every event is associated with a building and a room. The column Days contains an integer which is a product of prime numbers corresponding to days of the week ( A value of 21 means the event occurs on Tuesday = 3 and Thursday = 7).
I am hoping to find a way to generate a report of rooms in a specific building that will be open in the next few hours, along with how long they will be open for.
Here is what I have so far:
SELECT Rooms.Number
FROM Rooms
INNER JOIN Buildings on ( Rooms.BuildingID = Buildings.ID )
WHERE
Buildings.Name = "BuildingName"
EXCEPT
SELECT Events.RoomID
FROM Events
INNER JOIN Buildings on ( Events.BuildingID = Buildings.ID )
WHERE
Buildings.Name = "BuildingName" AND
Events.days & 11 = 0 AND
time("now", "localtime" BETWEEN events.s_time AND events.e_time;
Here I find all rooms for a specific building and then I remove rooms which currently have an scheduled event in progress.
I am looking forward to all helpful tips/comments.
If you're storing dates as the product of primes, the modulo (%) operator might be more useful:
SELECT * FROM Events
INNER JOIN Buildings on (Events.BuildingID = Buildings.ID)
WHERE
(Events.Days % 2 = 0 AND Events.Days % 5 = 0)
Would select events happening on either a Monday or Wednesday.
I do have to point out though, that storing the product of primes is both computationally and storage expensive. Much easier to store the sum of powers of two (Mon = 1, Tues = 2, Wed = 4, Thurs = 8, Fri = 16, Sat = 32, Sun = 64).
The largest possible value for your current implementation is 510,510. The smallest data type to store such a number is int (32 bits per row) and retrieving the encoded data requires up to 7 modulo (%) operations.
The largest possible value for a 2^n summation method is 127 which can be stored in a tinyint (8 bits per row) and retrieving the encoded data would use bitwise and (&) which is somewhat cheaper (and therefore faster).
Probably not an issue for what you're working with, but it's a good habit to choose whatever method gives you the best space and performance efficiency lest you hit serious problems should your solution be implemented at larger scales.
I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...
You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost
I want to create a table, with each row containing some sort of weight. Then I want to select random values with the probability equal to (weight of that row)/(weight of all rows). For example, having 5 rows with weights 1,2,3,4,5 out of 1000 I'd get approximately 1/15*1000=67 times first row and so on.
The table is to be filled manually. Then I'll take a random value from it. But I want to have an ability to change the probabilities on the filling stage.
I found this nice little algorithm in Quod Libet. You could probably translate it to some procedural SQL.
function WeightedShuffle(list of items with weights):
max_score ← the sum of every item’s weight
choice ← random number in the range [0, max_score)
current ← 0
for each item (i, weight) in items:
current ← current + weight
if current ≥ choice or i is the last item:
return item i
The easiest (and maybe best/safest?) way to do this is to add those rows to the table as many times as you want the weight to be - say I want "Tree" to be found 2x more often then "Dog" - I insert it 2 times into the table and I insert "Dog" once and just select elements at random one by one.
If the rows are complex/big then it would be best to create a separate table (weighted_Elements or something) in which you'll just have foreign keys to the real rows inserted as many times as the weights dictate.
The best possible scenario (if i understand your question properly) is to setup your table as you normally would and then add two columns both INT's.
Column 1: Weight - This column would hold your weight value going from -X to +X, X being the highest value you want to have as a weight (IE: X=100, -100 to 100). This value is populated to give the row an actual weight and increase or decrease the probability of it coming up.
Column 2: *Count** - This column would hold the count of how many times this row has come up, this column is needed only if you want to use fair weighting. Fair weighting prevents one row from always showing up. (IE: if you have one row weighted at 100 and another at 2 the row with 100 will always show up, this column will allow weight 2 to be more 'valueable' as you get more weight 100 results). This column should be incremented by 1 each time a row result is pulled but you can make the logic more advanced later so it adds the weight etc.
Logic: - Its really simple now, your query simply has to request all rows as you normally would then make an extra select that (you can change the logic here to whatever you want) takes the weights and subtracts the count and order by that column.
The end result should be a table where you will get your weights appearing more often until a certain point where the system will evenly distribute itself out (leave out column 2) and you will have a system that will always return the same weighted order unless you offset the base of the query (IE: LIMIT [RANDOM NUMBER], [NUMBER OF ROWS TO RETURN])
I'm not an expert in probability theory, but assuming you have a column called WEIGHT, how about
select FIELD_1, ... FIELD_N, (rand() * WEIGHT) as SCORE
from YOURTABLE
order by SCORE
limit 0, 10
This would give you 10 records, but you can change the limit clause, of course.
The problem is called Reservoir Sampling (https://en.wikipedia.org/wiki/Reservoir_sampling)
The A-Res algorithm is easy to implement in SQL:
SELECT *
FROM table
ORDER BY pow(rand(), 1 / weight) DESC
LIMIT 10;
I came looking for the answer to the same question - I decided to come up with this:
id weight
1 5
2 1
SELECT * FROM table ORDER BY RAND()/weight
it's not exact - but it is using random so i might not expect exact. I ran it 70 times to get number 2 10 times. I would have expect 1/6th but i got 1/7th. I'd say that's pretty close. I'd have to run a script to do it a few thousand times to get a really good idea if it's working.