Removing repeating values from sql - sql

I have a question for my homework in class that goes as such:
The Professor wants to review information about questions on quizzes that appear to be difficult.
Create a view named HardQuizzes that contains the quiz number, quiz date, average score, question id, number of students who chose A, number of students who chose B, number of students who chose C, and number of students who chose D for each question on a quiz where the average score is less than 15. Verify that the view has been created correctly. Confirm the change.
And I think I came up with a way to get the answer with this:
CREATE VIEW HardQuizzes
AS
SELECT DISTINCT
QQ.QuizNum,
QQ.QuizDate,
AvgScore,
NumChoseA,
NumChoseB,
NumChoseC,
NumChoseD
FROM
QuizQuestions QQ,
Quizzes Q
WHERE
QQ.QuizNum = Q.QuizNum
AND AvgScore < 15
But when I do this it creates the view but so many of the values repeat and I cant figure out a way to stop them repeating. Is there a way do that?
This is a screenshot of the view when I make it

Since it's an assignment, hoping to just nudge you in the right direction -
What other data is available in the original tables that you can investigate?
Are there multiple classes taking each of these quizzes?
If there are multiple classes, would the intent of the question lead you to believe the records should be combined?
If that's the case this could be accomplished with GROUP BY and aggregation functions over the NumChose columns.

Related

Create and define functions to apply to different segments/filters that exist (Instead of doing one very long SQL query)

I am trying to come up with some arithmetic calculations for some survey data. I want to do these calculations for a number of segments and want to figure out how to do it without writing numerous SELECT statements.
This is what I have so far:
FACT table. This tables holds survey data at a respondent level - for example, if a survey had 10 questions, this table will have 11 columns: a column to identify the respondent_ID and 10 other columns to identify the responses to those questions.
DIMENSION table. This table segments we want to view the survey data by at a respondent level - for example, if we want to view survey responses by membership_status and age_bracket, this table will have 3 columns: a column to identify the respondent_ID, and two columns to identify membership_status and age_bracket.
OUTPUT.
I want to get aggregate calculations to summarizes the responses to the survey overall and to each question. I also want to be able to get this information for all possible segments that exist in the DIMENSIONS table.
I can do the query below, however I'll need to do this for every segment:
SELECT
COUNT(DISTINCT(CASE WHEN f.QUESTION_1 IN ('8', '9', '10') THEN f.RESPONDENT_ID END))*1.0 / COUNT(DISTINCT(CASE WHEN f.QUESTION_1 IS NOT NULL THEN f.RESPONDENT_ID END))*1.0 AS CSAT_1
FROM FACT f
JOIN DIMENSION d ON f.RESPONDENT_ID = d.RESPONDENT_ID
WHERE d.MEMBERSHIP_STATUS = 'ACTIVE'
The calculation above gives us something called a top 3 box. That is just one calculation, I will need to do many of them. Additionally, ever calculation will need to be done for each segment. In order to get a calculation for nonactive members, I would need to run another query and set d.MEMBERSHIP_STATUS = 'INACTIVE' and I would need to run another query with no filter, to get the overall calculation.
Is there a way I could store all my arithmetic calculations needed in my output as a function (maybe in a temp table or something) - my thought is that it'll be better to set the functions somewhere, and then when I need to calculate the output, I would some how call the function to do all the calculations I need, and give me all the calculations for every segment I have?
I can't fully envision how to get there, or if this is even a good solution, so guidance and detailed SQL code would be extremely helpful.Examples please!

SQL counting number of rows

I am looking for a way to search for a certain number of rows as a quality check. For example, we have tables that have a certain set of results that are needed.
Here is a quick table for an example:
ID: Name: Result: Reportable:
ONE A 10 X
TWO B 12 X
THREE C 1
FOUR D 18 X
FOUR(redo) D 11 X
So we are looking to double check results as there are people who accidentally report results multiple times (as in the case with ID FOUR). We have used having counts but we need the numbers to be specific and need a query to verify that number is satisfied.
In the table above we only want IDs ONE, TWO, and FOUR, however we have 4 results (one extra). Currently we have our check showing the count needed (ie 3) and the current result count (4) to show the mismatch but want a query to easily only show the result needed. We would need the redo result most of the time so we have set it so we take the latest date, but it doesn't help filter how many rows or results. I apologize if anything is confusing and I am not able to share the SQL query that we have currently. It's my first time posting so if I need to clarify anything please let me know as this seems to be very complicated. Thank you for your time.
EDIT: The details
We have one table (Table A) letting us know which results are reportable. The ones that are reportable go into another table (Table B). We have had issues in which people have made too many results reportable which overpopulates the Table B. Our old query had a count in Table B, but due to mistakes in people placing multiple reportables, samples which had many redos seem to be finished as they were all placed and met the count in Table B.
So now by using the Table A that helps tell us how many are Reportable, we want this to double check that the samples are indeed ready.
As I understand the question, you want ids that have multiple reportables. Assuming you really mean name, then:
select name
from t
where reportable = 'X'
group by name
having count(*) >= 2;

Using distinct and then an aggregate function in Postgresql?

This is a pretty basic problem and for whatever reason I can't find a reasonable solution. I'll do my best to explain.
Say you have an event ticket (section, row, seat #). Each ticket belongs to an attendee. Multiple tickets can belong to the same attendee. Each attendee has a worth (ex: Attendee #1 is worth $10,000). That said, here's what I want to do:
1. Group the tickets by their section
2. Get number of tickets (count)
3. Get total worth of the attendees in that section
Here's where I'm having problems: If Attendee #1 is worth $10,000 and is using 4 tickets, sum(attendees.worth) is returning $40,000. Which is not accurate. The worth should be $10,000. Yet when I make the result distinct on the attendee, the count is not accurate. In an ideal world it'd be nice to do something like
select
tickets.section,
count(tickets.*) as count,
sum(DISTINCT ON (attendees.id) attendees.worth) as total_worth
from
tickets
INNER JOIN
attendees ON attendees.id = tickets.attendee_id
GROUP BY tickets.section
Obviously this query doesn't work. How can I accomplish this same thing in a single query? OR is it even possible? I'd prefer to stay away from sub queries too because this is part of a much larger solution where I would need to do this across multiple tables.
Also, the worth should follow the ticket divided evenly. Ex: $10,000 / 4. Each ticket has an attendee worth of $5,000. So if the tickets are in different sections, they take their prorated worth with them.
Thanks for your help.
You need to aggregate the tickets before the attendees:
select ta.section, sum(ta.numtickets) as count, sum(a.worth) as total_worth
from (select attendee_id, section, count(*) as numtickets
from tickets
group by attendee_id, section
) ta INNER JOIN
attendees a
ON a.id = ta.attendee_id
GROUP BY ta.section
You still have a problem of a single attendee having seats in multiple sections. However, you do not specify how to solve that (apportion the worth? randomly choose one section? attribute it to all sections? canonically choose a section?)
Using jsonb_object_agg:
select
tickets.section,
count(tickets.*) as count,
(
SELECT SUM(value::int4)
FROM jsonb_each_text(jsonb_object_agg(attendees.id, attendees.worth))
) as total_worth
from
tickets
INNER JOIN
attendees ON attendees.id = tickets.attendee_id
GROUP BY tickets.section

Joining Tables in Oracle 8i to Limit Results

My background in SQL is limited, but my Googling isn't, I feel that I might just be missing the vocabulary to ask this question properly so hopefully beyond an answer to my question I can get the vocabulary I need to research this issue further.
I have a parts table - PARTS
I have a Purchase Order table - PO
and I have a PO Line Item table - PO_LINEITEM
The question I'm attempting to answer is given a particular part I want to get the latest purchase order and then look at the price we paid.
The PO table holds the date (PO_DATE) of when the Purchase Order was filled and the PO_LINEITEM table holds the information regarding the particular line item such as part primary key (PART_PRIMARYKEY) and price (PART_PRICE). The PARTS table isn't as important as the rest except that I need to return a PART primary key so that I can base the resulting view off the PARTS table
I've been through sub-queries and scalable sub-queries and the like, but I can't seem to find the right combination.
I started from a base of:
SELECT a.PO_DATE, b.PART_PRIMARYKEY, b.PART_PRICE
FROM PO a, PO_LINEITEM b
WHERE a.PO_PRIMARYKEY = b.PO_PRIMARYKEY
As you would expect this returns a list of every instance of an object in a PO with it's price and the date the Purchase Order was filled. The closest I have come to crack this is by using the MAX function on the date such as:
SELECT MAX(a.PO_DATE) AS DATE, b.PART_PRIMARYKEY, b.PART_PRICE
FROM PO a, PO_LINEITEM b
WHERE a.PO_PRIMARYKEY = b.PO_PRIMARYKEY
GROUP BY b.PART_PRIMARYKEY, b.PART_PRICE
This returns a Max date for each price a we paid for a particular part, so:
PART 1234, £12.95, 12/08/2012
PART 1234, £13.00, 14/08/2012
PART 1234. £11.15, 17/08/2012
PART 2345, £5.25, 12/08/2012
PART 2345, £5.65, 13/08/2012
etc.
What I need is:
PART 1234, £11.15, 17/08/2012
PART 2345. £5.65, 13/08/2012
If I could just group by the PART_PRIMARYKEY that would be excellent, but I get an ORA-00979 not a GROUP BY expression when I try.
Like I said I feel that my lack of vocabulary around this issue is impeding me finding an answer, so if anyone could point me in the right direction I'd be grateful
So hopefully I'm not asking a question that is asked every other day, but haven't found because I didn't use the magical combination of words to find.
Thank you for any help you can offer.
Look up Analytic Functions. They were introduced in 8i though I'm not sure how advanced they were at the time compared to how very good they can be in 11. A few links that I've used to understand them:
http://www.oracle-base.com/articles/misc/analytic-functions.php
http://www.orafaq.com/node/55
Though, a sub-query such as this might suffice (I may have your column naming mixed up):
select A.PART_PRIMARYKEY, B.PART_PRICE, A.PO_DATE
from PO A, PO_LINEITEM B
where A.PO_PRIMARYKEY = B.PO_PRIMARYKEY
and (A.PART_PRIMARYKEY, A.PO_DATE) in
( select A.PART_PRIMARYKEY, max(A.PO_DATE)
from PO A, PO_LINEITEM B
where A.PO_PRIMARYKEY = B.PO_PRIMARYKEY
group by A.PART_PRIMARYKEY);

minimizing runtime of calculating several ranks (based on grades) on large sql tables, how short can it get (ms access)

I've been stuck with the rather famous problem of ranking students by grade for a couple weeks now, and while I've learned much, I still haven't solved my problem (the ranks are generated, but the process is too slow):
I have a large table (320,000 rows) that contains the student codes (serves as an identifier, instead of their names), the students classroom, the test , the tests date, the subject, the question number and the students grade on that question. This table is the base for everything else that is calculated and its size makes all these calculations very very slow, to the point where I find me almost breaking everything here at work.
First, some intel on the school (very little info, required to understand the problem)
Here at the school we have weekly tests over several subjects. The school is also separated in classrooms with different purposes (one is focused on math, physics and chemistry, another one is focused on biology, and the last one focuses on history, Portuguese and geography). But they all do the same tests every week.
What we want to do is calculate the standard deviation for each question for everyone in the school (not per-classroom) and the average grade per question (also for everyone in the school), and then generate the following ranks (all of them per date):
-Rank per subject per classroom (with "raw" grades), Rank per subject considering the whole school (with "raw" grades) and Rank per subject considering the whole school (using normalized grades, with the standard deviation per question and the average grade per question information)
-The same ranks that were mentioned above, but not per Subject, considering instead all subjects
As you can see, after calculating the average grades and the standard deviations, we still need to calculate the sums of the grades on each question, and rank according to these sums (the actual subject/test grades). I've attacked this problem in a few ways:
1)
Created two tables, one with the grades per student per subject (fields: Students code, Students classroom, Date of test, Subject, Grade, Normalized Grade, Rank in Classroom, Rank in School, Rank in School using normalized grades) and another with the grades per student per test (all subjects taken into account, fields: Students code, Students classroom, Date of test, Grade, Normalized Grade, Rank in Classroom, Rank in School, Rank in School using normalized grades).
The insertion of data in these tables takes about 50 seconds
Then, I tried using SQL to rank, however, I ran into some problems:
-Access has no ROW_NUMBER or RANK functions, and thus I have to use queries with COUNT, like (below is just a simplified version):
SELECT 1+(SELECT Count(*) FROM grades_table_per_subject t2 WHERE
t2.Grade > t1.Grade AND t1.Date=t2.Date AND t1.Subject=t2.Subject) AS [Global Rank],
1+(SELECT Count(*) FROM grades_table_per_subject t3 WHERE t3.Grade > t1.Grade AND
t3.Date=t1.Date AND t3.Subject=t1.Subject AND t3.Classroom=t1.Classroom) AS
[Rank in classroom] FROM grades_table_per_subject;
There still is the rank with the normalized grades in the query above, but I omitted it.
The table grades_table_per_subject has about 45,000 lines and this query takes more than 15 minutes here, even with indexing (tried many different index combinations, even some odd ones when I saw that the ones that should work didn't).
I also tried to ORDER BY Count() DESC the inner selects, but I hit ctrl+break after 7 minutes and no results.
2)
Added the following fields to the tables above:
Rank in Classroom, Rank in School, Rank in School using normalized grades
Then I tried using VBA with DAO and manually update the Rank fields, running the following code (simplified version):
Set rs = CurrentDb.OpenRecordset("SELECT Classroom, Date, Subject, Grade, [Rank in classroom] FROM
grades_table_per_subject ORDER BY Date, Classroom, Subject, Grade DESC;", dbOpenDynaset)
...
...
rs.movefirst
i=1
While Not rs.eof
'Verifies if there was a change on either one of Subject, Classroom, Date and if so:
...
i = 1
...
rs.Edit
rs![Rank in classroom]=i
rs.Update
i = i + 1
rs.movenext
Wend
rs.close
This obviously builds only one of the ranks (in this case per subject per classroom), and it takes alone 3min 10sec.
I verified that it takes so long due to the writes on the table (rs.Edit and rs.Update are the culprits, commenting them makes the whole thing run in only 4 seconds), but I need the ranks written to the table to generate an access report later.
FINALLY:
I could generate all the ranks once and make ways for the users to access all the data very quickly, but the idea is that everything should be calculated on-the-fly. The times we have achieved, however, make this impossible.
Overall, the question to be asked is the following:
-Is there a way to calculate the ranks shown above through an Access Query under 10 seconds, or to use VBA and calculate-insert these ranks to the table in a similar time considering the size of the tables used here?
Also, I would love to see a list of efficient ranking algorithms, so that even if I can't do everything quickly, I can improve it as much as possible.
I could generate all the ranks once and make ways for the users to access all the data very quickly, but the idea is that everything should be calculated on-the-fly.
Why?
Why bother regenerating the same data over and over? It's most likely preferable to generate these statistics when the data changes and just look them up every other time. Redoing work you've already done whenever somebody wants to check something is just silly.
I just saw you say ms access only
so ignore this answer -- or consider moving to a real DB if you want to be able to do this type of power processing.
original answer below
I don't have access to your test data, but how fast does this run?
SELECT RANK () OVER (PARTITION BY [Date],[Subject] ORDER BY Grade) AS [Global Rank],
RANK () OVER (PARTITION BY [Date],[Subject], Classroom ORDER BY Grade) AS [Rank in classroom]
FROM grades_table_per_subject
My guess is you are not going to be able to beat SQL Servers ranking speed in VBA, if this is not fast enough then you need to look in the profiler and see what indexes it suggests you make.