Optimizing my stored procedure - is this the right way to do it? - sql

SELECT TOP 1 #CurrentStudentID = StudentID
FROM Courses WITH (NOLOCK)
WHERE Courses.CourseID = #CourseID
ORDER BY StudentID
-- Loop through all the students and find if he/she is registered for more than one course.
WHILE (##ROWCOUNT > 0 AND #CurrentStudentID IS NOT NULL)
BEGIN
-- Select all other courses student is currently registered in.
IF ##ROWCOUNT > 0
BEGIN
-- return required information
END
ELSE
BEGIN
-- Perform some operations
END
-- Select the next registered student
SELECT TOP 1 #CurrentStudentID = StudentID
FROM Courses WITH (NOLOCK)
WHERE Courses.CourseID = #CourseID AND
Courses.StudentID > #CurrentStudentID
ORDER BY StudentID
END
Can someone help with my logic here? I wrote a stored procedure to find out if a student of a course is currently taking other courses from the same school.
I'm particularly worried about the two SELECT queries and the performance of a while loop if the number of students is huge. I feel the way I am doing it feels very contrived. I'm sure there are better ways to do this.
I've done SQL profiling on this stored procedure and it's duration can range from 0 - 60 ms for a single call. I don't understand why the same stored procedure's execution time is so random and inconsistent.
Appreciate any help. I only have 1 year plus of SQL Server 2008 experience.
Thanks in advance.

AS I mentioned, SQL is a set-based theory language. In other words, it is semi-relational with data sets that allows for efficient comparisons between groups of data. "Lower" languages such as C++ or Java do not maintain such large data sets, since they are cursor (line by line) based-languages.
High level as this definition is, the point is to think of your data like EXCEL sheets. You have predefined columns such as CourseID and StudentID, that have information in the other columns that are dependent on those values (CourseID 1:1 Course_Name) and some information that is repetitive (CourseID can have multiple students).
True normalization includes removing interdependent columns, but lets not worry about that right now. The main focus is on what makes sense for the business. Your table has Identifying columns for its courses and students. So you do not need to use a cursor if those values do not have conflicting interdependent values.
SELECT StudentID, COUNT(COURSEID) AS CLASS_NUM
FROM COURSES
GROUP BY StudentID
HAVING COUNT(COURSEID) > 1
The GROUP BY returns distinct sets of values from the columns listed, flattening the other rows and allowing aggregate functions like COUNT(). (note: NULLS are not counted in the COUNT(). Use an ISNULL function)
You have not yet limited the list, and yet you achieve the same results. After SQL flattens the rows, you can use a HAVING clause to further limit the result sets from the GROUP BY if needed.
Way faster than a cursor, definitely. :)
Now, if your table includes students in different semesters and years, you might consider adding this to the GROUP BY, so that you have sets in your GROUP BY (StudentID and Year)
Also, recall that the SELECT statement LOGICALLY read AFTER the GROUP BY and HAVING clauses, so any columns listed in the SELECT statement must appear in the GROUP BY or or have an aggregate function.

Related

If I have multiple values in a column, how can I count it in SQL?

Let me illustrate this:
Student_ID
Course
StudentID1
CourseA
StudentID2
CourseB
StudentID3
CourseA CourseB
There is an existing table that has data that looks like the one above (Not exactly Student & Course, this is more for illustration purpose) and my job is to do a count of how many students for each course. The table is very huge and I do not know how many and what are the courses are out there (Easily in thousands), so wonder if there is a way I can get a list of all these courses and their counts through SQL?
The current method that my team did is SELECT DISTINCT COURSE, COUNT(STUDENT_ID) FROM TABLE GROUP BY COURSE, but this does not work because it treats "CourseA CourseB" as its own.
There are some other columns in this table that I might need to do a SUM as well.
Appreciate any advice on this, thanks!
you could use below to find number of students for each course:
select course, count(*) as no_of_students
from table
group by course;

what does Group By multiple columns means?

I use oracle 11g , so i read alot of artics about it but i dont understand
how exactly its happened in database , so lets say that have two tables:
select * from Employee
select * from student
so when we want to make group by in multi columns :
SELECT SUBJECT, YEAR, Count(*)
FROM Student
GROUP BY SUBJECT, YEAR;
so my question is: what exactly happened in database ? i mean the query count(*) do first in every column in group by and then sort it ? or what? can any one explain it in details ?.
SQL is a descriptive language, not a procedural language.
What the query does is determine all rows in the original data where the group by keys are the same. It then reduces them to one row.
For example, in your data, these all have the same data:
subject year name
English 1 Harsh
English 1 Pratik
English 1 Ramesh
You are saying to group by subject, year, so these become:
Subject Year Count(*)
English 1 3
Often, this aggregation is implemented using sorting. However, that is up to the database -- and there are many other algorithms. You cannot assume that the database will sort the data. But, if it easier for you to think of it, you can think of the data being sorted by the group by keys, in order to identify the groups. Just one caution, the returned values are not necessarily in any particular order (unless your query includes an order by).

SELECT TOP 1 is returning multiple records

I shall link my database down below.
I have a query called 'TestMonday1' and what this does is return the student with the fewest 'NoOfFrees' and insert the result of the query into the lesson table. Running the query should help explain what i mean. The problem im having is my SQL code has 'SELECT TOP 1' yet if the query returns two students who have the same number of frees it returns both these records. Wit this being a timetable planner, it should only ever return one result, i shall also put the code below,
Many thanks
Code:
INSERT INTO Lesson ( StudentID, LessonStart, LessonEnd, DayOfWeek )
SELECT TOP 1 Availability.StudentID, Availability.StartTime,
Availability.EndTime, Availability.DayOfWeek
FROM Availability
WHERE
Availability.StartTime='16:00:00' AND
Availability.EndTime='18:00:00' AND
Availability.DayOfWeek='Monday' AND
LessonTaken IS NULL
ORDER BY
Availability.NoOfFrees;
This happens because Access returns all records in case of ties in ORDER BY (all records returned have the same values of fields used in ORDER BY).
You can add another field to ORDER BY to make sure there's no ties. StudentID looks like a good candidate (though I don't know your schema, replace with something else if it suits better):
ORDER BY
Availability.NoOfFrees, Availability.StudentID;

Creating a denormalized table from a normalized key-value table using 100s of joins

I have an ETL process which takes values from an input table which is a key value table with each row having a field ID and turning it into a more denormalized table where each row has all the values. Specifically, this is the input table:
StudentFieldValues (
FieldId INT NOT NULL,
StudentId INT NOT NULL,
Day DATE NOT NULL,
Value FLOAT NULL
)
FieldId is a foreign key from table Field, Day is a foreign key from table Days. The PK is the first 3 fields. There are currently 188 distinct fields. The output table is along the lines of:
StudentDays (
StudentId INT NOT NULL,
Day DATE NOT NULL,
NumberOfClasses FLOAT NULL,
MinutesLateToSchool FLOAT NULL,
... -- the rest of the 188 fields
)
The PK is the first 2 fields.
Currently the query that populates the output table does a self join with StudentFieldValues 188 times, one for each field. Each join equates StudentId and Day and takes a different FieldId. Specifically:
SELECT Students.StudentId, Days.Day,
StudentFieldValues1.Value NumberOfClasses,
StudentFieldValues2.Value MinutesLateToSchool,
...
INTO StudentDays
FROM Students
CROSS JOIN Days
LEFT OUTER JOIN StudentFieldValues StudentFieldValues1
ON Students.StudentId=StudentFieldValues1.StudentId AND
Days.Day=StudentFieldValues1.Day AND
AND StudentFieldValues1.FieldId=1
LEFT OUTER JOIN StudentFieldValues StudentFieldValues2
ON Students.StudentId=StudentFieldValues2.StudentId AND
Days.Day=StudentFieldValues2.Day AND
StudentFieldValues2.FieldId=2
... -- 188 joins with StudentFieldValues table, one for each FieldId
I'm worried that this system isn't going to scale as more days, students and fields (especially fields) are added to the system. Already there are 188 joins and I keep reading that if you have a query with that number of joins you're doing something wrong. So I'm basically asking: Is this something that's gonna blow up in my face soon? Is there a better way to achieve what I'm trying to do? It's important to note that this query is minimally logged and that's something that wouldn't have been possible if I was adding the fields one after the other.
More details:
MS SQL Server 2014, 2x XEON E5 2690v2 (20 cores, 40 threads total), 128GB RAM. Windows 2008R2.
352 million rows in the input table, 18 million rows in the output table - both expected to increase over time.
Query takes 20 minutes and I'm very happy with that, but performance degrades as I add more fields.
Think about doing this using conditional aggregation:
SELECT s.StudentId, d.Day,
max(case when sfv.FieldId = 1 then sfv.Value end) as NumberOfClasses,
max(case when sfv.FieldId = 2 then sfv.Value end) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d LEFT OUTER JOIN
StudentFieldValues sfv
ON s.StudentId = sfv.StudentId AND
d.Day = sfv.Day
GROUP BY s.StudentId, d.Day;
This has the advantage of easy scalability. You can add hundreds of fields and the processing time should be comparable (longer, but comparable) to fewer fields. It is also easer to add new fields.
EDIT:
A faster version of this query would use subqueries instead of aggregation:
SELECT s.StudentId, d.Day,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 1 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as NumberOfClasses,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 2 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d;
For performance, you want a composite index on StudentFieldValues(StudentId, day, FieldId, Value).
Yes, this is going to blow up. You have your definitions of "normalized" and "denormalized" backwards. The Field/Value table design is not a relational design. It's a variation of the entity-attribute-value design, which has all sorts of problems.
I recommend you do not try to pivot the data in an SQL query. It doesn't scale well that way. Instea, you need to query it as a set of rows, as it is stored in the database, and fetch back the result set into your application. There you write code to read the data row by row, and apply the "fields" to fields of an object or a hashmap or something.
I think there may be some trial and error here to see what works but here are some things you can try:
Disable indexes and re-enable after data load is complete
Disable any triggers that don't need to be ran upon data load scenarios.
The above was taken from an msdn post where someone was doing something similar to what you are.
Think about trying to only update the de-normalized table based on changed records if this is possible. Limiting the result set would be much more efficient if this is a possibility.
You could try a more threaded iterative approach in code (C#, vb, etc) to build this table by student where you aren't doing the X number of joins all at one time.

SQL: selective subqueries

I'm having an SQL query (MSSQLSERVER) where I add columns to the resultset using subselects:
SELECT P.name,
(select count(*) from cars C where C.type = 'sports') AS sportscars,
(select count(*) from cars C where C.type = 'family') AS familycars,
(select count(*) from cars C where C.type = 'business') AS businesscars
FROM people P
WHERE P.id = 1;
The query above is just from a test setup that's a bit nonsense, but it serves well enough as example I think. The query I'm actually working on spans a number of complex tables which only distracts from the issue at hand.
In the example above, each record in the table "people" also has three additional columns: "wantsSportscar", "wantsFamilycar" and "wantsBusinesscar". Now what I want to do is only do the subselect of each additional column if the respective "wants....." field in the people table is set to "true". In other words, I only want to do the first subselect if P.wantsSportscar is set to true for that specific person. The second and third subselects should work in a similar manner.
So the way this query should work is that it shows the name of a specific person and the number of models available for the types of cars he wants to own. It might be worth noting that my final resultset will always only contain a single record, namely that of one specific user.
It's important that if a person is not interested in a certain type of cars, that the column for that type will not be included in the final resultset. An example to be sure this is clear:
If person A wants a sportscar and a familycar, the result would include the columns "name", "sportscars" and "familycars".
If person B wants a businesscar, the result would include the columns "name" and "businesscar".
I've been trying to use various combinations with IF, CASE and EXISTS statements, but so far I've not been able to get a syntactically correct solution. Does anyone know if this is even possible? Note that the query will be stored in a Stored Procedure.
In your case, there are 8 column layouts possible and to do this, you will need 8 separate queries (or build your query dynamically).
It's not possible to change the resultset layout within a single query.
Instead, you may design your query as follows:
SELECT P.name,
CASE WHEN wantssport = 1 THEN (select count(*) from cars C where C.type = 'sports') ELSE NULL END AS sportscars,
CASE WHEN wantsfamily = 1 THEN (select count(*) from cars C where C.type = 'family') ELSE NULL END AS familycars,
CASE WHEN wantsbusiness = 1 THEN (select count(*) from cars C where C.type = 'business') ELSE NULL END AS businesscars
FROM people P
WHERE P.id = 1
which will select NULL in appropriate column if a person doesn't want it, and parse these NULL's on client side.
Note that relational model answers the queries in terms of relations.
In your case, the relation is as follows: "this person needs are satisifed with this many sport cars, this many business cars and this many family cars".
Relational model always answers this specific question with a quaternary relation.
It doesn't omit any of the relation members: instead, it just sets them to NULL which is the SQL's way to show that the member of a relation is not defined, applicable or meaningful.
I'm mostly an Oracle guy but there's a high chance the same applies. Unless I've misunderstood, what you want is not possible at that level - you will always have a static number of columns. Your query can control if the column is empty but since in the outer-most part of the query you have specified X number of columns, you are guaranteed to get X columns in your resultset.
As I said, I am unfamiliar with MS SQL Server but I'm guessing there will be some way of executing dynamic SQL, in which case you should research that since it should allow you to build a more flexible query.
You may be able to do what you want by first selecting the values as separate rows into a temp table, then doing a PIVOT on that table (turning the rows into columns).
It's important that if a person is not
interested in a certain type of cars,
that the column for that type will not
be included in the final resultset. An
example to be sure this is clear:
You will not be able to do it in plain SQL. I suggest you just make this column NULL or ZERO.
If you want the query to be dynamically expand when new cars are added, then PIVOTing could help you somewhat.
There are three fundamentals you want to learn to make this work easy. The first is data normalization, the second is GROUP BY, and the third is PIVOT.
First, data normalization. Your design of the people table is not in first normal form. The columns "wantsports", "wantfamily", "wantbusiness" are really a repeating group, although they may not look like one. If you can modify the table design, you will find it advantageous to create a third table, lets call it "peoplewant", with two key columns, personid and cartype. I can go into detail about why this design will be more flexible and powerful if you like, but I'm going to skip that for now.
On to GROUP BY. This allows you to produce a result that summarizes each group in one row of the result.
SELECT
p.name,
c.type,
c.count(*) as carcount
FROM people p,
INNER JOIN peoplewant pw ON p.id = pw.personid
INNER JOIN cars c on pw.cartype = c.type
WHERE
p.id = 1
GROUP BY
p.name,
c.type
This (untested) query gives you the result you want, except that the result has a separate row for each car type the person wants.
Finally, PIVOT. The PIVOT tool in your DBMS allows you to turn this result into a form where there is just one row for the person, and there is a separate column for each of the cartypes wanted by that person. I haven't used PIVOT myself, so I'll let somebody else edit this response to provide an example using PIVOT.
If you use the same technique to retrieve data for multiple people in one sweep, keep in mind that a column will appear for each wanted type that any person wants, and zeroes will appear in the PIVOT result for persons who do not want a car type that is in the result columns.
Just came across this post through a google search, so I realize I'm late to this party by a bit, but .. sure this really is possible to do... however, I wouldn't suggest actually doing it this way because it's usually considered a Very Bad Thing (tm).
Dynamic SQL is your answer.
Before I say how to do it, I want to preface this with, Dynamic SQL is a very dangerous thing, if you aren't sanitizing your input from the application.
So, therefore, proceed with caution:
declare #sqlToExecute nvarchar(max);
declare #includeSportsCars bit;
declare #includeFamilyCars bit;
declare #includeBusinessCars bit;
set #includeBusinessCars = 1
set #includeFamilyCars = 1
set #includeSportsCars = 1
set #sqlToExecute = 'SELECT P.name '
if #includeSportsCars = 1
set #sqlToExecute = #sqlToExecute + '(select count(*) from cars C where C.type = ''sports'') AS sportscars, ';
if #includeFamilyCars = 1
set #sqlToExecute = #sqlToExecute + '(select count(*) from cars C where C.type = ''family'') AS familycars, ';
if #includeBusinessCars = 1
set #sqlToExecute = #sqlToExecute + '(select count(*) from cars C where C.type = ''business'') AS businesscars '
set #sqlToExecute = #sqlToExecute + ' FROM people P WHERE P.id = 1;';
exec(#sqlToExecute)