SQL, to loop or not to loop? - sql

the problem story goes like:
consider a program to manage bank accounts with balance limits for each customer
{table Customers, table Limits} where for each Customer.id there's one Limit record
then the client said to store a history for the limits' changes, it's not a problem since I've already had date column for Limit but the active/latest limits's view-query needs to be changed
before: Customer-Limit was 1 to 1 so a simple select did the job
now: it would show all the Limits' records which means multiple records for each Customers and I need the latest Limits only so I thought of something like this pseudo code
foreach( id in Customers)
{
select top 1 *
from Limits
where Limits.customer_id = id
order by Limits.date
}
but while looking through SO for similar issues, I came across stuff like
"95% of the time when you need a looping structure in tSQL you are probably doing it wrong"-JohnFx
and
"SQL is primarily a set-orientated language - it's generally a bad idea to use a loop in it."-Mark Bannister
can anyone confirm/explain why is it wrong to loop? and in the explained problem above, what am I getting wrong that I need to loop?
thanks in advance
update : my solution
in light of TomTom's answer & suggested link here and before Dean kindly answered with code I came up with this
SELECT *
FROM Customers c
LEFT JOIN Limits a ON a.customer_id = c.id
AND a.date =
(
SELECT MAX(date)
FROM Limits z
WHERE z.customer_id = a.customer_id
)
thought I'd share :>
thanks for your response,
happy coding

Will this do?
;with l as (
select *, row_number() over(partition by customer_id order by date desc) as rn
from limits
)
select *
from customers c
left join l on c.customer_id = l.customer_id and l.rn = 1

I am assuming that earlier (i.e. before implementing the history functionality) you must be updating the Limits table. Now, for implementing the history functionality you have started inserting new records. Doesnt this trigger a lot of changes in your databases and code?
Instead of inserting new records, how about keeping the original functionality as is and creating a new table say Limits_History which will store all the old values from Limits table before updating it? Then all you need to do is fetch records from this table if you want to show history. This will not cause any changes in your existing SPs and code hence will be less error prone.
To insert record in the Limits_History table, you can simply create an AFTER TRIGGER and use the deleted magic table. Hence you need not worry about calling an SP or something to maintain history. The trigger will do this for you. Good examples of trigger are here
Hope this helps

It is wrong. You can do the same by quyting customers and limits with a subquery limiting to the most recent record on limit.
This is similar in concept to the query presented in Most recent record in a left join
You may have to do so in 2 joins - get most recent date, then get limit for the date. While this may look complex - it is a beginner issue, talk complex when you have sql statements reaching 2 printed pages and more ;)
Now, for an operational system the table design is broken - limits should contain the mos trecent limit, and a LimitHistory table the historical (or: all) entries, allowing fast retrieval of the CURRENT limit (which will be the one to apply to all transaction) without the overhead of the history. The table design you have assumes all limits are identical - that may be the truth (is the truth) for a reporting data warehouse, but is wrong for a transactional system as the history is not transacted.

Confirmation for why loop is wrong is exactly in the quoted parts in your question - SQL is a set-orientated language.
This means when you work on sets there's no reason to loop through the single rows, because you already have the 'result' (set) of data you want to work on.
Then the work you are doing should be done on the set of rows, because otherwise your selection is wrong.
That being said there are of course situations where looping is done in SQL and it will generally be done via cursors if on data, or done via a while loop if calculating stuff. (generally, exceptions always change).
However, as also mentioned in the quotes, often when you feel like using a loop you either shouldn't (it's poor performance) or you're doing logic in the wrong part of your application.
Basically - it is similar to how object orientated languages works on objects and references to said objects. Set based language works on - well, sets of data.
SQL is basically made to function in that manner - query relational data into result sets - so when working with the language, you should let it do what it can do and work on that. Just as if it was Java or any other language.

Related

Why is there no `select last` or `select bottom` in SQL Server like there is `select top`?

I know this might probably sound like a stupid question, but please bear with me.
In SQL-server we have
SELECT TOP N ...
now in that we can get the first n rows in ascending order (by default), cool. If we want records to be sorted on any other column, we just specify that in the order by clause, something like this...
SELECT TOP N ... ORDER BY [ColumnName]
Even more cool. But what if I want the last row? I just write something like this...
SELECT TOP N ... ORDER BY [ColumnName] DESC
But there is a slight concern with that. I said concern and not issue because it isn't actually an issue. By this way, I could get the last row based on that column, but what if I want the last row that was inserted. I know about SCOPE_IDENTITY, IDENT_CURRENT and ##IDENTITY, but consider a heap (a table without a clustered index) without any identity column, and multiple accesses from many places (please don't go into this too much as to how and when these multiple operation are happening, this doesn't concern the main thing). So in this case there doesn't seems to be an easy way to find which row was actually inserted last. Some might answer this as
If you do a select * from [table] the last row shown in the sql result window will be the last one inserted.
To anything thinking about this, this is not actually the case, at least not always and one that you can always rely on (msdn, please read the Advanced Scanning section).
So the question boils down to this, as in the title itself. Why doesn't SQL Server have a
SELECT LAST
or say
SELECT BOTTOM
or something like that, where we don't have to specify the Order By and then it would give the last record inserted in the table at the time of executing the query (again I am not going into details about how would this result in case of uncommitted reads or phantom reads).
But if still, someone argues that we can't talk about this without talking about these read levels, then, for them, we could make it behave as the same way as TOP work but just the opposite. But if your argument is then we don't need it as we can always do
SELECT TOP N ... ORDER BY [ColumnName] DESC
then I really don't know what to say here. I know we can do that, but are there any relation based reason, or some semantics based reason, or some other reason due to which we don't have or can't have this SELECT LAST/BOTTOM. I am not looking for way to does Order By, I am looking for reason as to why do don't have it or can't have it.
Extra
I don't know much about how NOSQL works, but I've worked (just a little bit) with mongodb and elastic search, and there too doesn't seems to be anything like this. Is the reason they don't have it is because no one ever had it before, or is it for some reason not plausible?
UPDATE
I don't need to know that I need to specify order by descending or not. Please read the question and understand my concern before answering or commenting. I know how will I get the last row. That's not even the question, the main question boils down to why no select last/bottom like it's counterpart.
UPDATE 2
After the answers from Vladimir and Pieter, I just wanted to update that I know the the order is not guaranteed if I do a SELECT TOP without ORDER BY. I know from what I wrote earlier in the question might make an impression that I don't know that's the case, but if you just look a further down, I have given a link to msdn and have mentioned that the SELECT TOP without ORDER BY doesn't guarantees any ordering. So please don't add this to your answer that my statement in wrong, as I have already clarified that myself after a couple of lines (where I provided the link to msdn).
You can think of it like this.
SELECT TOP N without ORDER BY returns some N rows, neither first, nor last, just some. Which rows it returns is not defined. You can run the same statement 10 times and get 10 different sets of rows each time.
So, if the server had a syntax SELECT LAST N, then result of this statement without ORDER BY would again be undefined, which is exactly what you get with existing SELECT TOP N without ORDER BY.
You have stressed in your question that you know and understand what I've written below, but I'll still keep it to make it clear for everyone reading this later.
Your first phrase in the question
In SQL-server we have SELECT TOP N ... now in that we can get the
first n rows in ascending order (by default), cool.
is not correct. With SELECT TOP N without ORDER BY you get N "random" rows. Well, not really random, the server doesn't jump randomly from row to row on purpose. It chooses some deterministic way to scan through the table, but there could be many different ways to scan the table and server is free to change the chosen path when it wants. This is what is meant by "undefined".
The server doesn't track the order in which rows were inserted into the table, so again your assumption that results of SELECT TOP N without ORDER BY are determined by the order in which rows were inserted in the table is not correct.
So, the answer to your final question
why no select last/bottom like it's counterpart.
is:
without ORDER BY results of SELECT LAST N would be exactly the same as results of SELECT TOP N - undefined.
with ORDER BY result of SELECT LAST N ... ORDER BY X ASC is exactly the same as result of SELECT TOP N ... ORDER BY X DESC.
So, there is no point to have two key words that do the same thing.
There is a good point in the Pieter's answer: the word TOP is somewhat misleading. It really means LIMIT result set to some number of rows.
By the way, since SQL Server 2012 they added support for ANSI standard OFFSET:
OFFSET { integer_constant | offset_row_count_expression } { ROW | ROWS }
[
FETCH { FIRST | NEXT } {integer_constant | fetch_row_count_expression } { ROW | ROWS } ONLY
]
Here adding another key word was justified that it is ANSI standard AND it adds important functionality - pagination, which didn't exist before.
I would like to thank #Razort4x here for providing a very good link to MSDN in his question. The "Advanced Scanning" section there has an excellent example of mechanism called "merry-go-round scanning", which demonstrates why the order of the results returned from a SELECT statement cannot be guaranteed without an ORDER BY clause.
This concept is often misunderstood and I've seen many question here on SO that would greatly benefit if they had a quote from that link.
The answer to your question
Why doesn't SQL Server have a SELECT LAST or say SELECT BOTTOM or
something like that, where we don't have to specify the ORDER BY and
then it would give the last record inserted in the table at the time
of executing the query (again I am not going into details about how
would this result in case of uncommitted reads or phantom reads).
is:
The devil is in the details that you want to omit. To know which record was the "last inserted in the table at the time of executing the query" (and to know this in a somewhat consistent/non-random manner) the server would need to keep track of this information somehow. Even if it is possible in all scenarios of multiple simultaneously running transactions, it is most likely costly from the performance point of view. Not every SELECT would request this information (in fact very few or none at all), but the overhead of tracking this information would always be there.
So, you can think of it like this: by default the server doesn't do anything specific to know/keep track of the order in which the rows were inserted, because it affects performance, but if you need to know that you can use, for example, IDENTITY column. Microsoft could have designed the server engine in such a way that it required an IDENTITY column in every table, but they made it optional, which is good in my opinion. I know better than the server which of my tables need IDENTITY column and which do not.
Summary
I'd like to summarise that you can look at SELECT LAST without ORDER BY in two different ways.
1) When you expect SELECT LAST to behave in line with existing SELECT TOP. In this case result is undefined for both LAST and TOP, i.e. result is effectively the same. In this case it boils down to (not) having another keyword. Language developers (T-SQL language in this case) are always reluctant to add keywords, unless there are good reasons for it. In this case it is clearly avoidable.
2) When you expect SELECT LAST to behave as SELECT LAST INSERTED ROW. Which should, by the way, extend the same expectations to SELECT TOP to behave as SELECT FIRST INSERTED ROW or add new keywords LAST_INSERTED, FIRST_INSERTED to keep existing keyword TOP intact. In this case it boils down to the performance and added overhead of such behaviour. At the moment the server allows you to avoid this performance penalty if you don't need this information. If you do need it IDENTITY is a pretty good solution if you use it carefully.
There is no select last because there is no need for it. Consider a "select top 1 * from table" . Top 1 would get you the first row that is returned. And then the process stops.
But there is no guarantees about ordering if you don't specify an order by. So it may as well be any row in the dataset you get back.
Now do a "select last 1 * from table". Now the database will have to process all the rows in order to get you the last one.
And because ordering is non-deterministic, it may as well be the same result as from the select "top 1".
See now where the problem comes? Without an order by top and last are actually the same, only "last" will take more time. And with an order by, there's really only a need for top.
SELECT TOP N ...
now in that we can get the first n rows in ascending order (by
default), cool. If we want records to be sorted on any other column,
we just specify that in the order by clause, something like this...
What you say here is totally wrong and absolutely NOT how it works. There is no guarantee on what order you get. Ascending order on what ?
create table mytest(id int, id2 int)
insert into mytest(id,id2)values(1,5),(2,4),(3,3),(4,2),(5,1)
select top 1 * from mytest
select * from mytest
create clustered index myindex on mytest(id2)
select top 1 * from mytest
select * from mytest
insert into mytest(id,id2)values(6,0)
select top 1 * from mytest
Try this code line by line and see what you get with the last "select top 1".....you get in this case the last inserted record.
update
I think you understand that "select top 1 * from table" basically means: "Select a random row from the table".
So what would last mean? "Select the last random row from the table?" Wouldn't the last random row from a table be conceptually the same as saying any 1 random row from the table? And if that's true, top and last are the same, so there is no need for last.
Update 2
In hindsight I was happier with the syntax mysql uses : LIMIT.
Top doesn't say anything about ordering it is only there to specify the number of rows to be returned.
Limits the rows returned in a query result set to a specified number of rows or percentage of rows in SQL Server 2014.
The reasons why SELECT LAST_INSERTED does not make sense.
It cannot be easily applied to non-heap tables.
Heap data can be freely moved by DBMS so those "natural" order is subject to change. To keep it the system needs some additional mechanism which seems to be a useless waste.
If really desired it can be simulated by adding some 'auto-increment' column.
SQL Server ordering is arbitrary unless otherwise stated. It's set based, therefore you must define what your set is. Correct SCOPE_IDENTITY() is the correct way to capture the last inserted record, or the OUTPUT clause. Why would you do inserts on a heap that you need to reference chronologically anyway?? That's super bad database design.

Having Trouble Running An SQL Update Statement

Forgive me but I'm relatively new to SQL.
I am trying to update a column of a table I created with a function I created but when I run the Update Statement, nothing happens, I just see the underscore flashing (I'm assuming its trying to run it). The Update Statement is updating around 60,000 fields so I assume it should take a little while but it's been 10 minutes and no good.
I would just like to know if anyone knows just some general reasons that the underscore may be flashing. I know this is super general but I've just never seen this before.
Here's an image of what I'm talking about:
http://i.imgur.com/Xk3kM2U.png?1
EDIT: There are exactly 67,662 records in the table.
I've also just screenshotted the query and linked it.
Your old-style joins have no join condition between the ap1/r1 pair and the ap2/r2 pair, so you're calling your calc_distance() function for 67,662 * 67,622 combinations of coordinates. The use of distinct is potentially a warning that you know you're getting duplicates. And then there is no correlation between the subquery and the update itself, so you're repeating that for each row in temproute. That will take a while.
It looks like you maybe don't want to be looking at the source airport from two copies of the route table; but the source and destination airports from a single copy.
Something like (untested):
UPDATE temproute tr
SET distance = (
SELECT calc_distance(ap2.latitude, ap2.longitude, ap1.latitude, ap1.longitude)
FROM routes r
JOIN airports ap1 ON ap1.icaoairport = r.sourceid
JOIN airports ap2 ON ap2.icaoairport = r.destid
WHERE r.routeid = tr.routeid
);
If temproute is a copy of route too, which the line count implies, then you don't need to refer to route directly at all in the subquery, perhaps.
But I'm speculating about what you're doing.

Query to Find Adjacent Date Records

There exists in my database a page_history table; the idea is that whenever a record in the page table is changed, that record's old values are stored in the history table.
My job now is to find occasions in which a record was changed, and retrieve the pre- and post-conditions of that change. Specifically, I want to know when a page changed groups, and what groups were involved in the change. The query I have below can find these instances, but with the use of the min function, I can only get back the values that match between the two records:
select page_id,
original_group,
min(created2) change_date
from (select h.page_id,
h.group_id original_group,
i.group_id new_group,
h.created_dttm created1,
i.created_dttm created2
from page_history h,
page_history i
where h.page_id = i.page_id
and h.created_dttm < i.created_dttm
and h.group_id != i.group_id)
group by page_id, original_group, created1
order by page_id
When I try to get, say, any details of the second record, like new_group, I'm hit with a ORA-00979: not a GROUP BY expression error. I don't want to group by new_group, though, because that's going to destroy the logic (I think it would find records displaying times a page changed from a group to another group, regardless of any changes to other groups in between).
My question, then, is how can I modify this query, or go about writing a new one, that achieves a similar end, but with the added availability of columns that do not match between the two records? In essence, how can I find that min record without sacrificing all the other columns I'm not trying to compare? I don't exactly need a complete answer, any suggestions that point me in the right direction would be appreciated.
I use PL/SQL Developer, and it looks like version 11.2.0.2.0 of Oracle.
EDIT: I have found a solution. It's not pretty, and I'd still like to see some alternatives, but if helping me out would threaten to explode your brain, I would advise relocating to an easier question.
Without seeing your table structure it's hard to re-write the query but when you have a min function used like that it invariably seems better to put it into a separate sub select to get what you want and then compare the result of that.

Learning ExecuteSQL in FMP12, a few questions

I have joined a new job where I am required to use FileMaker (and gradually transition systems to other databases). I have been a DB Admin of a MS SQL Server database for ~2 years, and I am very well versed in PL/SQL and T-SQL. I am trying to pan my SQL knowledge to FMP using the ExecuteSQL functionaloty, and I'm kinda running into a lot of small pains :)
I have 2 tables: Movies and Genres. The relevant columns are:
Movies(MovieId, MovieName, GenreId, Rating)
Genres(GenreId, GenreName)
I'm trying to find the movie with the highest rating in each genre. The SQL query for this would be:
SELECT M.MovieName
FROM Movies M INNER JOIN Genres G ON M.GenreId=G.GenreId
WHERE M.Rating=
(
SELECT MAX(Rating) FROM Movies WHERE GenreId = M.GenreId
)
I translated this as best as I could to an ExecuteSQL query:
ExecuteSQL ("
SELECT M::MovieName FROM Movies M INNER JOIN Genres G ON M::GenreId=G::GenreId
WHERE M::Rating =
(SELECT MAX(M2::Rating) FROM Movies M2 WHERE M2::GenreId = M::GenreId)
"; "" ; "")
I set the field type to Text and also ensured values are not stored. But all I see are '?' marks.
What am I doing incorrectly here? I'm sorry if it's something really stupid, but I'm new to FMP and any suggestions would be appreciated.
Thank you!
--
Ram
UPDATE: Solution and the thought process it took to get there:
Thanks to everyone that helped me solve the problem. You guys made me realize that traditional SQL thought process does not exactly pan to FMP, and when I probed around, what I realized is that to best use SQL knowledge in FMP, I should be considering each column independently and not think of the entire result set when I write a query. This would mean that for my current functionality, the JOIN is no longer necessary. The JOIN was to bring in the GenreName, which is a different column that FMP automatically maps. I just needed to remove the JOIN, and it works perfectly.
TL;DR: The thought process context should be the current column, not the entire expected result set.
Once again, thank you #MissJack, #Chuck (how did you even get that username?), #pft221 and #michael.hor257k
I've found that FileMaker is very particular in its formatting of queries using the ExecuteSQL function. In many cases, standard SQL syntax will work fine, but in some cases you have to make some slight (but important) tweaks.
I can see two things here that might be causing the problem...
ExecuteSQL ("
SELECT M::MovieName FROM Movies M INNER JOIN Genres G ON
M::GenreId=G::GenreId
WHERE M::Rating =
(SELECT MAX(M2::Rating) FROM Movies M2 WHERE M2::GenreId = M::GenreId)
"; "" ; "")
You can't use the standard FMP table::field format inside the query.
Within the quotes inside the ExecuteSQL function, you should follow the SQL format of table.column. So M::MovieName should be M.MovieName.
I don't see an AS anywhere in your code.
In order to create an alias, you must state it explicitly. For example, in your FROM, it should be Movies AS M.
I think if you fix those two things, it should probably work. However, I've had some trouble with JOINs myself, as my primary experience is with FMP, and I'm only just now becoming more familiar with SQL syntax.
Because it's incredibly hard to debug SQL in FMP, the best advice I can give you here is to start small. Begin with a very basic query, and once you're sure that's working, gradually add more complicated elements one at a time until you encounter the dreaded ?.
There's a number of great posts on FileMaker Hacks all about ExecuteSQL:
Since you're already familiar with SQL, I'd start with this one: The Missing FM 12 ExecuteSQL Reference. There's a link to a PDF of the entire article if you scroll down to the bottom of the post.
I was going to recommend a few more specific articles (like the series on Robust Coding, or Dynamic Parameters), but since I'm new here and I can't include more than 2 links, just go to FileMaker Hacks and search for "ExecuteSQL". You'll find a number of useful posts.
NB If you're using FMP Advanced, the Data Viewer is a great tool for testing SQL. But beware: complex queries on large databases can sometimes send it into fits and freeze the program.
The first thing to keep in mind when working with FileMaker and ExecuteSQL() is the difference between tables and table occurrences. This is a concept that's somewhat unique to FileMaker. Succinctly, tables store the data, but table occurrences define the context of that data. Table occurrences are what you're seeing in FileMaker's relationship graph, and the ExecuteSQL() function needs to reference the table occurrences in its query.
I agree with MissJack regarding the need to start small in building the SQL statement and use the Data Viewer in FileMaker Pro Advanced, but there's one more recommendation I can offer, which is to use SeedCode's SQL Explorer. It does require the adding of table occurrences and fields to duplicate the naming in your existing solution, but this is pretty easy to do and the file they offer includes a wizard for building the SQL query.

How could i write this code in a more performant way?

In our app people have 1 or multiple projects. These projects have a start and an end date. People have a limited amount of available days.
Now we have a page that displays the availability of a given person on a week by week basis. It currently shows 18 weeks.
The way we currently calculate the available time for a given week is like this:
def days_available(query_date=Date.today)
days_engaged = projects.current.where("start_date < ? AND finish_date > ?", query_date, query_date).sum(:days_on_project)
available = days_total - hours_engaged
end
This means that to display the page descibed above the app will fire 18(!) queries into the database. We have pages that lists the availability of multiple people in a table. For these pages the amount of queries is quickly becomes staggering.
It is also quite slow.
How could we handle the availability retrieval in a more performant manner?
This is quite a common scenario when working with date ranges in an entity. Easy and fastest way is in SQL:
Join your events to a number generated date table (see generate days from date range) so that you have a row for each day a person or people are occupied. Once you have the data in this form it is simply a matter of grouping by the week date part of the date and counting the rows per grouping.
You can extend this to group by person for multiple person queries.
From a SQL point of view, I'd advise using a stored procedure and pass in your date/range requirement, you can then return a recordset for a user or possibly multiple users. This way your code just has to access db once.
You can then output recordset data in one go, by iterating through.
Hope this helps.
USE Stored procedure to fire your query to SQL to get data.
Pass paramerts in your case it is today's date to the SQl query.
Apply your conditions and Logic in the SQL Stored procedure , Using procedure is the goood and fastest way to retrieve data from the SQL , also it will prevent your code from the SQL injection too.
Call that SP from your Code as i dont know the Ruby on raisl I cant provide you steps about how to Call the Stored procedure from it.
After that the data fdetched as per you stored procedure will be available in Data table or something like that.
After getting the data you can perform all you need
Hope this helps
see what query is executed. further you may make comand explain to your query
explain select * from project where start_date < any_date and end_date> any_date2
you see the plan of query . Use this plan to optimized your query.
for example :
if you have index using field end_date replace a condition(end_date> any_date2 and start_date < any_date) . this step will using index if you have index on this field. But it step is db dependent . example is for nysql. if you want use index in mysql you must have using index condition on left part of where
There's not really enough information in your question to know exactly what you're trying to achieve here, e.g. the code snippet doesn't make use of the returned database query, so you could just remove it to make it faster. Perhaps this is just a bug in the code you posted?
Having said that, there are some techniques you should look into to implement your functionality.
I would take a look at using data warehouse techniques. I would think of your 'availability information' as a Fact table in a star schema, with 'Dates' and 'People' as Dimension tables.
You can then use queries to get stuff like - list of users for this projects for this week, and their availability.
Data warehousing has a whole bunch of resources you can tap into to help make this perform well, but there's also a lot of terminology that can be confusing, but for this type of 'I need to slice and dice my data across several sets of things (people and time)', Data Warehousing techniques can be quite powerful.
As I dont understand ruby on rails,from sql point of view i suggest you to write a stored procedure and return a dataset.And do the necessary table operations on the dataset from front end.It will reduce the unnecessary calls to DB.