MySQL select from db efficiency and time-consumption - sql

Which one of those will take the least amount of time?
1.
q = 'SELECT * FROM table;'
result = db->query(q)
while(row = mysql_fetch_assoc(result))
{
userName = row['userName'];
userPassword = row['userPassword'];
if (enteredN==userName && enteredP==userPassword)
return true;
return false;
}
2.
q = 'select * from table where userName='.enteredN.' and userPassword='.enteredP.';'
...

The second one is guaranteed to be significantly faster on just about any database management system. You may not notice a difference if you have just a few rows in the table, but when you get to have thousands of rows, it will become quite obvious.
As a general rule, you should let the database management system handle filtering, grouping, and sorting that you need; database management systems are designed to do those things and are generally highly optimized.
If this is a frequently used query, make sure you use an index on the unique username field.
As soulmerge brings up, you do need to be careful with SQL injection; see the PHP Manual for information how to protect against it. While you're at it, you should seriously consider storing password hashes, not the passwords themselves; the MySQL Manual has information on various functions that can help you do this.

Number 2 is bound to be much, much!, faster. (unless the table only contains a few rows)
It is not only be SQL servers are very efficient at filtering (and they can use indexes etc. which the loop in #1 doesn't have access to), but also because #1 may cause a very significant amount of data be transferred from the server to the php application.
Furthermore solution #1 would cause all the login credentials of the users to transit on the wire, exposing them to possible snooping, somewhere along the line... Do note that solution #2 also incur a potential security risk if the channel between SQL server and application is not secure; this risk is somewhat lessened because 100% of the accounts details are not transmitted with every single login attempt.
Beyond this risk (which speaks to internal security), there is also the very real risk of SQL injection from fully outside attackers (as pointed out by others in this post). This however can be fully addressed by escaping the single quote characters contained within end-users supplied strings.

Related

Is this INSERT likely to cause any locking/concurrency issues?

In an effort to avoid auto sequence numbers and the like for one reason or another in this particular database, I wondered if anyone could see any problems with this:
INSERT INTO user (label, username, password, user_id)
SELECT 'Test', 'test', 'test', COALESCE(MAX(user_id)+1, 1) FROM user;
I'm using PostgreSQL (but also trying to be as database agnostic as possible)..
EDIT:
There's two reasons for me wanting to do this.
Keeping dependency on any particular RDBMS low.
Not having to worry about updating sequences if the data is batch-updated to a central database.
Insert performance is not an issue as the only tables where this will be needed are set-up tables.
EDIT-2:
The idea I'm playing with is that each table in the database have a human-generated SiteCode as part of their key, so we always have a compound key. This effectively partitions the data on SiteCode and would allow taking the data from a particular site and putting it somewhere else (obviously on the same database structure). For instance, this would allow backing up of various operational sites onto one central database, but also allow that central database to have operational sites using it.
I could still use sequences, but it seems messy. The actual INSERT would look more like this:
INSERT INTO user (sitecode, label, username, password, user_id)
SELECT 'SITE001', 'Test', 'test', 'test', COALESCE(MAX(user_id)+1, 1)
FROM user
WHERE sitecode='SITE001';
If that makes sense..
I've done something similar before and it worked fine, however the central database in that case was never operational (it was more of a way of centrally viewing data / analyzing) so it did not need to generate ids.
EDIT-3:
I'm starting to think it'd be simpler to only ever allow the centralised database to be either active-only or backup-only, thus avoiding the problem completely and allowing a more simple design.
Oh well back to the drawing board!
There are a couple of points:
Postgres uses Multi-Version Concurrency Control (MVCC) so Readers are never waiting on writers and vice versa. But there is of course a serialization that happens upon each write. If you are going to load a bulk of data into the system, then look at the COPY command. It is much faster than running a large swab of INSERT statements.
The MAX(user_id) can be answered with an index, and probably is, if there is an index on the user_id column. But the real problem is that if two transactions start at the same time, they will see the same MAX(user_id) value. It leads me to the next point:
The canonical way of handling numbers like user_id's is by using SEQUENCE's. These essentially are a place where you can draw the next user id from. If you are really worried about performance on generating the next sequence number, you can generate a batch of them per thread and then only request a new batch when it is exhausted (sometimes called a HiLo sequence).
You may be wanting to have user_id's packed up nice and tight as increasing numbers, but I think you should try to get rid of that. The reason is that deleting a user_id will create a hole anyway. So i'd not worry too much if the sequences were not strictly increasing.
Yes, I can see a huge problem. Don't do it.
Multiple connections can get the EXACT SAME id at the same time. I was going to add "under load" but it doesn't even need to be - just need the right timing between two queries.
To avoid it, you can use transactions or locking mechanisms or isolation levels specific to each DB, but once we get to that stage, you might as well use the dbms-specific sequence/identity/autonumber etc.
EDIT
For question edit2, there is no reason to fear gaps in the user_id, so you have one sequence across all sites. If gaps are ok, some options are
use guaranteed update statements, such as (in SQL Server)
update tblsitesequenceno set #nextnum = nextnum = nextnum + 1
Multiple callers to this statement are each guaranteed to get a unique number.
use a single table that produces the identity/sequence/autonumber (db specific)
If you cannot have gaps at all, consider using a transaction mechanism that will restrict access while you are running the max() query. Either that or use a proliferation of (sequences/tables with identity columns/tables with autonumber) that you manipulate using dynamic SQL using the same technique for a single sequence.
By all means use a sequence to generate unique numbers. They are fast, transaction safe and reliable.
Any self-written implemention of a "sequence generator" is either not scalable for a multi-user environment (because you need to do heavy locking) or simply not correct.
If you do need to be DBMS independent, then create an abstraction layer that uses sequences for those DBMS that support them (Posgres, Oracle, Firebird, DB2, Ingres, Informix, ...) and a self written generator on those that don't.
Trying to create a system than is DBMS independent, simply means it will run equally slow on all systems if you don't exploit the advantages of each DBMS.
Your goal is a good one. Avoiding IDENTITY and AUTOINCREMENT columns means avoiding a whole plethora of administration problems. Here is just one example of the many.
However most responders at SO will not appreciate it, the popular (as opposed to technical) response is "always stick an Id AUTOINCREMENT column on everything that moves".
A next-sequential number is fine, all vendors have optimised it.
As long as this code is inside a Transaction, as it should be, two users will not get the same MAX()+1 value. There is a concept called Isolation Level which needs to be understood when coding Transactions.
Getting away from user_id and onto a more meaningful key such as ShortName or State plus UserNo is even better (the former spreads the contention, latter avoids the next-sequential contention altogether, relevant for high volume systems).
What MVCC promises, and what it actually delivers, are two different things. Just surf the net or search SO to view the hundreds of problems re PostcreSQL/MVCC. In the realm of computers, the laws of physics applies, nothing is free. MVCC stores private copies of all rows touched, and resolves collisions at the end of the Transaction, resulting in far more Rollbacks. Whereas 2PL blocks at the beginning of the Transaction, and waits, without the massive storage of copies.
most people with actual experience of MVCC do not recommend it for high contention, high volume systems.
The first example code block is fine.
As per Comments, this item no longer applies: The second example code block has an issue. "SITE001" is not a compound key, it is a compounded column. Do not do that, separate "SITE" and "001" into two discrete columns. And if "SITE" is a fixed, repeatingvalue, it can be eliminated.
Different users can have the same user_id, concurrent SELECT-statements will see the same MAX(user_id).
If you don't want to use a SEQUENCE, you have to use an extra table with a single record and update this single record every time you need a new unique id:
CREATE TABLE my_sequence(id INT);
BEGIN;
UPDATE my_sequence SET id = COALESCE(id, 0) + 1;
INSERT INTO
user (label, username, password, user_id)
SELECT 'Test', 'test', 'test', id FROM my_sequence;
COMMIT;
I agree with maksymko, but not because I dislike sequences or autoincrementing numbers, as they have their place. If you need a value to be unique throughout your "various operational sites" i.e. not only within the confines of the single database instance, a globally unique identifier is a robust, simple solution.

SQL Performance Question

I have a question regarding the performance of SQL. I will illustrate my problem with pseudocode.
I am wondering which will preform better and by how much? Say for 10 items, on each page load. In .NET. Is is a lot faster? a little faster? A difference not noticable on SQL?
foreach(item in mylist) {
CallSQLStoredProc(item.id);
}
vs
int[] ids; // array of ids
CallSQLStoredProc(ids) // stored procedure returns more than one row for each id
The second option will certainly be faster because it is a single network round trip, as well as a single SP call.
Definetly the second, varying from about 10x faster to a little faster.
If whatever you're doing with the id's can be done in a set operation, you'll get several times the performance gain than calling the SP individually.
I regularly have procs that look like:
create procedure proc ( #ids varchar(max) ) as
select * from users_tbl u
inner join spiltCSVs(#ids) c
on c.id = u.id
--so on and so forth
That's a set based operation; as opposed to the procedural method using a cursor in the proc, or using the for loop to iterate over calling the procedure with an individual id.
Since this wouldn't fit in a comment for ocdecio's answer...
Just to expand on it... in most systems that I've seen the network traffic is the limiting factor for performance (assuming a reasonably tuned database and front-end code that isn't absolutely horrible). Even if your web server and database server are on the same machine, the interprocess communication can be a limiting factor if you have frequent calls back and forth between the two.
On each page load, or the first time the page is loaded? I would not want to do that for every single postback in an ASP.NET page.
To answer your question more directly, if you're able to get multiple records by sending multiple IDs, do that. More efficient and more scalable should you ever need more than 10 items.
it all depends how the proc is coded, if you pass in 10 items in the 2nd proc and that proc then uses a cursor to get those rows then the first call might be faster
Iterating anything is going to always cause more overhead. There aren't many situations where iteration improves performance.
My advice has always been to avoid 2 things in programming:
if then else statments
iteration
You will always have situations where you will use both, but the less you use of them the more potential your application has to run faster and smoother.
How much faster the second will be really depends on too many things. The network overhead might be insignificant compared to the size of your result sets.
There is another alternative (which should be faster than either depending on the locking behavior), which is to call all of them asynchronously - then your page can effectively complete when the longest one completes. Obviously, this will require some additional coding.
In this example, there is only one SP overhead. We'll assume the SP returns either a single rowset which the client will split/process or multiple rowsets:
int[] ids; // array of ids
CallSQLStoredProc(ids) // stored procedure returns more than one row for each id
In this example, the SP call overheads are n times the single call. and the calls are serialized:
foreach(item in mylist) {
CallSQLStoredProc(item.id);
}
In the third alternative:
foreach(item in mylist) {
StartSQLStoredProc(item.id);
}
// Continue building the page until you reach a point where you absolutely have to have the data
wait();
This still has the n DB call overheads, but the performance improvement can depend on the capacity of the SQL Server and network in order to parallelize the workload. In addition you get the benefit of the ability to start the SQL Server working while the page is building.
The single SP solution can still win out, particularly if it can assemble a single result set with a UNION where the SQL Server can parallelize the task. However, if the result sets have separate schemas or the UNION cannot perform well, A multiple SP asynchronous solution can beat it out (and can also take advantage of the ability to do other work in the page).
IF you want scalability in your application, you will want to use caching as much as possible. You should be running any shared queries only once and storing the result in the cache.
As for your query, provided you aren't using cursors in the query for each ID, it should be faster provided that network latency has meaningful impact on what your doing. When in doubt, measure. I've been very surprised many times when I actually implemented timing on my functions to see how long different things took.
In .net System.Diagnostics.StopWatch is your friend :).

Performance benefit when SQL query is limited vs calling entire row?

How much of a performance benefit is there by selecting only required field in query instead of querying the entire row? For example, if I have a row of 10 fields but only need 5 fields in the display, is it worth querying only those 5? what is the performance benefit with this limitation vs the risk of having to go back and add fields in the sql query later if needed?
It's not just the extra data aspect that you need to consider. Selecting all columns will negate the usefulness of covering indexes, since a bookmark lookup into the clustered index (or table) will be required.
It depends on how many rows are selected, and how much memory do those extra fields consume. It can run much slower if several text/blobs fields are present for example, or many rows are selected.
How is adding fields later a risk? modifying queries to fit changing requirements is a natural part of the development process.
The only benefit I know of explicitly naming your columns in your select statement is that if a column your code is using gets renamed your select statement will fail before your code. Even better if your select statement is within a proc, your proc and the DB script would not compile. This is very handy if you are using tools like VS DB edition to compile/verify DB scripts.
Otherwise the performance difference would be negligible.
The number of fields retrieved is a second order effect on performance relative to the large overhead of the SQL request itself -- going out of process, across the network to another host, and possibly to disk on that host takes many more cycles than shoveling a few extra bytes of data.
Obviously if the extra fields include a megabyte blob the equation is skewed. But my experience is that the transaction overhead is of the same order, or larger, than the actual data retreived. I remember vaguely from many years ago than an "empty" NOP TNS request is about 100 bytes on the wire.
If the SQL server is not the same machine from which you're querying, then selecting the extra columns transfers more data over the network (which can be a bottleneck), not forgetting that it has to read more data from the disk, allocate more memory to hold the results.
There's not one thing that would cause a problem by itself, but add things up and they all together cause performance issues. Every little bit helps when you have lots of either queries or data.
The risk I guess would be that you have to add the fields to the query later which possibly means changing code, but then you generally have to add more code to handle extra fields anyway.

Encrypted database query

I've just found out about Stack Overflow and I'm just checking if there are ideas for a constraint I'm having with some friends in a project, though this is more of a theoretical question to which I've been trying to find an answer for some time.
I'm not much given into cryptography but if I'm not clear enough I'll try to edit/comment to clarify any questions.
Trying to be brief, the environment is something like this:
An application where the front-end as access to encrypt/decrypt keys and the back-end is just used for storage and queries.
Having a database to which you can't have access for a couple of fields for example let's say "address" which is text/varchar as usual.
You don't have access to the key for decrypting the information, and all information arrives to the database already encrypted.
The main problem is something like this, how to consistently make queries on the database, it's impossible to do stuff like "where address like '%F§YU/´~#JKSks23%'". (IF there is anyone feeling with an answer for this feel free to shoot it).
But is it ok to do where address='±!NNsj3~^º-:'? Or would it also completely eat up the database?
Another restrain that might apply is that the front end doesn't have much processing power available, so already encrypting/decrypting information starts to push it to its limits. (Saying this just to avoid replies like "Exporting a join of tables to the front end and query it there".)
Could someone point me in a direction to keep thinking about it?
Well thanks for so fast replies at 4 AM, for a first time usage I'm really feeling impressed with this community. (Or maybe I'm it's just for the different time zone)
Just feeding some information:
The main problem is all around partial matching. As a mandatory requirement in most databases is to allow partial matches. The main constraint is actually the database owner would not be allowed to look inside the database for information. During the last 10 minutes I've come up with a possible solution which extends again to possible database problems, to which I'll add here:
Possible solution to allow semi partial matching:
The password + a couple of public fields of the user are actually the key for encrypting. For authentication the idea is to encrypt a static value and compare it within the database.
Creating a new set of tables where information is stored in a parsed way, meaning something like: "4th Street" would become 2 encrypted rows (one for '4th' another for 'Street'). This would already allow semi-partial matching as a search could already be performed on the separate tables.
New question:
Would this probably eat up the database server again, or does anyone think it is a viable solution for the partial matching problem?
Post Scriptum: I've unaccepted the answer from Cade Roux just to allow for further discussion and specially a possible answer to the new question.
You can do it the way you describe - effectively querying the hash, say, but there's not many systems with that requirement, because at that point the security requirements are interfering with other requirements for the system to be usable - i.e. no partial matches, since the encryption rules that out. It's the same problem with compression. Years ago, in a very small environment, I had to compress the data before putting it in the data format. Of course, those fields could not easily be searched.
In a more typical application, ultimately, the keys are going to be available to someone in the chain - probably the web server.
For end user traffic SSL protects that pipe. Some network switches can protect it between web server and database, and storing encrypted data in the database is fine, but you're not going to query on encrypted data like that.
And once the data is displayed, it's out there on the machine, so any general purpose computing device can be circumvented at that point, and you have perimeter defenses outside of your application which really come into play.
why not encrypt the disk holding the database tables, encrypt the database connections, and let the database operate normally?
[i don't really understand the context/contraints that require this level of paranoia]
EDIT: "law constraints" eh? I hope you're not involved in anything illegal, I'd hate to be an inadvertent accessory... ;-)
if the - ahem - legal constraints - force this solution, then that's all there is to be done - no LIKE matches, and slow response if the client machines can't handle it.
A few months ago I came across the same problem: the whole database (except for indexes) is encrypted and the problem on partial matches raised up.
I searched the Internet looking for a solution, but it seems that there's not much to do about this but a "workaround".
The solution I've finally adopted is:
Create a temporary table with the data of the field against which the query is being performed, decrypted and another field that is the primary key of the table (obviously, this field doesn't have to be decrypted as is plain-text).
Perform the partial match agains that temporary table and retrieve the identifiers.
Query the real table for those identifiers and return the result.
Drop the temporary table.
I am aware that this supposes a non-trivial overhead, but I haven't found another way to perform this task when it is mandatory that the database is fully encrypted.
Depending on each particular case, you may be able to filter the number of lines that are inserted into the temporary table without losing data for the result (only consider those rows which belong to the user that is performing the query, etc...).
You want do use md5 hashing. Basically, it takes your string and turns it into a hash that cannot be reproduced. You can then use it to validate against things later. For example:
$salt = "123-=asd";
$address = "3412 g ave";
$sql = "INSERT INTO addresses (address) VALUES ('" . md5($salt . $address) . "')";
mysql_query($sql);
Then, to validate an address in the future:
$salt = "123-=asd";
$address = "3412 g ave";
$sql = "SELECT address FROM addresses WHERE address = '" . md5($salt . $address) . "'";
$res = mysql_query($sql);
if (mysql_fetch_row($res))
// exists
else
// does not
Now it is encrypted on the database side so nobody can find it out - even if they looked in your source code. However, finding the salt will help them decrypt it though.
http://en.wikipedia.org/wiki/MD5
If you need to store sensitive data that you want to query later I'd recommend to store it in plain text, restricting access to that tables as much as you can.
If you can't do that, and you don't want overhead in the front end you can make a component in the back end, running in a server, that processes the encrypted data.
Making querys to encrypted data? If you're using a good encryption algorithm I can't imagine how to do that.

Simulating queries of large views for benchmarking purposes

A Windows Forms application of ours pulls records from a view on SQL Server through ADO.NET and a SOAP web service, displaying them in a data grid. We have had several cases with ~25,000 rows, which works relatively smoothly, but a potential customer needs to have many times that much in a single list.
To figure out how well we scale right now, and how (and how far) we can realistically improve, I'd like to implement a simulation: instead of displaying actual data, have the SQL Server send fictional, random data. The client and transport side would be mostly the same; the view (or at least the underlying table) would of course work differently. The user specifies the amount of fictional rows (e.g. 100,000).
For the time being, I just want to know how long it takes for the client to retrieve and process the data and is just about ready to display it.
What I'm trying to figure out is this: how do I make the SQL Server send such data?
Do I:
Create a stored procedure that has to be run beforehand to fill an actual table?
Create a function that I point the view to, thus having the server generate the data 'live'?
Somehow replicate and/or randomize existing data?
The first option sounds to me like it would yield the results closest to the real world. Because the data is actually 'physically there', the SELECT query would be quite similar performance-wise to one on real data. However, it taxes the server with an otherwise meaningless operation. The fake data would also be backed up, as it would live in one and the same database — unless, of course, I delete the data after each benchmark run.
The second and third option tax the server while running the actual simulation, thus potentially giving unrealistically slow results.
In addition, I'm unsure how to create those rows, short of using a loop or cursor. I can use SELECT top <n> random1(), random2(), […] FROM foo if foo actually happens to have <n> entries, but otherwise I'll (obviously) only get as many rows as foo happens to have. A GROUP BY newid() or similar doesn't appear to do the trick.
For data for testing CRM type tables, I highly recommend fakenamegenerator.com, you can get 40,000 fake names for free.
You didn't mention if you're using SQL Server 2008. If you use 2008 and you use Data Compression, be aware that random data will act very differently (slower) than real data. Random data is much harder to compress.
Quest Toad for SQL Server and Microsoft Visual Studio Data Dude both have test data generators that will put fake "real" data into records for you.
If you want results you can rely on you need to make the testing scenario as realistic as possible, which makes option 1 by far your best bet. As you point out if you get results that aren't good enough with the other options you won't be sure that it wasn't due to the different database behaviour.
How you generate the data will depend to a large degree on the problem domain. Can you take data sets from multiple customers and merge them into a single mega-dataset? If the data is time series then maybe it can be duplicated over a different range.
The data is typically CRM-like, i.e. contacts, projects, etc. It would be fine to simply duplicate the data (e.g., if I only have 20,000 rows, I'll copy them five times to get my desired 100,000 rows). Merging, on the other hand, would only work if we never deploy the benchmarking tool publicly, for obvious privacy reasons (unless, of course, I apply a function to each column that renders the original data unintelligible beyond repair? Similar to a hashing function, only without modifying the value's size too much).
To populate the rows, perhaps something like this would do:
WHILE (SELECT count(1) FROM benchmark) < 100000
INSERT INTO benchmark
SELECT TOP 100000 * FROM actualData