SQL : Select records at a time - sql

I have question that at a time in select query how many records are selected that means what is maximum limit of selecting recods in sql 2K,2k5,2k8.
Thanks in advaces.

There's no hard limit that I'm aware of on SQL server's side on how many rows you can SELECT. If you could INSERT them all you can read them all out at the same time.
However, if you select millions of rows at the time, you may experience issues like your client running out of memory or your connection timing out before being able to transmit all the data you SELECTed.

I don't believe there is a built in 'limit' for selecting rows, it'll be down to the architecture that SQL server is running out (i.e. 32bit/64bit, memory available etc) Certainly you can select hundreds of thousands of rows without issue.
But... if you ever find yourself asking for that many rows from a database you should optimise your code / stored procedures so that you only retrieve a subset at a time.

As #paolo says, there is no SQL-defined hard limit; you can specify a limit in your query with the LIMIT keyword (although that is database-dependent).
However, there is one important point: performing a SELECT query with actual database servers typically does not actually transfer the data from the server to the client, or load everything into memory. A query usually has a cursor into a resultset, and as the client iterates through the resultset more rows are fetch from the server, usually in chunks. So unless a client explicitly copies all data from the resultset to memory, at no point will this implicitly happen.
This is all completely database-dependent, and in many cases drivers allow tweaking of chunk size etc.

Related

the faster way to extract all records from oracle

I have oracle table contain 900 million records , this table partioned to 24 partion , and have indexes :
i try to using hint and i put fetch_buffer to 100000:
select /+ 8 parallel +/
* from table
it take 30 minutes to get 100 million records
my question is :
is there are any way more faster to get the 900 million (all data in the table ) ? should i use partions and did 24 sequential queries ? or should i use indexes and split my query to 10 queries for example
The network is almost certainly the bottleneck here. Oracle parallelism only impacts the way the database retrieves the data, but data is still sent to the client with a single thread.
Assuming a single thread doesn't already saturate your network, you'll probably want to build a concurrent retrieval solution. It helps that the table is already partitioned, then you can read large chunks of data without re-reading anything.
I'm not sure how to do this in Scala, but you want to run multiple queries like this at the same time, to use all the client and network resources possible:
select * from table partition (p1);
select * from table partition (p2);
...
Not really an answer but too long for a comment.
A few too many variables can impact this to give informed advice, so the following are just some general hints.
Is this over a network or local on the server? If the database is remote server then you are paying a heavy network price. I would suggest (if possible) running the extract on the server using the BEQUEATH protocol to avoid using the network. Once the file(s) complete, is will be quicker to compress and transfer to destination than transferring the data direct from database to local file via JDBC row processing.
With JDBC remember to set the cursor fetch size to reduce round tripping - setFetchSize. The default value is tiny (10 I think), try something like 1000 to see how that helps.
As for the query, you are writing to a file so even though Oracle might process the query in parallel, your write to file process probably doesn't so it's a bottleneck.
My approach would be to write the Java program to operate off a range of values as command line parameters, and experiment to find which range size and concurrent instances of the Java give optimal performance. The range will likely fall within discrete partitions so you will benefit from partition pruning (assuming the range value is an a indexed column ideally the partition key).
Roughly speaking I would start with range of 5m, and run concurrent instances that match the number of CPU cores - 2; this is not a scientifically derive number just one that I tend to use as my first stab and see what happens.

Checking Duplicates before inserting into SQL database

So I've been doing some research and I need to write up an INSERT statement to insert unique client names into a table on my server. However the default standard of the database already has thousands of clients in it, and when inserting new clients we need to check if they already exist before attempting to add it to the system.
My question is what would be the best/fastest way to do this? Would it be better to run a simple select query on the clients table (ordered by ASC), and do a binary search or something on the results, or perhaps just do a SQL query similar to the one below?
IF NOT EXISTS (SELECT 1 FROM clients AS c WHERE c.clientname = ?)
BEGIN
INSERT INTO clients (clientname, address, ...)
VALUES (?, ?, ...)
END
Is this a slow statement? I may have to run the insert several hundred times per each submission.
The standard advice is to create a UNIQUE constraint if you want a given column to be unique.
ALTER TABLE clients ADD UNIQUE KEY (clientname);
Then try to do the INSERT, and it'll succeed if there is no matching row, and it'll fail if there is a duplicate. No SELECT is necessary.
It is not too uncommon to calculate the cost of a SQL in query in terms of disk operations (usually that means reading/writing a block (typically 8 KB) is the unit for your costs). (In Memory-DBs should change something about this line of thought).
If you have hundreds, possible thousands of items and each item is... Say 20 Bytes, then your full database will possibly fit in a single block on disk (400 items/block). Maybe it needs a couple more blocks, but hurray: It is a neglectable small number. With such a small database, your database will probably lounge around in your database' memory cache and you will only need to pay for write access.
As your database grows the number of block accesses that you need can be exponentially reduced if you have an index.
Both your solution and Bill's solution will not cause any write access if an item is already present in the database and thus it should both be equally fast.
The interesting part would be:
I may have to run the insert several hundred times per each
submission.
That would mean that you might write one and the same block on disk hundreds of times. It would be faster if you could do this in a single step. However, this is indeed a problem as I am not aware of any SQL function that allows this behavior. MySQL's INSERT offers a way to specify a number of values in a single statment. This MIGHT be a considerable plus (I don't know how smart MySQL handles this situation) but it is specific to MySQL and it is not portable.
Another way to speed things up is to not wait until the blocks you have changed are written to disk. This comes at the risk of loosing data without notice, but can be a significant performance boost. Again this is specific to the DBMS that you use. E.g. if you use MySQL with InnoDB you can set the option innodb_flush_log_at_trx_commit=0 in your my.ini to archieve this behaviour.
Would it be better to run a simple select query on the clients table
(ordered by ASC), and do a binary search or something on the results
This would needlessly copy large amounts of data from your DBMS to a client (which may possibly be on different machines, communicating over a network protocol). This would still be OK for your small DB, but it does not scale well. It may only be of use if it helps you to save data in a single operation to disk.

Windows 2003 server becomes very slow while executing query that retrieves lac's of records

I executed a query, which retrieves more than 100,000 records, which uses joins to retrieve the records.
While this query was running the whole server becomes very slow and this affects other sites, which try to run normal query to get records.
In this case, query which runs for getting that many number of records and other query running simultaneously are of different data base.
Your query retrieving hundreds of thousands of records is probably causing significant IO and trashes the buffer pool. You need to address this from two directions:
review your requirements. Why are you retrieving hundreds of thousands of records? For sure no human can look at so many. Any analysis should be pushed to be performed on the server and only retrieve aggregate results
Why do you need to analyze frequently hundreds of thousands of records? Perhaps you need an ETL pipeline to extract the required aggregates/summation on a daily basis
Maybe the query does need to analyse hundreds of thousands of records, perhaps you're missing an index
If all of the above don't apply it simply means you need a bigger boat. Your current hardware cannot handle the requirements.

libpq very slow for large (20 million record) database

I am new to SQL/RDBMS.
I have an application which adds rows with 10 columns in PostgreSQL server using the libpq library. Right now, my server is running on same machine as my visual c++ application.
I have added around 15-20 million records. The simple query of getting total count is taking 4-5 minutes using select count(*) from <tableName>;.
I have indexed my table with the time I am entering the data (timecode). Most of the time I need count with different WHERE / AND clauses added.
Is there any way to make things fast? I need to make it as fast as possible because once the server moves to network, things will become much slower.
Thanks
I don't think network latency will be a large factor in how long your query takes. All the processing is being done on the PostgreSQL server.
The PostgreSQL MVCC design means each row in the table - not just the index(es) - must be walked to calculate the count(*) which is an expensive operation. In your case there are a lot of rows involved.
There is a good wiki page on this topic here http://wiki.postgresql.org/wiki/Slow_Counting with suggestions.
Two suggestions from this link, one is to use an index column:
select count(index-col) from ...;
... though this only works under some circumstances.
If you have more than one index see which one has the least cost by using:
EXPLAIN ANALYZE select count(index-col) from ...;
If you can live with an approximate value, another is to use a Postgres specific function for an approximate value like:
select reltuples from pg_class where relname='mytable';
How good this approximation is depends on how often autovacuum is set to run and many other factors; see the comments.
Consider pg_relation_size('tablename') and divide it by the seconds spent in
select count(*) from tablename
That will give the throughput of your disk(s) when doing a full scan of this table. If it's too low, you want to focus on improving that in the first place.
Having a good I/O subsystem and well performing operating system disk cache is crucial for databases.
The default postgres configuration is meant to not consume too much resources to play nice with other applications. Depending on your hardware and the overall utilization of the machine, you may want to adjust several performance parameters way up, like shared_buffers, effective_cache_size or work_mem. See the docs for your specific version and the wiki's performance optimization page.
Also note that the speed of select count(*)-style queries have nothing to do with libpq or the network, since only one resulting row is retrieved. It happens entirely server-side.
You don't state what your data is, but normally the why to handle tables with a very large amount of data is to partition the table. http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
This will not speed up your select count(*) from <tableName>; query, and might even slow it down, but if you are normally only interested in a portion of the data in the table this can be helpful.

Select query too slow > 5min

I have a tableMyTable with 29,000 rows.
MyTable structure {
StudentId bigint,
....
}
Number of columns > 10 columns. The database in the hosting server.
From SSMS i execute the query:
SELECT *
FROM MyTable
Is it normal that the execution lasts more than 5 min?
First of all, retrieving all the data from a remote database is never a good idea. You are using an important share of bandwidth. Hopefully, the query you are using is only used for debugging purpose and should never hit production.
You did not mention if it took 5 minutes before you started receiving something or if you are receiving your data over the course of that 5 minutes, at a constant rate.
In the first situation, not receiving rows at all might indicating a that a lock is effective on your table, due to another operation.
In the latter situation, you are constantly receiving rows, but at a slow rate. Bandwidth and server load play a big part in that. To get you a rough idea of the amount of data that you are downloading, run this stored procedure:
EXEC sp_spaceused 'YourTableName';
Consider that the server has to upload that data and that you have to download the data.
Binary and xml fields (also called BLOB field) usually take a lot of data and you may not be able to control the amount of data stored by the user in those field.
Try checking the size of your variable length fields (varchar, xml and varbinary) by running a DATALENGTH on your column:
SELECT DATALENGTH(MyField) FROM MyTable
You can also get an average:
SELECT AVG(DATALENGTH(MyField)) FROM MyTable
A good idea concerning BLOB field is to retrieve them only when needer and not when you are loading a list of data.
For example, assume a XML field stored in a PurchaseOrder table. If you wish to display the list of PO to your user, you usually don't need to retrieve that field, unless the user open the PO.
Many recent ORM, like nHibernate, offers lazy loading for columns, along with paging so you can retrieve a small amount of row.
Ayende posted a rent about loading unbounded result set two weeks ago.
You're right - the select query shouldn't take that long. It's not the number of rows. Likely it's the type of data you've got on that table/view, and perhaps the storage configuration (slow disk, filegroups config, etc).
Some ideas to consider to remedy this performance problem:
be specific in the columns that you want to retrieve. For ad-hoc queries, SELECT * is fine, but recognize that the RDBMS will work slightly harder to determine which columns are on the table/view.
gathering the values any columns of datatype text, varbinary will take proportionally longer depending on the data within those fields.
consider the indexes (do you have any?) on the table/view?
is this a Prod database, where more/other activity might be hitting this table?
If you edit your question, perhaps include the full table definition so that we can get a real look at what's happening with the datatypes.
I would recommend that you consider OMG Ponies's recommendation - it could be due to the bandwidth between the box and your machine, so
try to remote the box and see how long the query takes on that machine.
If it takes almost same amount of time, then the problem lies either in the database design or underlying hardware, or other factors (table datatypes, wrong indexes, overall load on the machine, overall hits to this table, etc)
if it takes significantly less amount of time, then the problem is surely between your machine and the box - ideally this shouldn't be a big problem, because the web server will be closer to the db server, probably on same LAN (so it should be much faster in the real world). Also, I'm sure you wouldn't use a 'Select *' in the actual app to pick 29000 rows, so it shouldn't create a lot of problem.