How to load 1 milion records from database fast?

How to load 1 milion records from database fast? - sql

Now we have a firebird database with 1.000.000 that must be processed after ALL are loaded in RAM memory. To get all of those we must extract data using (select * first 1000 ...) for 8 hours. What is the solution for this?

Does each of your "select * first 1000" (as you described it) do a full table scan? Look at those queries, and make sure they are using an index.

How long does it take to construct the DTO object that you are creating with each data read?
{ int a = read.GetInt32(0); int b = read.GetInt32(1); mylist.Add(new DTO(a,b)); }
You are creating a million of these objects. If it takes 29 milliseconds to create one DTO object, then that is going to take over 8 hours to complete.

to load data from a table with
1.000.000 rows in C# using a firebird db takes on a Pentium 4 3Ghz at least
8 hours
Everybody's been assuming you were running a SQL query to select the records from the database Something like
select *
from your_big_table
/
Because that really would take a few seconds. Well, a little longer to display it on a screen, but executing the actual select should be lightning fast.
But that reference to C# makes me think you're doing something else. Perhaps what you really have is an RBAR loop instantiating one million objects. I can see how that might take a little longer. But even so, eight hours? Where does the time go?
edit
My guess was right and you are instantiating 1000000 objects in a loop. The correct advice would be to find some other way of doing whatever it is you do once you have got all your objects in memory. Without knowing more about the details it is hard to give specifics. But it seems unlikely this is a UI think - what user is going to peruse a million objects?
So a general observation will have to suffice: use bulk operations to implement bulk activity. SQL databases excel at handling sets. Leverage the power of SQL to process your million rows in a single set, rather than as individual rows.
If you don't find this answer helpful then you need to give us more details regarding want you're trying to achieve.

What sort of processing do you need to do that would require to load them in memory and not just process them via SQL statements?
There are two techniques I use that work depending on what I am trying to do.
Assuming there is some sort of artificial key (identity), work in batches, incrementing the last identity value processed.
BCP the data out to a text file, churn through the updates, then BCP it back in, remembering to turn off constraints and indexes before the IN step.

Take a look at this:
http://www.firebirdfaq.org/faq13/

Related

How long should a query that returns 5 million records take?

I realise the answer should probably be 'as little time as possible' but I'm trying to learn how to optimise databases and I have no idea what an acceptable time is for my hardware.
For a start I'm using my local machine with a copy of sql server 2008 express. I have a dual-core processor, 2GB ram and a 64bit OS (if that makes a difference). I'm only using a simple table with about 6 varchar fields.
At first I queried the data without any indexing. This took a ridiculously long amount of time so I cancelled and added a clustered index (using the PK) to the table. This cut the time down to 1 minute 14 sec. I have no idea if this is the best I can get or whether I'm still able to cut this down even further?
Am I limited by my hardware or is there anything else I can do to my table/database/queries to get results faster?
FYI I'm only using a standard SELECT * FROM <Table> to retrieve my results.
EDIT: Just to clarify, I'm only doing this for testing purposes. I don't NEED to pull out all the data, I'm just using that as a consistent test to see if I can cut down the query times.
I suppose what I'm asking is: Is there anything I can do to speed up the performance of my queries other than a) upgrading hardware and b) adding indexes (assuming the schema is already good)?

I think you are asking the wrong question.
First of all - why do you need so many articles at one time on the local machine? What do you want to do with them? I'm asking because I think you want to transfer this of data to somewhere, so you should be measuring how long it takes to transfer the data.
Some advice:
Your applications should not select 5 million records at the time. Try to split your query and get the data in smaller sets.
UPDATE:
Because you are doing this for testing, I suggest that you
Remove * from your query - it takes SQL server some time to resolve this.
Put your data in temporary storage, try using VIEW or a temporary table for this.
Use plan caching on your server
to improve performance. But even if you're just testing, I still don't understand why you would need such tests if your application would never use such a query. Testing just for the sake of testing is a bad use of time

Look at the query execution plan. If your query is doing a table scan, it will obviously take a long time. The query execution plan can help you decide what kind of indexing you would need on the table. Also, creating table partitions can help sometimes in cases where the data is partitioned by a condition (usually date and time).

I did 5.5 million in 20 seconds. That's taking over 100k schedules with different frequencies and forecasting them for the next 25 years. Just max scenario testing, but proves the speed you can achieve in a scheduling system as an example.

The best optimized way depends on the indexing strategy you choose. As many of the above answers, i too would say partitioning the table would help sometimes. And its not the best practice to query all the billion record in a single time frame. Will give you much better results if you could try to query partially with the iterations. you may check this link to clear the doubts on the minimum requirements for the Sql server 2008 Minimum H/W and S/W Requirements for Sql server 2008

When fecthing 5 million rows you are almost 100% going spool to tempdb. you should try to optimize your temp Db by adding additional files. if you have multiple drives on seperate disks you should split the table data into different ndf files located on seperate disks. parititioning wont help when querying all the data on the disk
U can also use a query hint to force parrallelism MAXDOP this will increase the CPU utilization. Ensure that the columns contain few nulls as possible and rebuild ur indexes and stats

Select query too slow > 5min

I have a tableMyTable with 29,000 rows.
MyTable structure {
StudentId bigint,
....
}
Number of columns > 10 columns. The database in the hosting server.
From SSMS i execute the query:
SELECT *
FROM MyTable
Is it normal that the execution lasts more than 5 min?

First of all, retrieving all the data from a remote database is never a good idea. You are using an important share of bandwidth. Hopefully, the query you are using is only used for debugging purpose and should never hit production.
You did not mention if it took 5 minutes before you started receiving something or if you are receiving your data over the course of that 5 minutes, at a constant rate.
In the first situation, not receiving rows at all might indicating a that a lock is effective on your table, due to another operation.
In the latter situation, you are constantly receiving rows, but at a slow rate. Bandwidth and server load play a big part in that. To get you a rough idea of the amount of data that you are downloading, run this stored procedure:
EXEC sp_spaceused 'YourTableName';
Consider that the server has to upload that data and that you have to download the data.
Binary and xml fields (also called BLOB field) usually take a lot of data and you may not be able to control the amount of data stored by the user in those field.
Try checking the size of your variable length fields (varchar, xml and varbinary) by running a DATALENGTH on your column:
SELECT DATALENGTH(MyField) FROM MyTable
You can also get an average:
SELECT AVG(DATALENGTH(MyField)) FROM MyTable
A good idea concerning BLOB field is to retrieve them only when needer and not when you are loading a list of data.
For example, assume a XML field stored in a PurchaseOrder table. If you wish to display the list of PO to your user, you usually don't need to retrieve that field, unless the user open the PO.
Many recent ORM, like nHibernate, offers lazy loading for columns, along with paging so you can retrieve a small amount of row.
Ayende posted a rent about loading unbounded result set two weeks ago.

You're right - the select query shouldn't take that long. It's not the number of rows. Likely it's the type of data you've got on that table/view, and perhaps the storage configuration (slow disk, filegroups config, etc).
Some ideas to consider to remedy this performance problem:
be specific in the columns that you want to retrieve. For ad-hoc queries, SELECT * is fine, but recognize that the RDBMS will work slightly harder to determine which columns are on the table/view.
gathering the values any columns of datatype text, varbinary will take proportionally longer depending on the data within those fields.
consider the indexes (do you have any?) on the table/view?
is this a Prod database, where more/other activity might be hitting this table?
If you edit your question, perhaps include the full table definition so that we can get a real look at what's happening with the datatypes.

I would recommend that you consider OMG Ponies's recommendation - it could be due to the bandwidth between the box and your machine, so
try to remote the box and see how long the query takes on that machine.
If it takes almost same amount of time, then the problem lies either in the database design or underlying hardware, or other factors (table datatypes, wrong indexes, overall load on the machine, overall hits to this table, etc)
if it takes significantly less amount of time, then the problem is surely between your machine and the box - ideally this shouldn't be a big problem, because the web server will be closer to the db server, probably on same LAN (so it should be much faster in the real world). Also, I'm sure you wouldn't use a 'Select *' in the actual app to pick 29000 rows, so it shouldn't create a lot of problem.

Postgres: How to fire multiple queries in same time?

I have one procedure which updates record values, and i want to fire it up against all records in table (over 30k records), procedure execution time is from 2 up to 10 seconds, because it depends on network load.
Now i'm doing UPDATE table SET field = procedure_name(paramns); but with that amount of records it takes up to 40 min to process all table.
Now im using 4 different connections witch fork to background and fires query with WHERE clause set to iterate over modulo of row id's to speed this up, ( WHERE id_field % 4 = ) and this works well and cuts down table populate to ~10 mins.
But i want to avoid using cron, shell jobs and multiple connections for this, i know that it can be done with libpq, but is there a way to fire up a query (4 different non-blocking queries) and do not wait till it ends execution, within single connection?
Or if anyone can point me out to some clues on how to write that function, using postgres internals, or simply in C and bound it as a stored procedure?
Cheers Darius

I've got a sure answer for this question - IF you will share with us what your ab workout is!!! I'm getting fat by the minute and I need answers myself...
OK I'll answer anyway.
If you are updating one table, on one database server, in 40 minutes 'single threaded' and in 10 minutes with 4 threads, the bottleneck is not the database server; otherwise, it would get bogged down in I/O. If you are executing a bunch of UPDATES, one call per record, the network round-trip time is killing you.
I'm pretty sure this is the case and not that it's either an I/O bottleneck on the DB or the possibility that procedure_name(paramns); is taking a long time. (If that were the procedure taking 2-10 seconds it would take like 2500 min to do 30K records). The reason I am sure is that starting 4 concurrent processed cuts the time in 1/4. So especially it is not an i/o issue on the DB server.
This might be the one excuse for putting business logic in an SP on the server. Optimization unfortunately means breaking the rules. The consequence is difficult maintenance. but, duh!!
However, the best solution would be to get this set up to use 'bulk update' queries. That might mean you have to take several strange and unintuitive steps such as this:
This will require a lot of modfication if multiple users can run it concurrently.
refactor the system so procedure_name(paramns) can get all the data it needs to process all records via a select statement. May need to use creative joins. If it's an SP of course now you are moving the logic to the client.
Use that have the program create an XML or other importable flat file format with the PK of the record to update, and the new field value or values. Write all the updates to this file instead of executing them on the DB.
have a temp table on the database that matches the layout of this flat file
run an import on the database - clear the temp table and import the file
do an update of a join of the temp table and the table to be updated, e.g., UPDATE mytbl, mytemp WHERE myPK=mytempPK SET myval=mytempnewval (use the right join syntax of course).
You can try some of these things 'by hand' first before you bother coding, to see if it's worth the speed increase.
If possible, you can still put this all in an SP!
I'm not making any guarantees, especially as I look down at my ever-fattening belly, but, this has the potential to melt your update job down to under a minute.

It is possible to update multiple rows at once. Below an example in postgres:
UPDATE
table_name
SET
column_name = temp.column_name
FROM
(VALUES
(<id1>, <value1>),
(<id2>, <value2>),
(<id3>, <value3>)
) AS temp("id", "column_name")
WHERE
table_name.id = temp.id

PHP has some functions for asynchrone queries:
pg_ send_ execute()
pg_ send_ prepare()
pg_send_query()
pg_ send_ query_ params()
No idea about other programming languages, you have to dig into the manuals.

I think you can't. Single connection can handle single query at once. It's described in libpq documentation chapter "Asynchronous Command Processing":
"After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the results. PQsendQuery cannot be called again (on the same connection) until PQgetResult has returned a null pointer, indicating that the command is done."

SQL, selecting and updating

I am trying to select 100s of rows at a DB that contains 100000s of row and update those rows afters.
the problem is I don't want to go to DB twice for this purpose since update only marks those rows as "read".
is there any way I can do this in java using simple jdbc libraries? (hopefully without using stored procedures)
update: ok here is some clarification.
there are a few instance of same application running on different servers, they all need to select 100s of "UNREAD" rows sorted according to creation_date column, read blob data within it, write it to file and ftp that file to some server. (I know prehistoric but requirements are requirements)
The read and update part is for to ensure each instance getting diffent set of data. (in order, tricks like odds and evens wont work :/)
We select data for update. the data transfers through the wire (we wait and wait) and then we update them as "READ". then release lock for reading. this entire thing takes too long. By reading and updating at the same time, I would like to reduce lock time (from time we use select for update to actual update) so that using multiple instances would increase read rows per second.
Still have ideas?

It seems to me there might be more than one way to interpret the question here.
You are selecting the rows for the
sole purpose of updating them and
not reading them.
You are selecting the rows to show
to somebody, and marking them as
read either one at a time or all as a group.
You want to select the rows and mark
them as read at the time you select
them.
Let's take Option 1 first, as that seems to be the easiest. You don't need to select the rows in order to update them, just issue an update with a WHERE clause:
update table_x
set read = 'T'
where date > sysdate-1;
Looking at option 2, you want to mark them as read when a user has read them (or a down stream system has received it, or whatever). For this, you'll probably have to do another update. If you query for the primary key, in addition to the other columns you'll need in the first select, you will probably have an easier time of updating, as the DB won't have to do table or index scans to find the rows.
In JDBC (Java) there is a facility to do a batch update, where you execute a set of updates all at once. That's worked out well when I need to perform a lot of updates that are of the exact same form.
Option 3, where you want to select and update all in one shot. I don't find much use for this, personally, but that doesn't mean others don't. I suppose some kind of stored procedure would reduce the round trips. I'm not sure what db you are working with here and can't really offer specifics.

Going to the DB isn't so bad. If you aren't returning anything 'across the wire' then an update shouldn't do you too much damage and its only a few hundred thousand rows. What is your worry?

If you're doing a SELECT in JDBC and iterating over the ResultSet to UPDATE each row, you're doing it wrong. That's an (n+1) query problem that will never perform well.
Just do an UPDATE with a WHERE clause that determines which of those rows needs to be updated. It's a single network round trip that way.
Don't be too code-centric. Let the database do the job it was designed for.

Can't you just use the same connection without closing it?

Simulating queries of large views for benchmarking purposes

A Windows Forms application of ours pulls records from a view on SQL Server through ADO.NET and a SOAP web service, displaying them in a data grid. We have had several cases with ~25,000 rows, which works relatively smoothly, but a potential customer needs to have many times that much in a single list.
To figure out how well we scale right now, and how (and how far) we can realistically improve, I'd like to implement a simulation: instead of displaying actual data, have the SQL Server send fictional, random data. The client and transport side would be mostly the same; the view (or at least the underlying table) would of course work differently. The user specifies the amount of fictional rows (e.g. 100,000).
For the time being, I just want to know how long it takes for the client to retrieve and process the data and is just about ready to display it.
What I'm trying to figure out is this: how do I make the SQL Server send such data?
Do I:
Create a stored procedure that has to be run beforehand to fill an actual table?
Create a function that I point the view to, thus having the server generate the data 'live'?
Somehow replicate and/or randomize existing data?
The first option sounds to me like it would yield the results closest to the real world. Because the data is actually 'physically there', the SELECT query would be quite similar performance-wise to one on real data. However, it taxes the server with an otherwise meaningless operation. The fake data would also be backed up, as it would live in one and the same database — unless, of course, I delete the data after each benchmark run.
The second and third option tax the server while running the actual simulation, thus potentially giving unrealistically slow results.
In addition, I'm unsure how to create those rows, short of using a loop or cursor. I can use SELECT top <n> random1(), random2(), […] FROM foo if foo actually happens to have <n> entries, but otherwise I'll (obviously) only get as many rows as foo happens to have. A GROUP BY newid() or similar doesn't appear to do the trick.

For data for testing CRM type tables, I highly recommend fakenamegenerator.com, you can get 40,000 fake names for free.

You didn't mention if you're using SQL Server 2008. If you use 2008 and you use Data Compression, be aware that random data will act very differently (slower) than real data. Random data is much harder to compress.
Quest Toad for SQL Server and Microsoft Visual Studio Data Dude both have test data generators that will put fake "real" data into records for you.

If you want results you can rely on you need to make the testing scenario as realistic as possible, which makes option 1 by far your best bet. As you point out if you get results that aren't good enough with the other options you won't be sure that it wasn't due to the different database behaviour.
How you generate the data will depend to a large degree on the problem domain. Can you take data sets from multiple customers and merge them into a single mega-dataset? If the data is time series then maybe it can be duplicated over a different range.

The data is typically CRM-like, i.e. contacts, projects, etc. It would be fine to simply duplicate the data (e.g., if I only have 20,000 rows, I'll copy them five times to get my desired 100,000 rows). Merging, on the other hand, would only work if we never deploy the benchmarking tool publicly, for obvious privacy reasons (unless, of course, I apply a function to each column that renders the original data unintelligible beyond repair? Similar to a hashing function, only without modifying the value's size too much).
To populate the rows, perhaps something like this would do:
WHILE (SELECT count(1) FROM benchmark) < 100000
INSERT INTO benchmark
SELECT TOP 100000 * FROM actualData

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas