libpq very slow for large (20 million record) database

libpq very slow for large (20 million record) database - sql

I am new to SQL/RDBMS.
I have an application which adds rows with 10 columns in PostgreSQL server using the libpq library. Right now, my server is running on same machine as my visual c++ application.
I have added around 15-20 million records. The simple query of getting total count is taking 4-5 minutes using select count(*) from <tableName>;.
I have indexed my table with the time I am entering the data (timecode). Most of the time I need count with different WHERE / AND clauses added.
Is there any way to make things fast? I need to make it as fast as possible because once the server moves to network, things will become much slower.
Thanks

I don't think network latency will be a large factor in how long your query takes. All the processing is being done on the PostgreSQL server.
The PostgreSQL MVCC design means each row in the table - not just the index(es) - must be walked to calculate the count(*) which is an expensive operation. In your case there are a lot of rows involved.
There is a good wiki page on this topic here http://wiki.postgresql.org/wiki/Slow_Counting with suggestions.
Two suggestions from this link, one is to use an index column:
select count(index-col) from ...;
... though this only works under some circumstances.
If you have more than one index see which one has the least cost by using:
EXPLAIN ANALYZE select count(index-col) from ...;
If you can live with an approximate value, another is to use a Postgres specific function for an approximate value like:
select reltuples from pg_class where relname='mytable';
How good this approximation is depends on how often autovacuum is set to run and many other factors; see the comments.

Consider pg_relation_size('tablename') and divide it by the seconds spent in
select count(*) from tablename
That will give the throughput of your disk(s) when doing a full scan of this table. If it's too low, you want to focus on improving that in the first place.
Having a good I/O subsystem and well performing operating system disk cache is crucial for databases.
The default postgres configuration is meant to not consume too much resources to play nice with other applications. Depending on your hardware and the overall utilization of the machine, you may want to adjust several performance parameters way up, like shared_buffers, effective_cache_size or work_mem. See the docs for your specific version and the wiki's performance optimization page.
Also note that the speed of select count(*)-style queries have nothing to do with libpq or the network, since only one resulting row is retrieved. It happens entirely server-side.

You don't state what your data is, but normally the why to handle tables with a very large amount of data is to partition the table. http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
This will not speed up your select count(*) from <tableName>; query, and might even slow it down, but if you are normally only interested in a portion of the data in the table this can be helpful.

Related

Postgres query optimization

On postgres 9.0, set both index_scan and seq_scan to Off. Why does it improve query performance by 2x?

This may help some queries run faster, but is almost certain to make other queries slower. It's interesting information for diagnostic purposes, but a bad idea for a long-term "solution".
PostgreSQL uses a cost-based optimizer, which looks at the costs of all possible plans based on statistics gathered by scanning your tables (normally by autovacuum) and costing factors. If it's not choosing the fastest plan, it is usually because your costing factors don't accurately model actual costs for your environment, statistics are not up-to-date, or statistics are not fine-grained enough.
After turning index_scan and seq_scan back on:
I have generally found the cpu_tuple_cost default to be too low; I have often seen better plans chosen by setting that to 0.03 instead of the default 0.01; and I've never seen that override cause problems.
If the active portion of your database fits in RAM, try reducing both seq_page_cost and random_page_cost to 0.1.
Be sure to set effective_cache_size to the sum of shared_buffers and whatever your OS is showing as cached.
Never disable autovacuum. You might want to adjust parameters, but do that very carefully, with small incremental changes and subsequent monitoring.
You may need to occasionally run explicit VACUUM ANALYZE or ANALYZE commands, especially for temporary tables or tables which have just had a lot of modifications and are about to be used in queries.
You might want to increase default_statistics_target, from_collapse_limit, join_collapse_limit, or some geqo settings; but it's hard to tell whether those are appropriate without a lot more detail than you've given so far.
You can try out a query with different costing factors set on a single connection. When you confirm a configuration which works well for your whole mix (i.e., it accurately models costs in your environment), you should make the updates in your postgresql.conf file.
If you want more targeted help, please show the structure of the tables, the query itself, and the results of running EXPLAIN ANALYZE for the query. A description of your OS and hardware helps a lot, too, along with your PostgreSQL configuration.

Why ?
The most logical answer is because of the way your database tables are configured.
Without you posting your table schema's I can only hazard a guess that your indices don't have a high cardinality.
that is to say, that if your index contains too much information to be useful then it will be far less efficient, or indeed slower.
Cardinality is a measure of how unique a row in your index is. The lower the cardinality, the slower your query will be.
A perfect example is having a boolean field in your index; perhaps you have a Contacts table in your database and it has a boolean column that records true or false depending on whether the customer would like to be contacted by a third party.
In the mean, if you did 'select * from Contacts where OptIn = true'; you can imagine that you'd return a lot of Contacts; imagine 50% of contacts in our case.
Now if you add this 'Optin' column to an index on that same table; it stands to reason that no matter how fine the other selectors are, you will always return 50% of the table, because of the value of 'OptIn'.
This is a perfect example of low cardinality; it will be slow because any query involving that index will have to select 50% of the rows in the table; to then be able to apply further WHERE filters to reduce the dataset again.
Long story short; If your Indices include bad fields or simply represent every column in the table; then the SQL engine has to resort to testing row-by-agonizing-row.
Anyway, the above is theoretical in your case; but it is a known common reason for why queries suddenly start taking much longer.
Please fill in the gaps regarding your data structure, index definitions and the actual query that is really slow!

How long should a query that returns 5 million records take?

I realise the answer should probably be 'as little time as possible' but I'm trying to learn how to optimise databases and I have no idea what an acceptable time is for my hardware.
For a start I'm using my local machine with a copy of sql server 2008 express. I have a dual-core processor, 2GB ram and a 64bit OS (if that makes a difference). I'm only using a simple table with about 6 varchar fields.
At first I queried the data without any indexing. This took a ridiculously long amount of time so I cancelled and added a clustered index (using the PK) to the table. This cut the time down to 1 minute 14 sec. I have no idea if this is the best I can get or whether I'm still able to cut this down even further?
Am I limited by my hardware or is there anything else I can do to my table/database/queries to get results faster?
FYI I'm only using a standard SELECT * FROM <Table> to retrieve my results.
EDIT: Just to clarify, I'm only doing this for testing purposes. I don't NEED to pull out all the data, I'm just using that as a consistent test to see if I can cut down the query times.
I suppose what I'm asking is: Is there anything I can do to speed up the performance of my queries other than a) upgrading hardware and b) adding indexes (assuming the schema is already good)?

I think you are asking the wrong question.
First of all - why do you need so many articles at one time on the local machine? What do you want to do with them? I'm asking because I think you want to transfer this of data to somewhere, so you should be measuring how long it takes to transfer the data.
Some advice:
Your applications should not select 5 million records at the time. Try to split your query and get the data in smaller sets.
UPDATE:
Because you are doing this for testing, I suggest that you
Remove * from your query - it takes SQL server some time to resolve this.
Put your data in temporary storage, try using VIEW or a temporary table for this.
Use plan caching on your server
to improve performance. But even if you're just testing, I still don't understand why you would need such tests if your application would never use such a query. Testing just for the sake of testing is a bad use of time

Look at the query execution plan. If your query is doing a table scan, it will obviously take a long time. The query execution plan can help you decide what kind of indexing you would need on the table. Also, creating table partitions can help sometimes in cases where the data is partitioned by a condition (usually date and time).

I did 5.5 million in 20 seconds. That's taking over 100k schedules with different frequencies and forecasting them for the next 25 years. Just max scenario testing, but proves the speed you can achieve in a scheduling system as an example.

The best optimized way depends on the indexing strategy you choose. As many of the above answers, i too would say partitioning the table would help sometimes. And its not the best practice to query all the billion record in a single time frame. Will give you much better results if you could try to query partially with the iterations. you may check this link to clear the doubts on the minimum requirements for the Sql server 2008 Minimum H/W and S/W Requirements for Sql server 2008

When fecthing 5 million rows you are almost 100% going spool to tempdb. you should try to optimize your temp Db by adding additional files. if you have multiple drives on seperate disks you should split the table data into different ndf files located on seperate disks. parititioning wont help when querying all the data on the disk
U can also use a query hint to force parrallelism MAXDOP this will increase the CPU utilization. Ensure that the columns contain few nulls as possible and rebuild ur indexes and stats

Performance of queries using count(*) on tables with many rows (300 million+)

I understand there are limitations to using sqlite, but I'd like to know if it should be able to handle this scenario.
My table has over 300 million records and the db is about 12 gigs. The data import util with sqlite is nice and fast. But then I added an index to a string column in this table, and it ran all night to complete this operation. I haven't compared this to other db's, but seemed quite slow to me.
Now that my index is added, I'm wanting to look for duplicates in the data. So I'm trying to run a "having count > 0" query and it seems to be taking hours as well. My query looks like:
select col1, count(*)
from table1
group by col1
having count(*) > 1
I would assume this query would use my index on col1, but the slow query execution makes me wonder if it is not?
Would perhaps sql server handle this kind of thing better?

SQLite's count() isn't optimized - it does a full table scan even if indexed. Here is the recommended approach to speed things up. Run EXPLAIN QUERY PLAN to verify and you'll see:
EXPLAIN QUERY PLAN SELECT COUNT(FIELD_NAME) FROM TABLE_NAME;
I get something like this:
0|0|0|SCAN TABLE TABLE_NAME (~1000000 rows)

But then I added an index to a string column in this table, and it ran all night to complete this
operation. I haven't compared this to other db's, but seemed quite slow to me.
I hate to tell yuo, but how does your server look like? Not arguing, but that is a possibly very resoruce intensive operation that may require a lot of IO and normal computers or chehap web servers with a slow hard disc are not suited for significant database work. I run hundreds og gigabyte db project work and my smallest "large data" server has 2 SSD and 8 Velociraptors for data and log. The largest one has 3 storage nodes with a total of 1000gb SSD discs - simply because IO is what a db server lives and breathes on.
So I'm trying to run a "having count > 0" query and it seems to be taking hours as well
How much RAM? ENough to fit it all in memory, or a low memory virtual server where the missing memory blows up to bad IO? How much memory can / does SqlLite use? How is the temp setup? In memory? Sql server would possibly use a lot of memory / tempdb space for this type of check.

increase the sqlite cache via PRAGMA cache_size=<number of pages>. the memory used is <number of pages> times <size of page>. (which can be set via PRAGMA page_size=<size of page>)
by setting those values to 16000 and 32768 respectively (or about 512MB), i was able to get this one program's bulk load down from 20mins to 2mins. (although i think that if the disk on that system wasn't so slow, this might not have had as much effect)
but you might not have this extra memory available on lesser embedded platforms, i don't recommend increasing it as much as i did on those, but for desktop or laptop level systems it can greatly help.

Is it possible to Cache the result set of a select query in the database?

I am trying to optimize the search query which is the most used in our system. So far I have added some missing indexes and that has helped slightly. But I want to further reduce the load on the db server. One option that I will use is caching the result set as a LIST in the asp.net Cache so that I don't have to hit the db often.
However, I was wondering if there is a way to Cache some portions of the select query at the db as well. e.g. for the search results we consider only users who have been active in the last 180 days and who have share-info set as true. So this is like a super set which the db processes everytime and then applies other conditions such as category specified, city etc. which are passed. Is it possible to somehow Cache the Super Set so that I can run queries against the super set rather than run the query against the whole table? Will creating a View help in this? I am a bit hesitant to create a view as I read managing views can be an overhead and takes away some flexibility to modfy the tables.
I am using Sql-Server 2005 so cannot create a filtered index on the table, which I think would have been helpful.

I agree with #Neville K. SQL Server is pretty smart at caching data in memory. You might see limited / no performance gains for your effort.
You could consider indexed views (Enterprise Edition only) http://technet.microsoft.com/en-us/library/cc917715.aspx for your sub-query.

It is, of course, possible to do this - but I'm not sure if it will help.
You can create a scheduled job - once a night, perhaps - which populates a table called "active_users_with_share_info" by truncating it, and then repopulating it based on a select query filtering out users active in the last 180 days with "share_info = true".
Then you can join your search query to this table.
However, I doubt this would do much good - SQL Server is pretty smart at caching. Unless you're dealing with huge volumes of data (100 of millions of records), or very limited hardware, I doubt you'd get any measurable performance improvements - but by all means try it!
Of course, the price for this would be more moving parts in your application, more interesting failure modes (what happens if the overnight batch fails silently?), and more training for any new developers you bring into the team.

How many rows in a database are TOO MANY?

I've a MySQL InnoDB table with 1,000,000 records. Is this too much? Or databases can handle this and more? I ask because I noticed that some queries (for example, getting the last row from a table) are slower (seconds) in the table with 1 millon rows than in one with 100.

I've a MySQL InnoDB table with 1000000 registers. Is this too much?
No, 1,000,000 rows (AKA records) is not too much for a database.
I ask because I noticed that some queries (for example, getting the last register of a table) are slower (seconds) in the table with 1 million registers than in one with 100.
There's a lot to account for in that statement. The usual suspects are:
Poorly written query
Not using a primary key, assuming one even exists on the table
Poorly designed data model (table structure)
Lack of indexes

I have a database with more than 97,000,000 records(30GB datafile), and having no problem .
Just remember to define and improve your table index.
So its obvious that 1,000,000 is not MANY ! (But if you don't index; yes, it is MANY )

Use 'explain' to examine your query and see if there is anything wrong with the query plan.

I think this is a common misconception - size is only one part of the equation when it comes to database scalability. There are other issues that are hard (or harder):
How large is the working set (i.e. how much data needs to be loaded in memory and actively worked on). If you just insert data and then do nothing with it, it's actually an easy problem to solve.
What level of concurrency is required? Is there just one user inserting/reading, or do we have many thousands of clients operating at once?
What levels of promise/durability and consistency of performance are required? Do we have to make sure that we can honor each commit. Is it okay if the average transaction is fast, or do we want to make sure that all transactions are reliably fast (six sigma quality control like - http://www.mysqlperformanceblog.com/2010/06/07/performance-optimization-and-six-sigma/).
Do you need to do any operational issues, such as ALTER the table schema? In InnoDB this is possible, but incredibly slow since it often has to create a temporary table in foreground (blocking all connections).
So I'm going to state the two limiting issues are going to be:
Your own skill at writing queries / having good indexes.
How much pain you can tolerate waiting on ALTER TABLE statements.

If you mean 1 million rows, then it depends on how your indexing is done and the configuration of your hardware. A million rows is not a large amount for an enterprise database, or even a dev database on decent equipment.
if you mean 1 million columns (not sure thats even possible in MySQL) then yes, this seems a bit large and will probably cause problems.

Register? Do you mean record?
One million records is not a real big deal for a database these days. If you run into any issue, it's likely not the database system itself, but rather the hardware that you're running it on. You're not going to run into a problem with the DB before you run out of hardware to throw at it, most likely.
Now, obviously some queries are slower than others, but if two very similar queries run in vastly different times, you need to figure out what the database's execution plan is and optimize for it, i.e. use correct indexes, proper normalization, etc.
Incidentally, there is no such thing as a "last" record in a table, from a logical standpoint they have no inherent order.

I've seen non-partitioned tables with several billion (indexed) records, that self-joined for analytical work. We eventually partitioned the thing but honestly we didn't see that much difference.
That said, that was in Oracle and I have not tested that volume of data in MySQL. Indexes are your friend :)

Assuming you mean "records" by "registers" no, it's not too much, MySQL scales really well and can hold as many records as you have space for in your hard disk.
Obviously though search queries will be slower. There is really no way around that except making sure that the fields are properly indexed.

The larger the table gets (as in more rows in it), the slower queries will typically run if there are no indexes. Once you add the right indexes your query performance should improve or at least not degrade as much as the table grows. However, if the query itself returns more rows as the table gets bigger, then you'll start to see degradation again.
While 1M rows are not that many, it also depends on how much memory you have on the DB server. If the table is too big to be cached in memory by the server, then queries will be slower.

Using the query provided will be exceptionally slow because of using a sort merge method to sort the data.
I would recommend rethinking the design so you are using indexes to retrieve it or make sure it is already ordered in that manner so no sorting is needed.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas