hsqldb select query really slow

hsqldb select query really slow - hsqldb

I use hsqldb and I have constructed a database with a table that contains 10.000.000 records. It takes about 15min to construct this table. Then in another program that needs these data, I try to read them. I thought that reading them in groups of 100.000 would be faster. So I execute this query:
rs = st.executeQuery("SELECT * FROM PATIENT WHERE pid>="+start+" AND pid<="+end+" ;");
where start and end define the group I want to read each time.
I have made an index on pid but query execution is still very slow. Actually it's been running for 24 minutes and has fetched the first 24 out of 100 groups. Is this normal? What else can I do?
Thank you!

It is a good idea to select in groups of 100000. You can verify that your query is using the index by executing this statement with your query:
EXPLAIN PLAN FOR SELECT ...
If the query is indeed using the index, you can speed up selection by defining
SET TABLE PATIENT CLUSTERED ON (PID)
or define PID as the primary key
You should also look at increasing the cache size of the database, or increasing the nio file usage limit, or other tweaks discussed in the Guide. Use the latest HSQLDB jars from http://hsqldb.org/support/

Related

Simple select query running forever

A simple SQL Select is query running forever for a particular ID in SQL Server 2012.
This query is running forever; it should return 10000 rows:
select *
from employees
where company_id = 34
If I change the query to
select *
from employees
where company_id = 12
it returns 7000 rows very quickly.
Employees is a view created by joining different tables.
Could there be a problem in the view?

One possibility is that you have a very large table. Such a query is probably scanning the entire tables and returning rows that match as they are encountered.
My guess is that rows for company 12 are encountered before rows for company 34.
If this is the case, then an index on (company_id) should help.
There may be other causes as well. Here are two other possibilities:
Contention for rows with company_id 34 that are causing delays on reading the data (this would depend on isolation level that you are using and the nature of concurrent updates).
An unlimited size column which is populated with very big values for company_id 34 and empty or very small for 12.
There may be other possibilities as well.

One of the things you can do to speed up the process is to index the column on company_id as a b-tree index would speed up the search.

Without looking at the structure of the table and execution plan, here are a few things that can be suggested apart from what Gordon has already covered:
Could you create indexes on the underlying tables which can cover this query? That would include index on the 'searched' and 'sorted' columns (joins, where clause, order by, group by, distinct) and include the SELECTED columns in the INCLUDE part of the indexes (in case of a nonclustered rowstore index)? Aim is to see 'index seek' in the Execution Plan.
Update statistics on the underlying tables. (And as a side note, would suggest to keep 'AUTO CREATE' and 'AUTO UPDATE' statistics ON unless you have a reason not to do that automatically in your application)
Would also like to know when was the last time defragmentation was performed on the server. Long due defragmentation could be a very good reason for why you might see this kind of issues on certain values, specially on a table on which lot of write operations happen.
Execute the query again. Even if you do not have proper information about #3 above, you can try to execute the query skipping step#3.
While running the query check for waits stats in in the server by
querying at dmvs: sys.dm_os_wait_stats and sys.dm_tran_locks. Please
check whether the wait is due to CXPACKET (waits due to other parallel
processes) or PAGEIOLATCH (Reading from Disk than RAM) or locks. It
is the starting point of the investigation which will give you the
root cause and you can then take appropriate measure accordingly.
Additional quick check can be: checking 'Available RAM' in the
server task manager. Please make sure that your SQL Server RAM is not used up by
some other unnecessary applications/sessions.

Indexing SQL-database slow down inserts too much

I have 2 queries taking too long time, timing out when running them inside an azure website.
1st.
SELECT Value FROM SEN.ValueTable WHERE OptId = #optId
2d
INSERT INTO SEN.ValueTable (Value, OptId)
SELECT Value, OptId FROM REF.ValueTable WHERE OptId = #optId
The both SELECTS will always return 7860 Values. The problem is that I do around 10 of these queries with different #optId. First I ran without any Indexes, then the 1st Query would timeout every now and then. I then added a non-clustered index to SEN.ValueTable and then the 2d Query began to timeout.
1st Query from an Azure VM
2d Query from an Azure-WebApp
I've tried to increase the timeout-time through the .config-files, but they still timeout within 30seconds (There is no time limit from the customer, the retrieving of data from the sql-database will not be the slow thing of the application anyway).
Is there anyway to speed it up/get rid of the timeouts? Will indexing the REF.ValueTable speed the insert up anything?

First, the obvious solution is to add an index to SEN.ValueTable(OptId, Value) and to have no index on REF.ValueTable(OptId, Value). I think this gets around your performance problem.
More importantly, it should be not be taking 30 seconds to fetch or insert 7,860 rows -- nothing like that. So, what else is going on? Is there a trigger on REF.ValueTable() that might be slowing things down? Are there other constraints? Are the columns particularly wide? I mean, if Value is VARCHAR(MAX) and normally 100 Mbytes, then inserting values might be an issue.

If you really run such a query:
SELECT Value, OptId
FROM REF.ValueTable
WHERE OptId = #optId;
The best index for it would be the following:
CREATE INDEX idx_ValueTable_OptId_Value
ON REF.ValueTable (OptId)
INCLUDE (Value);
Any index will slow inserts down, but will benefit read queries. If you want more elaborate answer, post more details - table DDLs and execution plans.

Try resumable online index rebuild -
https://azure.microsoft.com/en-us/blog/resumable-online-index-rebuild-is-in-public-preview-for-azure-sql-db/

How can I efficiently manipulate 500k records in SQL Server 2005?

I am getting a large text file of updated information from a customer that contains updates for 500,000 users. However, as I am processing this file, I often am running into SQL Server timeout errors.
Here's the process I follow in my VB application that processes the data (in general):
Delete all records from temporary table (to remove last month's data) (eg. DELETE * FROM tempTable)
Rip text file into the temp table
Fill in extra information into the temp table, such as their organization_id, their user_id, group_code, etc.
Update the data in the real tables based on the data computed in the temp table
The problem is that I often run commands like UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) and these commands frequently time out. I have tried bumping the timeouts up to as far as 10 minutes, but they still fail. Now, I realize that 500k rows is no small number of rows to manipulate, but I would think that a database purported to be able to handle millions and millions of rows should be able to cope with 500k pretty easily. Am I doing something wrong with how I am going about processing this data?
Please help. Any and all suggestions welcome.

subqueries like the one you give us in the question:
UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id)
are only good on one row at a time, so you must be looping. Think set based:
UPDATE t
SET user_id = u.user_id
FROM tempTable t
inner join myUsers u ON t.external_id=u.external_id
and remove your loops, this will update all rows in one statement and be significantly faster!

Needs more information. I am manipulating 3-4 million rows in a 150 million row table regularly and I am NOT thinking this is a lot of data. I have a "products" table that contains about 8 million entries - includign full text search. No problems either.
Can you just elaborte on your hardware? I assume "normal desktop PC" or "low end server", both with absolutely non-optimal disc layout, and thus tons of IO problems - on updates.

Make sure you have indexes on your tables that you are doing the selects from. In your example UPDATE command, you select the user_id from the myUsers table. Do you have an index with the user_id column on the myUsers table? The downside of indexes is that they increase time for inserts/updates. Make sure you don't have indexes on the tables you are trying to update. If the tables you are trying to update do have indexes, consider dropping them and then rebuilding them after your import.
Finally, run your queries in SQL Server Management Studio and have a look at the execution plan to see how the query is being executed. Look for things like table scans to see where you might be able to optimize.

Look at the KM's answer and don't forget about indexes and primary keys.

Are you indexing your temp table after importing the data?
temp_table.external_id should definitely have an index since it is in the where clause.

There are more efficient ways of importing large blocks of data. Look in SQL Books Online under BCP (Bulk Copy Protocol.)

Select million+ records while huge insert is running

I am trying to extract application log file from a single table. The select query statement is pretty straightforward.
select top 200000 *
from dbo.transactionlog
where rowid>7
and rowid <700000 and
Project='AmWINS'
The query time for above select is above 5 mins. Is it considered long? While the select is running, the bulk insertion is also running.
[EDIT]
Actually, I am having serious problem on my current Production logging database,
Basically, we only have one table (transactionlog). all the application log will be insert into this table. For Project like AmWINS, base on select count result, we have about 800K++ records inserted per day. The insertion of record are running 24 hours daily in Production environment. User would like to extract data from the table if user want to check the transaction logs. Therefore, we need to select the records out from the table if necessary.
I tried to simulate on UAT enviroment to pump in the volumn as per Production which already grow up to 10millions records until today. and while i try to extract records, at the same time, I simulate with a bulk insertion to make it look like as per production environment. It took like 5 mins just to extract 200k records.
During the extraction running, I monitor on the SQL phyiscal server CPU is spike up to 95%.
the tables have 13 fields and a identity turn on(rowid) with bigint. rowid is the PK.
Indexes are create on Date, Project, module and RefNumber.
the tables are created on rowlock and pagelock enabled.
I am using SQL server 2005.
Hope you guys can give me some professional advices to enlighten me. Thanks.

It may be possible for you to use the "Nolock" table hint, as described here:
Table Hints MSDN
Your SQL would become something like this:
select top 200000 * from dbo.transactionlog with (no lock) ...
This would achieve better performance if you aren't concerned about the complete accuracy of the data returned.

What are you doing with the 200,000 rows? Are you running this over a network? Depending on the width of your table, just getting that amount of data across the network could be the bulk of the time spent.

It depends on your hardware. Pulling 200000 rows out while there is data being inserted requires some serious IO, so unless you have a 30+disk system, it will be slow.
Also, is your rowID column indexed? This will help with the select, but could slow down the bulk insert.

I am not sure, but doesn't bulk insert in MS SQL lock the whole table?

As ck already said. Indexing is important. So make sure you have an appropriate index ready. I would not only set an index on rowId but also on Project. Also I would rewrite the where-clause to:
WHERE Project = 'AmWINS' AND rowid BETWEEN 8 AND 699999
Reason: I guess Project is more restrictive than rowid and - correct me, if I'm wrong - BETWEEN is faster than a < and > comparison.

You could also export this as a local dat or sql file.

No amount of indexing will help here because it's a SELECT * query so it's most likely a PK scan or an horrendous bookup lookup
And the TOP is meaningless because there is no ORDER BY.
The simultaneous insert is probably misleading as far as I can tell, unless the table only has 2 columns and the bulk insert is locking the whole table. With a simple int IDENTITY column the insert and select may not interfere with each other too.
Especially if the bulk insert is only a few 1000s of rows (or even 10,000s)
Edit. The TOP and rowid values do not imply a million plus

How to quickly duplicate rows in SQL

Edit: Im running SQL Server 2008
I have about 400,000 rows in my table. I would like to duplicate these rows until my table has 160 million rows or so. I have been using an statement like this:
INSERT INTO [DB].[dbo].[Sales]
([TotalCost]
,[SalesAmount]
,[ETLLoadID]
,[LoadDate]
,[UpdateDate])
SELECT [TotalCost]
,[SalesAmount]
,[ETLLoadID]
,[LoadDate]
,[UpdateDate]
FROM [DB].[dbo].[Sales]
This process is very slow. and i have to re-issue the query some large number of times Is there a better way to do this?

To do this many inserts you will want to disable all indexes and constraints (including foreign keys) and then run a series of:
INSERT INTO mytable
SELECT fields FROM mytable
If you need to specify ID, pick some number like 80,000,000 and include in the SELECT list ID+80000000. Run as many times as necessary (no more than 10 since it should double each time).
Also, don't run within a transaction. The overhead of doing so over such a huge dataset will be enormous. You'll probably run out of resources (rollback segments or whatever your database uses) anyway.
Then re-enable all the constraints and indexes. This will take a long time but overall it will be quicker than adding to indexes and checking constraints on a per-row basis.

Since each time you run that command it will double the size of your table, you would only need to run it about 9 times (400,000 * 29 = 204,800,000). Yes, it might take a while because copying that much data takes some time.

The speed of the insert will depend on a number of things...the physical disk speed, indexes, etc. I would recommend removing all indexes from the table and adding them back when you're done. If the table is heavily indexed then that should help quite a bit.
You should be able to repeatedly run that query in a loop until the desired number of rows is achieved. Every time you run it you'll double the data, so you'll end up with:
400,000
800,000
1,600,000
3,200,000
6,400,000
12,800,000
25,600,000
51,200,000
102,400,000
204,800,000
After nine executions.

You don't state your SQL database, but most have a bulk loading tool to handle this scenario. Check the docs. If you have to do it with INSERTs, remove all indexes from the table first and reapply them after the data is INSERTed; this will generally be much faster than indexing during insertion.

this may still take a while to run... you might want to turn off logging while you create your data.
INSERT INTO [DB].[dbo].[Sales] (
[TotalCost] ,[SalesAmount] ,[ETLLoadID]
,[LoadDate] ,[UpdateDate]
)
SELECT s.[TotalCost] ,s.[SalesAmount] ,s.[ETLLoadID]
,s.[LoadDate] ,s.[UpdateDate]
FROM [DB].[dbo].[Sales] s (NOLOCK)
CROSS JOIN (SELECT TOP 400 totalcost FROM [DB].[dbo].[Sales] (NOLOCK)) o

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas