SQL Table causes slow running Queries - sql

I have a sql table that has approximately 6,000 records with 17 columns each. If I do a basic search on the table (i.e. Select * from table_Orders)... it takes 1.5 minutes to return all the records! Any query that I run using this table is also very slow. I have reindexed the table so fragmentation is not an issue. This table has 2 nvarchar(max) columns in it that is storing xml data. Returning the table without these columns is extremely fast (less than 1 second). So, i'm guessing it's the xml data that is bogging down the queries. Is there anything I can do to speed up performance of queries that utilize columns with xml in them? Any insight will be greatly appreciated. I don't typically work with xml within sql, so I don't even know where to start.

This sounds like network speed and NOT search times. Even a table scan of 6000 rows only takes a fraction of a second - to search all the rows. Returning those rows to a client though... you're downloading all that data, so you're going to see a difference when you retrieve a lot of it. This has nothing to do with "query performance" and there isn't anything you can do about it unless you can make the network faster or deliver less data.
You can test this by issuing queries searching for a key in your clustered index. Assuming you have a clustered index on RowID...
select RowId, NonXmlColumn where RowId = 3 -- or some other reasonable key
select RowId, XmlColumn where RowId = 3
The search time for those queries will be the same. So, any difference in speed can be attributed to the network.

I would not store that data in a table. I would not store in VARCHAR if I did. Not to sound like a jerk.
SQL Server has a xml datatype: http://msdn.microsoft.com/en-us/library/hh403385.aspx
It says there is limitations. You should see what applies to your scenario.
If you need to preserve the XML - stick it somewhere else and pull from it what ever fields you'll need to search on.

Related

Executing same simple select statement or stored procedure on SQL Azure takes long time or times-out

I have two SQL Server Azure instances with Standard S2: 50 DTUs. When I run simple select statements on two instances, one of them takes more time than other or times out. Slower one have more records in tables in slower instance.
Both the instances have same table schema. Number of records in tables present in slower instances, LogEvidence table have 1324928 and LogItem table have 649391. Number of records in tables present in faster instances, LogEvidence table have 89504 and LogItem table have 89496.
Below is the simple select statement
select count(*) from logitem
Above simple select statement takes 0s on faster instance and on slower instance it takes 138s. And if I execute any stored procedure, slower instance takes more times or times out.
Time taken by both instances should be almost same.
Those simple queries perform big scans on the table and involve reading all rows. If the table has a clustered index you don't have to perform a SELECT COUNT(*) to know the number of records the table has. The following query should to that faster:
SELECT OBJECT_NAME(ps.object_id) , i.name , row_count
FROM sys.dm_db_partition_stats AS ps INNER JOIN sys.indexes AS i
ON ps.index_id = i.index_id AND ps.object_id = i.object_id
WHERE i.name like '%logitem%'
If the table does not have an Id please add an autoid on the table and make it the clustered index.
You can also try to add a useless WHERE clause like below to the query, and you may get a better performance.
SELECT count(*)
FROM logitem
WHERE id > 0
Where Id is the autoid column.
I had some experience with azure, and from your description I think there is one of following things you can do:
Since you are using only count, then indexes play no role. Though I understand other answer says to use where id>0, but azure should count 1M rows without 30 second timeout. But for other queries you need Indexes, or Azure will fail.
Check if your server is not under maintenance, it is low chance but it does happen with us, we are on s4 and occasionally our server just get slow, but after 10-30 minute it works fine. Maybe the actual hardware get in some process that slows it down.
This is most important reason for slow execution, especially if you have lot of write and delete happen on your server. Check the database size. Azure database got fragmented too quickly, we have to optimize it's data fragmentation every 10 days, if your bacpac size is 100MB and your database size in Azure shows like 5-6 GB, then it definitely need optimization as lot of fragments were generated. MSDN has given some queries to recreate indexes and remove fragmentation, I don't remember them URL, but simple google search will bring that. It should speed things up.
Azure has feature that auto generate indexes, check if both table share same indexes, maybe your faster version has some index Azure create by itself.
You should step back and ponder your assumption:
1. "performance should be about the same" - you have more data in one case vs. the other. In the limit, you should expect the performance of the second one to potentially be somewhat slower than the original one.
Now, let's go into the "why" it can be slower and how you can investigate each case:
Step 1: Look at the query plans for each case and see what you have. Likely, you will have something like:
StreamAgg <- Clustered Index Scan
(if you have other b-tree indexes, you might scan one of them and it might be faster since the index would not be as wide and thus the index will have fewer pages to scan)
Step 2: You can look at the actual execution times and resource use for each query to see why they are different. One way to do this is to run "set statistics time on", then "set statistics io on", then run your query. it will dump out extra information into SSMS when you run the query from there. (You can read about this here: https://learn.microsoft.com/en-us/sql/t-sql/statements/set-statistics-io-transact-sql?view=sql-server-2017)
If you review the output from each one, you may find reasons for why the performance is different. One possible explanation is that the amount of memory is limited in an S2 and you are just at the boundary for where all the pages fit in memory vs. not for the two examples. In that case, doing a count(*) query would need to cycle through all the pages and do much more IO than in the smaller case where they might all be in memory already.
Step 3: You can also potentially examine the query store to get insight into why one case is fast and one case is not. An overview of how to use it is here:
https://learn.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store?view=sql-server-2017
Note: it is on-by-default in SQL Azure so you can just go look at the time window when you ran the queries to get insight into what was happening at that time in your database.
Finally, you might consider ways to make the query go faster if you need it to be faster.
* creating a narrow b-tree index on the table may help for that one query (count(*) doesn't return any columns and just needs a count of rows from some non-filtered index).
* you could use a Columnstore (which requires an S3 or above for memory reasons). This kind of column-oriented index is optimized for this kind of query and would be much faster as the size of the table increases in the future.
Hope that help

Redshift performance difference between CTAS and select count

I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?
When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.
CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.

Large Denormalized Table Optimization

I have a single large denormalized table that mirrors the make up of a fixed length flat file that is loaded yearly. 112 columns and 400,000 records. I have a unique clustered index on the 3 columns that make up the where clause of the query that is run most against this table. Index Frag is .01. Performance on the query is good, sub second. However, returning all the records takes almost 2 minutes. The execution plan shows 100% of the cost is on a Clustered Index Scan (not seek).
There are no queries that require a join (due to the denorm). The table is used for reporting. All fields are type nvarchar (of the length of the field in the data file).
Beyond normalizing the table. What else can I do to improve performance.
Try paginating the query. You can split the results into, let's say, groups of 100 rows. That way, your users will see the results pretty quickly. Also, if they don't need to see all the data every time they view the results, it will greatly cut down the amount of data retrieved.
Beyond this, adding parameters to the query that filter the data will reduce the amount of data returned.
This post is a good way to get started with pagination: SQL Pagination Query with order by
Just replace the "50" and "100" in the answer to use page variables and you're good to go.
Here are three ideas. First, if you don't need nvarchar, switch these to varchar. That will halve the storage requirement and should make things go faster.
Second, be sure that the lengths of the fields are less than nvarchar(4000)/varchar(8000). Anything larger causes the values to be stored on a separate page, increasing retrieval time.
Third, you don't say how you are retrieving the data. If you are bringing it back into another tool, such as Excel, or through ODBC, there may be other performance bottlenecks.
In the end, though, you are retrieving a large amount of data, so you should expect the time to be much longer than for retrieving just a handful of rows.
When you ask for all rows, you'll always get a scan.
400,000 rows X 112 columns X 17 bytes per column is 761,600,000 bytes. (I pulled 17 out of thin air.) Taking two minutes to move 3/4 of a gig across the network isn't bad. That's roughly the throughput of my server's scheduled backup to disk.
Do you have money for a faster network?

Sql Server Paging issue

Friends,
I have already implemented paging in my SP -
with MyData As (
select ROW_NUMBER() over (order by somecolumn desc) AS [Row],
x,y,z,...
)
Select x,y,z,...
From MyData
Where [Row] between ((#currentPage - 1) * #pageSize + 1) and (#currentPage*#pageSize)
The problem here is that data is retried very fast if with clause return smaller number of rows but it takes long time when there are millions of records. Sometimes it times out.
Is there any other alternative?
Thanks for sharing your valuable time.
SQL server optimisation is a very broad subject and it is pretty much impossible to work out the issue with the limited amount of information you have posted. However since you're in a rush for a solution - Firstly I would suggest checking your actual execution plan, post it here, and making sure that the index is actually being used - if this is not the case then consider using the FASTFIRSTROW table hint to force the index to be used - check here and here on how it can improve things and here in how it can make things worse.
Next to consider is SQL parameter sniffing - it's unlikely from what you have said but possible check here for details enter link description here
For large scale performance gains you may need to look at architectural changes at the very least ensure that your transaction logs are on a different disk to your data.The reason you separate the database files from the log files is because database access is random and log access is sequential. Best practice dictates that you don't mix those two I/O types on the same disk
Also if you've got million of rows then you really need to consider splitting the data across multiple disks.
Finally I would strongly consider partioning either the table or the index see here for a start
The reason why your query is slow is that you have sort whole table on every request. To speed it up significantly you need to avoid sorting big chuck of data, at cost of CPU, HDD/Memory or limitations on pagination logic.
As there is not much information about how you table is sorted and if you insert in the middle / delete entries very often, I'll narrow down you question by making these assumptions:
I would imagine you have a table storing an archive of articles. New entries are mostly at the bottom of the table, entries from the middle of the table deleted rarely.
You sort always by the same column somecolumn and in the same order, e.g. descending.
You do not have any user entered filters (like article title or author).
This makes the table static in terms of the output: each article will be on the same place, unless a new one inserted. New one come to the top of your output. Then you can store ROW_NUMBER() OVER () as a column. A more convenient solution will be an IDENTITY column. It will speed up things if you create a clustered index on this column
alter table add [Record_Number] int null IDENTITY
This new column is added as null so you can populate values first time. Then you can make it not null.
On the other hand you can last row number very quickly by
select #Max_Row = SELECT MAX(row_number) from MyTable
Now when you have total number of rows, page size and page number you can select rows you need in one statement without sorting the whole lot.
Select * From MyTable
Where row_number between
(#Max_Row - #Page * #Page_Size) + 1 AND
#Max_Row -(#Page - 1) * #Page_Size
If you do have a filter in your CTE, then give some more information about how your data is structured, so we can think of a way to limit the scope of CTE.

Appropriate query and indexes for a logging table in SQL

Assume a table named 'log', there are huge records in it.
The application usually retrieves data by simple SQL:
SELECT *
FROM log
WHERE logLevel=2 AND (creationData BETWEEN ? AND ?)
logLevel and creationData have indexes, but the number of records makes it take longer to retrieve data.
How do we fix this?
Look at your execution plan / "EXPLAIN PLAN" result - if you are retrieving large amounts of data then there is very little that you can do to improve performance - you could try changing your SELECT statement to only include columns you are interested in, however it won't change the number of logical reads that you are doing and so I suspect it will only have a neglible effect on performance.
If you are only retrieving small numbers of records then an index of LogLevel and an index on CreationDate should do the trick.
UPDATE: SQL server is mostly geared around querying small subsets of massive databases (e.g. returning a single customer record out of a database of millions). Its not really geared up for returning truly large data sets. If the amount of data that you are returning is genuinely large then there is only a certain amount that you will be able to do and so I'd have to ask:
What is it that you are actually trying to achieve?
If you are displaying log messages to a user, then they are only going to be interested in a small subset at a time, and so you might also want to look into efficient methods of paging SQL data - if you are only returning even say 500 or so records at a time it should still be very fast.
If you are trying to do some sort of statistical analysis then you might want to replicate your data into a data store more suited to statistical analysis. (Not sure what however, that isn't my area of expertise)
1: Never use Select *
2: make sure your indexes are correct, and your statistics are up-to-date
3: (Optional) If you find you're not looking at log data past a certain time (in my experience, if it happened more than a week ago, I'm probably not going to need the log for it) set up a job to archive that to some back-up, and then remove unused records. That will keep the table size down reducing the amount of time it takes search the table.
Depending on what kinda of SQL database you're using, you might look into Horizaontal Partitioning. Oftentimes, this can be done entirely on the database side of things so you won't need to change your code.
Do you need all columns? First step should be to select only those you actually need to retrieve.
Another aspect is what you do with the data after it arrives to your application (populate a data set/read it sequentially/?).
There can be some potential for improvement on the side of the processing application.
You should answer yourself these questions:
Do you need to hold all the returned data in memory at once? How much memory do you allocate per row on the retrieving side? How much memory do you need at once? Can you reuse some memory?
A couple of things
do you need all the columns, people usually do SELECT * because they are too lazy to list 5 columns of the 15 that the table has.
Get more RAM, themore RAM you have the more data can live in cache which is 1000 times faster than reading from disk
For me there are two things that you can do,
Partition the table horizontally based on the date column
Use the concept of pre-aggregation.
Pre-aggregation:
In preagg you would have a "logs" table, "logs_temp" table, a "logs_summary" table and a "logs_archive" table. The structure of logs and logs_temp table is identical. The flow of application would be in this way, all logs are logged in the logs table, then every hour a cron job runs that does the following things:
a. Copy the data from the logs table to "logs_temp" table and empty the logs table. This can be done using the Shadow Table trick.
b. Aggregate the logs for that particular hour from the logs_temp table
c. Save the aggregated results in the summary table
d. Copy the records from the logs_temp table to the logs_archive table and then empty the logs_temp table.
This way results are pre-aggregated in the summary table.
Whenever you wish to select the result, you would select it from the summary table.
This way the selects are very fast, because the number of records are far less as the data has been pre-aggregated per hour. You could even increase the threshold from an hour to a day. It all depends on your needs.
Now the inserts would be fast too, because the amount of data is not much in the logs table as it holds the data only for the last hour, so index regeneration on inserts would take very less time as compared to very large data-set hence making the inserts fast.
You can read more about Shadow Table trick here
I employed the pre-aggregation method in a news website built on wordpress. I had to develop a plugin for the news website that would show recently popular (popular during the last 3 days) news items, and there are like 100K hits per day, and this pre-aggregation thing has really helped us a lot. The query time came down from more than 2 secs to under a second. I intend on making the plugin publically available soon.
As per other answers, do not use 'select *' unless you really need all the fields.
logLevel and creationData have indexes
You need a single index with both values, what order you put them in will affect performance, but assuming you have a small number of possible loglevel values (and the data is not skewed) you'll get better performance putting creationData first.
Note that optimally an index will reduce the cost of a query to log(N) i.e. it will still get slower as the number of records increases.
C.
I really hope that by creationData you mean creationDate.
First of all, it is not enough to have indexes on logLevel and creationData. If you have 2 separate indexes, Oracle will only be able to use 1.
What you need is a single index on both fields:
CREATE INDEX i_log_1 ON log (creationData, logLevel);
Note that I put creationData first. This way, if you only put that field in the WHERE clause, it will still be able to use the index. (Filtering on just date seems more likely scenario that on just log level).
Then, make sure the table is populated with data (as much data as you will use in production) and refresh the statistics on the table.
If the table is large (at least few hundred thousand rows), use the following code to refresh the statistics:
DECLARE
l_ownname VARCHAR2(255) := 'owner'; -- Owner (schema) of table to analyze
l_tabname VARCHAR2(255) := 'log'; -- Table to analyze
l_estimate_percent NUMBER(3) := 5; -- Percentage of rows to estimate (NULL means compute)
BEGIN
dbms_stats.gather_table_stats (
ownname => l_ownname ,
tabname => l_tabname,
estimate_percent => l_estimate_percent,
method_opt => 'FOR ALL INDEXED COLUMNS',
cascade => TRUE
);
END;
Otherwise, if the table is small, use
ANALYZE TABLE log COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;
Additionally, if the table grows large, you shoud consider to partition it by range on creationDate column. See these links for the details:
Oracle Documentation: Range Partitioning
OraFAQ: Range partitions
How to Create and Manage Partition Tables in Oracle