Increase Speed In SQL Big Data

Increase Speed In SQL Big Data - sql

I have a table of 50 million records with 17 columns. I want to publish data into some tables. I have built some tables for this.
I wrote a sql script for this work. But the speed of this script is very low.
The main problem is that Before I want to insert a record in a table, I must check the table to not exists that record.
Of course I already done some optimization in my code. For example I replace cursor with while statement. But still the speed is very low.
What can I do to increase the speed and optimization?

I must check the table to not exists that record.
Let the database do the work via a unique constraint or index. Decide on the columns that cannot be identical and run something like:
create unique index unq_t_col1_col2_col3 on t(col1, col2, col3);
The database will then return an error if you attempt to insert a duplicate.
This is standard functionality and should be available in any database. But, you should tag your question with the database you are using and provide more information about what you mean by a duplicate.

Related

Indexing a table with duplicate records

I have a SQL Server table with around 50,000 rows. The table gets updated once in a day by some upstream process.
The following query has been fired from application:
SELECT * FROM Table1 where Field1 = "somevalue"
The "Field1" column contains duplicate values. I am trying to improve performance of the above query. I cannot modify the code in the application side. So limiting column instead of "SELECT *" is not possible. I am planning to index the table. Should I define a NON-CLUSTERED index on "Field1" column in order to improve performance? Or some other kind of indexing would help? Is there any other ways to improve performance from DB side ?

Yes, a non-clustered index on Field1 should serve your purposes...
For example,
CREATE NONCLUSTERED INDEX Idx_Table1_Field1 ON Table1 (Field1)

The best thing you can do is run SP_BlitzIndex by Brent Ozar to get a better picture of your entire database index setup (including this table).
http://www.brentozar.com/blitzindex/
If your table already has a clustered index (which it should - apply one following these principles), you should first look at the execution plan to see what it is advocating.
Further, if the table is only updated every day, and presumably during off hours, you can easily compress the table and given it has repetitive data mostly, you will save over 50% IO and space on the query and incur a small CPU overhead. Table compression has no effect on the data itself, only on the space it holds. This feature is only available in SQL Server Enterprise.
Last but not least, are your data types properly set, i.e. are you pulling from datetime when the column could easily be date, or are you pulling from bigint when the column could easily be int.
Asking a question as to how to make an index really isn't a proper question for Stack, i.e.
CREATE NONCLUSTERED INDEX Idx_Table1_Field1 ON Table1 (Field1)
As it is already on MSDN and can even be created via SSMS via Create Index drop down right clicking on the index burst out section under a given table icon, the question you should be asking is how do I properly address performance improvements in my environment related to indexing. Finally, analyze whether or not your end query result really necessitates a select * - this is a common oversight on data display, a table with 30 columns is selected from a dataset when the developer only plans on showing 5 of the columns, which would be a 600% IO gain if the dataset only populated 5 columns.
Please also note the famous index maintenance script by Ole Hallengren

I want to create a copy of a table, fewer columns than original... but still updated from it

I have a 30gb table, which has 30-40 columns. I create reports using this table and it causes performance problems. I just use 4-5 columns of this table for the reports. So that, I want to create a second table for the reports. But the second table must be updated when the original table is changed without using triggers.
No matter what my query is, When the query is executed, sql tries to cache all 30gb. When the cache is fully loaded, sql starts to use disk. Actually I want to aviod this
How can I do this?
Is there a way of doing this using ssis
thanks in advance

CREATE VIEW myView
AS
SELECT
column1,
column3,
column4 * column7
FROM
yourTable
A view is effectively just a stored query, like a macro. You can then select from that view as if it were a normal table.
Unless you go for matierialised views, it's not really a table, it's just a query. So it won't speed anything up, but it does encapsulate code and assist in controlling what data different users/logins can read.

If you are using SQL Server, what you want is an indexed view. Create a view using the column you want and then place an index on them.
An indexed view stores the data in the view. It should keep the view up-to-date with the underlying table, and it should reduce the I/O for reading the table. Note: this assumes that your 4-5 columns are much narrower than the overall table.

Dems answer with the view seems ideal, but if you are truly looking for a new table, create it and have it automatically updated with triggers.
Triggers placed on the primary table can be added for all Insert, Update and Delete actions upon it. When the action happens, the trigger fires and can be used to do additional function... such as update your new secondary table. You will pull from the Inserted and Deleted tables (MSDN)
There are many great existing articles here on triggers:
Article 1, Article 2, Google Search

You can create that second table just like you're thinking, and use triggers to update table 2 whenever table 1 is updated.
However, triggers present performance problems of their own; the speed of your inserts and updates will suffer. I would recommend looking for more conventional alternatives to improve query performance, which sounds like SQL Server since you mentioned SSIS.
Since it's only 4-5 out of 30 columns, have you tried adding an index which covers the query? I'm not sure if there are even more columns in your WHERE clause, but you should try that first. A covering index would actually do exactly what you're describing, since the table would never need to be touched by the query. Of course, this does cost a little in terms of space and insert/update performance. There's always a tradeoff.
On top of that, I can't believe that you would need to pull a large percentage of rows for any given report out of a 30 gb table. It's simply too much data for a report to have. A filtered index can improve query performance even more by only indexing the rows that are most likely to be asked for. If you have a report which lists the results for the past calendar month, you could add a condition to only index the rows WHERE report_date > '5/1/2012' for example.

how to optimize sql server table for faster response?

i found a in a table there are 50 thousands records and it takes one minute when we fetch data from sql server table just by issuing a sql. there are one primary key that means a already a cluster index is there. i just do not understand why it takes one minute. beside index what are the ways out there to optimize a table to get the data faster. in this situation what i need to do for faster response. also tell me how we can write always a optimize sql. please tell me all the steps in detail for optimization.
thanks.

The fastest way to optimize indexes in table is to use SQL Server Tuning Advisor. Take a look http://www.youtube.com/watch?v=gjT8wL92mqE <-- here

Select only the columns you need, rather than select *. If your table has some large columns e.g. OLE types or other binary data (maybe used for storing images etc) then you may be transferring vastly more data off disk and over the network than you need.
As others have said, an index is no help to you when you are selecting all rows (no where clause). Using an index would be slower in such cases because of the index read and table lookup for each row, vs full table scan.

If you are running select * from employee (as per question comment) then no amount of indexing will help you. It's an "Every column for every row" query: there is no magic for this.
Adding a WHERE won't help usually for select * query too.
What you can check is index and statistics maintenance. Do you do any? Here's a Google search
Or change how you use the data...
Edit:
Why a WHERE clause usually won't help...
If you add a WHERE that is not the PK..
you'll still need to scan the table unless you add an index on the searched column
then you'll need a key/bookmark lookup unless you make it covering
with SELECT * you need to add all columns to the index to make it covering
for a many hits, the index will probably be ignored to avoid key/bookmark lookups.
Unless there is a network issue or such, the issue is reading all columns not lack of WHERE
If you did SELECT col13 FROM MyTable and had an index on col13, the index will probably be used.
A SELECT * FROM MyTable WHERE DateCol < '20090101' with an index on DateCol but matched 40% of the table, it will probably be ignored or you'd have expensive key/bookmark lookups

Irrespective of the merits of returning the whole table to your application that does sound an unexpectedly long time to retrieve just 50000 rows of employee data.
Does your query have an ORDER BY or is it literally just select * from employee?
What is the definition of the employee table? Does it contain any particularly wide columns? Are you storing binary data such as their CVs or employee photo in it?
How are you issuing the SQL and retrieving the results?
What isolation level are your select statements running at (You can use SQL Profiler to check this)
Are you encountering blocking? Does adding NOLOCK to the query speed things up dramatically?

Fast check of existence of an entry in a SQL database

Nitpicker Question:
I like to have a function returning a boolean to check if a table has an entry or not. And i need to call this a lot, so some optimizing is needed.
Iues mysql for now, but should be fairly basic...
So should i use
select id from table where a=b limit 1;
or
select count(*) as cnt from table where a=b;
or something completly different?
I think SELECT with limit should stop after the first find, count(*) needs to check all entries. So SELECT could be faster.
Simnplest thing would be doing a few loop and test it, but my tests were not helpful. (My test system seemd to be used otherwise too, which diluted mny results)

this "need" is often indicative of a situation where you are trying to INSERT or UPDATE. the two most common situations are bulk loading/updating of rows, or hit counting.
checking for existence of a row first can be avoided using the INSERT ... ON DUPLICATE KEY UPDATE statement. for a hit counter, just a single statement is needed. for bulk loading, load the data in to a temporary table, then use INSERT ... ON DUPLICATE KEY UPDATE using the temp table as the source.
but if you can't use this, then the fastest way will be select id from table where a=b limit 1; along with force index to make sure mysql looks ONLY at the index.

The limit 1 will tell the MySQL to stop searching after it finds one row. If there can be multiple rows that match the criteria, this is faster than count(*).
There are more ways to optimize this, but the exact nature would depend on the amount of rows and the spread of a and b. I'd go with the "where a=b" approach until you actually encounter performance issues. Databases are often so fast that most queries are no performance issue at all.

faster way to use sets in MySQL

I have a MySQL 5.1 InnoDB table (customers) with the following structure:
int record_id (PRIMARY KEY)
int user_id (ALLOW NULL)
varchar[11] postcode (ALLOW NULL)
varchar[30] region (ALLOW NULL)
..
..
..
There are roughly 7 million rows in the table. Currently, the table is being queried like this:
SELECT * FROM customers WHERE user_id IN (32343, 45676, 12345, 98765, 66010, ...
in the actual query, currently over 560 user_ids are in the IN clause. With several million records in the table, this query is slow!
There are secondary indexes on table, the first of which being on user_id itself, which I thought would help.
I know that SELECT(*) is A Bad Thing and this will be expanded to the full list of fields required. However, the fields not listed above are more ints and doubles. There are another 50 of those being returned, but they are needed for the report.
I imagine there's a much better way to access the data for the user_ids, but I can't think how to do it. My initial reaction is to remove the ALLOW NULL on the user_id field, as I understand NULL handling slows down queries?
I'd be very grateful if you could point me in a more efficient direction than using the IN ( ) method.
EDIT
Ran EXPLAIN, which said:
select_type = SIMPLE
table = customers
type = range
possible_keys = userid_idx
key = userid_idx
key_len = 5
ref = (NULL)
rows = 637640
Extra = Using where
does that help?

First, check if there is an index on USER_ID and make sure it's used.
You can do it with running EXPLAIN.
Second, create a temporary table and use it in a JOIN:
CREATE TABLE temptable (user_id INT NOT NULL)
SELECT *
FROM temptable t
JOIN customers c
ON c.user_id = t.user_id
Third, how may rows does your query return?
If it returns almost all rows, then it just will be slow, since it will have to pump all these millions over the connection channel, to begin with.
NULL will not slow your query down, since the IN condition only satisfies non-NULL values which are indexed.
Update:
The index is used, the plan is fine except that it returns more than half a million rows.
Do you really need to put all these 638,000 rows into the report?
Hope its not printed: bad for rainforests, global warming and stuff.
Speaking seriously, you seem to need either aggregation or pagination on your query.

"Select *" is not as bad as some people think; row-based databases will fetch the entire row if they fetch any of it, so in situations where you're not using a covering index, "SELECT *" is essentially no slower than "SELECT a,b,c" (NB: There is sometimes an exception when you have large BLOBs, but that is an edge-case).
First things first - does your database fit in RAM? If not, get more RAM. No, seriously. Now, suppose your database is too huge to reasonably fit into ram (Say, > 32Gb) , you should try to reduce the number of random I/Os as they are probably what's holding things up.
I'll assuming from here on that you're running proper server grade hardware with a RAID controller in RAID1 (or RAID10 etc) and at least two spindles. If you're not, go away and get that.
You could definitely consider using a clustered index. In MySQL InnoDB you can only cluster the primary key, which means that if something else is currently the primary key, you'll have to change it. Composite primary keys are ok, and if you're doing a lot of queries on one criterion (say user_id) it is a definite benefit to make it the first part of the primary key (you'll need to add something else to make it unique).
Alternatively, you might be able to make your query use a covering index, in which case you don't need user_id to be the primary key (in fact, it must not be). This will only happen if all of the columns you need are in an index which begins with user_id.
As far as query efficiency is concerned, WHERE user_id IN (big list of IDs) is almost certainly the most efficient way of doing it from SQL.
BUT my biggest tips are:
Have a goal in mind, work out what it is, and when you reach it, stop.
Don't take anybody's word for it - try it and see
Ensure that your performance test system is the same hardware spec as production
Ensure that your performance test system has the same data size and kind as production (same schema is not good enough!).
Use synthetic data if it is not possible to use production data (Copying production data may be logistically difficult (Remember your database is >32Gb) ; it may also violate security policies).
If your query is optimal (as it probably already is), try tuning the schema, then the database itself.

Is this your most important query? Is this a transactional table?
If so, try creating a clustered index on user_id. Your query might be slow because it still must make random disk reads to retrieve the columns (key lookups), even after finding the records that match (index seek on the user_Id index).
If you cannot change the clustered index, then you might want to consider an ETL process (simplest is a trigger that inserts into another table with the best indexing). This should yield faster results.
Also note that such large queries may take some time to parse, so help it out by putting the queried ids into a temp table if possibl

Are they the same ~560 id's every time? Or is it a different ~500 ids on different runs of the queries?
You could just insert your 560 UserIDs into a separate table (or even a temp table), stick an index on the that table and inner join it to you original table.

You can try to insert the ids you need to query on in a temp table and inner join both tables. I don't know if that would help.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas