Best Practise for SQL Replication / Load Managing - sql

I'm currently running an Ubuntu Server with Mariadb on it. It serves all sql requests for a Website (with a good amount of requests on it).
Few times a day we import large CSV files into the Database to update our Data. The Problem is, since those csv blast the db (import takes around 15 Minutes).
It seems to be using only 1 Core from 4 but still, the website (or better its sql requests during that time) get ridiculously slow. Now the question for me is what can I do here to affect the Website less to none?
I was considering Database replication to a different server, but I'm expecting that to use the same amount of ressources during the import time so no real benefit here I guess?
The other thing I considered is to have 2 SQL Databases, and during an import all requests should be switched to the other Database Server and I would basicly do each import twice, one on Server 1 (during that time Server 2 should serve the site) once thats done, the website has to be swiched to Server 1 and the import is done on Server 2. While that would work, it seems to be quite an effort for a non perfect solution (like how are requests handled during the switch from Server 1 to 2 and so on.
So what solutions exist here, preferably somewhat affordable.
All ideas and hints are welcome.
Thanks in advance
Best Regards
Menax

Is the import replacing an entire table? If so, load it into a separate table, then swap it into place. Essentially zero downtime, namely during the RENAME TABLE. For details, see http://mysql.rjweb.org/doc.php/deletebig or possibly http://mysql.rjweb.org/doc.php/staging_table
If the import is doing something else, please provide details.
One connection uses one core, no more.
More (from Comments)
SELECT id, marchants_id
from products
WHERE links LIKE '%https://´www.merchant.com/productsite_5'
limit 1
That is hard to optimize because of the leading wildcard in the LIKE. Is that really what you need? As it stands, that query must scan the table.
SELECT id, price
from price_history
WHERE product_id = 5
order by id desc
limit 1
That would benefit from INDEX(product_id, id, price) -- in that order. With that index, the query will be as close to instantaneous as possible.
Please provide the rest of the transaction with the Update and Insert, plus SHOW CREATE TABLE. There is quite possibly a way to "batch" the actions rather than doing one product price at a time. This may speed it up 10-fold.
Flip-flopping between two servers -- only if the data is readonly. If you are otherwise modifying the tables, that would be a big nightmare.
For completely replacing a table,....
CREATE a new TABLE
Populate it
RENAME TABLE to swap the new table into place.
(But I still don't understand your processing well enough to say if this is best. When you say "switch the live database", are you referring to a server (computer), a DATABASE (schema), or just one TABLE?

Related

MS SQL Server Query caching

One of my projects has a very large database on which I can't edit indexes etc., have to work as it is.
What I saw when testing some queries that I will be running on their database via a service that I am writing in .net. Is that they are quite slow when ran the first time?
What they used to do before is - they have 2 main (large) tables that are used mostly. They showed me that they open SQL Server Management Studio and run a
SELECT *
FROM table1
JOIN table2
a query that takes around 5 minutes to run the first time, but then takes about 30 seconds if you run it again without closing SQL Server Management Studio. What they do is they keep open SQL Server Management Studio 24/7 so that when one of their programs executes queries that are related to these 2 tables (which seems to be almost all queries ran by their program) in order to have the 30 seconds run time instead of the 5 minutes.
This happens because I assume the 2 tables get cached and then there are no (or close to none) disk reads.
Is this a good idea to have a service which then runs a query to cache these 2 tables every now and then? Or is there a better solution to this, given the fact that I can't edit indexes or split the tables, etc.?
Edit:
Sorry just I was possibly unclear, the DB hopefully has indexes already, just I am not allowed to edit them or anything.
Edit 2:
Query plan
This could be a candidate for an indexed view (if you can persuade your DBA to create it!), something like:
CREATE VIEW transhead_transdata
WITH SCHEMABINDING
AS
SELECT
<columns of interest>
FROM
transhead th
JOIN transdata td
ON th.GID = td.HeadGID;
GO
CREATE UNIQUE CLUSTERED INDEX transjoined_uci ON transhead_transdata (<something unique>);
This will "precompute" the JOIN (and keep it in sync as transhead and transdata change).
You can't create indexes? This is your biggest problem regarding performance. A better solution would be to create the proper indexes and address any performance by checking wait stats, resource contention, etc... I'd start with Brent Ozar's blog and open source tools, and move forward from there.
Keeping SSMS open doesn't prevent the plan cache from being cleared. I would start with a few links.
Understanding the query plan cache
Check your current plan cache
Understanding why the cache would clear (memory constraint, too many plans (can't hold them all), Index Rebuild operation, etc. Brent talks about this in this answer
How to clear it manually
Aside from that... that query is suspect. I wouldn't expect your application to use those results. That is, I wouldn't expect you to load every row and column from two tables into your application every time it was called. Understand that a different query on those same tables, like selecting less columns, adding a predicate, etc could and likely would cause SQL Server to generate a new query plan that was more optimized. The current query, without predicates and selecting every column... and no indexes as you stated, would simply do two table scans. Any increase in performance going forward wouldn't be because the plan was cached, but because the data was stored in memory and subsequent reads wouldn't experience physical reads. i.e. it is reading from memory versus disk.
There's a lot more that could be said, but I'll stop here.
You might also consider putting this query into a stored procedure which can then be scheduled to run at a regular interval through SQL Agent that will keep the required pages cached.
Thanks to both #scsimon #Branko Dimitrijevic for their answers I think they were really useful and the one that guided me in the right direction.
In the end it turns out that the 2 biggest issues were hardware resources (RAM, no SSD), and Auto Close feature that was set to True.
Other fixes that I have made (writing it here for anyone else that tries to improve):
A helper service tool will rearrange(defragment) indexes once every
week and will rebuild them once a month.
Create a view which has all the columns from the 2 tables in question - to eliminate JOIN cost.
Advised that a DBA can probably help with better tables/indexes
Advised to improve server hardware...
Will accept #Branko Dimitrijevic 's answer as I can't accept both

Is it possible to Cache the result set of a select query in the database?

I am trying to optimize the search query which is the most used in our system. So far I have added some missing indexes and that has helped slightly. But I want to further reduce the load on the db server. One option that I will use is caching the result set as a LIST in the asp.net Cache so that I don't have to hit the db often.
However, I was wondering if there is a way to Cache some portions of the select query at the db as well. e.g. for the search results we consider only users who have been active in the last 180 days and who have share-info set as true. So this is like a super set which the db processes everytime and then applies other conditions such as category specified, city etc. which are passed. Is it possible to somehow Cache the Super Set so that I can run queries against the super set rather than run the query against the whole table? Will creating a View help in this? I am a bit hesitant to create a view as I read managing views can be an overhead and takes away some flexibility to modfy the tables.
I am using Sql-Server 2005 so cannot create a filtered index on the table, which I think would have been helpful.
I agree with #Neville K. SQL Server is pretty smart at caching data in memory. You might see limited / no performance gains for your effort.
You could consider indexed views (Enterprise Edition only) http://technet.microsoft.com/en-us/library/cc917715.aspx for your sub-query.
It is, of course, possible to do this - but I'm not sure if it will help.
You can create a scheduled job - once a night, perhaps - which populates a table called "active_users_with_share_info" by truncating it, and then repopulating it based on a select query filtering out users active in the last 180 days with "share_info = true".
Then you can join your search query to this table.
However, I doubt this would do much good - SQL Server is pretty smart at caching. Unless you're dealing with huge volumes of data (100 of millions of records), or very limited hardware, I doubt you'd get any measurable performance improvements - but by all means try it!
Of course, the price for this would be more moving parts in your application, more interesting failure modes (what happens if the overnight batch fails silently?), and more training for any new developers you bring into the team.

Select query too slow > 5min

I have a tableMyTable with 29,000 rows.
MyTable structure {
StudentId bigint,
....
}
Number of columns > 10 columns. The database in the hosting server.
From SSMS i execute the query:
SELECT *
FROM MyTable
Is it normal that the execution lasts more than 5 min?
First of all, retrieving all the data from a remote database is never a good idea. You are using an important share of bandwidth. Hopefully, the query you are using is only used for debugging purpose and should never hit production.
You did not mention if it took 5 minutes before you started receiving something or if you are receiving your data over the course of that 5 minutes, at a constant rate.
In the first situation, not receiving rows at all might indicating a that a lock is effective on your table, due to another operation.
In the latter situation, you are constantly receiving rows, but at a slow rate. Bandwidth and server load play a big part in that. To get you a rough idea of the amount of data that you are downloading, run this stored procedure:
EXEC sp_spaceused 'YourTableName';
Consider that the server has to upload that data and that you have to download the data.
Binary and xml fields (also called BLOB field) usually take a lot of data and you may not be able to control the amount of data stored by the user in those field.
Try checking the size of your variable length fields (varchar, xml and varbinary) by running a DATALENGTH on your column:
SELECT DATALENGTH(MyField) FROM MyTable
You can also get an average:
SELECT AVG(DATALENGTH(MyField)) FROM MyTable
A good idea concerning BLOB field is to retrieve them only when needer and not when you are loading a list of data.
For example, assume a XML field stored in a PurchaseOrder table. If you wish to display the list of PO to your user, you usually don't need to retrieve that field, unless the user open the PO.
Many recent ORM, like nHibernate, offers lazy loading for columns, along with paging so you can retrieve a small amount of row.
Ayende posted a rent about loading unbounded result set two weeks ago.
You're right - the select query shouldn't take that long. It's not the number of rows. Likely it's the type of data you've got on that table/view, and perhaps the storage configuration (slow disk, filegroups config, etc).
Some ideas to consider to remedy this performance problem:
be specific in the columns that you want to retrieve. For ad-hoc queries, SELECT * is fine, but recognize that the RDBMS will work slightly harder to determine which columns are on the table/view.
gathering the values any columns of datatype text, varbinary will take proportionally longer depending on the data within those fields.
consider the indexes (do you have any?) on the table/view?
is this a Prod database, where more/other activity might be hitting this table?
If you edit your question, perhaps include the full table definition so that we can get a real look at what's happening with the datatypes.
I would recommend that you consider OMG Ponies's recommendation - it could be due to the bandwidth between the box and your machine, so
try to remote the box and see how long the query takes on that machine.
If it takes almost same amount of time, then the problem lies either in the database design or underlying hardware, or other factors (table datatypes, wrong indexes, overall load on the machine, overall hits to this table, etc)
if it takes significantly less amount of time, then the problem is surely between your machine and the box - ideally this shouldn't be a big problem, because the web server will be closer to the db server, probably on same LAN (so it should be much faster in the real world). Also, I'm sure you wouldn't use a 'Select *' in the actual app to pick 29000 rows, so it shouldn't create a lot of problem.

SQL -> MongoDB Export Performance Issues

I'm trying to set up an automated process to regularly transform and export a large MS SQL 2008 database to MongoDB.
There is not a 1-1 correspondence between tables in SQL and collections in MongoDB -- for example the Address table in SQL is translated into an array embedded in each customer's record in Mongo and so on.
Right now I have a 3 step process:
Export all the relevant portions of the database to XML using a FOR XML query.
Translate XML to mongoimport-friendly JSON using XSLT
Import to mongo using mongoimport
The bottleneck right now seems to be #2. XML->JSON conversion for 3 million customer records (each with demographic info and embedded address and order arrays) takes hours with libxslt.
It seems hard to believe that there's not already some pre-built way to do this, but I can't seem to find one anywhere.
Questions:
A) Are there any pre-existing utilities I could use to do this?
B) If no, is there a way I could speed up my process?
C) Am I approaching the whole problem the wrong way?
Another approach is to go through each table and add information to mongo on a record by record basis and let Mongo do the denormalizing! For instance to add each phone number, just go through the phone number table and do a '$addToSet' for each phone number to the record.
You can also do this in parallel and do tables separately. This may speed things up but may 'fragment' the mongo database more.
You may want to add any required indexes before you start, otherwise adding the indexes at the end may be a large delay.

Simulating queries of large views for benchmarking purposes

A Windows Forms application of ours pulls records from a view on SQL Server through ADO.NET and a SOAP web service, displaying them in a data grid. We have had several cases with ~25,000 rows, which works relatively smoothly, but a potential customer needs to have many times that much in a single list.
To figure out how well we scale right now, and how (and how far) we can realistically improve, I'd like to implement a simulation: instead of displaying actual data, have the SQL Server send fictional, random data. The client and transport side would be mostly the same; the view (or at least the underlying table) would of course work differently. The user specifies the amount of fictional rows (e.g. 100,000).
For the time being, I just want to know how long it takes for the client to retrieve and process the data and is just about ready to display it.
What I'm trying to figure out is this: how do I make the SQL Server send such data?
Do I:
Create a stored procedure that has to be run beforehand to fill an actual table?
Create a function that I point the view to, thus having the server generate the data 'live'?
Somehow replicate and/or randomize existing data?
The first option sounds to me like it would yield the results closest to the real world. Because the data is actually 'physically there', the SELECT query would be quite similar performance-wise to one on real data. However, it taxes the server with an otherwise meaningless operation. The fake data would also be backed up, as it would live in one and the same database — unless, of course, I delete the data after each benchmark run.
The second and third option tax the server while running the actual simulation, thus potentially giving unrealistically slow results.
In addition, I'm unsure how to create those rows, short of using a loop or cursor. I can use SELECT top <n> random1(), random2(), […] FROM foo if foo actually happens to have <n> entries, but otherwise I'll (obviously) only get as many rows as foo happens to have. A GROUP BY newid() or similar doesn't appear to do the trick.
For data for testing CRM type tables, I highly recommend fakenamegenerator.com, you can get 40,000 fake names for free.
You didn't mention if you're using SQL Server 2008. If you use 2008 and you use Data Compression, be aware that random data will act very differently (slower) than real data. Random data is much harder to compress.
Quest Toad for SQL Server and Microsoft Visual Studio Data Dude both have test data generators that will put fake "real" data into records for you.
If you want results you can rely on you need to make the testing scenario as realistic as possible, which makes option 1 by far your best bet. As you point out if you get results that aren't good enough with the other options you won't be sure that it wasn't due to the different database behaviour.
How you generate the data will depend to a large degree on the problem domain. Can you take data sets from multiple customers and merge them into a single mega-dataset? If the data is time series then maybe it can be duplicated over a different range.
The data is typically CRM-like, i.e. contacts, projects, etc. It would be fine to simply duplicate the data (e.g., if I only have 20,000 rows, I'll copy them five times to get my desired 100,000 rows). Merging, on the other hand, would only work if we never deploy the benchmarking tool publicly, for obvious privacy reasons (unless, of course, I apply a function to each column that renders the original data unintelligible beyond repair? Similar to a hashing function, only without modifying the value's size too much).
To populate the rows, perhaps something like this would do:
WHILE (SELECT count(1) FROM benchmark) < 100000
INSERT INTO benchmark
SELECT TOP 100000 * FROM actualData