I am running some ETL on my Azure SQL DW at DW500
so I have 20 concurrency slots available
some of my queries would require RC xlargerc, some largerc, etc
so the expected load can vary from query to query
is there any option to control the assigned RC in the query directly?
e.g. using OPTION or any other hints?
the only workaround I could find so far is to create separate users with different resource classes assigned which is not really feasible
thanks in advance,
-gerhard
There is currently no option to control this at query level. You have to be logged in as the appropriate user with the appropriate resource class (smallrc, mediumrc, largerc, and xlargerc) assigned to them.
DWU500 is pretty low, with max 20 concurrent queries and only 20 concurrency slots. Remember an xlargerc user would take 16 of those slots, as per here, so you could only have 1 other mediumrc user or 4 smallrc users running at the same time. ie you could not have one largerc and one xlargerc user running at the same time. These queries would queue.
Can you tell us a bit more about your scenario? For example, why switch users during ETL? What ETL tool are you using, eg SSIS, Azure Data Factory etc
If you think this is a worthwhile option, consider making a feedback request.
Related
Currently, I encounter an issue of suspended queries in Azure Synapse when executing from ADF (Store procedures call).
Also, I followed the suggestion in the link below for troubleshooting the issue:
Delete due to sensitive informations
The troubleshoot queries returned as below:
I checked if the transaction lock is the issue as I killed a few suspending or running queries which they ran for more than 15 hours. I also checked for the rest of the queries running but there is nothing would cause the transaction lock. I tried to run the store procedure manually from Azure Data Studio which is blocked as mentioned above and It took 40 seconds to complete.
While the suspending query from ADF, it took nearly an hour to finish.
Any suggestion to troubleshoot this issue is much appreciated.
Thanks
There a number of factors you must always consider when tuning queries in Azure Synapse Analytics, dedicated SQL pools:
DWU - what DWU is your pool at? Lower DWUs mean lower concurrent users and lower performance and should not be used for any kind of performance tuning. Crank it up temporarily to rule this out as a problem, bearing in mind changing this disconnects any active queries. Also bear in mind, not all queries respond to higher DWU.
Resource class - what resource class is associated with the user executing these queries? Remember the default is smallrc, and the admin user always has smallrc. Understand static and dynamic resource classes. DMV sys.dm_pdw_exec_requests will give you useful information on this. Trial with your workload to find the sweetspot between performance and concurrency v resource class. Encourage your dev team to use labels in their queries: OPTION ( LABEL = 'some informative label' )
Table geometry - this is the distribution (ROUND_ROBIN|HASH|REPLICATE) of your table and the indexing choice (CLUSTERED COLUMNSTORE|CLUSTERED INDEX|HEAP). Clustered columnstore and round robin are the defaults but they are not always appropriate. Consider what is appropriate for your tables.
If you work through those and still have an issue you can start to look at statistics and workload classification for starters, but gather information on the points above should give you a good idea.
If you are just doing single value INSERTs, then don't. Dedicated SQL pools are terrible with these. Convert these to load from a file in a single INSERT / COPY INTO.
As I understand it, BigQuery's caching mechanism is on a per user basis. But we'd like to be able to share the cache on something like a project/dataset/table level.
For example, John & Mary both work on the same Google project XYZ. They love using BigQuery, and both query the table Bar in dataset Foo i.e. XYZ:Foo.Bar to get beautiful insights from their data.
John logs in and writes a query against XYZ:Foo.Bar which takes 10 seconds to execute. A few minutes later Mary logs in and composes the exact same query on XYZ:Foo.Bar. It also takes 10 seconds, but she does not get a cache hit.
Is there anything that can be done to share the query cache across users i.e. on a project/dataset/table level? Or have I missed something obvious?
BigQuery doesn't share cache across users for privacy reasons - but it could be an interesting feature request to propose: https://code.google.com/p/google-bigquery/.
An alternative you could implement today is a proxy that would connect to BigQuery on behalf of your users with a service account. For example, you get the BigQuery native cache and an application level cache when using http://demo.redash.io. Same with Datalab - as it uses a service account by default, results are cached for users in the same project.
I am working on Asp.Net MVC web application, back-end is SQL Server 2012.
This application will provide billing, accounting, and inventory management. The user will create an account by signup. just like http://www.quickbooks.in. Each user will create some masters and various transactions. There is no limit, user can make unlimited records in the database.
I want to keep stable database performance, after heavy data load. I am maintaining proper indexing and primary keys in it, but there would be a heavy load on the database, per user.
So, should I create a separate database for each user, or should maintain one database with UserID. Add UserID in each table and making a partition based on UserID?
I am not an expert in SQL Server, so please provide suggestions with clear specifications.
Please inform me if there is any lack of information.
A DB per user is what happens when customers need to be able pack up and leave taking the actual database with them. Think of a self hosted wordpress website. Or if there are incredible risks to one user accidentally seeing another user's data, so it's safer to rely on the servers security model than to rely on remembering to add the UserId filter to all your queries. I can't imagine a scenario like that, but who knows-- maybe if the privacy laws allowed for jail time, I would rather data partitioned by security rules rather than carefully writing WHERE clauses.
If you did do user-per-database, creating a new user will be 10x more effort. While INSERT, UPDATE and so on stay the same from version to version, with each upgrade the syntax for database, user creation, permission granting and so on will evolve enough to break those scripts each SQL version upgrade.
Also, this will multiply your migration headaches by the number of users. Let's say you have 5000 users and you need to add some new columns, change a columns data type, update a trigger, and so on. Instead of needing to run that change script 1x, you need to run it 5000 times.
Per user Dbs also probably wastes disk space. Each of those databases is going to have a transaction log, sitting idle taking up the minimum log space.
As for load, if collectively your 5000 users are doing 1 billion inserts, updates and so on per day, my intuition tells me that it's going to be faster on one database, unless there is some sort of contension issue (everyone reading and writing to the same table at the same time and the same pages of the same table). Each database has machine resources (probably threads and memory) per database doing housekeeping, so these extra DBs can't be free.
Anyhow, the best thing to do is to simulate the two architectures and use a random data generator to simulate load and see how they perform.
It's not an easy answer to give.
First, there is logical design to be considered. Then you have integrity, security, management and performance (in this very order).
A database is a logical unit of data, self contained. Ideally, you should be able to take a database, move it to another instance, probably change the connection strings and be running again.
All the constraints are database-level. No foreign keys can exist referencing some object outside the database.
So, try thinking in these terms first.
How would you reliably prevent one user messing up the other user's data? Keep in mind that it's just a matter of time before someone opens an excel sheet and fire up queries on the database bypassing your application. Row level security in SQL Server is something you don't want to deal with.
Multiple databases mean that all management tasks should be scripted out and executed on all databases. Yes, there is some overhead to it, but once you set it up it's just the matter of monitoring. If a database goes suspect, it's a single customer down, not all of them. You can even have different versions for different customes if each customer have it's own database. Additionally, if you roll an upgrade, you can do it per customer, so the inpact will be much less.
Performance is the least relevant factor here. Of course, it really depends on how many customers and how much data, but proper indexing will solve these issues. Scale-out is much easier with multiple databases.
BTW, partitioning, as you mentioned it, is never a performance booster, it's simply a management feature, allowing for faster loading and evicting of data from a table.
I'd probably put each customer in separate database, but it's up to you eventually to make a decision for yourself. Hope I've helped some with this.
I have 10 reports and 250 customers. All the reports are ran by my customers and each report take parameters. Depending on the parameters, same report connects to different database and gets result. I know with different parameters caching is not an option. But I dont want to run these reports on live data during day time. Is there anything I can do (snapshot, subscription) that can run overnight and either sends these reports or save a snapshot that could be used for next 24 hours?
Thanks in advance.
As M Fredrickson suggests, subscriptions might work here depending on the number of different reports to be sent.
Another approach is to consolidate your data query to a single shared datasource. Shared datasources can have caching enabled, and there are several options for refreshing that cache, such as on first access or on a timed schedule. See MSDN for more details.
The challenge with a cached datasource is to figure out how to remove all parameters from the actual data query by moving them elsewhere, usually the dataset filter in the report, or into the filters of the individual data elements, such as your tablixes.
I use this approach to refresh a 10 minute query overnight, and then return the report all day long in less than 30 seconds, with many different possible parameters filtering the dataset.
You can also mix this approach with others by using multiple datasets in your report, some cached and some not.
I would suggest going the route of subscriptions. While you could do some fancy hack to get multiple snapshots of a single report, it would be cleaner to use subscriptions.
However, since you've got 250 customers, and 10 different reports, I doubt that you'll want to configure and manage 2,500 different subscriptions within Report Manager... so I would suggest that you create a data driven subscription for each of the reports.
We're building a Silverlight application which will be offered as SaaS. The end product is a Silverlight client that connects to a WCF service. As the number of clients is potentially large, updating needs to be easy, preferably so that all instances can be updated in one go.
Not having implemented multi tenancy before, I'm looking for opinions on how to achieve
Easy upgrades
Data security
Scalability
Three different models to consider are listed on msdn
Separate databases. This is not easy to maintain as all schema changes will have to be applied to each customer's database individually. Are there other drawbacks? A pro is data separation and security. This also allows for slight modifications per customer (which might be more hassle than it's worth!)
Shared Database, Separate Schemas. A TenantID column is added to each table. Ensuring that each customer gets the correct data is potentially dangerous. Easy to maintain and scales well (?).
Shared Database, Separate Schemas. Similar to the first model, but each customer has its own set of tables in the database. Hard to restore backups for a single customer. Maintainability otherwise similar to model 1 (?).
Any recommendations on articles on the subject? Has anybody explored something similar with a Silverlight SaaS app? What do I need to consider on the client side?
Depends on the type of application and scale of data. Each one has downfalls.
1a) Separate databases + single instance of WCF/client. Keeping everything in sync will be a challenge. How do you upgrade X number of DB servers at the same time, what if one fails and is now out of sync and not compatible with the client/WCF layer?
1b) "Silos", separate DB/WCF/Client for each customer. You don't have the sync issue but you do have the overhead of managing many different instances of each layer. Also you will have to look at SQL licensing, I can't remember if separate instances of SQL are licensed separately ($$$). Even if you can install as many instances as you want, the overhead of multiple instances will not be trivial after a certain point.
3) Basically same issues as 1a/b except for licensing.
2) Best upgrade/management scenario. You are right that maintaining data isolation is a huge concern (1a technically shares this issue at a higher level). The other issue is if your application is data intensive you have to worry about data scalability. For example if every customer is expected to have tens/hundreds millions rows of data. Then you will start to run into issues and query performance for individual customers due to total customer base volumes. Clients are more forgiving for slowdowns caused by their own data volume. Being told its slow because the other 99 clients data is large is generally a no-go.
Unless you know for a fact you will be dealing with huge data volumes from the start I would probably go with #2 for now, and begin looking at clustering or moving to 1a/b setup if needed in the future.
We also have a SaaS product and we use solution #2 (Shared DB/Shared Schema with TenandId). Some things to consider for Share DB / Same schema for all:
As mention above, high volume of data for one tenant may affect performance of the other tenants if you're not careful; for starters index your tables properly/carefully and never ever do queries that force a table scan. Monitor query performance and at least plan/design to be able to partition your DB later on based some criteria that makes sense for your domain.
Data separation is very very important, you don't want to end up showing a piece of data to some tenant that belongs to other tenant. every query must have a WHERE TenandId = ... in it and you should be able to verify/enforce this during dev.
Extensibility of the schema is something that solutions 1 and 3 may give you, but you can go around it by designing a way to extend the fields that are associated with the documents/tables in your domain that make sense (ie. Metadata for tables as the msdn article mentions)
What about solutions that provide an out of the box architecture like Apprenda's SaaSGrid? They let you make database decisions at deploy and maintenance time and not at design time. It seems they actively transform and manage the data layer, as well as provide an upgrade engine.
I've similar case, but my solution is take both advantage.
Where data and how data being placed is the question from tenant. Being a tenant of course I don't want my data to be shared, I want my data isolated, secure and I can get at anytime I want.
Certain data it possibly share eg: company list. So database should be global and tenant database, just make sure to locked in operation tenant database schema, and procedure to update all tenant database at once.
Anyway SaaS model everything delivered as server / web service, so no matter where the database should come to client as service, then only render by client GUI.
Thanks
Existing answers are good. You should look deeply into the issue of upgrading and managing multiple databases. Without knowing the specific app, it might turn out easier to have multiple databases and not have to pay the extra cost of tracking the TenantID. This might not end up being the right decision, but you should certainly be wary of the dev cost of data sharing.