Sharing a table row between users in SqlServer Azure - sql

Context: A mobile note taking application that is connected to windows azure mobile services. (Sql Server Azure)
Currently I have 2 tables: Users & Notes.
A user downloads their respective notes by querying the Notes table and asking for all notes that have the userID match their own.
Example:
SELECT * FROM Notes WHERE userID = myID;
But now I want my users to be able to share notes between them, so...
I'm thinking of adding a "SharedList" & "SharedListMember" tables, where each note will have a shared list with their respective sharing members on the SharedListMember child table.
Example:
SELECT DISTINCT n.* FROM Notes n LEFT OUTER JOIN SharedList l ON n.list = l.id INNER JOIN SharedListMember lm ON l.id = lm.list WHERE (lm.memberID = myID OR n.userID = myID)
I have added a LEFT OUTER JOIN because not all tasks will be shared.
I would be adding indexes on Notes.list, SharedList.id (Primary Key) , SharedListMember.memberID, SharedListMember.list
Questions:
How much performance impact can I expect with this setup ? Is there a faster way?
I currently query about a 1000 notes in less than a second. What would happen if I've got 10 million notes ?

You will likely notice no impact on 10 million notes, with this SQL query.
Your bottlenecks will be bandwidth back to your app, if your notes ever contain attachments and latency with the SQL query call to the database, so cache locally if you can and do Async calls where practical in your application.
This is a case of don't try to over optimize a solution that isn't causing a problem. SQL Azure is highly optimized and I have millions of rows in some of my tables and queries return in less than a second and are far more complicated than the one you have shown above.

Related

SQL: Insert Into Table 1 From Table 2 then Update Table 2 - Performance Increase

I am working to increase the speed and performance for a database process that I have inherited. The basic steps, prior to this process, is a utility uploads about a million or more records into an Upload Table. That process is pretty quick, but things start to slow down once we start adding/updating/moving items from the Upload Table into other tables in the database.
I have read a few articles stating that using IF NOT EXIST may be quicker than SELECT DISTINCT so I was thinking about refactoring the following code to do so but I was also wondering if there is a way to combine these two queries in order to increase the speed.
The Upload Table contains many columns, I am just showing the Product portion but there is also Store Columns which has the same number of columns as the Product and many other details that are not a one-to-one relationship between tables.
The first query inserts the product into the Product table if it does not already exist, then the next step updates the Upload Table with Product IDs for all the records in the Upload Table.
INSERT INTO Product (p.ProductCode, p.ProductDescription, p.ProductCodeQualifier, p.UnitOfMeasure)
SELECT DISTINCT
ut.ProductCode, ut.ProductDescription, ut.ProductCodeQualifier, ut.UnitOfMeasure
FROM
Upload_Table ut
LEFT JOIN
Product p ON (ut.ProductCode = p.ProductCode)
AND (ut.ProductDescription = p.ProductDescription)
AND (ut.ProductCodeQualifier = p.ProductCodeQualifier)
AND (ut.UnitOfMeasure = p.UnitOfMeasure)
WHERE
p.Id IS NULL
AND ut.UploadId = 123456;
UPDATE Upload_Table
SET ProductId = Product.Id
FROM Upload_Table
INNER JOIN Product ON Upload_Table.ProductCode = Product.ProductCode
AND Upload_Table.ProductDescription = Product.ProductDescription
AND Upload_Table.ProductCodeQualifier = Product.ProductCodeQualifier
AND Upload_Table.UnitOfMeasure = Product.UnitOfMeasure
WHERE (Upload_Table.UploadId = 123456)
Any help or suggestions would be greatly appreciated. I am decent with my understanding of SQL but I am not an expert.
Thanks!
Currently have not tried to make any changes for this part as I am trying to find the best result for speed increases and a better understanding of how this process can be improved.
Recommendations:
You can disable triggers, foreign keys, constraints, and indexes before inserting and updating, after these processes you can enable all these again. Because indexes, triggers, and foreign keys accept the inserting (updating) performance very badly.
Is not recommended to use auto-commit transaction mode during the update or insert process. This is getting a very bad performance, on auto-commit mode every time for each after inserting records the transactions automatically will be committed. But, for best performance, I recommended you use commit only after inserting 1000 records (or after 10000).
If you can, then you can do this process (insert or update) periodically multiple times a day, you can also do this using triggers. I don't know your business logic, maybe this variant will not satisfy you.
And don't forget to analyze executing plan for your queries.

Google BigQuery Query exceeded resource limits

I'm setting up a crude data warehouse for my company and I've successfully pulled contact, company, deal and association data from our CRM into bigquery, but when I join these together into a master table for analysis via our BI platform, I continually get the error:
Query exceeded resource limits. This query used 22602 CPU seconds but would charge only 40M Analysis bytes. This exceeds the ratio supported by the on-demand pricing model. Please consider moving this workload to the flat-rate reservation pricing model, which does not have this limit. 22602 CPU seconds were used, and this query must use less than 10200 CPU seconds.
As such, I'm looking to optimise my query. I've already removed all GROUP BY and ORDER BY commands, and have tried using WHERE commands to do additional filtering but this seems illogical to me as it would add processing demands.
My current query is:
SELECT
coy.company_id,
cont.contact_id,
deals.deal_id,
{another 52 fields}
FROM `{contacts}` AS cont
LEFT JOIN `{assoc-contact}` AS ac
ON cont.contact_id = ac.to_id
LEFT JOIN `{companies}` AS coy
ON CAST(ac.from_id AS int64) = coy.company_id
LEFT JOIN `{assoc-deal}` AS ad
ON coy.company_id = CAST(ad.from_id AS int64)
LEFT JOIN `{deals}` AS deals
ON ad.to_id = deals.deal_id;
FYI {assoc-contact} and {assoc-deal} are both separate views I created from the associations table for easier associations of those tables to the companies table.
It should also be noted that this query has occasionally run successfully, so I know it does work, it just fails about 90% of the time due to the query being so big.
TLDR;
Check your join keys. 99% of the time the cause of the problem is a combinatoric explosion.
I can't know for sure since I don't have access to the data of the underlying table, but I will give a general resolution method which in my experience worked every time to find the root cause.
Long Answer
Investigation method
Say you are joining two tables
SELECT
cols
FROM L
JOIN R ON L.c1 = R.c1 AND L.c2 = R.c2
and you run into this error. The first thing you should do is check for duplicates in both tables.
SELECT
c1, c2, COUNT(1) as nb
FROM L
GROUP BY c1, c2
ORDER by nb DESC
And the same thing for each table involved in a join.
I bet that you will find that your join keys is duplicated. BigQuery is very scalable, so in my experience this error happens when you have a join key that repeats more than 100 000 times on both tables. It means that after your join, you will have 100000^2 = 10 billion rows !!!
Why BigQuery gives this error
In my experience, this error message means that your query does too many computation compared to the size of your inputs.
No wonder you're getting this if you end up with 10 billion rows after joining tables with a few million rows each.
BigQuery's on-demand pricing model is based on the amount of data read in your tables. This means that people could try to abuse this by, say running CPU-intensive computations while reading small datasets. To give an extreme example, imagine someone makes a Javascript UDF to mine bitcoin and runs it on BigQuery
SELECT MINE_BITCOIN_UDF()
The query will be billed $0 because it doesn't read anything, but will consume hours of Google's CPU. Of course they had to do something about this.
So this ratio exists to make sure that users don't do anything sketchy by using hours of CPUs while processing a few Mb of inputs.
Other MPP platforms with a different pricing model (e.g. Azure Synapse who charges based on the amount of bytes processed, not read like BQ) would perhaps have run without complaining, and then billed you 10Tb for reading that 40Mb table.
P.S.: Sorry for the late and long answer, it's probably too late for the person who asked, but hopefully it will help whoever runs into that error.

Strategies for optimizing views referencing views in SQL Server?

Update:
Out of respect for your time I am adding indices in the tables that the subviews calling from. I will come back and edit this once I have improved as much as I can to minimize complexity and be more specific in my request for help. I can also delete and rewrite this post if that is preferred.
The bottleneck is the subview. The estimated plan shows most of the work is a Hash Match between the tables and the subview. link to query plan
I understand that the predicates and the join columns should be indexed. What I'm not sure are ideal strategies for the subviews.
Should I:
Convert the subview to a table value function? I heard from an expert that this is an ideal solution but they didn't cover why. I don't know if the indexed columns from the subview carry in to the main view
Or do I need to convert the main view in to a table function too to take advantage of the indices?
Or maybe I'm way off and don't need to convert to table value function at all?
Main view:
SELECT *
FROM table1 WITH (INDEX(IX_table1))
INNER JOIN table2 WITH (INDEX(IX_table2)) ON table1.field1 = table2.field1
AND table1.field2 = table2.field2
LEFT JOIN SubView WITH (nolock) ON table1.field1 = SubView1.field1
AND table1.field2 = SubView1.field2
AND table2.field3 = SubView1.field3
AND table2.field4 = SubView1.field4
WHERE table1.PredicateDate >= dynamicDate
AND table1.field1 IN (3, 4)
AND table1.field5 = 0
Apologies . . . put in the answer section instead of comment section. But after trying to fix, it didn't matter . . . not enough posts to add a comment.
This is for tables on a Microsoft ERP system. Microsoft has their default indexes on tables that shouldn't be changed or deleted. On any ERP upgrades, the indexes get recreated by Microsoft anyways.
The tables for most of the reporting are order history headers (8 million records) and lines (57 million records). These tables get populated when an order is transferred to invoice or an invoice is posted. For first situation, order goes to history table and an invoice is created in the open table. The 2nd situation, an anvoice is moved to history table when the invoice is posted. For these processes, the ERP system has a thick client (that hasn't changed much since 2010 or earlier). The process is a rather long process with many tables that does not use an explicit SQL transaction. If this process is interrupted, then a manual fix up is required for any tables that were not updated.
The READONLY/READUNCOMMITTED was initially used for large reporting against the live the database. The Views that Vinh is using are used against a replication server that is now in place. The READONLY is normally used against information that is in previous months/days so the current day changes are not a problem. The large reports were slowing down the transfer and posting processes discussed in the previous paragraph. The posting process above currently takes about 1 hr to post 500 transactions, so it is good any time we can keep the process from slowing down.
Why a specific index is specified: The 57 million rows are divided into order types (SOPTYPE 2 (order), 3 (invoice), and 5 (backorder)). Most of the Microsoft indexes use the SOPTYPE as the 1st field in the index. So most of the queries end up using a index scan rather than an index seek. In some cases, just specifying the index reduces the query time from 2 min to 5 sec. When comparing the index scores, both indexes may be at 80% but SQL tends to choose the index with the SOPTYPE as the first index field.
We are probably one of the larger data users of the particular ERP system. I don't believe Microsoft has optimized for this data size.
I hope this information helps.
It is taking awhile to optimize the subviews due to dependencies outside my control. Were not able to delete the post so closing out this question for now.

Optimize calls to a commonly called, expensive query

I have a view in my database which returns the last updated value for a number of tables. This is to prevent those tables being queried directly for changes by the application, the application is in a multi user environment and these tables may be frequently updated for short bursts, then ignored for hours at a time.
I have a view called vwLastUpated
CREATE VIEW vwLastUpdated as
SELECT Tasks, Items, ListItems FROM
(Select Max(ModifiedTime) as Tasks from tblTasks) a CROSS JOIN
(Select Max(ModifiedTime) as Items from tblItems) b CROSS JOIN
(Select Max(ModifiedTime) as ListItems from tblListItem) c
Clients are configured to call this view around every 10-30 seconds (user configurable), the trouble is, when there are a lot of clients (around 80 at one site), the view gets hit very, very frequently, and can sometimes take a few milliseconds to run, but sometimes takes 200-300 ms to run if updates are occurring, this seems to be slowing down the front end during heavy use. The tables are properly indexed on ModifiedTime DESC.
These sites are using SQL Express in some cases, at other sites they have the full version of SQL and I can design the view differently and use Agent to update a common table (tblLastUpdated) where Agent updates the table directly by essentially running the above query every 5 seconds.
What could I do to make the process more efficient and reduce the load on the database server where SQL Express is used?
The client sites are on a minimum of SQL Server 2008 (up to SQL 2012)
Do you have an index on the following fields?
tblTasks(ModifiedTime)
tblItems(ModifiedTime)
ListItems(ModifiedTime)
This should ensure pretty good performance.
If you do, and there is still the problem of interacting locks, then you might consider having another table with this information. If you do updates/inserts directly on the tables, this would require triggers. If you wrap updates and inserts in stored procedures, then you can do the changes there.
This would basically be turning the view into a table and updating that table whenever the data changes. That update should be very fast and have minimal interaction with other queries.

how to stop running bigquery query

is there any way how to cancel a running query?
i use the web interface. first i ran a series of tests on tables of 10k and than 20k rows and the response was in seconds. but than i ran the triple join query on a table of 100k rows and it seems endless after thousand of seconds.
i just wanted to run some tests before moving all the work to bigquery but now i'm afraid it's gonna spend the whole monthly 100gb free limit + more.
the table is a simple key-value pairs of integer values.
The shell command bq cancel job_id will do this now. You can get the job_id from the Query History tab in the BigQuery console. If you started the query via the CLI, it will have logged the job_id to the standard output.
There isn't a way, currently to stop a running query either via the API or the UI. You may be able to close the query builder (via the 'x' in the top right of the UI) and re-open it again to make the UI responsive again. We're currently working on this feature in the UI.
It is surprising that the query would take so long, even for a join, for tables of that size, unless your join was joining on non-unique keys so was taking time generating the cross-products of matching keys. For example:
SELECT t1.foo
FROM (SELECT 1 as one, foo FROM table1) t1
JOIN (SELECT 1 as one, bar FROM table2) t2
ON t1.one = t2.one
Would generate n x m rows where n is thenumber of rows in table1 and m is the number of rows in table2. Is there any chance your query is doing something similar? If not, can you send the query? (maybe in another SO question, related to slow join performance).
We didn't find a way to stop jobs while using the Java API, and as far as i know you can't stop a job at the web interface.