What is the best approach to fetch a flag if objects are connected? - sql

Suppose, we have two entities\tables - Users and Games (could be anything instead). And a user can mark multiple games as a favourite. So we also have a user_favourite_game (user_id, game_id) table.
Then suppose, a user is fetching a list of all available games and some of them should have the "favourite" flag = true (pagination is used, so we'll assume 20 games are fetched each time). So I see two approaches here:
We can make one request populating the "favourite" field, e. g.
SELECT
g.*,
ufg.game_id IS NOT NULL AS favourite
FROM
games g LEFT JOIN
user_favourite_game ufg ON ufg.user_id = :userId AND g.id = ufg.game_id
ORDER BY
g.id;
We can select the games and then perform 20 requests to check whether a game is of user's favourites.
Which approach is better to use and why? Any other ideas?
On the last project, we used the second approach because of the complexity of computations required for each entity. So it was a lot more complicated rather than in the example above and close to impossible to be calculated inside a single query.
But in general, it seems to me that in such simple cases a single query with JOIN should run faster than 20 simple queries. Although, I'm not sure how it will behave when we'll have a lot of data in user_favourite_game table

Use the database for what it's designed to do and have it give you the results as part of your original query.
The time your DB will spend performing the outer join on the user favorite game table will likely be less than the network overhead of 20 separate requests for the favorite flag.
Make sure the tables are indexed appropriate as they grow and have accurate statistics.
This isn't a hard and fast rule, and actual performance testing should guide, but I have observed plenty of applications that were harmed by network chattiness. If your round-trip cost for each request is 250ms, your 20 calls will be very expensive. If your round-trip cost is 1ms, people might never notice.

Firing 20 queries(irrespective of how simple they are) will always slow your application. Factors includes network cost, query running etc.
You should fire one query to get page of available games and then make another query to get list of "favorite" games of that user by passing ids of games present in that page. Then set/unset the flag by looping the result. This way you will make only 2 DB calls and it will improve performance significantly.

Related

How can I efficiently roll up stats on huge dataset?

This is likely my ignorance of basic data science, but here goes...
I have a massive database of events--many billions. I have up to 5 or 6 filters users can select, so whatever filters user selects becomes the WHERE clause of my SQL query. In a status display I need to show a couple of simple computed stats over this data, with filters applied (average, %, that kind of thing--easy computations).
My problem is this:
If I do it the straightforward way: SQL where we say
select a,b,c from Events where fieldX=filter1_value and fieldY=filter2_value
where the filter values are provided by user's manipulation of filter controls, then feed the results thru a simple compute engine to roll up the stats.
High confidence this works functionally, but.... I'm very concerned that even with filters you can easily have many millions of results and simply rolling these up, even for a simple computation, will be too long for real-time display (ie. providing results to web UI via REST).
Alternatively, the combinatorial explosion of potential filter values is likewise too vast (I think) to allow me to project ready-to-eat stats.
Is there a different/better way to allow me to do this and still be fast?

How to efficiently filter large amount of records based on user permissions on specific records with specific criteria?

I'm working as a maintainer for a legacy Java-based cargo railway consignment note accounting system. There is a serious performance issue with retrieving a list of consignment notes to display on their website.
I cannot publish the entire query, but here are some statistics to give the general idea:
it has 17 left joins
it has a huge where clause with 5 OR groups to determine if a user is allowed to access a record because of a specific relation to the record (consignor, consignee, carrier, payer, supervisor) and to check user's permission to access records related to a specific railway station
each of the OR group has, in average, two exists() checks with subqueries on some data related to the record and also to check the station permission
when expanded to be human-readable, the query is about 200 lines long
Essentially, the availability of each record to currently logged-in user depends on the following factors:
- the company of the user
- the company of the carrier, consignee, consignor, payer of each specific consignment note
- every consignment note has multiple route sections and every section has its own carrier and payer, thus requiring further access control conditions to make these records visible to the user
- every consignment note and every route section has origin and destination stations, and a user is allowed to see the record only if he has been given access to any of these stations (using a simple relation table).
There are about 2 million consignment note records in the database and the customer is complaining that it takes too long to load a page with 20 records.
Unfortunately it is not possible to optimize the final query before passing it to the RDBMS (Oracle 11g, to be specific) because the system has complex architecture and a home-brew ORM tool, and the final query is being assembled in at least three different places that are responsible for collection of fields to select, collection of joins, adding criteria selected in the UI and, finally, the reason for this question - the permission related filter.
I wouldn't say that the final query is very complex; on the contrary, it is simple in its nature but it's just huge.
I'm afraid that caching solutions wouldn't be very effective in this case because data changes very often and the cache would be overwritten every minute or so. Also, because of individual permissions, each user should have own cache that would have to be maintained.
Besides the usual recommendations - dealing with indexes and optimizing each subquery as much as possible - are there any other well-known solutions for filtering large amount of records based on complex permission rules?
Just my two cents, since I see no other answers around.
First of all you would need to get the execution plan of the query. Without it, it's not that easy to have an idea of what could get improved. It sounds like a nice challenge, if it wasn't for your urgency.
Well, you say the query has 17 left joins. Does that mean there is a single main table in the query? If so, then that's the first section I would optimize. The key aspect is to reduce the TABLE ACCESS BY ROWID operations as much as possible on that table. The typical solution is to add well tailored indexes to narrow down the INDEX RANGE SCAN as much as possible on that table, therefore reducing the heap fetches.
Then, when navigating the rest of the [outer] tables (presumably using NESTED LOOPS) you can try materializing some of those conditions into simple 0/1 flags you could use, instead of the whole conditions.
Also, if you only need 20 rows, I would expect that to be very fast... well as long as the query is properly pipelined. If in your case it's taking too long, then it may not be the case. Are you sorting/aggregating/windowing by some specific condition that prevents pipelining? This condition could be the most important factor to index if you just need 20 rows.
Finally, you could try avoiding heap fetches by using "covering indexes". That could really improve performance of your query, but I would leave it as a last resort, since they have their downsides.
Well, again a good solution really requires to take a good look at the execution plan. If you still are game, post it, and I can look at it.

is it ok to loop a sql query in programing language

I have a doubt in mind when retrieving data from database.
There are two tables and master table id always inserted to other table.
I know that data can retrieve from two table by joining but want to know,
if i first retrieve all my desire data from master table and then in loop (in programing language) join to other table and retrieve data, then which is efficient and why.
As far as efficiency goes the rule is you want to minimize the number of round trips to the database, because each trip adds a lot of time. (This may not be as big a deal if the database is on the same box as the application calling it. In the world I live in the database is never on the same box as the application.) Having your application loop means you make a trip to the database for every row in the master table, so the time your operation takes grows linearly with the number of master table rows.
Be aware that in dev or test environments you may be able to get away with inefficient queries if there isn't very much test data. In production you may see a lot more data than you tested with.
It is more efficient to work in the database, in fewer larger queries, but unless the site or program is going to be very busy, I doubt that it'll make much difference that the loop is inside the database or outside the database. If it is a website application then looping large loops outside the database and waiting on results will take a more significant amount of time.
What you're describing is sometimes called the N+1 problem. The 1 is your first query against the master table, the N is the number of queries against your detail table.
This is almost always a big mistake for performance.*
The problem is typically associated with using an ORM. The ORM queries your database entities as though they are objects, the mistake is assume that instantiating data objects is no more costly than creating an object. But of course you can write code that does the same thing yourself, without using an ORM.
The hidden cost is that you now have code that automatically runs N queries, and N is determined by the number of matching rows in your master table. What happens when 10,000 rows match your master query? You won't get any warning before your database is expected to execute those queries at runtime.
And it may be unnecessary. What if the master query matches 10,000 rows, but you really only wanted the 27 rows for which there are detail rows (in other words an INNER JOIN).
Some people are concerned with the number of queries because of network overhead. I'm not as concerned about that. You should not have a slow network between your app and your database. If you do, then you have a bigger problem than the N+1 problem.
I'm more concerned about the overhead of running thousands of queries per second when you don't have to. The overhead is in memory and all the code needed to parse and create an SQL statement in the server process.
Just Google for "sql n+1 problem" and you'll lots of people discussing how bad this is, and how to detect it in your code, and how to solve it (spoiler: do a JOIN).
* Of course every rule has exceptions, so to answer this for your application, you'll have to do load-testing with some representative sample of data and traffic.

Best approach to cache Counts from SQL tables?

I would like to develop a Forum from scratch, with special needs and customization.
I would like to prepare my forum for intensive usage and wondering how to cache things like User posts count and User replies count.
Having only three tables, tblForum, tblForumTopics, tblForumReplies, what is the best approach of cache the User topics and replies counts ?
Think at a simple scenario: user press a link and open the Replies.aspx?id=x&page=y page, and start reading replies. On the HTTP Request, the server will run an SQL command wich will fetch all replies for that page, also "inner joining with tblForumReplies to find out the number of User replies for each user that replied."
select
tblForumReplies.*,
tblFR.TotalReplies
from
tblForumReplies
inner join
(
select IdRepliedBy, count(*) as TotalReplies
from tblForumReplies
group by IdRepliedBy
) as tblFR
on tblFR.IdRepliedBy = tblForumReplies.IdRepliedBy
Unfortunately this approach is very cpu intensive, and I would like to see your ideas of how to cache things like table Counts.
If counting replies for each user on insert/delete, and store it in a separate field, how to syncronize with manual data changing. Suppose I will manually delete Replies from SQL.
These are the three approaches I'd be thinking of:
1) Maybe SQL Server performance will be good enough that you don't need to cache. You might be underestimating how well SQL Server can do its job. If you do your joins right, it's just one query to get all the counts of all the users that are in that thread. If you are thinking of this as one query per user, that's wrong.
2) Don't cache. Redundantly store the user counts on the user table. Update the user row whenever a post is inserted or deleted.
3) If you have thousands of users, even many thousand, but not millions, you might find that it's practical to cache user and their counts in the web layer's memory - for ASP.NET, the "Application" cache.
I would not bother with caching untill I will need this for sure. From my expirience this is no way to predict places that will require caching. Try iterative approach, try to implement witout cashe, then gether statistics and then implement right caching (there are many kinds like content, data, aggregates, distributed and so on).
BTW, I do not think that your query is CPU consuming. SQL server will optimaze that stuff and COUNT(*) will run in ticks...
tbl prefixes suck -- as much as Replies.aspx?id=x&page=y URIs do. Consider ASP.NET MVC or just routing part.
Second, do not optimize prematurely. However, if you really need so, denormalize your data: add TotalReplies column to your ForumTopics table and either rely on your DAL/BL to keep this field up to date (possibly with a scheduled task to resync those), or use triggers.
For each reply you need to keep TotalReplies and TotalDirectReplies. That way, you can support tree-like structure of replies, and keep counts update throughout the entire hierarchy without a need to count each time.

Effecient way to model aggregate data of a many-to-one relationship (e.g. votes count on a stackoverflow question)

I'm curious about what be the best way to model this for optimized performance... not as concerned about real time data integrity
I'll continue with the stackoverflow example
Question
id
title
Votes
id
user
question
A question has many votes
For many queries however, we're only concerned with the aggregate number of votes (e.g. to show next to the question).
Good relational db theory would create the two entities (Q and V) as separate relations, requiring a join then a sum or count aggregate call.
Another possibility is to break normal form and occasionally materialize the aggregate value of votes as an attribute in Question (e.g. Question.votes). Performance is gained on reads, however, depending on how stale you are willing to let your "votes" data get, it requires a lot more rights to that Question record... in turn hindering performance.
Other techniques involving caching, etc. can be used. But I'm just wondering, performance wise what's the best solution? Let's say the site is higher traffic and receiving a considerable more amount of votes than questions.
Open to non-relational models as well.
It's unlikely that a join will be too slow in this case, especially if you have an index on (question) in the Votes table.
If it is REALLY too slow, you can cache the vote count in the Question table:
id - title - votecount
You can update the votecount whenever you record a vote. For example, from a stored procedure or directly from your application code.
Those updates are tricky, but since you're not that worried about consistency, I guess it's ok if the vote is sometimes not exactly right. To fix any errors, you can periodically regenerate all cached counts like:
UPDATE q
SET votecount = count(v.question)
FROM questions q
LEFT JOIN votes v on v.question = q.id
The aggregate count(v.question) returns 0 if no question was found, as opposed to count(*), which would return 1.
If locks are an issue, consider using "with (nolock)" or "set transaction isolation level read uncommited" to bypass locks (again, based on data integrity being a low priority.)
As an alternative to nolock, consider "read committed snapshot", which is meant for databases with heavy read and less write activity. You can turn it on with:
ALTER DATABASE YourDb SET READ_COMMITTED_SNAPSHOT ON;
It is available for SQL Server 2005 and higher. This is how Oracle works by default, and it's what stackoverflow itself uses. There's even a coding horror blog entry about it.
I used indexed views from sql 2005 all over the place for this kind of thing on a social networking site. Our load was definitely a high ratio of reads/writes so it worked well for us.
I would suggest keeping the vote in memory for the lifetime of the application.
Why hit a db for something as simple as a count, when at some point you will have loaded the item once and asked what the initial amount was on a request basis.
It also has alot to do with how you are implementing repositories, if your question object lazy loads votes but eager loads the count of votes then you can speed up the process while not having an issue about keeping it in memory. Still keep the votes in db, just maintain the count in your application