I expect this is a common enough use-case, but I'm unsure the best way to leverage database features to do it. Hopefully the community can help.
Given a business domain where there are a number of attributes to make up a record. We can just call these a,b,c
Each of these belong to a parent record, of which there can be many,
Given an external datasource that will post updates to those attributes, at arbitrary times, and typically only a subset, so you get instructions like
z:{a:3}
or
y:{b:2,c:100}
What are good ways to be able to query postgres for the 'current state', ie. wanting a single row result that represents the most recent value for all of a,b,c, for each of the parent records.
current state looks overall like
x:{a:0, b:0, c:1}
y:{a:1, b:2, c:3}
z:{a:2, b:65, c:6}
If it matters, The difference in time between updates on a single value could be arbitrarily long
I am deliberately avoiding having a table that keeps updating and writing an individual row for the state because the write-contention could be a problem, and I think there must be a better overall pattern.
Your question is a bit theorical - but in essence you are describing a top-1-per-group problem. In Postgres, you can use distinct on for this.
Assuming that your table is called mytable, where attributes are stored in column attribute, and that column ordering_id defines the ordering of the rows (that could be a timestamp or an serial for example), you would phrase the query as:
select distinct on (attribute) t.*
from mytable t
order by attribute, ordering_id desc
Related
Scenario
Suppose I am building a database for a Messenger app. Let there be two tables, a User table and a Conversation table. Each conversation has a list of participating users, and each user has a list of conversations they are in. In short, there is a many-to-many relationship between Users and Conversations tables.
Now suppose I want to load the first 10 conversations of a user's list of conversations in descending chronological order when I open the app. Assuming that # Conversations in table >> # Conversations a user has >> 10, a brute force way is to load every conversation in the user's list, then order them in-memory, and finally return the first 10. I think this is how a normal SQL engine will process such a query.
Concern
My concern is that when # Conversations a user becomes very large, this operation becomes too resource consuming. Is there any faster way to achieve the same result (fetching a sorted sublist of records from a table) with possibly additional database setup?
Example
For instance, imagine a user have 300 conversations, and we want to page through these conversations in order. The above method would either download all 300 conversations to disk then do the sorting locally, or let the server do the sorting. The first method uses too much bandwidth and the information may not be up-to-date, and the second method requires pulling in all 300 conversations from the database each time we page.
Question
My question is this: is my concern of this particular case valid? If so, how should I modify my database setup to avoid this issue? How are some existing examples like Facebook Messenger handling this? If not, why is this not a performance concern?
Edit
I realised after asking the question that in an RDBMS we would simply create a third table to store the many-to-many relationship, and building an index on this table would solve this problem. However, would NoSQL databases that support storing lists in columns (more specifically, AWS DynamoDB) have an advantage over traditional RDBMS in this case?
It looks as though the table list you posited is not adequate to represent the data you're trying to extract. Presuming that there can be no more than one creator of a conversation, that user id can safely be stored there.
But the likely structure of the tables will include a "comment" table, with (at a minimum) the following fields:
* Primary key -- record id for _this_ comment
* conversation_id -- reference to the conversation this comment is part of
* user_id -- The user ID of the person making this comment
* parent_id -- The comment that preceded this one (presuming threaded conversations)
* create_dt -- Datetime that the comment was added to the thread
* comment_body -- The actual comment itself.
If this is indeed the case, you'd be looking at a query that looks something like this:
SELECT DISTINCT conversation_id FROM
(
SELECT conversation_id, create_dt
FROM Conversation
WHERE person_id = {DesiredPerson}
UNION
SELECT conversation_id, create_dt
FROM Comment
WHERE person_id = {DesiredPerson}
} ORDER BY create_dt DESC
LIMIT 10
...will give the the id of the 10 most recent conversations in which the DesiredPerson has participated.
Contrary to your belief, database optimizers are smart enough that the query will NOT end up requiring the two queries to be entirely evaluated to produce the desired result. If there are appropriate indices on the table, this should be a pretty efficient query (e.g. compound index on both tables of conversation_id + create_dt). In fact, this query would likely be satisfied without having to reference the tables at all--the result can be calculated entirely from the indexes. Using the MySQL TOP modifier with both count and skip values should allow you to handle paging pretty efficiently.
Is there any faster way to achieve the same result (fetching a sorted
sublist of records from a table) with possibly additional database
setup?
Yes, there is.
This "additional database setup" is called "index". I think every relational DBMS allows to create indexes.
There can be several types of indexes, but most common is a b-tree index, where data is stored in a balanced tree, which allows to quickly find the necessary element(s) and read the data in the order by which the index is sorted.
Index is a supplementary structure stored and maintained by the database engine on disk in addition to the main table data. You can usually create many different indexes on the same table. The engine would try to pick the most suitable index when running the specific query. Different queries may use different indexes.
Since index structure has to be maintained when underlying data changes, it means that usually creating an index helps the SELECT queries, but somewhat slows down UPDATE, DELETE and INSERT. This is why it is usually a trade-off and requires some skill to identify what set of indexes should exist. It largely depends on what kind of queries run and their relative importance.
For a specific example of how to implement efficient pagination with the help of appropriate index have a look at Pagination Done the Right Way from the web-site, that is called Use the index, Luke.
It also has a good intro into Anatomy of an SQL Index and many other useful articles.
Is my concern of this particular case valid?
It is not valid for 300 rows, but becomes more and more important as your tables grow in size. For 300 million rows most likely it would be rather important.
How does a function like SUM work? If I execute
select id,sum(a) from mytable group by id
does it sort by id and then sum over each range of equal id's? I am no planner expert, but it looks like that is what is happening, where mytable is maybe a hundred million rows with a few million distinct id's.
Or does it just keep a hash of id -> current_sum, and then at each row either increments the value of id or add a new key? Isn't that far faster and less memory hungry?
SQL standards try to dictate external behavior, not internal behavior. In this particular case, a SQL implementation that conforms to (one of the many) standards is supposed to act like it does things in this order.
Build a working table from all the table constructors in the FROM clause. (There's only one in your example.)
In the GROUP BY clause, partition the working table into groups. Reduce each group to one row. Replace the working table with the grouped table.
Resolve the expressions in the SELECT clause.
Query optimizers that follow SQL standards are free to rearrange things however they like, as long as the result is the same as if it had followed those steps.
You can find more details in the answers and comments to this SO question.
So, I found this, http://helmingstay.blogspot.com/2009/06/postgresql-poetry-aggregate-median-with.html, which claims that it does indeed use the accumulator pattern. Hmmm.
Suppose I have two tables:
Group
(
id integer primary key,
someData1 text,
someData2 text
)
GroupMember
(
id integer primary key,
group_id foreign key to Group.id,
someData text
)
I'm aware that my SQL syntax is not correct :) Hopefully is clear enough. My problem is this: I want to load a group record and all the GroupMember records associated with that group. As I see it, there are two options.
A single query:
SELECT Group.id, Group.someData1, Group.someData2 GroupMember.id, GroupMember.someData
FROM Group INNER JOIN GroupMember ...
WHERE Group.id = 4;
Two queries:
SELECT id, someData2, someData2
FROM Group
WHERE id = 4;
SELECT id, someData
FROM GroupMember
WHERE group_id = 4;
The first solution has the advantage of only being one database round trip, but has the disadvantage of returning redundant data (All group data is duplicated for every group member)
The second solution returns no duplicate data but involves two round trips to the database.
What is preferable here? I suppose there's some threshold such that if the group sizes become sufficiently large, the cost of returning all the redundant data is going to be greater than the overhead involved with an additional database call. What other things should I be thinking about here?
Thanks,
Jordan
If you actually want the results joined, I believe it is always more efficient to do the joining at the server level. The SQL processor is designed to match sets of data.
If you really want the results of 2 sql statements, you can always send two statements in one batch separated by a semicolon, and get two resultsets back with one round trip to the DB.
How the data is finally used is an important and unknown factor.
I suggest the single query method for most applications. Proper indexing will keep the query more efficient than the two query method.
The single query method also has the benefit of remaining valid if you need to select more than one group.
If you are only ever going to be retreiving a single group record with each request to the database then i would go with the second option. If you are retrieving multiple group records and associated group member records, go with the join as it will be much quicker.
In general, it depends on what type of data you are trying to display.
If you are showing a single group and all its members, performance differences between the two options would be negligible.
If you are showing many groups and all of their members, the overhead of having to make a roundtrip to the database for each successive group will quickly outweigh any benefit you got from receiving a little less data.
Some other things you might want to consider in you reasoning
Result Set Size - For many groups and members, your result set size may become a limiting factor as the size to retrieve and keep it in memory increases. This is likely to occur with the second option. You may want to consider paging the data, so that you are only retrieving a certain subset at a time.
Lazy Loading - If you are only getting the members of some groups, or a user is requesting the members one group at a time, consider Lazy Loading. This means only making the additional query to get the group's members when needed. This makes sense only in certain use cases, but it can be much more effective than retrieving all data up front.
Depending on the type of database and your frontend application, you can return the results of two SQL statements on one trip (A stored procedure in SQL Server 2005 for example).
If you are creating a report that requires many fields from the Group table, you may not want the increased amount of data with the first query.
If this is some type of data entry app, you've probably already presented the Group data to the user, so they could fill in the group id on the where clause (or preferably via some parameter) and now they need the member results.
It really, really, really depends on what use you will make of the data.
For insatnce, if you were assembling a list of group members for a mail shot, and you need the group name for each letter you're going to send to a member, and you have no use for the Group level then the single joined query makes a lot of sense.
But if, say, you're coding a master-detail screen or report, with a page for each group and displaying information at both the Group and the Member levels then the two separate queries is probably most useful.
Unless you are retrieving quite large amounts of data (tens of thousands of groups with hundreds of memebers per group, or similar orders of magnitude) it is unlikely you are going to see much difference between performances of the two approaches.
On a simple query like this I would to try to perform it in one query. The overhead of two database calls will probably exceed the additional SQL processing time from the query.
A UNION clause will do this for you:
SELECT id, someData1, someData2
FROM Group
WHERE id = 4
UNION
SELECT id, someData, null
FROM GroupMember
WHERE group_id = 4;
Having dutifully normalised all my data, I'm having a problem combining 3NF rows into a single row for output.
Up until now I've been doing this with server-side coding, but for various reasons I now need to select all rows related to another row, and combine them in a single row, all in MySQL...
So to try and explain:
I have three tables.
Categories
Articles
CategoryArticles_3NF
A category contains CategoryID + titles, descriptions etc. It can contain any number of articles in the Articles table, consisting of ArticleID + a text field to house the content.
The CategoryArticles table is used to link the two, so contains both the CategoryID and the ArticleID.
Now, if I select a Category record, and I JOIN the Articles table via the linking CategoryArticles_3NF table, the result is a separate row for each article contained within that category.
The issue is that I want to output one single row for each category, containing content from all articles within.
If that sounds like a ridiculous request, it's because it is. I'm just using articles as a good way to describe the problem. My data is actually somewhat different.
Anyway - the only way I can see to achieve this is to use a 'GROUP_CONCAT' statement to group the content fields together - the problem with this is that there is a limit to how much data this can return, and I need it to be able to handle significantly more.
Can anyone tell me how to do this?
Thanks.
This sounds like something that should be done in the front end without more information.
If you need to, you can increase the size limit of GROUP_CONCAT by setting the system variable group_concat_max_len. It has a limit based on max_allowed_packet, which you can also increase. I think that the max size for a packet is 1GB. If you need to go higher than that then there are some serious flaws in your design.
EDIT: So that this is in the answer and not just buried in the comments...
If you don't want to change the group_concat_max_len globally then you can change it for just your session with:
SET SESSION group_concat_max_len = <your value here>
OK, so practically every database based application has to deal with "non-active" records. Either, soft-deletions or marking something as "to be ignored". I'm curious as to whether there are any radical alternatives thoughts on an `active' column (or a status column).
For example, if I had a list of people
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
active BOOLEAN,
...
);
That means to get a list of active people, you need to use
SELECT * FROM people WHERE active=True;
Does anyone suggest that non active records would be moved off to a separate table and where appropiate a UNION is done to join the two?
Curiosity striking...
EDIT: I should make clear, I'm coming at this from a purist perspective. I can see how data archiving might be necessary for large amounts of data, but that is not where I'm coming from. If you do a SELECT * FROM people it would make sense to me that those entries are in a sense "active"
Thanks
You partition the table on the active flag, so that active records are in one partition, and inactive records are in the other partition. Then you create an active view for each table which automatically has the active filter on it. The database query engine automatically restricts the query to the partition that has the active records in it, which is much faster than even using an index on that flag.
Here is an example of how to create a partitioned table in Oracle. Oracle doesn't have boolean column types, so I've modified your table structure for Oracle purposes.
CREATE TABLE people
(
id NUMBER(10),
name VARCHAR2(100),
active NUMBER(1)
)
PARTITION BY LIST(active)
(
PARTITION active_records VALUES (0)
PARTITION inactive_records VALUES (1)
);
If you wanted to you could put each partition in different tablespaces. You can also partition your indexes as well.
Incidentally, this seems a repeat of this question, as a newbie I need to ask, what's the procedure on dealing with unintended duplicates?
Edit: As requested in comments, provided an example for creating a partitioned table in Oracle
Well, to ensure that you only draw active records in most situations, you could create views that only contain the active records. That way it's much easier to not leave out the active part.
We use an enum('ACTIVE','INACTIVE','DELETED') in most tables so we actually have a 3-way flag. I find it works well for us in different situations. Your mileage may vary.
Moving inactive stuff is usually a stupid idea. It's a lot of overhead with lots of potential for bugs, everything becomes more complicated, like unarchiving the stuff etc. What do you do with related data? If you move all that, too, you have to modify every single query. If you don't move it, what advantage were you hoping to get?
That leads to the next point: WHY would you move it? A properly indexed table requires one additional lookup when the size doubles. Any performance improvement is bound to be negligible. And why would you even think about it until the distant future time when you actually have performance problems?
I think looking at it strictly as a piece of data then the way that is shown in the original post is proper. The active flag piece of data is directly dependent upon the primary key and should be in the table.
That table holds data on people, irrespective of the current status of their data.
The active flag is sort of ugly, but it is simple and works well.
You could move them to another table as you suggested. I'd suggest looking at the percentage of active / inactive records. If you have over 20 or 30 % inactive records, then you might consider moving them elsewhere. Otherwise, it's not a big deal.
Yes, we would. We currently have the "active='T/F'" column in many of our tables, mainly to show the 'latest' row. When a new row is inserted, the previous T row is marked F to keep it for audit purposes.
Now, we're moving to a 2-table approach, when a new row is inserted, the previous row is moved to an history table. This give us better performance for the majority of cases - looking at the current data.
The cost is slightly more than the old method, previously you had to update and insert, now you have to insert and update (ie instead of inserting a new T row, you modify the existing row with all the new data), so the cost is just that of passing in a whole row of data instead of passing in just the changes. That's hardly going to make any effect.
The performance benefit is that your main table's index is significantly smaller, and you can optimise your tablespaces better (they won't grow quite so much!)
Binary flags like this in your schema are a BAD idea. Consider the query
SELECT count(*) FROM users WHERE active=1
Looks simple enough. But what happens when you have a large number of users, so many that adding an index to this table would be required. Again, it looks straight forward
ALTER TABLE users ADD INDEX index_users_on_active (active)
EXCEPT!! This index is useless because the cardinality on this column is exactly two! Any database query optimiser will ignore this index because of it's low cardinality and do a table scan.
Before filling up your schema with helpful flags consider how you are going to access that data.
https://stackoverflow.com/questions/108503/mysql-advisable-number-of-rows
We use active flags quite often. If your database is going to be very large, I could see the value in migrating inactive values to a separate table, though.
You would then only require a union of the tables when someone wants to see all records, active or inactive.
In most cases a binary field indicating deletion is sufficient. Often there is a clean up mechanism that will remove those deleted records after a certain amount of time, so you may wish to start the schema with a deleted timestamp.
Moving off to a separate table and bringing them back up takes time. Depending on how many records go offline and how often you need to bring them back, it might or might not be a good idea.
If the mostly dont come back once they are buried, and are only used for summaries/reports/whatever, then it will make your main table smaller, queries simpler and probably faster.
We use both methods for dealing with inactive records. The method we use is dependent upon the situation. For records that are essentially lookup values, we use the Active bit field. This allows us to deactivate entries so they wont be used, but also allows us to maintain data integrity with relations.
We use the "move to separation table" method where the data is no longer needed and the data is not part of a relation.
The situation really dictates the solution, methinks:
If the table contains users, then several "flag" fields could be used. One for Deleted, Disabled etc. Or if space is an issue, then a flag for disabled would suffice, and then actually deleting the row if they have been deleted.
It also depends on policies for storing data. If there are policies for keeping data archived, then a separate table would most likely be necessary after any great length of time.
No - this is a pretty common thing - couple of variations depending on specific requirements (but you already covered them):
1) If you expect to have a whole BUNCH of data - like multiple terabytes or more - not a bad idea to archive deleted records immediately - though you might use a combination approach of marking as deleted then copying to archive tables.
2) Of course the option to hard delete a record still exists - though us developers tend to be data pack-rats - I suggest that you should look at the business process and decide if there is now any need to even keep the data - if there is - do so... if there isn't - you should probably feel free just to throw the stuff away.....again, according to the specific business scenario.
From a 'purist perspective' the realtional model doesn't differentiate between a view and a table - both are relations. So that use of a view that uses the discriminator is perfectly meaningful and valid provided the entities are correctly named e.g. Person/ActivePerson.
Also, from a 'purist perspective' the table should be named person, not people as the name of the relation reflects a tuple, not the entire set.
Regarding indexing the boolean, why not:
ALTER TABLE users ADD INDEX index_users_on_active (id, active) ;
Would that not improve the search?
However I don't know how much of that answer depends on the platform.
This is an old question but for those search for low cardinality/selectivity indexes, I'd like to propose the following approach that avoids partitioning, secondary tables, etc.:
The trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value.
Then your query becomes:
SELECT * FROM people WHERE dateInactivated is NULL;
The index will pull in just the right set of rows that you care about.
Filtering data on a bit flag for big tables is not really good in terms of performance. In case when 'active' determinate virtual deletion you can create 'TableName_delted' table with the same structure and move deleted data there using delete trigger.
That solution will help with performance and simplifies data queries.