Query to get all revisions of an object graph - sql

I'm implementing an audit log on a database, so everything has a CreatedAt and a RemovedAt column. Now I want to be able to list all revisions of an object graph but the best way I can think of for this is to use unions. I need to get every unique CreatedAt and RemovedAt id.
If I'm getting a list of countries with provinces the union looks like this:
SELECT c.CreatedAt AS RevisionId from Countries as c where localId=#Country
UNION
SELECT p.CreatedAt AS RevisionId from Provinces as p
INNER JOIN Countries as c ON p.CountryId=c.LocalId AND c.LocalId = #Country
UNION
SELECT c.RemovedAt AS RevisionId from Countries as c where localId=#Country
UNION
SELECT p.RemovedAt AS RevisionId from Provinces as p
INNER JOIN Countries as c ON p.CountryId=c.LocalId AND c.LocalId = #Country
For more complicated queries this could get quite complicated and possibly perform very poorly so I wanted to see if anyone could think of a better approach. This is in MSSQL Server.
I need them all in a single list because this is being used in a from clause and the real data comes from joining on this.

You have most likely already implemented your solution, but to address a few issues; I would suggest considering Aleris's solution, or some derivative thereof.
In your tables, you have a "removed at" field -- well, if that field were active (populated), technically the data shouldn't be there -- or perhaps your implementation has it flagged for deletion, which will break the logging once it is removed.
What happens when you have multiple updates during a reporting period -- the previous log entries would be overwritten.
Having a separate log allows for archival of the log information and allows you to set a different log analysis cycle from your update/edit cycles.
Add whatever "linking" fields required to enable you to get back to your original source data OR make the descriptions sufficiently verbose.
The fields contained in your log are up to you but Aleris's solution is direct. I may create an action table and change the field type from varchar to int, as a link into the action table -- forcing the developers to some standardized actions.
Hope it helps.

An alternative would be to create an audit log that might look like this:
AuditLog table
EntityName varchar(2000),
Action varchar(255),
EntityId int,
OccuranceDate datetime
where EntityName is the name of the table (eg: Contries, Provinces), the Action is the audit action (eg: Created, Removed etc) and the EntityId is the primary key of the modified row in the original table.
The table would need to be kept synchronized on each action performed to the tables. There are a couple of ways to do this:
1) Make triggers on each table that will add rows to AuditTable
2) From your application add rows in AuditTable each time a change is made to the repectivetables
Using this solution is very simple to get a list of logs in audit.
If you need to get columns from original table is also possible using joins like this:
select *
from
Contries C
join AuditLog L on C.Id = L.EntityId and EntityName = 'Contries'

You could probably do it with a cross join and coalesce, but the union is probably still better from a performance standpoint. You can try testing each though.
SELECT
COALESCE(C.CreatedAt, P.CreatedAt)
FROM
dbo.Countries C
FULL OUTER JOIN dbo.Provinces P ON
1 = 0
WHERE
C.LocalID = #Country OR
P.LocalID = #Country

Related

Using 'LIKE' in a join (Replacing CASE with dimension table)

I've successfully prototyped a system that is working quite well and it is now time for me to go back and clean some things up before proceeding - as per suggestion from my senior.
In a generic sense, we're using views to only give us customers from one company and group them into one parent company. For example, grouping 'Tesco UK & Ireland' as company 'Tesco'.
I do this with:
CASE
WHEN CustName = 'Tesco UK & Ireland' THEN 'Tesco'
ELSE CustName
END
However, there is one issue with this approach (which works until you need to incorporate the grouping as a dimension table). Some companies have more specific names that if I were to go through all of them would have a case statement worth hundreds of lines; and some other times the customer names aren't being uploaded correctly. Another example with a random company: 'PC World' is what i'd expect, although sometimes I'm given 'Currys PC World', 'PC World Glasgow' and different variations of that. So to combat this I've tried:
CASE
WHEN CustName LIKE 'Tesco UK & Ireland' THEN 'Tesco'
WHEN CustName LIKE '%PC World%' THEN 'PC World Other'
END CustName
END
However, I was wondering if there is a way to incorporate this in a dimension/mapping table?
Ideally, I'd like to join on CustName to a dimension table and be given the generic name.
Any ideas?
Paul.
I would suggest that you maintain a mapping table with two (important) columns, one for the original name and one for the mapped name. In queries, you would use a left join.
Here is an example:
create table CompanyNameSynonyms (
CompanyNameSynonymId identity(1, 1) primary key,
CompanyName varchar(255) unique,
MappedName varchar(255),
CreatedAt datetime default getdate()
);
Then a query would like like:
select coalesce(cns.MappedName, t.CompanyName) as Name, count(*)
from t left join
CompanyNameSynonyms cns
on t.CompanyName = cns.CompanyName
group by coalesce(cns.MappedName, t.CompanyName);
You do need to populate this with all examples of the alternative names, and then keep the data up-to-date. However, I consider this a benefit for three reasons. First, being explicit is usually a good idea for such reports, to avoid unnecessary confusion. Second, the join using = is faster than join using like with wildcards. Third, all the code that uses this table will be updated when the table is updated.

Subquery that matches column with several ranges defined in table

I've got a pretty common setup for an address database: a person is tied to a company with a join table, the company can have an address and so forth.
All pretty normalized and easy to use. But for search performance, I'm creating a materialized, rather denormalized view. I only need a very limited set of information and quick queries. Most of everything that's usually done via a join table is now in an array. Depending on the query, I can either search it directly or join it via unnest.
As a complement to my zipcodes column (varchar[]), I'd like to add a states column that has the (German fedaral) states already precomputed, so that I don't have to transform a query to include all kinds of range comparisons.
My mapping date is in a table like this:
CREATE TABLE zip2state (
state TEXT NOT NULL,
range_start CHARACTER VARYING(5) NOT NULL,
range_end CHARACTER VARYING(5) NOT NULL
)
Each state has several ranges, and ranges can overlap (one zip code can be for two different states). Some ranges have range_start = range_end.
Now I'm a bit at wit's end on how to get that into a materialized view all at once. Normally, I'd feel tempted to just do it iteratively (via trigger or on the application level).
Or as we're just talking about 5 digits, I could create a big table mapping zip to state directly instead of doing it via a range (my current favorite, yet something ugly enough that it prompted me to ask whether there's a better way)
Any way to do that in SQL, with a table like the above (or something similar)? I'm at postgres 9.3, all features allowed...
For completeness' sake, here's the subquery for the zip codes:
(select array_agg(distinct address.zipcode)
from affiliation
join company
on affiliation.ins_id = company.id
join address
on address.com_id = company.id
where affiliation.per_id = person.id) AS zipcodes,
I suggest a LATERAL join instead of the correlated subquery to conveniently compute both columns at once. Could look like this:
SELECT p.*, z.*
FROM person p
LEFT JOIN LATERAL (
SELECT array_agg(DISTINCT d.zipcode) AS zipcodes
, array_agg(DISTINCT z.state) AS states
FROM affiliation a
-- JOIN company c ON a.ins_id = c.id -- suspect you don't need this
JOIN address d ON d.com_id = a.ins_id -- c.id
LEFT JOIN zip2state z ON d.zipcode BETWEEN z.range_start AND z.range_end
WHERE a.per_id = p.id
) z ON true;
If referential integrity is guaranteed, you don't need to join to the table company at all. I took the shortcut.
Be aware that varchar or text behaves differently than expected for numbers. For example: '333' > '0999'. If all zip codes have 5 digits you are fine.
Related:
What is the difference between LATERAL and a subquery in PostgreSQL?

How can I compare two tables and delete on matching fields (not matching records)

Scenario: A sampling survey needs to be performed on membership of 20,000 individuals. Survey sample size is 3500 of the total 20000 members. All membership individuals are in table tblMember. Same survey was performed the previous year and members whom were surveyed are in tblSurvey08. Membership data can change over the year (e.g. new email address, etc.) but the MemberID data stays the same.
How do I remove the MemberID/records contained tblSurvey08 from tblMember to create a new table of potential members to be surveyed (lets call it tblPotentialSurvey09). Again the record for a individual member may not match from the different tables but the MemberID field will remain constant.
I am fairly new at this stuff but I seem to be having a problem Googling a solution - I could use the EXCEPT function but the records for the individuals members are not necessarily the same from one table to next - just the MemberID may be the same.
Thanks
SELECT
* (replace with column list)
FROM
member m
LEFT JOIN
tblSurvey08 s08
ON m.member_id = s08.member_id
WHERE
s08.member_id IS NULL
will give you only members not in the 08 survey. This join is more efficient than a NOT IN construct.
A new table is not such a great idea, since you are duplicating data. A view with the above query would be a better choice.
I apologize in advance if I didn't understand your question but I think this is what you're asking for. You can use the insert into statement.
insert into tblPotentialSurvey09
select your_criteria from tblMember where tblMember.MemberId not in (
select MemberId from tblSurvey08
)
First of all, I wouldn't create a new table just for selecting potential members. Instead, I would create a new true/false (1/0) field telling if they are eligible.
However, if you'd still want to copy data to the new table, here's how you can do it:
INSERT INTO tblSurvey00 (MemberID)
SELECT MemberID
FROM tblMember m
WHERE NOT EXISTS (SELECT 1 FROM tblSurvey09 s WHERE s.MemberID = m.MemberID)
If you just want to create a new field as I suggested, a similar query would do the job.
An outer join should do:
select m_09.MemberID
from tblMembers m_09 left outer join
tblSurvey08 m_08 on m_09.MemberID = m_08.MemberID
where
m_08.MemberID is null

Best Practice to querying a Lookup table

I am trying to figure out a way to query a property feature lookup table.
I have a property table that contains rental property information (address, rent, deposit, # of bedrooms, etc.) along with another table (Property_Feature) that represents the features of this property (pool, air conditioning, laundry on-site, etc.). The features themselves are defined in yet another table labeled Feature.
Property
pid - primary key
other property details
Feature
fid - primary key
name
value
Property_Feature
id - primary key
pid - foreign key (Property)
fid - foreign key (Feature)
Let say someone wants to search for property that has air conditioning, and a pool and laundry on-site. How do you query the Property_Feature table for multiple features for the same property if each row only represents one feature? What would the SQL query look like? Is this possible? Is there a better solution?
Thanks for the help and insight.
In terms of database design, yours is the right way to do it. It's correctly normalized.
For the query, I would simply use exists, like this:
select * from Property
where
exists (select * from Property_Feature where pid = property.pid and fid = 'key_air_conditioning')
and
exists (select * from Property_Feature where pid = property.pid and fid = 'key_pool')
Where key_air_conditioning and key_pool are obviously the keys for those features.
The performance will be OK even for large databases.
Here's the query that will find all the properties with a pool:
select
p.*
from
property p
inner join property_feature pf on
p.pid = pf.pid
inner join feature f on
pf.fid = f.fid
where
f.name = 'Pool'
I use inner joins instead of EXISTS since it tends to be a bit faster.
You can also do something like this:
SELECT *
FROM Property p
WHERE 3 =
( SELECT COUNT(*)
FROM Property_Feature pf
, Feature f
WHERE pf.pid = p.pid
AND pf.fid = f.fid
AND f.name in ('air conditioning', 'pool', 'laundry on-site')
);
Obviously, if your front end is capturing the fids of the feature items when the user is selecting them, you can dispense with the join to Feature and constrain directly on fid. Your front end would know what the count of features selected was, so determining the value for "3" above is trivial.
Compare it, performance wise, to the tekBlues construction above; depending on your data distribution, either one of these might be the faster query.

SQL database problems with addressbook table design

I am writing a addressbook module for my software right now. I have the database set up so far that it supports a very flexible address-book configuration.
I can create n-entries for every type I want. Type means here data like 'email', 'address', 'telephone' etc.
I have a table named 'contact_profiles'.
This only has two columns:
id Primary key
date_created DATETIME
And then there is a table called contact_attributes. This one is a little more complex:
id PK
#profile (Foreign key to contact_profiles.id)
type VARCHAR describing the type of the entry (name, email, phone, fax, website, ...) I should probably change this to a SET later.
value Text (containing the value for the attribute).
I can now link to these profiles, for example from my user's table. But from here I run into problems.
At the moment I would have to create a JOIN for each value that I want to retrieve.
Is there a possibility to somehow create a View, that gives me a result with the type's as columns?
So right now I would get something like
#profile type value
1 email name#domain.tld
1 name Sebastian Hoitz
1 website domain.tld
But it would be nice to get a result like this:
#profile email name website
1 name#domain.tld Sebastian Hoitz domain.tld
The reason I do not want to create the table layout like this initially is, that there might always be things to add and I want to be able to have multiple attributes of the same type.
So do you know if there is any possibility to convert this dynamically?
If you need a better description please let me know.
You have reinvented a database design called Entity-Attribute-Value. This design has a lot of weaknesses, including the weakness you've discovered: it's very hard to reproduce a query result in a conventional format, with one column per attribute.
Here's an example of what you must do:
SELECT c.id, c.date_created,
c1.value AS name,
c2.value AS email,
c3.value AS phone,
c4.value AS fax,
c5.value AS website
FROM contact_profiles c
LEFT OUTER JOIN contact_attributes c1
ON (c.id = c1.profile AND c1.type = 'name')
LEFT OUTER JOIN contact_attributes c1
ON (c.id = c1.profile AND c1.type = 'email')
LEFT OUTER JOIN contact_attributes c1
ON (c.id = c1.profile AND c1.type = 'phone')
LEFT OUTER JOIN contact_attributes c1
ON (c.id = c1.profile AND c1.type = 'fax')
LEFT OUTER JOIN contact_attributes c1
ON (c.id = c1.profile AND c1.type = 'website');
You must add another LEFT OUTER JOIN for every attribute. You must know the attributes at the time you write the query. You must use LEFT OUTER JOIN and not INNER JOIN because there's no way to make an attribute mandatory (the equivalent of simply declaring a column NOT NULL).
It's far more efficient to retrieve the attributes as they are stored, and then write application code to loop through the result set, building an object or associative array with an entry for each attribute. You don't need to know all the attributes this way, and you don't have to execute an n-way join.
SELECT * FROM contact_profiles c
LEFT OUTER JOIN contact_attributes ca ON (c.id = ca.profile);
You asked in a comment what to do if you need this level of flexibility, if not use the EAV design? SQL is not the correct solution if you truly need unlimited metadata flexibility. Here are some alternatives:
Store a TEXT BLOB, containing all the attributes structured in XML or YAML format.
Use a semantic data modeling solution like Sesame, in which any entity can have dynamic attributes.
Abandon databases and use flat files.
EAV and any of these alternative solutions is a lot of work. You should consider very carefully if you truly need this degree of flexibility in your data model, because it's hugely more simple if you can treat the metadata structure as relatively unchanging.
If you are limiting yourself to displaying a single email, name, website, etc. for each person in this query, I'd use subqueries:
SELECT cp.ID profile
,cp.Name
,(SELECT value FROM contact_attributes WHERE type = 'email' and profile = cp.id) email
,(SELECT value FROM contact_attributes WHERE type = 'website' and profile = cp.id) website
,(SELECT value FROM contact_attributes WHERE type = 'phone' and profile = cp.id) phone
FROM contact_profiles cp
If you're using SQL Server, you could also look at PIVOT.
If you want to show multiple emails, phones, etc., then consider that each profile must have the same number of them or you'll have blanks.
I'd also factor out the type column. Create a table called contact_attribute_types which would hold "email", "website", etc. Then you'd store the contact_attribute_types.id integer value in the contact_attributes table.
You will need to generate a query like:
select #profile,
max(case when type='email' then value end) as email,
max(case when type='name' then value end) as name,
max(case when type='website' then value end) as website
from mytable
group by #profile
However, that will only show one value for each type per #profile. Your DBMS may have a function you can use instead of MAX to concatenate all the values as a comma-separated string, or you may be able to write one.
This kind of data model is generally best avoided for the reasons you have already mentioned!
You create a view for each contact type
When you want all the information you pull from the entire table, when you want a subset of a specific contact type, you pull from the view.
I'd create a stored procedure that takes the intent {all, phone, email, address} as one of the parameters and then derive the data. All my app code would call this stored procedure to get the data. Also, when a new type is added (which should be very infrequently, you create another view and modify only this sproc).
I've implemented a similar design for multiple small/med size systems and have had no issues.
Am I missing something? This seems trivial?
EDIT:
I see what I was missing... You are trying to be normalized and denormalized at the same time. I'm not sure what the rest of your business rules are for pulling records. You could have profiles with multiple or null values for phone/email/addresses etc. I would keep your data format the same and again use a sproc to create the specific view you want. As your business needs change, you leave your data alone and just create another sproc to access it.
There is no one right answer for this question, as one would need to know, for your specific organization or application, how many of those contact methods the business wants to collect, how current they want the information to be, and how much flexibility they are willing to invest in.
Of course, many of here could make some good guesses as to what the average business would want to do, but the real answer is to find out what your project, what your users, are interested in.
BTW, all architecture questions about "best"-ness require this sort of cost, benefit, and risk analysis.
Now that the approach of document-oriented databases is getting more and more popular, one could use one of them to store all this information in one entry - and therefor deleting all those extra joins and queries.