Related
I have a table in BQ which I refresh on daily basis. It's a full snapshot every day.
I have a business requirement to create deltas of that feed.
Table Details :
Table contains 10 columns
Out of 10 columns, 5 columns change on daily basis. How do I identify which columns changed and only create a snapshot for that?
For eg here are the columns in tableA: columns which will frequently change are in bold.
Custid - ABC
first_product - toy
first_product_purchase_date - 2015-01-01
last_product - ebook
last_product_purchase_date - 2018-05-01
second_product - Magazine
second_product_purchase_date - 2016-01-01
third_product - null
third_product_purchase_date - null
fourth_product - null
fourth_product_purchase_date - null
After more purchase Data will look like this:
Custid - ABC
first_product - toy
first_product_purchase_date - 2015-01-01
last_product - Hardbook
last_product_purchase_date - 2018-05-17
second_product - Magazine
second_product_purchase_date - 2016-01-01
third_product - CD
third_product_purchase_date - 2017-01-01
fourth_product - null
fourth_product_purchase_date - null
first_product = first product ever purchased
last_product = most recent product purchased
This is just one row of records for one customer. I have millions of customers with all these columns, and let's say half a million of the rows will be updated on daily basis.
In my delta, I just want the rows where any of the column value changed.
It seems like you have a column for each product bought and their repetition, perhaps this comes from a de-normalize dimensional models. To query the last "update" you would have to compare each columns the previous row by using the lead function. This would use a lot of computation and might not be optimal.
I recommend using repeated fields. The product and product_purchase_date would be repeated field and you could simply query using a where product_purchase_date = current_date() which would use much less computation.
De-normalize dimensional models are meant to use less computation on traditional data warehouses. Bigquery being fast, highly scalable, enterprise data warehouse has a lot of computing power.
To have a beter understanding on how BigQuery works under the hood I recommend reviewing this document.
I realize that referring to these as dimension and fact tables is not exactly appropriate. I am at a lost for better terminology, so please excuse this categorization that I use in the post.
I am building an application for employee record keeping.
The database will contain organizational information. The information is mostly defined in three tables: Locations, Divisions, and Departments. However, there are others with similar problems. First, I need to store the available values for these tables. This will allow for available values in the application when managing an employee and for management of these values when adding/deleting departments and such. For instance, the Locations table may look like,
LocationId | LocationName | LocationStatus
1 | New York | Active
2 | Denver | Inactive
3 | New Orleans | Active
I then need to store these values for each employee and keep their history. My first thought was to create LocationHistory, DivisionHistory, and DepartmentHistory tables. I cannot pinpoint why, but this struck me as poor design. My next inclination was to create a DimLocation/FactLocation, DimDivision/FactDivision, DimDepartment/FactDepartment set of tables. I do not believe this makes sense either. I have also considered naming them as a combination of Employee, i.e. EmployeeLocations, EmployeeDivisions, etc. Regardless of the naming convention for these tables, I imagine that data would look similar to a simplified version I have below:
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 3 | 2008-07-01 | NULL
1 | 2 | 2007-04-01 | 2008-06-30
I realize any of the imagined solutions I described above could work, but I am really looking to create a design that will be easy for others to maintain with an intuitive, familiar structure. I would like to receive this community's help, opinions, and experience with this matter. I am open to and would welcome any suggestion to consider. For instance, should I even store the available values for these three tables in the database? Should they be maintained in the application code/business logic layer? Do I just need to get over seeing the word History repeating three times?
Thanks!
Firstly, I see no issue in describing these as Dimension and Fact tables outside of a warehouse :)
In terms of conceptualising and understanding the relationships, I personally see the use of start/end dates perfectly easy for people to understand. Allowing Agent and Location fact tables, and then time dependant mapping tables such as Agent_At_Location, etc. They do, however, have issues worthy of taking note.
If EndDate is 2008-08-30, was the employee in that location UP TO 30th August, or UP TO and INCLUDING 30th August.
Dealing with overlapping date periods in queries can give messy queries, but more importantly, slow queries.
The first one seems simply a matter of convention, but it can have certain implications when dealign with other data. For example, consider that an EndDate of 2008-08-30 means that they ARE at that location UP TO and INCLUDING 30th August. Then you join on to their Daily Agent Data for that day (Such as when they Actually arrived at work, left for breaks, etc). You need to join ON AgentDailyData.EventTimeStamp < '2008-08-30' + 1 in order to include all the events that happened during that day.
This is because the data's EventTimeStamp isn't measured in days, but probably minutes or seconds.
If you consider that the EndDate of '2008-08-30' means that the Agent was at that Location UP TO but NOT INCLDUING 30th August, the join does not need the + 1. In fact you don't need to know if the date is DAY bound, or can include a time component or not. You just need TimeStamp < EndDate.
By using EXCLUSIVE End markers, all of your queries simplify and never need + 1 day, or + 1 hour to deal with edge conditions.
The second one is much harder to resolve. The simplest way of resolving an overlapping period is as follows:
SELECT
CASE WHEN TableA.InclusiveFrom > TableB.InclusiveFrom THEN TableA.InclusiveFrom ELSE TableB.InclusiveFrom END AS [NetInclusiveFrom],
CASE WHEN TableA.ExclusiveFrom < TableB.ExclusiveFrom THEN TableA.ExclusiveFrom ELSE TableB.ExclusiveFrom END AS [NetExclusiveFrom],
FROM
TableA
INNER JOIN
TableB
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
-- Where InclusiveFrom is the StartDate
-- And ExclusiveFrom is the EndDate, up to but NOT including that date
The problem with that query is one of indexing. The first condition TableA.InclusiveFrom < TableB.ExclusiveFrom could be be resolved using an index. But it could give a Massive range of dates. And then, for each of those records, the ExclusiveDates could all be just about anything, and certainly not in an order that could help quickly resolve TableA.ExclusiveFrom > TableB.InclusiveFrom
The solution I have previously used for that is to have a maximum allowed gap between InclusiveFrom and ExclusiveFrom. This allows something like...
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.InclusiveFrom >= TableB.InclusiveFrom - 30
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
The condition TableA.ExclusiveFrom > TableB.InclusiveFrom STILL can't benefit from indexes. But instead we've limitted the number of rows that can be returned by searching TableA.InclusiveFrom. It's at most only ever 30 days of data, because we know that we restricted the duration to a maximum of 30 days.
An example of this is to break up the associations by calendar month (max duration of 31 days).
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 2 | 2007-04-01 | 2008-05-01
1 | 2 | 2007-05-01 | 2008-06-01
1 | 2 | 2007-06-01 | 2008-06-25
(Representing Employee 1 being in Location 2 from 1st April to (but not including) 25th June.)
It's effectively a trade off; using Disk Space to gain performance.
I've even seen this pushed to the extreme of not actually storing date Ranges, but storing the actual mapping for each and every day. Essentially, it's like restricting the maximum duration to 1 day...
EmployeeId | LocationId | EffectiveDate
1 | 2 | 2007-06-23
1 | 2 | 2007-06-24
1 | 3 | 2007-06-25
1 | 3 | 2007-06-26
Instinctively I initially rebelled against this. But in subsequent ETL, Warehousing, Reporting, etc, I actually found it Very powerful, adaptable, and maintainable. I actually saw people making fewer coding mistakes, writing code in less time, the code ending up running faster, and being much more able to adapt to clients' changing needs.
The only two down sides were:
1. More disk space taken (But trival compared to the size of fact tables)
2. Inserts and Updates to this mapping was slower
The actual slow down for Inserts and Updates only actually matter Once, where this model was being used to represent a constantly changing process net; where the app wanted to change the mapping about 30 times a second. Even then it worked, it just chomped up more CPU time than was ideal.
If you want to be efficient and keep a history, do these things. There are multiple solutions to this problem, but this is the one that I keep going back to:
Remember that each row represents a single entity, if you make corrections that entity, that's fine, but don't re-use and ID for a new Location. Set it up so that instead of deleting a Location, you mark it as deleted with a bit and hide it from the interface, that way when it's referenced historically, it's still there.
Create a history table that includes the current value, or no records if a value isn't currently set. Have the foreign key tie back to the employee and tie to the location.
Create a column in the employee table that points to the current active location in the history. When you need to get the employees location, you join in the history table based on this ID. When you need to get all of the history for an employee you join from the history table.
This structure keeps it all normalized, and gives you an easy way to find the current value without having to do any date comparisons.
As far as using the word history, think of it in different terms: since it contains the current item as well as historical items, it's really just a junction table that keeps around the old item. As such you can name it something like EmployeeLocations.
I've been beating my head on the desk trying to figure this one out. I have a table that stores job information, and reasons for a job not being completed. The reasons are numeric,01,02,03,etc. You can have two reasons for a pending job. If you select two reasons, they are stored in the same column, separated by a comma. This is an example from the JOBID table:
Job_Number User_Assigned PendingInfo
1 user1 01,02
There is another table named Pending, that stores what those values actually represent. 01=Not enough info, 02=Not enough time, 03=Waiting Review. Example:
Pending_Num PendingWord
01 Not Enough Info
02 Not Enough Time
What I'm trying to do is query the database to give me all the job numbers, users, pendinginfo, and pending reason. I can break out the first value, but can't figure out how to do the second. What my limited skills have so far:
select Job_number,user_assigned,SUBSTRING(pendinginfo,0,3),pendingword
from jobid,pending
where
SUBSTRING(pendinginfo,0,3)=pending.pending_num and
pendinginfo!='00,00' and
pendinginfo!='NULL'
What I would like to see for this example would be:
Job_Number User_Assigned PendingInfo PendingWord PendingInfo PendingWord
1 User1 01 Not Enough Info 02 Not Enough Time
Thanks in advance
You really shouldn't store multiple items in one column if your SQL is ever going to want to process them individually. The "SQL gymnastics" you have to perform in those cases are both ugly hacks and performance degraders.
The ideal solution is to split the individual items into separate columns and, for 3NF, move those columns to a separate table as rows if you really want to do it properly (but baby steps are probably okay if you're sure there will never be more than two reasons in the short-medium term).
Then your queries will be both simpler and faster.
However, if that's not an option, you can use the afore-mentioned SQL gymnastics to do something like:
where find ( ',' |fld| ',', ',02,' ) > 0
assuming your SQL dialect has a string search function (find in this case, but I think charindex for SQLServer).
This will ensure all sub-columns begin and start with a comma (comma plus field plus comma) and look for a specific desired value (with the commas on either side to ensure it's a full sub-column match).
If you can't control what the application puts in that column, I would opt for the DBA solution - DBA solutions are defined as those a DBA has to do to work around the inadequacies of their users :-).
Create two new columns in that table and make an insert/update trigger which will populate them with the two reasons that a user puts into the original column.
Then query those two new columns for specific values rather than trying to split apart the old column.
This means that the cost of splitting is only on row insert/update, not on _every single select`, amortising that cost efficiently.
Still, my answer is to re-do the schema. That will be the best way in the long term in terms of speed, readable queries and maintainability.
I hope you are just maintaining the code and it's not a brand new implementation.
Please consider to use a different approach using a support table like this:
JOBS TABLE
jobID | userID
--------------
1 | user13
2 | user32
3 | user44
--------------
PENDING TABLE
pendingID | pendingText
---------------------------
01 | Not Enough Info
02 | Not Enough Time
---------------------------
JOB_PENDING TABLE
jobID | pendingID
-----------------
1 | 01
1 | 02
2 | 01
3 | 03
3 | 01
-----------------
You can easily query this tables using JOIN or subqueries.
If you need retro-compatibility on your software you can add a view to reach this goal.
I have a tables like:
Events
---------
eventId int
eventTypeIds nvarchar(50)
...
EventTypes
--------------
eventTypeId
Description
...
Each Event can have multiple eventtypes specified.
All I do is write 2 procedures in my site code, not SQL code
One procedure converts the table field (eventTypeIds) value like "3,4,15,6" into a ViewState array, so I can use it any where in code.
This procedure does the opposite it collects any options your checked and converts it in
If changing the schema is an option (which it probably should be) shouldn't you implement a many-to-many relationship here so that you have a bridging table between the two items? That way, you would store the number and its wording in one table, jobs in another, and "failure reasons for jobs" in the bridging table...
Have a look at a similar question I answered here
;WITH Numbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
FROM JobId
),
Split AS
(
SELECT JOB_NUMBER, USER_ASSIGNED, SUBSTRING(PENDING_INFO, Numbers.N, CHARINDEX(',', PENDING_INFO + ',', Numbers.N) - Numbers.N) AS PENDING_NUM
FROM JobId
JOIN Numbers ON Numbers.N <= DATALENGTH(PENDING_INFO) + 1
AND SUBSTRING(',' + PENDING_INFO, Numbers.N, 1) = ','
)
SELECT *
FROM Split JOIN Pending ON Split.PENDING_NUM = Pending.PENDING_NUM
The basic idea is that you have to multiply each row as many times as there are PENDING_NUMs. Then, extract the appropriate part of the string
While I agree with DBA perspective not to store multiple values in a single field it is doable, as bellow, practical for application logic and some performance issues. Let say you have 10000 user groups, each having average 1000 members. You may want to have a table user_groups with columns such as groupID and membersID. Your membersID column could be populated like this:
(',10,2001,20003,333,4520,') each number being a memberID, all separated with a comma. Add also a comma at the start and end of the data. Then your select would use like '%,someID,%'.
If you can not change your data ('01,02,03') or similar, let say you want rows containing 01 you still can use " select ... LIKE '01,%' OR '%,01' OR '%,01,%' " which will insure it match if at start, end or inside, while avoiding similar number (ie:101).
For ten years we've been using the same custom sorting on our tables, I'm wondering if there is another solution which involves fewer updates, especially since today we'd like to have a replication/publication date and wouldn't like to have our replication replicate unnecessary entries.I had a look into nested sets, but it doesn't seem to do the job for us.
Base table:
id | a_sort
---+-------
1 10
2 20
3 30
After inserting:
insert into table (a_sort) values(15)
An entry at the second position.
id | a_sort
---+-------
1 10
2 20
3 30
4 15
Ordering the table with:
select * from table order by a_sort
and resorting all the a_sort entries, updating at least id=(2,3,4)
will of course produce the desired output:
id | a_sort
---+-------
1 10
4 20
2 30
3 40
The column names, the column count, datatypes, a possible join, possible triggers or the way the resorting is done is/are irrelevant to the problem.Also we've found some pretty neat ways to do this task fast.
only; how the heck can we reduce the updates in the db to 1 or 2 max.
Seems like an awfully common problem.
The captain obvious in me thougth once "use an a_sort float(53), insert using a fixed value of ordervaluefirstentry+abs(ordervaluefirstentry-ordervaluenextentry)/2".
But this would only allow around 1040 "in between" entries - so never resorting seems a bit problematic ;)
You really didn't describe what you're doing with this data, so forgive me if this is a crazy idea for your situation:
You could make a sort of 'linked list' where instead of a column of values, you have a column for the 'next highest valued' id. This would decrease the number of updates to a maximum of 2.
You can make it doubly linked and also have a column for next lowest, which would bring the maximum number of updates to 3.
See:
http://en.wikipedia.org/wiki/Linked_list
Excuse the long question!
We have two database tables, e.g. Car and Wheel. They are related in that a wheel belongs to a car and a car has multiple wheels. The wheels, however, can be changed without affecting the "version" of the car. The car's record can be updated (e.g. paint job) without affecting the version of the wheels (i.e. no cascade updating).
For example, Car table currently looks like this:
CarId, CarVer, VersionTime, Colour
1 1 9:00 Red
1 2 9:30 Blue
1 3 9:45 Yellow
1 4 10:00 Black
The Wheels table looks like this (this car only has two wheels!)
WheelId, WheelVer, VersionTime, CarId
1 1 9:00 1
1 2 9:40 1
1 3 10:05 1
2 1 9:00 1
So, there's been 4 versions of this two wheeled car. It's first wheel (WheelId 1) hasn't changed. The second wheel was changed (e.g. painted) at 10:05.
How do I efficiently do as of queries that can be joined to other tables as required? Note that this is a new database and we own the schema and can change it or add audit tables to make this query easier. We've tried one audit table approach (with columns: CarId, CarVersion, WheelId, WheelVersion, CarVerTime, WheelVerTime), but it didn't really improve our query.
Example query: Show the Car ID 1 as it was, including its wheel records as of 9:50. This query should result in these two rows being returned:
WheelId, WheelVer, WheelVerTime, CarId, CarVer, CarVerTime, CarColour
1 2 9:40 1 3 9:45 Yellow
2 1 9:00 1 3 9:45 Yellow
The best query we could come up with was this:
select c.CarId, c.VersionTime, w.WheelId,w.WheelVer,w.VersionTime,w.CarId
from Cars c,
( select w.WheelId,w.WheelVer,w.VersionTime,w.CarId
from Wheels w
where w.VersionTime <= "12 Jun 2009 09:50"
group by w.WheelId,w.CarId
having w.WheelVer = max(w.WheelVer)
) w
where c.CarId = w.CarId
and c.CarId = 1
and c.VersionTime <= "12 Jun 2009 09:50"
group by c.CarId, w.WheelId,w.WheelVer,w.VersionTime,w.CarId
having c.CarVer = max(c.CarVer)
And, if you wanted to try this then the create table and insert record SQL is here:
create table Wheels
(
WheelId int not null,
WheelVer int not null,
VersionTime datetime not null,
CarId int not null,
PRIMARY KEY (WheelId,WheelVer)
)
go
insert into Wheels values (1,1,'12 Jun 2009 09:00', 1)
go
insert into Wheels values (1,2,'12 Jun 2009 09:40', 1)
go
insert into Wheels values (1,3,'12 Jun 2009 10:05', 1)
go
insert into Wheels values (2,1,'12 Jun 2009 09:00', 1)
go
create table Cars
(
CarId int not null,
CarVer int not null,
VersionTime datetime not null,
colour varchar(50) not null,
PRIMARY KEY (CarId,CarVer)
)
go
insert into Cars values (1,1,'12 Jun 2009 09:00', 'Red')
go
insert into Cars values (1,2,'12 Jun 2009 09:30', 'Blue')
go
insert into Cars values (1,3,'12 Jun 2009 09:45', 'Yellow')
go
insert into Cars values (1,4,'12 Jun 2009 10:00', 'Black')
go
This kind of table is known as a valid-time state table in the literature. It is universally accepted that each row should model a period by having a start date and an end date. Basically, the unit of work in SQL is the row and a row should completely define the entity; by having just one date per row, not only do your queries become more complex, your design is compromised by splitting sub atomic parts on to different rows.
As mentioned by Erwin Smout, one of the definitive books on the subject is:
Richard T. Snodgrass (1999). Developing Time-Oriented Database Applications in SQL
It's out of print but happily is available as a free download PDF (link above).
I have actually read it and have implemented many of the concepts. Much of the text is in ISO/ANSI Standard SQL-92 and although some have been implemented in proprietary SQL syntaxes, including SQL Server (also available as downloads) I found the conceptual information much more useful.
Joe Celko also has a book, 'Thinking in Sets: Auxiliary, Temporal, and Virtual Tables in SQL', largely derived from Snodgrass's work, though I have to say where the two diverge I find Snodgrass's approaches preferable.
I concur this stuff is hard to implement in the SQL products we currently have. We think long and hard before making data temporal; if we can get away with merely 'historical' then we will. Much of the temporal functionality in SQL-92 is missing from SQL Server e.g. INTERVAL, OVERLAPS, etc. Some things as fundamental as sequenced 'primary keys' to ensure periods do not overlap cannot be implemented using CHECK constraints in SQL Server, necessitating triggers and/or UDFs.
Snodgrass's book is based on his work for SQL3, a proposed extension to Standard SQL to provide much better support for temporal databases, though sadly this seems to have been effectively shelved years ago :(
As-of queries are easier when each row has a start and an end time. Storing the end time in the table would be most efficient, but if this is hard, you can query it like:
select
ThisCar.CarId
, StartTime = ThisCar.VersionTime
, EndTime = NextCar.VersionTime
from Cars ThisCar
left join Cars NextCar
on NextCar.CarId = ThisCar.CarId
and ThisCar.VersionTime < NextCar.VersionTime
left join Cars BetweenCar
on BetweenCar.CarId = BetweenCar.CarId
and ThisCar.VersionTime < BetweenCar.VersionTime
and BetweenCar.VersionTime < NextCar.VersionTime
where BetweenCar.CarId is null
You can store this in a view. Say the view is called vwCars, you can select a car for a particular date like:
select *
from vwCars
where StartTime <= '2009-06-12 09:15'
and ('2009-06-12 09:15' < EndTime or EndTime is null)
You could store this in a table valued stored procedure, but that might have a steep performance penalty.
Depending on your application you might want to push the versioning to secondary auditing tables, that would have both a start and a nullable end date. I found in a high trafic OLTP that using the versioning approach can become fairly expensive and if most of your reads pull the latest version then this might be beneficul.
By using a start and end date you can query the ancillary tables looking for a date that is between start and stop or greater then start.
Storing the end time in the table for each situation makes the queries indeed easier to express, but creates the problem of maintaining integrity rules such as "no two distinct situations for the same car (wheel/...) may overlap" (still reasonably doable) and "there cannot be holes in the timeseries of distinct situations of any single (car/wheel/...)" (more troublish).
Not storing the end time in the table for each situation forces you to write self-joins each time you need to invoke an Allen operator (overlaps, merges, contains, ...) on the time intervals implied by the only time column you have.
SQL is just a nightmare if you need to do this kind of temporal stuff.
And incidentally, even just accurately formulating these queries in natural language is a nightmare. To illustrate : you said that you needed "as-of" queries, but your examples excluded the situations which were "as-of" 10:05 (wheelVer 3) and 10:00 (color black). This despite the fact that those situations are definitely also "as-of" 09:50.
You may be interested in a read of "Temporal Data and the Relational Model". Keep in mind that the treatment in this book is entirely abstract, since, as the book itself says, "this book is not about technology available anywhere today".
The other standard textbook on the subject (I'm told), is one by Snodgrass, but I don't know the title. I'm told the authors of these two books take completely opposite stances as to what the solution ought to be.
This query will return duplicates if you have two rows with the same exact version time for a single car ID, but that's a matter of defining what you consider to be the "latest" one in that situation. I haven't had a chance to test this yet, but I think it will give you what you need. It's at least pretty close.
SELECT
C.car_id,
C.car_version,
C.colour,
C.version_time AS car_version_time,
W.wheel_id,
W.wheel_version,
W.version_time AS wheel_version_time,
FROM
Cars C
LEFT OUTER JOIN Cars C2 ON
C2.car_id = C.car_id AND
C2.version_time <= #as_of_time AND
C2.version_time > C.version_time
LEFT OUTER JOIN Wheels W ON
W.car_id = C.car_id AND
W.version_time <= #as_of_time
LEFT OUTER JOIN Wheels W2 ON
W2.car_id = C.car_id AND
W2.wheel_id = W.wheel_id AND
W2.version_time <= #as_of_time AND
W2.version_time > W.version_time
WHERE
C.version_time <= #as_of_time AND
C2.car_id IS NULL AND
W2.wheel_id IS NULL