Technique for querying date based log data - sql

I have date based log data for financial records. Every time the record changes, a new copy of the record is made in the database.
The current method I am using, which I describe below, is both complex and poor performing. I am dealing with millions of rows and lots of log tables.
The logs are tables in my database that mimic the table we are logging with the addition of a unique log identifier and log date.
For instance, database table RecordLog looks like this:
LogId | RecordId | Log Date | Record Data
--------------------------------------------------------
1 |1 | 2019-07-02 | ...
2 |1 | 2019-05-12 | ...
3 |1 | 2019-03-22 | ...
4 |1 | 2019-01-01 | ...
5 |1 | 2018-08-01 | ...
6 |2 | 2018-01-01 | ...
7 |3 | 2019-01-01 | ...
8 |3 | 2019-02-15 | ...
9 |3 | 2018-10-15 | ...
-The LogId is the log unique id for the RecordLog table, while the RecordId references the unique identifier on the Record table.
-The Record data would mimic the rest of the Record table.
A lot of reporting|analytics occurs based on point in time. For instance, the user wants know the state of affairs at 2019-01-02
In that case we would get these rows since they are the closest recorded instances <= 2019-01-02:
LogId | RecordId | Log Date | Record Data
--------------------------------------------------------
4 |1 | 2019-01-01 | ...
6 |2 | 2018-01-01 | ...
7 |3 | 2019-01-01 | ...
In order to perform these queries now, I am utilizing an inner query.
select * from RecordLog where
...
and ...
and ...
and RecordLog.LogId in (
select max(InnerRecordLog.LogId) from RecordLog as InnerRecordLog
where InnerRecordLog.LogDate <= ?
group by InnerRecordLog.RecordId
order by InnerRecordLog.LogDate desc
)
One of the challenges is I am using HQL to write these queries which limits my access to some native db options

Postgres has a great extension called distinct on which is perfected suited for this:
select distinct on (lr.recordid) rl.*
from recordlog rl
where rl.logdate <= '2019-01-02'
order by lr.recordid, rl.logdate desc;
distinct on (as used here) returns one record per recordid (the keys in parentheses). The specific record is the latest logdate record -- but subject to the where conditions, of course.
In other databases, the most efficient method is usually a correlated subquery:
select rl.*
from recordlog rl
where rl.logdate = (select max(rl2.logdate)
from recordlog rl2
where rl2.recordid = rl.recordid and
rl2.logdate <= '2019-01-02'
);

Related

Dynamically generate unique primary key for a column composition in a T-SQL SELECT statement

Does anybody know if there's a way to generate a unique integer primary key in MS SQL Server / Transact SQL SELECT statement?
I've got the problem that I've got to merge 3 datatables programmatically by a composition of columns (like a composite primary key). The composition isn't the actual primary key though.
My tables look somewhat like this:
Table 1: Base Table which is needed to be filled:
+----------+------+----------------+------------+----------+
|operatorid|opcode|bookkeeping_date|cash_amount |tip_amount|
+----------+------+----------------+------------+----------+
| 1 | 1 | 01.01.2018 |null |null |
+----------+------+----------------+------------+----------+
| 1 | 1 | 01.02.2018 |null |null |
+----------+------+----------------+------------+----------+
| 2 | 2 | 01.02.2018 |null |null |
+----------+------+----------------+------------+----------+
Table 2: Cash Data Table to be merged with base table
+----------+------+----------------+------------+
|operatorid|opcode|bookkeeping_date|cash_amount |
+----------+------+----------------+------------+
| 1 | 1 | 01.01.2018 |2.50 |
+----------+------+----------------+------------+
| 1 | 1 | 01.02.2018 |17.80 |
+----------+------+----------------+------------+
| 2 | 2 | 01.02.2018 |4.20 |
+----------+------+----------------+------------+
Table 3: Tip Data Table to be merged with base table:
+----------+------+----------------+----------+
|operatorid|opcode|bookkeeping_date|tip_amount|
+----------+------+----------------+----------+
| 1 | 1 | 01.01.2018 |3.50 |
+----------+------+----------------+----------+
| 1 | 1 | 01.02.2018 |4.20 |
+----------+------+----------------+----------+
| 2 | 2 | 01.02.2018 |0.00 |
+----------+------+----------------+----------+
So simplified the goal is to fill "Table 1: Base Table" by merging the datatables. We already have a c# method to manage the merge by primary key after the datatables are selected.
My problem now is that I don't have a unique primary key but the composition of "operatorid" and "bookkeeping_date".
Is there a way I could modify my SELECT statements for each table to get a unique integer by hash or checksum or stuff like that?
Edit: The cash amount and the tip amount are summed up values in the table select statements with the aggregate SUM() being used.
Best regards
Epanalepsis
I'm sure that data can regularly change in these tables, and since historical records could always be introduced that throws off any ranking or row number functions, it'd probably be best to go a route that uses the operatorid and bookkeeping_date in tandem to create a unique key.
One simple approach would be to convert the date to an integer value, and append the operatorid to the end.
1/1/2018 becomes "43099". Operatorid 1 becomes of "1" at the end (or "01"/"001"/"0001" if these IDs have the chance of becoming larger integers).
select
convert(int, convert(varchar(6), convert(int, (convert(datetime,replace(bookkeeping_date,'.','-'),110))))
+ convert(varchar(2), operatorid)) as unique_id
from (select 1 as operatorid, 1 as opcode, '01.01.2018' as bookkeeping_date, 3.50 as tip_amount) tip
You'd have a few centuries before the date integers reach six digits and might cause some complications - but you could also convert the date integer to something with trailing zeroes as well to accommodate that if it's a concern.

BigQuery DML COUNT() across multiple tables

I'm looking for a mechanism to control the accuracy of data that I import daily on multiple BigQuery tables. Each table have similar format with a DATE and an ID column. The Table format looks like this:
Table_1
| DATE | ID |
| 2018-10-01 | A |
| 2018-10-01 | B |
| 2018-10-02 | A |
| 2018-10-02 | B |
| 2018-10-02 | C |
What I want to control is the evolution of the number of IDs, through such kind of output table:
CONTROL_TABLE
| DATE | COUNT(Table1.ID) | COUNT(Table2.ID) | COUNT(Table3.ID) |
| 2018-10-01 | 2 | 487654 | 675386 |
| 2018-10-02 | 3 | 488756 | 675447 |
I'm trying to do such through 1 single SQL query, but face several limits with the DML such as:
-> One single SELECT with all the tables jointed is out of question for performance purpose (20+ tables with millions lines)
-> I was thinking of going through temporary tables, but it seems I cannot run Multiple DELETE + INSERT functions on several tables with DML
-> I cannot use a wildcard table as the output of the query
Would anyone have an idea how to get such result in an optimized way, ideally through 1 single query ?

Duplicates when filtering by group with Mondrian

I'm trying to create a Mondrian schema to be used in Saiku. The rest of the schema is working correctly, but the main filter isn't.
I have tried several ways on making this work, but so far I always get duplicates.
This issue can be duplicated with only two tables we'll call fact_table and user_group. The fact_table contains the user id and measures such as:
user_id|amount
1 |10
2 |15
3 |17
The user_group table contains the user_id and the group(s) it belongs to. If a user belongs to several groups, it will have several rows
user_id|group_id
1 |100
1 |200
2 |100
Every time I run a query for the groups 100 and 200 I get the following incorrect data:
user_id|amount
1 |20
2 |15
Note that the amount for user 1 is duplicated because it belongs to two groups. The problem is that a dimension is not expecting to have duplicates by the id. Is there any way to make this work?
It seems your data warehouse schema does not follow star schema rules. Your dimension table user_group should contain only one key column with unique values (the user_id column is insufficient).
More possible solutions come to play:
1) Adding group_id column to fact_table (leads to duplicate amount records).
Fact table:
user_id | group_id | amount
1 | 100 | 10
1 | 200 | 10
2 | 100 | 15
3 | #null | 17
2) Consider fact_table and user_group tables as dimensions and a new fact table on the top of them.
Fact table:
new_fact_key | user_id | group_id
1 | 1 | 100
2 | 1 | 200
3 | 2 | 100
4 | 3 | #null
1st Dimension table:
user_id|amount
1 |10
2 |15
3 |17
2nd Dimension table:
group_id
100
200
100
You can replace #null values by surrogate keys. I recommend to read this book about data warehouse modelling The Data Warehouse Toolkit, 3rd Edition to find out more about star-schema concepts and surrogate keys.

Column 'Course.Course_Name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause

I need to link two tables columns, please help me. This my code:
SELECT Student.Stu_Course_ID, Course.Course_Name, COUNT(Student.Stu_ID) AS NoOfStudent FROM Student
INNER JOIN Course
ON Student.Stu_Course_ID=Course.Course_ID
GROUP BY Stu_Course_ID;
This is my course table:
__________________________________________
|Course_ID | Course_Name |
|1 | B.Eng in Software Engineering |
|2 | M.Eng in Software Engineering |
|3 | BSC in Business IT |
I got number of students from student table
_____________________________
|Stu_Course_ID | NoOfStudents |
|1 | 30 |
|2 | 12 |
|3 | 20 |
This is what i want
____________________________________________________________
|Stu_Course_ID | Course_Name | NoOfStudents|
|1 | B.Eng in Software Engineering | 30 |
|2 | M.Eng in Software Engineering | 12 |
|3 | BSC in Business IT | 20 |
You need to add Course.Course_Name to your group by clause:
SELECT Student.Stu_Course_ID,
Course.Course_Name,
COUNT(Student.Stu_ID) AS NoOfStudent
FROM Student
INNER JOIN Course
ON Student.Stu_Course_ID=Course.Course_ID
GROUP BY Student.Stu_Course_ID, Course.Course_Name;
Imagine the following simple table (T):
ID | Column1 | Column2 |
----|---------+----------|
1 | A | X |
2 | A | Y |
Your query is similary to this:
SELECT ID, Column1, COUNT(*) AS Count
FROM T
GROUP BY Column1;
So, you know you have 2 records for A in column1, so you expect a count of 2, however, you are also selecting ID, there are two different values for ID where Column1 = A, so the following result:
ID | Column1 | Count |
----|---------+----------|
1 | A | 2 |
Is no more or less correct than
ID | Column1 | Count |
----|---------+----------|
2 | A | 2 |
This is why ID cannot be contained in the select list, unless it included in the group by clause, or as part of an aggregate function.
For what it's worth, if Course_ID is the primary key in the table Course then following query is legal according to the SQL Standard, and will work in Postgresql, and I suspect at some point Microsoft will build this functionality into SQL Server too:
SELECT Course.Course_ID,
Course.Course_Name,
COUNT(Student.Stu_ID) AS NoOfStudent
FROM Student
INNER JOIN Course
ON Student.Stu_Course_ID=Course.Course_ID
GROUP BY Course.Course_ID;
The reason for this is that since Course.Course_ID is the primary key of Course there can be no duplicates of this in the table, therefore there can only be one value for Course_name for each Course_ID
give columns names after group by statements which you want to retreive so you have to also give Course.Course_Name as well...

select multi row inside of a table (not same condition)

I try to explain an issue I have faced nowadays.
Actually I have designed a table in order to track the changes applying by users inside of a depot of NLP engine.
I have two table named Token And Lexeme. each token has an id that directly connect to a row of lexeme table. and always I can find the latest and updated lexemes by looking up to token table.
here is their scheme:
Token Table:
+-----+----------+----------+
| Id | token |LexemeId* |
+-----+----------+----------+
LexemeId refers to a row inside of lexeme table.
Lexeme Table:
+-----+---------------------+-------------+
| Id | some information |UpdatedFrom* |
+-----+---------------------+-------------+
* UpdatedFrom field refers another row inside of Lexeme Table.
Null means there is no more rows related to this token(lexeme).
an example:
Token Table:
+-----+----------+----------+
| 0 | A |4 |
| 1 | B |1 |
+-----+----------+----------+
Lexeme Table:
+-----+----------------------+-------------+
| 0 | A information#1 |NULL |
| 1 | B information |NULL |
| 2 | A information#2 |0 |
| 3 | A information#3 |2 |
| 4 | A information#4 |3 |
+-----+----------------------+-------------+
I hope I could clear the air.
I want to write a store procedure to collect all records related to each token. for example for token 'A', I'm expected to have an array (or data table) looks like this:
+-----+----------------------+-------------+
| id | informations | updated from|
+-----+----------------------+-------------+
| 0 | A information#1 |NULL |
| 2 | A information#2 |0 |
| 3 | A information#3 |2 |
| 4 | A information#4 |3 |
+-----+----------------------+-------------+
anybody has any idea to help me....
my knowledge on sql transcript is summarized to Update, Insert and select statements, not more!
thanks in advanced...
Assuming this is in an RDBMS that supports recursive CTEs, try:
with cte as
(select t.id TokenId, t.token, l.Id, l.SomeInformation, l.UpdatedFrom
from Token t
join Lexeme l on t.LexemeId = l.id
union all
select t.TokenId, t.token, l.Id, l.SomeInformation, l.UpdatedFrom
from cte t
join Lexeme l on t.UpdatedFrom = l.id)
select Id, SomeInformation, UpdatedFrom
from cte
where TokenId=0 /* token = 'A' */
SQLFiddle here.