Activrecord timestamp wrong assignment - ruby-on-rails-3

for the following:
updated_at | created_at | id
----------------------------+----------------------------+---------
2016-08-26 12:33:35.900201 | 2016-08-25 12:33:13.782502 | 2951380
2016-08-26 12:33:35.916025 | 2016-08-25 12:33:13.781838 | 2951379
2016-08-25 12:33:13.684854 | 2016-08-25 12:33:13.684854 | 2951377
2016-08-25 12:33:13.684753 | 2016-08-25 12:33:13.684753 | 2951378
2016-08-25 12:33:13.652293 | 2016-08-25 12:33:13.652293 | 2951376
2016-08-26 12:32:59.669535 | 2016-08-25 12:33:13.589147 | 2951375
2016-08-26 12:32:59.680676 | 2016-08-25 12:33:13.556841 | 2951374
2016-08-26 12:32:59.559429 | 2016-08-25 12:33:13.496964 | 2951373
2016-08-26 12:32:59.573863 | 2016-08-25 12:33:13.461594 | 2951372
2016-08-26 12:31:10.338129 | 2016-08-25 12:33:13.400724 | 2951371
ID 2951378 has an earlier created_at (and updated_at) than the 2951377 record!
Anyone have any idea how that may happen, this records inserted by Queue worker handler.

Imagine several transactions that are occurring simultaneously. They all need auto-generated ids. But the database cannot reserve the same ids for each transaction, because if they all succeed, they will override each other on commit.
So, each transaction gets its own set of auto-incremented values. Transaction A might start before transaction B, and get some ids allocated, but then B finishes first and it's larger IDs get saved with an earlier time.
It's not a sign of any error. It is a reminder that you should never assume the order of auto-generated IDs correlates to the sequence of events in a DB.

it looks like Rails calculates the timestamps, rather than relying on the database to do so
https://github.com/rails/rails/blob/55dfa009769962367c58563480c9f776ae0f53ea/activerecord/lib/active_record/timestamp.rb#L120
so it makes sense that this kind of situation can happen when multiple workers are saving records at the same time.

Related

What is the best way to calculate how long is an object in a specific state?

I am looking for a recommended solution to solve problems like the one below.
Preferably with PostgreSQL, but any other SQL based solution would help.
So there is a table like this:
day | object_id | property_01 | ...
2022-01-24 | object_01 | A | ...
2022-01-23 | object_01 | A | ...
2022-01-22 | object_01 | A | ...
2022-01-21 | object_01 | B | ...
2022-01-20 | object_01 | B | ...
2022-01-19 | object_01 | A | ...
2022-01-18 | object_01 | A | ...
2022-01-17 | object_01 | B | ...
2022-01-16 | object_01 | A | ...
The base table is a daily "backup", so normally there are no gaps in the line of days.
There will be always a row since the birth of an object till it is deleted from the main DB.
What I need is to tell is that if an object is in "A" state at the day of query, how long is this object in it.
In the example the right ansver is since 2022-01-22, so a simple MIN() does not give the right ansver.
Also I do not need to tell all of the intervals in the "A" state, just how long the current one keeps.
What I tried:
I. Select all the object_id's that are in "A" state at a given date.
II: Select all the rows based on the object_id's.
Now I am trying to make a recursive query, but if there is some better solution, it would be a great help.
Thanks a lot!

DBT Snapshots with not unique records in the source

I’m interested to know if someone here has ever come across a situation where the source is not always unique when dealing with snapshots in DBT.
I have a data lake where data arrives on an append only basis. Every time the source is updated, a new recorded is created on the respective table in the data lake.
By the time the DBT solution is ran, my source could have more than 1 row with the unique id as the data has changed more than once since the last run.
Ideally, I’d like to update the respective dbt_valid_to columns from the snapshot table with the earliest updated_at record from the source and subsequently add the new records to the snapshot table making the latest updated_at record the current one.
I know how to achieve this using window functions but not sure how to handle such situation with dbt.
I wonder if anybody has faced this same issue before.
Snapshot Table
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | null |
Source Table
|**id**|**some_attribute**| **updated_at** |
| 123 | ABCD | 2021-01-01 00:00:00 |-> already been loaded to snapshot
| 123 | ZABC | 2021-06-30 00:00:00 |-> already been loaded to snapshot
-------------------------------------------
| 123 | ZZAB | 2021-11-21 00:10:00 |
| 123 | FXAB | 2021-11-21 15:11:00 |
Snapshot Desired Result
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | 2021-11-21 00:10:00 |
| 123 | ZZAB | 2021-11-21 00:10:00 | 2021-11-21 15:11:00 |
| 123 | FXAB | 2021-11-21 15:11:00 | null |
Standard snapshots operate under the assumption that the source table we are snapshotting are being changed without storing history. This is opposed to the behaviour we have here (basically the source table we are snapshotting is nothing more than an append only log of events) - which means that we may get away with simply using a boring old incremental model to achieve the same SCD2 outcome that snapshots give us.
I have some sample code here where I did just that that may be of some help https://gist.github.com/jeremyyeo/3a23f3fbcb72f10a17fc4d31b8a47854
I agree it would be very convenient if dbt snapshots had a strategy that could involve deduplication, but it isn’t supported today.
The easiest work around would be a stage view downstream of the source that has the window function you describe. Then you snapshot that view.
However, I do see potential for a new snapshot strategy that handles append only sources. Perhaps you’d like to peruse the dbt Snapshot docs and strategies source code on existing strategies to see if you’d like to make a new one!

Can you delete old entries from a table index?

I made a reminder application that heavily writes and reads records with future datetimes, but less on records with past datetimes. These reminders are indexed by remind_at, so a million of records means a million on the index, but speeds up checking records that must be reminded in the next hour.
| uuid | user_id | text | remind_at | ... | ... | ... |
| ------- | ------- | ------------ | ------------------- | --- | --- | --- |
| 45c1... | 23 | Buy paint | 2019-01-01 20:00:00 | ... | ... | ... |
| 23f1... | 924 | Pick up car | 2019-02-01 20:00:00 | ... | ... | ... |
| 2d84... | 650 | Call mom | 2020-03-01 20:00:00 | ... | ... | ... |
| 3f1a... | 81 | Get shoes | 2020-04-01 20:00:00 | ... | ... | ... |
The problem is performance. Once the database grows big, retrieving any record becomes relatively slow.
I'm trying to check what RDBMS offer a full or semi automated way allow better performance retrieving future datetimes, since past datetimes are rarely retrieved or checked.
A neat solution that I don't know if exist would be to instruct the RDBM to prune old entries from the index. I don't know if any RDBM allows that, but in PostgreSQL, SQL Server, and SQLite there is a way to use a "partial index", but what would happen if I **recreate an index on a table with millions of records?
Some solutions that didn't fit the bill:
Horizontal scaling: It would replicate the same problem, (n) number of times.
Vertical scaling: still doesn't fix the problem.
Sharding: Could be, since every instance would hold a part of the database, but the app will have to handle the "sharding key".
Two databases: Okay, one fast and other slow. Moving old entries to the "slow instance" (toaster) would be done manually. Also, the app would have to be heavily modified to check both databases since it doesn't know where it is initially. Logic increases heavily.
Anyway, the whole point is to make future (or the closest) records to remind snappier on retrieval while disregarding the performance to retrieve older entries.

I think I need a loop in an MS Access Query

I have a table of login and logout times for users, table looks something like below:
| ID | User | WorkDate | Start | Finish |
| 1 | Bill | 07/12/2017 | 09:00:00 | 17:00:00 |
| 2 | John | 07/12/2017 | 09:00:00 | 12:00:00 |
| 3 | John | 07/12/2017 | 12:30:00 | 17:00:00 |
| 4 | Mary | 07/12/2017 | 09:00:00 | 10:00:00 |
| 5 | Mary | 07/12/2017 | 10:10:00 | 12:00:00 |
| 6 | Mary | 07/12/2017 | 12:10:00 | 17:00:00 |
I'm running a query to find out the length of the breaks that each user took by running a date diff between the Min of Finish, and Max of Start, then doing some other sums/queries to find out their break length.
This works where i have a maximum of two rows per User per WorkDate, so rows 1,2,3 give me workable data.
Rows 4,5,6 do not.
So long story short, how can i calculate the break times based on the above data in MS Access in a query. I'm assuming i'm going to need some looping statement but have no idea where to begin.
Here is a solution that comes to mind first.
First query to get the min/max start and end times.
Second query to calculate the total time worked for each day by using your Min(start time) and max(end time) query.
Third query to calculate the total time worked for each shift (time difference between start and end times) and then do a daily sum.
Forth query to calculate the difference between total time from the second query and the total time from the third query. The difference gives you the amount of break time they took.
If you need additional help, I can provide some screenshots of example queries.

How to use update all, when all records are different?

How can I use update_all, if I want to update a column of 300,000 records all with a variety of different values?
What I want to do is something like:
Model.update_all(:column => [2,33,94,32]).where(:id => [22974,22975,22976,22977])
But unfortunately this doesn't work, and it's even worse for 300,000 entries.
From the ActiveRecord#update documentation:
people = { 1 => { "first_name" => "David" }, 2 => { "first_name" => "Jeremy" } }
Person.update(people.keys, people.values)
So in your case:
updates = {22974 => {column: 2}, 22975 => {column: 33}, 22976 => {column: 94}, 22977 => {column: 32}}
Model.update(updates.keys, updates.values)
Edit: Just had a look at the source, and this is generating n SQL queries too... So probably not the best solution
The only way I found to do it is to generate INSERT INTO request with updated values. I'm using gem "activerecord-import" for that.
For example,
I have a table with val values
+--------+--------------+---------+------------+-----+-------------------------+-------------------------+
| pkey | id | site_id | feature_id | val | created_at | updated_at |
+--------+--------------+---------+------------+-----+-------------------------+-------------------------+
| 1 | | 125 | 7 | 88 | 2016-01-27 10:25:45 UTC | 2016-02-05 11:18:14 UTC |
| 111765 | 0001-0000024 | 125 | 7 | 86 | 2016-01-27 11:33:22 UTC | 2016-02-05 11:18:14 UTC |
| 111766 | 0001-0000062 | 125 | 7 | 15 | 2016-01-27 11:33:22 UTC | 2016-02-05 11:18:14 UTC |
| 111767 | 0001-0000079 | 125 | 7 | 19 | 2016-01-27 11:33:22 UTC | 2016-02-05 11:18:14 UTC |
| 111768 | 0001-0000086 | 125 | 7 | 33 | 2016-01-27 11:33:22 UTC | 2016-02-05 11:18:14 UTC |
+--------+--------------+---------+------------+-----+-------------------------+-------------------------+
select records
products = CustomProduct.limit(5)
update records as you need
products.each_with_index{|p, i| p.val = i}
save records in single request
CustomProduct.import products.to_a, :on_duplicate_key_update => [:val]
All you records will be updated in single request. Please find out gem "activerecord-import" documentation for more details.
+--------+--------------+---------+------------+-----+-------------------------+-------------------------+
| pkey | id | site_id | feature_id | val | created_at | updated_at |
+--------+--------------+---------+------------+-----+-------------------------+-------------------------+
| 1 | | 125 | 7 | 0 | 2016-01-27 10:25:45 UTC | 2016-02-05 11:19:49 UTC |
| 111765 | 0001-0000024 | 125 | 7 | 1 | 2016-01-27 11:33:22 UTC | 2016-02-05 11:19:49 UTC |
| 111766 | 0001-0000062 | 125 | 7 | 2 | 2016-01-27 11:33:22 UTC | 2016-02-05 11:19:49 UTC |
| 111767 | 0001-0000079 | 125 | 7 | 3 | 2016-01-27 11:33:22 UTC | 2016-02-05 11:19:49 UTC |
| 111768 | 0001-0000086 | 125 | 7 | 4 | 2016-01-27 11:33:22 UTC | 2016-02-05 11:19:49 UTC |
+--------+--------------+---------+------------+-----+-------------------------+-------------------------+
the short answer to your question is, you can't.
The point of update_all is to assign the same value to the column for all records (matching the condition if provided). The reason that is useful is that it does it in a single SQL statement.
I agree with Shime's answer for correctness. Although that will generate n SQL calls. So, maybe there is something more to your problem you're not telling us. Perhaps you can iterate over each possible value, calling update_all for the objects that should get updated with that value. Then it's a matter of either building the appropriate hash, or, even better, if the condition is based on something in the Model itself, you can pass the condition to update_all.
This is my 2020 answer:
The most upvoted answer is wrong; as the author himself states, it will trigger n SQL queries, one for each row.
The second most upvoted answer suggests gem "activerecord-import", which is the way to go. However, it does so by instantiating ActiveRecord models, and if you are in business for a gem like this, you're probably looking for extreme performance (it was our case anyways).
So this is what we did. First, you build an array of hashes, each hash containing the id of the record you want to update and any other fields.
For instance:
records = [{ id: 1, name: 'Bob' }, { id: 2, name: 'Wilson' },...]
Then you invoke the gem like this:
YourModelName.import(records, on_duplicate_key_update: [:name, :other_columns_whose_keys_are_present_in_the_hash], validate: false, timestamps: false)
Explanation:
on_duplicate_key_update means that, if the database finds a collision on primary key (and it will on every row, since we're talking about updating existing records), it will NOT fail, and instead update the columns you pass on that array.
If you don't validate false (default is true), it will try to instantiate a new model instance for each row, and probably fail due to validation (since your hashes only contain partial information).
timestamp false is also optional, but good to know it's there.