PostgreSQL query to get events occurring within microseconds of each other

PostgreSQL query to get events occurring within microseconds of each other - sql

Ruby on Rails' ActiveRecord's destroy_all method generates one SQL DELETE query per record. We have a page in a web application that allows a user to delete associated records. Since the deleted_at timestamps are off by a few microseconds, it becomes challenging to group all the records that were deleted at the same time. Here is some example data:
id | deleted_at
----+----------------------------
71 |
45 | 2014-04-29 18:35:00.676153
46 | 2014-04-29 18:35:00.685657
47 | 2014-04-30 21:11:00.73096
48 | 2014-04-30 21:11:00.738533
49 | 2014-04-30 21:11:00.745232
50 |
51 |
(8 rows)
So you can see there were two events here, one affecting 2 rows (with ids 45 and 46) and one affecting 3 rows (with ids 47, 48, and 49). My question is how can I query this table so that each event is grouped into a single row? I've considered using extract(microseconds from r.deleted_at) but that would fail if the seconds were wrapped. I want a query, or even a function that compares each record and groups those with deleted_at within a certain threshold of each other.
Edit: I should mention that I can't use delete_all because I want callbacks to be run. We are using the paranoia gem to soft-delete records.

I found a workaround. Instead of depending on ActiveRecord callbacks, I can simply set the timestamp manually using update_all. So instead of Model.where(...).delete_all I do Model.where(...).update_all(deleted_at: Time.current). Now I can easily group by timestamp.
If anyone wants to answer the original PostgreSQL query question, I'll be glad to choose the correct answer for points.

Related

MariaDB - embed function to automatically sum columns and store result?

it is possible to store a function IN the table to automatically sum a group of columns and store the result in a final column?
ie:
+----+------------+-----------+-------------+------------+
| id | appleCount | pearCount | bananaCount | totalFruit |
+----+------------+-----------+-------------+------------+
| 1 | 300 | 60 | 120 | 480 |
+----+------------+-----------+-------------+------------+
where the column totalFruit is automatically calculated from the previous three columns and updated as the other columns update. in this specific application, there is ONLY going to be the one row. it would be spanky-handy to be able to just push the updated counts and then pull the calculated total out. i seem to recall reading about this ability somewhere, but for the life of me, i can't recall where... :poop:
if there is not way to do this, that's cool. but if there is... :smile:
TIA!
WR!

Yes, it is possible. But is it worth it? It is simple enough to do
SELECT ...
appleCount + pearCount + bananaCount AS totalFruit
...
See MariaDB Generated Columns for how to generate the extra column -- either as a real extra column or "virtual". What version of MariaDB?--There are a number of changes over time.
(MySQL users: 5.7.6 has a similar MySQL Generated Columns.)

django database design when you will have too many rows

I have a django web app with postgres db; the general operation is that every day I have an array of values that need to be stored in one of the tables.
There is no foreseeable need to query the values of the array but need to be able to plot the values for a specific day.
The problem is that this array is pretty big and if I were to store it in the db, I'd have 60 million rows per year but if I store each row as a blob object, I'd have 60 thousand rows per year.
Is is a good decision to use a blob object to reduce table size when you do not want to query with the row of values?
Here are the two options:
option1: keeping all
group(foreignkey)| parent(foreignkey) | pos(int) | length(int)
A | B | 232 | 45
A | B | 233 | 45
A | B | 234 | 45
A | B | 233 | 46
...
option2: collapsing the array into a blob:
group(fk)| parent(fk) | mean_len(float)| values(blob)
A | B | 45 |[(pos=232, len=45),...]
...
so I do NOT want to query pos or length but I want to query group or parent.
An example of read query that I'm talking about is:
SELECT * FROM "mytable"
LEFT OUTER JOIN "group"
ON ( "group"."id" = "grouptable"."id" )
ORDER BY "pos" DESC LIMIT 100
which is a typical django admin list_view page main query.

I tried loading the data and tried displaying the table in the django admin page without doing any complex query (just a read query).
When I get pass 1.5 millions rows, the admin page freezes. All it takes is a some count query on that table to cause the app to crash so I should definitely either keep the data as a blob or not keep it in the db at all and use the filesystem instead.
I want to emphasize that I've used django 1.8 as my test bench so this is not a postgres evaluation but rather a system evaluation with django admin and postgres.

Cumulative average number of records created for specific day of week or date range

Yeah, so I'm filling out a requirements document for a new client project and they're asking for growth trends and performance expectations calculated from existing data within our database.
The best source of data for something like this would be our logs table as we pretty much log every single transaction that occurs within our application.
Now, here's the issue, I don't have a whole lot of experience with MySql when it comes to collating cumulative sum and running averages. I've thrown together the following query which kind of makes sense to me, but it just keeps locking up the command console. The thing takes forever to execute and there are only 80k records within the test sample.
So, given the following basic table structure:
id | action | date_created
1 | 'merp' | 2007-06-20 17:17:00
2 | 'foo' | 2007-06-21 09:54:48
3 | 'bar' | 2007-06-21 12:47:30
... thousands of records ...
3545 | 'stab' | 2007-07-05 11:28:36
How would I go about calculating the average number of records created for each given day of the week?
day_of_week | average_records_created
1 | 234
2 | 23
3 | 5
4 | 67
5 | 234
6 | 12
7 | 36
I have the following query which makes me want to murderdeathkill myself by casting my body down an elevator shaft... and onto some bullets:
SELECT
DISTINCT(DAYOFWEEK(DATE(t1.datetime_entry))) AS t1.day_of_week,
AVG((SELECT COUNT(*) FROM VMS_LOGS t2 WHERE DAYOFWEEK(DATE(t2.date_time_entry)) = t1.day_of_week)) AS average_records_created
FROM VMS_LOGS t1
GROUP BY t1.day_of_week;
Halps? Please, don't make me cut myself again. :'(

How far back do you need to go when sampling this information? This solution works as long as it's less than a year.
Because day of week and week number are constant for a record, create a companion table that has the ID, WeekNumber, and DayOfWeek. Whenever you want to run this statistic, just generate the "missing" records from your master table.
Then, your report can be something along the lines of:
select
DayOfWeek
, count(*)/count(distinct(WeekNumber)) as Average
from
MyCompanionTable
group by
DayOfWeek
Of course if the table is too large, then you can instead pre-summarize the data on a daily basis and just use that, and add in "today's" data from your master table when running the report.

I rewrote your query as:
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week

The reason why your query takes so long is because of your inner select, you are essentialy running 6,400,000,000 queries. With a query like this your best solution may be to develop a timed reporting system, where the user receives an email when the query is done and the report is constructed or the user logs in and checks the report after.
Even with the optimization written by OMG Ponies (bellow) you are still looking at around the same number of queries.
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week

custom sorting or ordering a table without resorting the whole shebang

For ten years we've been using the same custom sorting on our tables, I'm wondering if there is another solution which involves fewer updates, especially since today we'd like to have a replication/publication date and wouldn't like to have our replication replicate unnecessary entries.I had a look into nested sets, but it doesn't seem to do the job for us.
Base table:
id | a_sort
---+-------
1 10
2 20
3 30
After inserting:
insert into table (a_sort) values(15)
An entry at the second position.
id | a_sort
---+-------
1 10
2 20
3 30
4 15
Ordering the table with:
select * from table order by a_sort
and resorting all the a_sort entries, updating at least id=(2,3,4)
will of course produce the desired output:
id | a_sort
---+-------
1 10
4 20
2 30
3 40
The column names, the column count, datatypes, a possible join, possible triggers or the way the resorting is done is/are irrelevant to the problem.Also we've found some pretty neat ways to do this task fast.
only; how the heck can we reduce the updates in the db to 1 or 2 max.
Seems like an awfully common problem.
The captain obvious in me thougth once "use an a_sort float(53), insert using a fixed value of ordervaluefirstentry+abs(ordervaluefirstentry-ordervaluenextentry)/2".
But this would only allow around 1040 "in between" entries - so never resorting seems a bit problematic ;)

You really didn't describe what you're doing with this data, so forgive me if this is a crazy idea for your situation:
You could make a sort of 'linked list' where instead of a column of values, you have a column for the 'next highest valued' id. This would decrease the number of updates to a maximum of 2.
You can make it doubly linked and also have a column for next lowest, which would bring the maximum number of updates to 3.
See:
http://en.wikipedia.org/wiki/Linked_list

Best way to use hibernate for complex queries like top N per group

I'm working now for a while on a reporting applications where I use hibernate to define my queries. However, more and more I get the feeling that for reporting use cases this is not the best approach.
The queries only result partial columns, and thus not typed objects
(unless you cast all fields in java).
It is hard to express queries without going straight into sql or
hql.
My current problem is that I want to get the top N per group, for example the last 5 days per element in a group, where on each day I display the amount of visitors.
The result should look like:
| RowName | 1-1-2009 | 2-1-2009 | 3-1-2009 | 4-1-2009 | 5-1-2009
| SomeName| 1 | 42 | 34 | 32 | 35
What is the best approach to transform the data which is stored per day per row to an output like this? Is it time to fall back on regular sql and work with untyped data?
I really want to use typed objects for my results but java makes my life pretty hard for that. Any suggestions are welcome!

Using the Criteria API, you can do this:
Session session = ...;
Criteria criteria = session.createCriteria(MyClass.class);
criteria.setFirstResult(1);
criteria.setMaxResults(5);
... any other criteria ...
List topFive = criteria.list();
To do this in vanilla SQL (and to confirm that Hibernate is doing what you expect) check out this SO post:

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PostgreSQL query to get events occurring within microseconds of each other - sql

Related

MariaDB - embed function to automatically sum columns and store result?

django database design when you will have too many rows

Cumulative average number of records created for specific day of week or date range

custom sorting or ordering a table without resorting the whole shebang

Best way to use hibernate for complex queries like top N per group

Categories

Resources