Histogram for time periods using SQLite (regular buckets 1h wide)

Histogram for time periods using SQLite (regular buckets 1h wide) - sql

I have a series of visitor data and would like to represent its timespans as a histogram the same way Google Maps does.
My table has two columns: firstSeen and another one called lastSeen. Both contain (I think) so called Unix-timestamps such as 1581981627248.0 or 1581981629641.0, their type is REAL.
And I'm a little bit lost I have to say, I'm not too experienced in using SQL. Calculating the average stay time and so on was easy but this is kind of weird?
I could easily do something like the following query:
SELECT
strftime('%H', datetime(round(firstseen / 1000), 'unixepoch', 'weekday 1')) as 'Hour'
FROM visitors;
This is already for only one day (monday), which I guess it is okay I guess.
Then I could count and group by the hour. But wouldn't this alone be wrong? I would only count firstSeen as a field but what about a visitor which was present at 08:59-09:00? As I understand it, you would have to count their present twice in this case, once for the time span of 08:00 to 08:59 and once for 09:00 to 09:59. I also see another problem for empty slots, would it be possible to include these? Would that even make sense?
I hope it's clear what I'd like to accomplish and someone can point me in the right direction.
~Edit~
added MRE:
CREATE TABLE `detections_2` (
`firstseen` REAL NOT NULL,
`lastseen` REAL NOT NULL
);
INSERT INTO detections_2(
firstSeen,
lastSeen
)
VALUES
(1581892607,1581892644),
(1581892607,1581892694),
(1581892607,1581892703),
(1581892607,1581892629),
(1581892607,1581892619),
(1581892607,1581892683),
(1581892607,1581892702),
(1581892607,1581892651),
(1581892607,1581892697),
(1581892607,1581892654),
(1581892607,1581892680),
(1581892607,1581892619),
(1581892607,1581892700),
(1581892607,1581892716),
(1581892607,1581892700),
(1581892607,1581892643),
(1581892607,1581892720),
(1581892607,1581892647),
(1581892607,1581892726),
(1581892607,1581892679),
(1581892607,1581892665),
(1581892607,1581892701),
(1581892607,1581892659),
(1581892607,1581892725),
(1581892607,1581892662),
(1581892607,1581892661),
(1581979007,1581979037),
(1581979007,1581979054),
(1581979007,1581979038),
(1581979007,1581979100),
(1581979007,1581979027),
(1581979007,1581979080),
(1581979007,1581979034),
(1581979007,1581979119),
(1581979007,1581979027),
(1581979007,1581979093),
(1581979007,1581979068),
(1581979007,1581979061),
(1581979007,1581979115),
(1581979007,1581979126),
(1581979007,1581979106),
(1581979007,1581979114),
(1581979007,1581979078),
(1581979007,1581979078),
(1581979007,1581979056),
(1581979007,1581979117),
(1581979007,1581979040),
(1581979007,1581979057),
(1581979007,1581979068),
(1581979007,1581979103)

Related

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".

There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

Selecting data from two different tables

I have 2 tables.
Table TSTRSN
[P]Client
[P]Year
[P]Rule_Nbr
Type_Code
Table TSTOCK
[P]Client
[P]Year
TimeStamp
EndOfFiscalYear
( [P] means Primary Key)
The request is twofold:
1) List a count of all the Rule_Nbr within a given time (from TimeStamp).
...then User chooses a specific Rule_Nbr...
2) List all Client, Year, EndOfFiscalYear for that specific Rule_Nbr
So for Part 1) I have to take the Rule_Nbr, take the matching Client and Year - use that to search for the TimeStamp. If it falls within the right time, increment count by 1... and so on.
Then for Part 2) I could either have saved the data from part 1 (I don't know if this is feasible given the size of the tables) or redo the query 1) for just one Rule_Nbr.
Im very new to SQL/DB2... so how to go about doing this? My first thought was make an array, store TSTRSN.Client/Year/Rule_Nbr and then prune it by comparing it to TSTOCK.Client/Year/Timestamp but I wonder if theres a better way (Im not even sure if Arrays exist in DB2!)
Any tips?

What you're looking for is the JOIN keyword.
http://www.gatebase.toucansurf.com/db2examples13.html

How to cast nvarchar to integer?

I have a database I'm running queries on where I cannot change the schema. I am in my second year of database management. We have not touched much on writing actual SQL as opposed to just using the GUI to create our queries and manage our DB's.
I have a population attribute that I need to run SUM on but the population is datatype ncharvar. I need to cast it to int.
I don't know how to do that in SQL though! Can someone please show me? I've been fiddling with it for awhile and I'm out of ideas. I'm very unfamiliar with SQL (as simple as it looks) and this would be helpful.
SELECT dbo_City.CityName, Sum(dbo_City.Population) AS SumOfPopulation
FROM dbo_City
GROUP BY dbo_City.CityName
ORDER BY dbo_City.CityName, Sum(dbo_City.Population);
I need to find what cities have populations between 189,999 and 200,000, which is a very simple query. I'm grouping by city and using the sum of the population. I'm not sure where to insert the 189,999-200,000 figure in the query but I can figure that out later. Right now I'm stuck on casting the ncharvar Population field to an Int so I can run this query!
I found the answer here:
Using SUM on nvarchar field
SELECT SUM(CAST(NVarcharCol as int))
But I'm not sure how to execute this solution. Specifically, I'm not sure where to insert this code in the above provided SQL, and I don't understand why the nvarchar is called nvarcharcol.

From MSDN:
Syntax for CAST:
CAST ( expression AS data_type [ ( length ) ] )
Your solution should look something like this:
SELECT c.CityName, CAST(c.Population AS INT) AS SumOfPopulation
FROM dbo_City AS c
WHERE ISNUMERIC(c.Population) = 1 AND CAST(c.Population AS INT) BETWEEN 189999 AND 200000
ORDER BY c.CityName, CAST(c.Population AS INT)
You shouldn't need the sum function unless you want to know the total population of the table, which would be more useful if it was a table of countries, cities, and city populations, unless this particular city table is broken down further (such as with individual zip codes?). In that case, the below would be the preference:
SELECT c.CityName, SUM(CAST(c.Population AS INT)) AS SumOfPopulation
FROM dbo_City AS c
WHERE ISNUMERIC(c.Population) = 1
GROUP BY c.CityName
HAVING SUM(CAST(c.Population AS INT)) BETWEEN 189999 AND 200000
ORDER BY c.CityName, SUM(CAST(c.Population AS INT))
I hope this helps point you in the right direction.
-C§
Edit: Integrated the "fail safe" from your linked syntax, which should stop that error coming up. It adds a filter to the column to only process those that are able to be cast to a numeric type without extra processing (such as removing the comma as in vkp's response).

I ran into a similar problem where I had temperatures (temp) stored as nvarchar. I wanted to filter out all temperatures under 50F. Unfortunately,
WHERE (temp > '5')
would include temperatures that started with a - sign (-5, -6, ...); even worse, I discovered that temperatures over 100F were also getting discarded.
SOLUTION:
WHERE (CAST (temp as SMALLINT) > '50')
I know this doesn't "directly" answer your question but for the life of me I couldnt find a specific answer to my problem anywhere on the web. I thought it would be lame to answer my own question, so I wanted to add that my discovery to your answer.

SQL Selecting records where one date range doesn't intersect another

I'm trying to write a simple reservation program for a campground.
I have a table for campsites (one record for every site available at the campground).
I have a table for visitors which uses the campsite table's id as a foreign key, along with a check in date and check out date.
What I need to do is gather a potential check in and check out date from the user and then gather all the campsites that are NOT being used at any point in that range of dates.
I think I'm close to the solution but there's one piece I seem to be missing.
I'm using 2 queries.
1) Gather all the campsites that are occupied during that date range.
2) Gather all campsites that are not in query 1.
This is my first query:
SELECT Visitors.CampsiteID, Visitors.CheckInDate, Visitors.CheckOutDate
FROM Visitors
WHERE (((Visitors.CheckInDate)>=#CHECKINDATE#
And (Visitors.CheckInDate)<=#CHECKOUTDATE#)
Or ((Visitors.CheckOutDate)>=#CHECKINDATE#
And (Visitors.CheckOutDate)<=CHECKOUTDATE));
I think I'm missing something. If the #CHECKINDATE# and #CHECKOUTDATE# both occur between someone else's Check-in and Check-out dates, then this doesn't catch it.
I know I could split this between two queries, where one is dealing with just the #CHECKINDATE# and the second is dealing with the #CHECKOUTDATE#, but I figure there's a cleaner way to do this and I'm just not coming up with it.
This is my second one, which I think is fine the way it is:
SELECT DISTINCT Campsites.ID, qryCampS_NotAvailable.CampsiteID
FROM Campsites LEFT JOIN qryCampS_NotAvailable
ON Campsites.ID = qryCampS_NotAvailable.CampsiteID
WHERE (((qryCampS_NotAvailable.CampsiteID) Is Null));
Thanks,
Charles

To get records that overlap with the requested time period, use this simple logic. Two time periods overlap when one starts before the other ends and the other ends after the first starts:
SELECT v.CampsiteID, v.CheckInDate, v.CheckOutDate
FROM Visitors v
WHERE v.CheckInDate <= #CHECKOUTDATE# and
v.CheckOutDate >= #CHECKINDATE# ;

Summarizing records by date in Rails 3 ActiveRecord

I have a large table with many records that share a timestamp. I want to get a result set that has a column summed by timestamp. I see how you can simply use the 'sum' method to get a columns total. I need to, however, group by a date column. This is far less obvious. I know I can use 'find_by_sql' but it will be hideous to code as I have to do this for over 20 columns. I assume AR must have some magic to do this which escapes me?
Date set example:
table/model: games/Game
player_name, points_scored, game_date
john, 20, 08-20-2012
sue, 30, 08-20-2012
john, 12, 08-21-2012
sue, 10, 08-21-2012
What i want to see in my results is:
game_date, total_points
08-20-2012, 50
08-21-2012, 22
Here is a crude example of what the SQL query would look like:
SELECT game_date, SUM(points_scored)
FROM games
GROUP BY game_date
Mind you, I actually have 20 'score' columns to SUM by timestamp.
How can I simply use AR to do this? Thanks in advance.

Ok. It took some digging and playing around but I figured it out. I was hoping to find something better than 'find_by_sql' and I did, but it isn't a whole lot better. Again, knowing that I need to SUM 20+ columns by timestamp, here is the solution in the context of the example above.
results = Game.select( 'game_date, SUM(points_scored) as "points_scored"').group( 'game_date' )
Now, that doesn't look so bad, but I have to type in the 20+ SUM() within that 'select' method. Doesn't save a whole lot of work from 'find_by_sql' but it works.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas