How to speed up SQL query with date GROUP BY? - sql

I have a normal SQLite database table called table1 with 7 columns and of course a rowid. The first column is an custom_id number, the second is date in format YYYY-MM-DD and other 5 are real number data columns. There are about 10M rows in the database, and custom_id and date columns have indices.
What I want to do is to speed up the following query:
SELECT date,max(data1) AS maximum
FROM table1
WHERE custom_id = '1123' AND data1 <> 'NaN'
GROUP BY strftime('%Y-%m', date)
I want to find the maximum correct (not NaN) data1 value for the custom_id 1123 for each year-month-combination. The code above works actually fine, but the query lasts 10 seconds in the first run, but the second time it takes under 1 second, which is OK for me. I run the query in my home PC Apache server with PHP. I think Apache uses some caching which explains the difference.
But the question is, how to speed up the first time run performance? I have many other custom_id:s to query, not all can be cached! Do I need more indices? Another kind of query?

We are going to create an index that will support the following operations:
Retrieve the records of a specific customer
aggregate by month
Creating the following index is not possible since strftime is not a deterministic function
create index table1_ix on table1 (custom_id,strftime('%Y-%m', date));
non-deterministic functions prohibited in index expressions
So instead of strftime('%Y-%m', date) we are going to use substr(date,1,7)
create index table1_ix on table1 (custom_id,substr(date,1,7));
The query should be changed accordingly
select substr(date,1,7), max(data1) as maximum
from table1
where custom_id = '1123'
and data1 <> 'NaN'
group by substr(date,1,7)

I am guessing this is what you intend:
SELECT strftime('%Y-%m', date), max(data1) AS maximum
FROM table1
WHERE custom_id = 1123 AND data <> 'NaN'
GROUP BY strftime('%Y-%m', date)
Start with an index on table1(custom_id, date).

Related

SELECT MIN from a subset of data obtained through GROUP BY

There is a database in place with hourly timeseries data, where every row in the DB represents one hour. Example:
TIMESERIES TABLE
id date_and_time entry_category
1 2017/01/20 12:00 type_1
2 2017/01/20 13:00 type_1
3 2017/01/20 12:00 type_2
4 2017/01/20 12:00 type_3
First I used the GROUP BY statement to find the latest date and time for each type of entry category:
SELECT MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category;
However now, I want to find which is the date and time which is the LEAST RECENT among the datetime's I obtained with the query listed above. I will need to use somehow SELECT MIN(date_and_time), but how do I let SQL know I want to treat the output of my previous query as a "new table" to apply a new SELECT query on? The output of my total query should be a single value—in case of the sample displayed above, date_and_time = 2017/01/20 12:00.
I've tried using aliases, but don't seem to be able to do the trick, they only rename existing columns or tables (or I'm misusing them..).There are many questions out there that try to list the MAX or MIN for a particular group (e.g. https://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/ or Select max value of each group) which is what I have already achieved, but I want to do work now on this list of obtained datetime's. My database structure is very simple, but I lack the knowledge to string these queries together.
Thanks, cheers!
You can use your first query as a sub-query, it is similar to what you are describing as using the first query's output as the input for the second query. Here you will get the one row out put of the min date as required.
SELECT MIN(date_and_time)
FROM (SELECT MAX(date_and_time) as date_and_time, entry_category
FROM timeseries_table
GROUP BY entry_category)a;
Is this what you want?
SELECT TOP 1 MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category
ORDER BY MAX(date_and_time) ASC;
This returns ties. If you do not want ties, then include an additional sort key:
SELECT TOP 1 MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category
ORDER BY MAX(date_and_time) ASC, entry_category;

SQL script to find previous value, not necessarily previous row

is there a way in SQL to find a previous value, not necessarily in the previous row, within the same SELECT statement?
See picture below. I'd like to add another column, ELAPSED, that calculates the time difference between TIMERSTART, but only when DEVICEID is the same, and I_TYPE is viewDisplayed. e.g. subtract 1 from 2, store difference in 3, store 0 in 4 because i_type is not viewDisplayed, subtract 2 from 5, store difference in 6, and so on.
It has to be a statement, I can't use a stored procedure in this case.
SELECT DEVICEID, I_TYPE, TIMERSTART,
O AS ELAPSED -- CASE WHEN <CONDITION> THEN TIMEDIFF() ELSE 0 END AS ELAPSED
FROM CLIENT_USAGE
ORDER BY TIMERSTART ASC
I'm using SAP HANA DB, but it works pretty much like the latest version of MS-SQL. So, if you know how to make it work in SQL, I can make it work in HANA.
You can make a subquery to find the last time entered previous to the row in question.
select deviceid, i_type, timerstart, (timerstart - timerlast) as elapsed.
from CLIENT_USAGE CU
join ( select top 1 timerstart as timerlast
from CLIENT_USAGE C
where (C.i_type = CU.i_type) and
(C.deviceid = CU.deviceid) and (C.timerstart < CU.timerstart)
order by C.timerstart desc
) as temp1
on temp1.i_type = CU.i_type
order by timerstart asc
This is a rough sketch of what the sql should look like I do not know what your primary key is on this table if it is i_type or i_type and deviceid. But this should help with how to atleast calculate the field. I do not think it would be necessary to store the value unless this table is very large or the hardware being used is very slow. It can be calculated rather easily each time this query is run.
SAP HANA supports window functions:
select DEVICEID,
TIMERSTART,
lag(TIMERSTART) over (partition by DEVICEID order by TIMERSTART) as previous_start
from CLIENT_USAGE
Then you can wrap this in parentheses and manipulate the data to your hearts' content

SQL find period that contain dates of specific year

I have a table (lets call it AAA) containing 3 colums ID,DateFrom,DateTo
I want to write a query to return all the records that contain (even 1 day) within the period DateFrom-DateTo of a specific year (eg 2016).
I am using SQL Server 2005
Thank you
Another way is this:
SELECT <columns list>
FROM AAA
WHERE DateFrom <= '2016-12-31' AND DateTo >= '2016-01-01'
If you have an index on DateFrom and DateTo, this query allows Sql-Server to use that index, unlike the query in Max xaM's answer.
On a small table you will probably see no difference but on a large one there can be a big performance hit using that query, since Sql-Server can't use an index if the column in the where clause is inside a function
Try this:
SELECT * FROM AAA
WHERE DATEPART(YEAR,DateFrom)=2016 OR DATEPART(YEAR,DateTo)=2016
Well you can use the following query
select * from Table1
WHERE DateDiff(day,DateFrom,DateTo)>0
AND YEAR(DateFrom) = YEAR(DateTo)
And here is the result:
Enjoy :D !

Query aggregate faster than MAX

I have a fairly large table in which one of the columns is a date column. The query I execute is as follows.
select max(date) from tbl where date < to_date('10/01/2010','MM/DD/YYYY')
That is, I want to find the cell value closest to and less than a particular date value. This takes considerable time because of the max on the large table. Is there a faster way to do this? maybe using LAST_VALUE?
Put an index on the date column and the query should be plenty fast.
1) Add an index to the date column. Simply put, an index allows the database engine to store information about the data so it will speed up most queries where that column is one of the clauses. Info here http://docs.oracle.com/cd/B28359_01/server.111/b28310/indexes003.htm
2) Consider adding a second clause to the query. You have where date < to_date('10/01/2010','MM/DD/YYYY') now, why not change it to:
where date < to_date('10/01/2010','MM/DD/YYYY') and date > to_date('09/30/2010', 'MM/DD/YYYY')
since this will reduce the number of scanned rows.
Try
select date from (
select date from tbl where date < to_date('10/01/2010','MM/DD/YYYY') order by date desc
) where rownum = 1

Group by in t-sql not displaying single result

See the image below. I have a table, tbl_AccountTransaction in which I have 10 rows. The lower most table having columsn AccountTransactionId, AgreementId an so on. Now what i want is to get a single row, that is sum of all amount of the agreement id. Say here I have agreement id =23 but when I ran my query its giving me two rows instead of single column, since there is nano or microsecond difference in between the time of insertion.
So i need a way that will give me row 1550 | 23 | 2011-03-21
Update
I have update my query to this
SELECT Sum(Amount) as Amount,AgreementID, StatementDate
FROM tbl_AccountTranscation
Where TranscationDate is null
GROUP BY AgreementID,Convert(date,StatementDate,101)
but still getting the same error
Msg 8120, Level 16, State 1, Line 1
Column 'tbl_AccountTranscation.StatementDate' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Your group by clause is in error
group by agreementid, convert(date,statementdate,101)
This makes it group by the date (without time) of the statementdate column. Whereas the original is grouping by the statementdate (including time) then for each row of the output, applying the stripping of time information.
To be clear, you weren't supposed to change the SELECT clause
SELECT Sum(Amount) as Amount,AgreementID, Convert(date,StatementDate,101)
FROM tbl_AccountTranscation
Where TranscationDate is null
GROUP BY AgreementID,Convert(date,StatementDate,101)
Because you have a Group By StatementDate.
In your example you have 2 StatementDates:
2011-03-21 14:38:59.470
2011-03-21 14:38:59.487
Change your query in the Group by section instead of StatementDate to be:
Convert(Date, StatementDate, 101)
Have you tried to
Group by (Convert(date,...)
instead of the StatementDate
You are close. You need to combine your two approaches. This should do it:
SELECT Sum(Amount) as Amount,AgreementID, Convert(date,StatementDate,101)
FROM tbl_AccountTranscation
Where TranscationDate is null
GROUP BY AgreementID,Convert(date,StatementDate,101)
If you never need the time, the perhaps you need to change the datatype, so you don't have to do alot of unnecessary converting in most queries. SQL Server 2008 has a date datatype that doesn't include the time. In earlier versions you could add an additional date column that is automatically generated to strip out the time companent so all the dates are like the format of '2011-01-01 00:00:00:000' then you can do date comparisons directly having only had to do the conversion once. This would allow you to have both the actual datetime and just the date.
You should group by DATEPART(..., StatementDate)
Ref: http://msdn.microsoft.com/en-us/library/ms174420.aspx