Fastest way to perform time average of multiple calculations in SQL? - sql

I have a question about the fastest way to perform a SQL Server query on a table, TheTable, that has the following fields: TimeStamp, Col1, Col2, Col3, Col4
I don't maintain the database, I just can access it. I need to perform 10 calculations that are similar to:
Col2*Col3 + 5
5*POWER(Col3,7) + 4*POWER(Col2,6) + 3*POWER(Col1,5)
Then I have to find the AVG and MAX of the calculation results using data from a chosen day (there is 8 months of data in the database so far). Since the data are sampled every 0.1 seconds, 864000 rows go into each calculation. I want to make sure that the query runs as quickly as possible. Is there a better way than this:
SELECT AVG(Col2*Col3 + 5),
AVG(5*POWER(Col3,7) + 4*POWER(Col2,6) + 3*POWER(Col1,5)),
MAX(Col2*Col3 + 5),
MAX(5*POWER(Col3,7) + 4*POWER(Col2,6) + 3*POWER(Col1,5))
FROM TheTable
WHERE TimeStamp >= '2010-08-31 00:00:00:000'
AND TimeStamp < '2010-09-01 00:00:00:000'
Thanks!

You could create those as computed (calculated) columns, and set Is Persisted to true. That will persist the calculated value to disk on insert, and make subsequent queries against those values very quick.
Alternately, if you cannot modify the table schema, you could create an Indexed View that calculates the values for you.

How about doing these calculations when you insert the data rather than when you select it? Then you will only have to do calcs for a given day on those values.
TableName
---------
TimeStamp
Col1
Col2
Col3
Col4
Calc1
Calc2
Calc3
and insert like so:
INSERT INTO TableName (...)
VALUES
(...,AVG(#Col2Val*#Col3Val + 5),...)

your only bet is to calculate the values ahead of time, either Computed Columns or persisted columns in a view, see here Improving Performance with SQL Server 2005 Indexed Views. If you are unable to alter the database you could pull the data out of that database into your own database. Just compute the columns as you insert it into your own database. Then run your queries off your own database.

Related

Query Distinct/Unique Value Counts for All Tables in a database - MS SQL Server

Is there a way to query the table name, column name, and distinct value count per column for all tables in a Database? How can I do this?
I'm on SQL Server Management Studio (2018)
I need to get a resulting table like this
TableName ColumnName Distinct Values
table1 col1 10
table1 col2 9
table1 col3 20
table2 col1 10
table2 col2 9
... ... ...
Thank you in advance.
Is there an efficient way to query the table name, column name, and distinct value count per column for all tables in a Database?
TL;DR: No, there is not an efficient way.
To be able to do this, you're going to need to use dynamic SQL and then (a hell of) a lot of COUNT(DISTINCT {Column}) aggregate functions. Adding the DISTINCT operator to a COUNT makes the function far more expensive. This can easily slow down a simple query with only one COUNT(DISTINCT {Column}), however, you want to do this on EVERY column in EVERY table in your database (that can be effected by a COUNT(DISTINCT {Column})).
There is no way to make that efficient, it will instead be incredibly slow, and I would not be surprised if it takes hours, or days, to run with a large enough database and could easily suffer from being a deadlock victim.
Personally, I would rethink what ever it is you are trying to achieve with this requirement.

Is hive partitioning hierarchical in nature?

Say we have a table partitioned as:-
CREATE EXTERNAL TABLE MyTable (
col1 string,
col2 string,
col3 string
)
PARTITIONED BY(year INT, month INT, day INT, hour INT, combination_id BIGINT);
Now obviously year is going to store year value (e.g. 2016), the month will store month va.ue (e.g. 7) the day will store day (e.g. 18) and hour will store hour value in 24 hour format (e.g. 13). And combination_id is going to be combination of padded (if single digit value pad it with 0 on left) values for all these. So in this case for example the combination id is 2016071813.
So we fire query (lets call it Query A):-
select * from mytable where combination_id = 2016071813
Now Hive doesn't know that combination_id is actually combination of year,month,day and hour. So will this query not take proper advantage of partitioning?
In other words, if I have another query, call it Query B, will this be more optimal than query A or there is no difference?:-
select * from mytable where year=2016 and month=7 and day=18 and hour=13
If Hive partitioning scheme is really hierarchical in nature then Query B should be better from performance point of view is what I am thinking. Actually I want to decide whether to get rid of combination_id altogether from partitioning scheme if it is not contributing to better performance at all.
The only real advantage for using combination id is to be able to use BETWEEN operator in select:-
select * from mytable where combination_id between 2016071813 and 2016071823
But if this is not going to take advantage of partitioning scheme, it is going to hamper performance.
Yes. Hive partitioning is hierarchical.
You can simply check this by printing the partitions of the table using below query.
show partitions MyTable;
Output:
year=2016/month=5/day=5/hour=5/combination_id=2016050505
year=2016/month=5/day=5/hour=6/combination_id=2016050506
year=2016/month=5/day=5/hour=7/combination_id=2016050507
In your scenario, you don't need to specify combination_id as partition column if you are not using for querying.
You can partition either by
Year, month, day, hour columns
or
combination_id only
Partitioning by Multiple columns helps in performance in grouping operations.
Say if you want to find maximum of a col1 for 'March' month of the years (2016 & 2015).
It can easily fetch the records by going to the specific 'Year' partition(year=2016/2015) and month partition(month=3)

SUM Every row as Total

I was looking for before posting but I don´t find anything. I don´t know if is possible what I want.
I want the sum of every column in the same row. For a better explanation, I attach a picture. I am using SQL Server 2005
Example:
Thanks for your time.
Normally your requirement suggests a lack in the design of your database. Maybe you should refactor it and create another table where you insert one row for every column and a foreign-key to the main-table. That will be much more efficient and makes it easier to maintain and to write queries.
However, you can do it in this way:
SELECT [TOTAL ROW] = Col1 + Col2 + Col3 + Col4 + .....,
OtherColumns ...
FROM dbo.TableName

SQLite slowdown as database grows (rolling log)

I'm having a problem with massive sqlite slowdowns in my c application and
have no idea whether it's to be expected or I'm not using sqlite correctly.
The db uses a rolling log like that explained here http://dt.deviantart.com/journal/Build-Your-Own-Circular-Log-with-MySQL-222550965.
The table being written to has about 170 float columns and is set to roll over at
2 million rows. The query to insert rows looks like:
INSERT OR REPLACE INTO table_name (row_id, <170 column names>) values ((SELECT
COALESCE(MAX(log_id), 0) % max_rows + 1 FROM table_name AS t), <170 floats>)
The insert time seems to grow linearly with the number of rows. The first
insert takes much less than a second while the 60,000th takes 30 seconds. Is
this what you'd expect? The db is stored on an ext3 formatted SD card could
this be a factor?
When you use MAX(log_id), you're asking the database to find the maximum value of log_id in the table. If you have no index on that column, the only way it can determine the maximum value is to scan the entire table.
You can add an index to the log_id column with an SQL command like;
create unique index idx1 on table_name (log_id);
Mind you, this could take a while on a particularly large table. If you can, try it on a copy first.

TSQL query with TOP 1 + ORDER BY or max/min + GROUP BY?

I need to get a record of value and timestamp by the max of timestamp. The combination of value and timestamp is a primary key. It seems that there are two ways to get max/min value. One query example is by using TOP 1 + ORDER BY:
SELECT TOP 1
value, timestamp
FROM myTable
WHERE value = #value
ORDER BY timestamp DESC
Another one is by MAX() + GROUP BY:
SELECT value, max(timestamp)
FROM myTable
WHERE value = #value
GROUP BY value
Is the second one is better than the first one in terms of performance? I read one person's comment of "to sort n items by first one is O(n power of 2), second O(n)" to my previous question. How about the case I have index on both value and timestamp?
If you don't have a composite index on (value, timestamp) then they will be poor and probably equally poor at that.
With an index, they'll probably be the same thanks to the Query Optimiser.
You can also quickly test for yourself by using these to see resources used:
SET STATISTICS IO ON
SET STATISTICS TIME ON
...but the best way is to use the Graphical Execution Plans
You should see huge differences in the IO + CPU with and without an index especially for larger table.
Note: You have a 3rd option
SELECT #value AS value, max(timestamp)
FROM myTable
WHERE value = #value
This will return a NULL for no rows which does make it slightly different to the other two
For anyone who finds this in a search and wants to know about Postgres (not applicable to OP), if the column is indexed the plans will be identical.