I am new to the world of SQL queries, I am comfortable writing basic sql queries limited to to CRUD operations.
I am now working on a project where I have to write complex queries and I am seeking help on how to do it.
Scenario
I have a table x
The logic I need to implement is
The first record starts with some default value let us say 0 as StartCount.
I need to add numbers Add1+Add2 and deduct Minus
The result of step 2+StartCount becomes my EndCount
The next Month StartCount is the EndCount of the previous row.
I have to repeat step 2,3,4 for all the rows in the table.
How can I do this using SQL
You want a cumulative sum, is available using window/analytic functions. It is something like this:
select x.*,
(first_value(startcount) over (order by <ordercol>) +
sum(add1 + add2 - minus) over (order by <ordercol>)
) as yourvalue
from x;
<ordercol> is needed because SQL tables represent unordered sets. You need a column that specifies the ordering of the rows.
Related
I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;
I'm looking for SQL query that will give me a simple percentage value based upon the number of occurrences of a value in a table with a single data column.
Example:
Table has single column of data, which has a header and 10 data rows:
COLUMN_HEADER
XYZ://abc123xyz456-0
XYZ://abc123xyz456-1
XYZ://abc123xyz456-2
XYZ://abc123xyz456-3
ABC://abc123xyz456-4
XYZ://abc123xyz456-5
XYZ://abc123xyz456-6
ABC://abc123xyz456-7
XYZ://abc123xyz456-8
XYZ://abc123xyz456-9
I'm looking for the query to look for all data that does not start with XYZ://*
and give that as a % of the row count.
In the above example, there are two rows that start with ABC:// and eight that start XYZ:// therefore the result should be:
80.00%
(so 8 out of 10 rows do not start with XYZ://)
As you can tell by now I'm a noob in SQL.
MS SQL 2014
Thanks in advance.
You can do this with conditional aggregation:
select avg(case when COLUMN_HEADER like 'XYZ://%' then 1.0 else 0 end) as xyz_ratio
Your logic and examples are backwards. 80% of the rows have values that do start with "XYZ://". Use like or not like as appropriate.
I need to compare current row and previous row and based on some comparision need to derive a column value. Currently apporach I m following is making two differnt record sets and then use rank function and then by joining rank functions I m able to achiieve this. However, this seems to be tedious apporach, is there a better way to achieve this. I m currently writing query something like below :-
select
< comparing columns from two record sets and deriving column value>
(
select(<some complex logic>, rank from a) rcdset,
(select <some complex logic>, rank +1 from a) rcdset2 where rcdset.rnk = rcdset1.rnk (+)
Database - Oracle 10g
Use LAG(value_expr) OVER (ORDER BY rank_col) to retrieve the value (value_expr) from previous row (order defined by rank_col), see http://oracle-base.com/articles/misc/lag-lead-analytic-functions.php
I was always bothered by how should I approach those, which solution is better. I guess the sample code should explain it better.
Lets imagine we have a table that has 3 columns:
(int)Id
(nvarchar)Name
(int)Value
I want to get the basic columns plus a number of calculations on the Value column, but with each of the calculation being based on a previous one, In other words something like this:
SELECT
*,
Value + 10 AS NewValue1,
Value / NewValue1 AS SomeOtherValue,
(Value + NewValue1 + SomeOtherValue) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
Obviously this will not work. NewValue1, SomeOtherValue and YetAnotherValue are on the same level in the query so they can't refer to each other in the calculations.
I know of two ways to write queries that will give me the desired result. The first one involves repeating the calculations.
SELECT
*,
Value + 10 AS NewValue1,
Value / (Value + 10) AS SomeOtherValue,
(Value + (Value + 10) + (Value / (Value + 10))) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
The other one involves constructing a multilevel query like this:
SELECT
t2.*,
(t2.Value + t2.NewValue1 + t2.SomeOtherValue) / 10 AS YetAnotherValue
FROM
(
SELECT
t1.*,
t1.Value / t1.NewValue1 AS SomeOtherValue
FROM
(
SELECT
*,
Value + 10 AS NewValue1
FROM
MyTable
WHERE
Name LIKE "A%"
) t1
) t2
But which one is the right way to approach the problem or simply "better"?
P.S. Yes, I know that "better" or even "good" solution isn't always the same thing in SQL and will depend on many factors.
I have tired a number of different combination of calculations in both variants. They always produced the same execution plan, so it could be assumed that there is no difference in the performance aspect. From the code usability perspective the first approach i obviously better as the code is more readable and compact.
There is no "right" way to write such queries. SQL Server, as with most databases (MySQL being a notable exception), does not create intermediate tables for each subquery. Instead, it optimizes the query as a whole and often moves all the calculations for the expressions into a single processing node.
The reason that column aliases cannot be re-used at the same level goes to the ANSI standard definition. In particular, nothing in the standard specifies the order of evaluation for the individual expressions. Without knowing the order, SQL cannot guarantee that the variable is defined before evaluated.
I often write multi-level queries -- either using subqueries or CTEs -- to make queries more readable and more maintainable. But then again, I will also copy logic from one variable to the other because it is expedient. In my opinion, this is something that the writer of the query needs to decide on, taking into account whether the query is part of the code for a system that needs to be maintained, local coding standards, whether the query is likely to be modified, and similar considerations.
Can anyone please help me in unstanding below csum function.
What will be the output in each case.
csum(1,1),
csum(1,1) + emp_no
csum(1,emp_no)+emp_no
CSUM is an old deprecated function from V2R3, over 15 years ago. It can always be rewritten using newer Standard SQL compliant syntax.
CSUM(1,1) returns the same as ROW_NUMBER() OVER (ORDER BY 1), a sequence starting with 1.
But you should never use it like that as ORDER BY 1 within a Windowed Aggregate Function is not the same as the final ORDER BY 1 of a SELECT, it's ordering all rows by the same value 1. Teradata calculates those functions in parallel based on the values in PARTITION BY and ORDER BY, this means all rows with the same PARTITION/ORDER data are processed on a single AMP, if there's only a single value one AMP will process all rows, resulting in a totally skewed distribution.
Instead of ORDER BY 1 you should use a column which is more or less unique in best case.
csum(1,emp_no)+emp_no is probably used with another SELECT to get the current maximum value of a column and add the new sequential values to it, i.e. creating your own gap-less sequence numbers.
This is the best way to do it:
SELECT ROW_NUMBER() OVER (ORDER BY column(s)_with_a_low_number_of_rows_per_value)
+ COALESCE((SELECT MAX(seqnum) FROM table),0)
,....
FROM table