Nested SQL in Presto not resolving a column name when trying to apply WHERE - sql

I have been trying to create an automated chart / query for the utilisation of my router.
I have a nested query that returns the following:
Record_Date | Mbps_IN | Mbps_OUT
YYYYMMDD HH:00 | 1234 | 1234
This should have one entry per hour but due to data collection issues from my router there are often missing hours or even days of data missing. The nature of the counter is a "delta" so elsewhere in the "raw data" I am capturing the delta of data volume between the previous record which results in a flat line for a number of hours and then a very big data value often 2-3 times bigger due to it containing multiple hours of utilisation recorded against the first hour the data feed returned.
Ultimately I would like to find a way to smooth / build an average from this spike and backfill the missing hours. (but that is a challenge for another day).
In the first instance I would like simply only select the rows where the value in Mbps_In is less than 1000.
However, when I do this from either metabase or a dbeaver connection direct to my PrestoDB I get an error:
Column 'results.Mbps_In' cannot be resolved {:message "line 27:7: Column 'results.Mbps_in' cannot be resolved", :errorCode 47, :errorName "COLUMN_NOT_FOUND",
My Query works just fine to give the tabular output including the outliers as follows:
select
metrics_date_hour Record_Date
,round(In_Utilisation_Mbps_Total,2) as Mbps_In
,round(Out_Utilisation_Mbps_Total,2) as Mbps_Out
from (
nested query
) results
-- WHERE results.Mbps_In < 1000
Group By Record_Date, Order By Record_Date desc
When I uncomment the Where clause I get the error on the failure to resolve the column name.
I feel like this should not be difficult but I have tried a few variations and efforts at referencing some of the original columns that were processed earlier to get to this results output but I am still failing to correctly reference the column from the results table.
Updated with successful query:
select
metrics_date_hour Record_Date
,round(sum(In_Utilisation_Mbps_Total),2) as Mbps_In
,round(sum(Out_Utilisation_Mbps_Total),2) as Mbps_Out
from (
nested query
) results
-- WHERE results.Mbps_In < 1000 - I didn't get this to work
Group By Record_Date
Having (sum(In_Utilisation_Mbps_Total) <1000
Order By Record_Date desc

The error is produced because you don't have a column named Mbps_In in your nested query. I thing that you really need a HAVING clause not a WHERE. Try to change it to this:
select
metrics_date_hour Record_Date
,round(In_Utilisation_Mbps_Total,2) as Mbps_In
,round(Out_Utilisation_Mbps_Total,2) as Mbps_Out
from (
nested query
) results
Group By Record_Date
Having Mbps_In<1000
Order By Record_Date desc
If you still want too use the WHERE clause, you need to change your column name:
select
metrics_date_hour Record_Date
,round(In_Utilisation_Mbps_Total,2) as Mbps_In
,round(Out_Utilisation_Mbps_Total,2) as Mbps_Out
from (
nested query
) results
Where In_Utilisation_Mbps_Total<1000
Group By Record_Date
Order By Record_Date desc

Related

druid sql query - count distinctly for a multi value field across records

Is there a way to do a distinct count across different rows for a multi-value field in druid SQL for a particular value in which value is only counted once across an array? eg suppose I have below records :
shippingSpeed
[standard, standard, standard, ground]
[standard,ground]
[ground,ground]
Expected Result:
standard 2
ground 3
I tried below query but it is aggregating the field count inside an array and then giving the total count across all records:
SELECT
"shippingSpeed", count(*)
FROM orders
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '30' DAY
GROUP BY 1
ORDER BY 2 ASC
Result:
standard 4
ground 4
This is because the Group By on multi-valued columns will UNNEST the array into multiple rows. It is counting each item as an instance correctly.
If you want to remove duplicates, define "shippingSpeed" at ingestion time with the property:
"multiValueHandling": "SORTED_SET"
You can find more details here: https://druid.apache.org/docs/latest/querying/multi-value-dimensions.html#overview
Okay there are some undocumented function's that you can use.
SELECT
array_set_add(MV_TO_ARRAY("shippingSpeed",null) , count(*)
FROM orders
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '30' DAY
GROUP BY 1
ORDER BY 2 ASC
which might work.
MV_TO_ARRAY -> converts the multi value col to an array
array_set_add -> creates a set out of the arrays. Since we donot have 2 arrays, second argument is null.
but what #sergio said might be the easiest option.

SQL: Apply an aggregate result per day using window functions

Consider a time-series table that contains three fields time of type timestamptz, balance of type numeric, and is_spent_column of type text.
The following query generates a valid result for the last day of the given interval.
SELECT
MAX(DATE_TRUNC('DAY', (time))) as last_day,
SUM(balance) FILTER ( WHERE is_spent_column is NULL ) AS value_at_last_day
FROM tbl
2010-07-12 18681.800775017498741407984000
However, I am in need of an equivalent query based on window functions to report the total value of the column named balance for all the days up to and including the given date .
Here is what I've tried so far, but without any valid result:
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(sum(balance) FILTER ( WHERE is_spent_column is NULL ) ) OVER ( ORDER BY DATE_TRUNC('DAY', (time)) ) AS total_value_per_day
FROM tbl
group by 1
order by 1 desc
2010-07-12 16050.496339044977568391974000
2010-07-11 13103.159119670350269890284000
2010-07-10 12594.525752964512456914454000
2010-07-09 12380.159588711091681327014000
2010-07-08 12178.119542536668113577014000
2010-07-07 11995.943973804127033140014000
EDIT:
Here is a sample dataset:
LINK REMOVED
The running total can be computed by applying the first query above on the entire dataset up to and including the desired day. For example, for day 2009-01-31, the result is 97.13522530000000000000, or for day 2009-01-15 when we filter time as time < '2009-01-16 00:00:00' it returns 24.446144000000000000.
What I need is an alternative query that computes the running total for each day in a single query.
EDIT 2:
Thank you all so very much for your participation and support.
The reason for differences in result sets of the queries was on the preceding ETL pipelines. Sorry for my ignorance!
Below I've provided a sample schema to test the queries.
https://www.db-fiddle.com/f/veUiRauLs23s3WUfXQu3WE/2
Now both queries given above and the query given in the answer below return the same result.
Consider calculating running total via window function after aggregating data to day level. And since you aggregate with a single condition, FILTER condition can be converted to basic WHERE:
SELECT daily,
SUM(total_balance) OVER (ORDER BY daily) AS total_value_per_day
FROM (
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(balance) AS total_balance
FROM tbl
WHERE is_spent_column IS NULL
GROUP BY 1
) AS daily_agg
ORDER BY daily

SELECT MIN from a subset of data obtained through GROUP BY

There is a database in place with hourly timeseries data, where every row in the DB represents one hour. Example:
TIMESERIES TABLE
id date_and_time entry_category
1 2017/01/20 12:00 type_1
2 2017/01/20 13:00 type_1
3 2017/01/20 12:00 type_2
4 2017/01/20 12:00 type_3
First I used the GROUP BY statement to find the latest date and time for each type of entry category:
SELECT MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category;
However now, I want to find which is the date and time which is the LEAST RECENT among the datetime's I obtained with the query listed above. I will need to use somehow SELECT MIN(date_and_time), but how do I let SQL know I want to treat the output of my previous query as a "new table" to apply a new SELECT query on? The output of my total query should be a single value—in case of the sample displayed above, date_and_time = 2017/01/20 12:00.
I've tried using aliases, but don't seem to be able to do the trick, they only rename existing columns or tables (or I'm misusing them..).There are many questions out there that try to list the MAX or MIN for a particular group (e.g. https://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/ or Select max value of each group) which is what I have already achieved, but I want to do work now on this list of obtained datetime's. My database structure is very simple, but I lack the knowledge to string these queries together.
Thanks, cheers!
You can use your first query as a sub-query, it is similar to what you are describing as using the first query's output as the input for the second query. Here you will get the one row out put of the min date as required.
SELECT MIN(date_and_time)
FROM (SELECT MAX(date_and_time) as date_and_time, entry_category
FROM timeseries_table
GROUP BY entry_category)a;
Is this what you want?
SELECT TOP 1 MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category
ORDER BY MAX(date_and_time) ASC;
This returns ties. If you do not want ties, then include an additional sort key:
SELECT TOP 1 MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category
ORDER BY MAX(date_and_time) ASC, entry_category;

Access: Compare current field value in subquery

I am trying to create a subquery in MS Access where the having clause compares a value on the current record. I created the queries separate, but am having a hard time trying to combine them.
I have the following query, which is a Purchase Order list (POsFullDetail), and should show the first occurrence of the date of a PO given the Stock number (Stockum):
SELECT POsFullDetail.PO, POsFullDetail.OrderDate, POsFullDetail.StockNum,
(SELECT First(POsFullDetail.OrderDate) AS FirstOfOrderDate
FROM POsFullDetail
GROUP BY POsFullDetail.StockNum
HAVING POsFullDetail.StockNum = POsFullDetail.StockNum.Value
ORDER BY First(POsFullDetail.OrderDate)
) AS First_Date
FROM POsFullDetail;
The statement that I am trying to work with is POsFullDetail.StockNum.Value
The way it is set up, it's asking for a value. When I created the subquery separate I entered the stock number directly.
The subquery gives you the first order date per stocknum.
When using it as a subquery, you are no longer interested in the first order date per stocknum, but in the first order date for the stocknum.
SELECT POsFullDetail.PO, POsFullDetail.OrderDate, POsFullDetail.StockNum,
(
SELECT First(SameStockNum.OrderDate) AS FirstOfOrderDate
FROM POsFullDetail AS SameStockNum
WHERE SameStockNum.StockNum = POsFullDetail.StockNum
) AS First_Date
FROM POsFullDetail;
As you see, you must use a table alias, so you can link the table to itself. Though working with the same table you call it one time POsFullDetail and one time SameStockNum which enables you to link by SameStockNum.StockNum = POsFullDetail.StockNum.

Oracle SQL last n records

i have read tons of articles regarding last n records in Oracle SQL by using rownum functionality, but on my case it does not give me the correct rows.
I have 3 columns in my table: 1) message (varchar), mes_date (date) and mes_time (varchar2).
Inside lets say there is 3 records:
Hello world | 20-OCT-14 | 23:50
World Hello | 21-OCT-14 | 02:32
Hello Hello | 20-OCT-14 | 23:52
I want to get the last 2 records ordered by its date and time (first row the oldest, and second the newest date/time)
i am using this query:
SELECT *
FROM (SELECT message
FROM messages
ORDER
BY MES_DATE, MES_TIME DESC
)
WHERE ROWNUM <= 2 ORDER BY ROWNUM DESC;
Instead of getting row #3 as first and as second row #2 i get row #1 and then row #3
What should i do to get the older dates/times on top follow by the newest?
Maybe that helps:
SELECT *
FROM (SELECT message,
mes_date,
mes_time,
ROW_NUMBER() OVER (ORDER BY TO_DATE(TO_CHAR(mes_date, 'YYYY-MM-DD') || mes_time, 'YYYY-MM-DD HH24:MI') DESC) rank
FROM messages
)
WHERE rank <= 2
ORDER
BY rank
I am really sorry to disappoint - but in Oracle there's no such thing as "the last two records".
The table structure does not allocate data at the end, and does not keep a visible property of time (the only time being held is for the sole purpose of "flashback queries" - supplying results as of point in time, such as the time the query started...).
The last inserted record is not something you can query using the database.
What can you do? You can create a trigger that orders the inserted records using a sequence, and select based on it (so SELECT * from (SELECT * FROM table ORDER BY seq DESC) where rownum < 3) - that will assure order only if the sequence CACHE value is 1.
Notice that if the column that contains the message date does not have many events in a second, you can use that column, as the other solution suggested - e.g. if you have more than 2 events that arrive in a second, the query above will give you random two records, and not the actual last two.
AGAIN - Oracle will not be queryable for the last two rows inserted since its data structure do not managed orders of inserts, and the ordering you see when running "SELECT *" is independent of the actual inserts in some specific cases.
If you have any questions regarding any part of this answer - post it down here, and I'll focus on explaining it in more depth.
select * from table
minus
select * from table
where rownum<=(select count(*) from table)-n