SQL: aggregate table data in subgroups - sql

fellow SQLers.
I have a simple yet complex problem that I need to solve with SQL (SQLite) for my FME-script to run efficiantly. I have data I want to aggregate but in subgroups - taking z-level into account. It's a geological profile where I want to simplify the incoming data before I create my profile.
Example data
x,y,z_from,z_to,value
1,2,1,2,A
1,2,2,3,A
1,2,3,4,B
1,2,4,5,A
1,2,5,6,A
1,2,6,7,A
1,2,7,8,C
With SQL I want to aggregate on value but only rows that are adjacent (sort of a dissolve but on non-spatial data). And I want to calculate (also in sql) the correct z_from- and z_to-values. So the result from the example data above should be:
x,y,z_from,z_to,value
1,2,1,3,A
1,2,3,4,B
1,2,4,7,A
1,2,7,8,C
Any help on this will be remembered for all eternity.
Peter

It looks like you want all rows where the the "next" value is different, as defined by z_to. Unfortunately, SQLite doesn't support lead(), but you can do:
select e.*
from (select e.*,
(select e2.value
from example e2
where e2.z_to > e.z_to
order by e2.z_to
limit 1
) as next_value
from example e
) e
where next_value is null or next_value <> value;

Related

comparing current row and previous in oracle sql

I need to compare current row and previous row and based on some comparision need to derive a column value. Currently apporach I m following is making two differnt record sets and then use rank function and then by joining rank functions I m able to achiieve this. However, this seems to be tedious apporach, is there a better way to achieve this. I m currently writing query something like below :-
select
< comparing columns from two record sets and deriving column value>
(
select(<some complex logic>, rank from a) rcdset,
(select <some complex logic>, rank +1 from a) rcdset2 where rcdset.rnk = rcdset1.rnk (+)
Database - Oracle 10g
Use LAG(value_expr) OVER (ORDER BY rank_col) to retrieve the value (value_expr) from previous row (order defined by rank_col), see http://oracle-base.com/articles/misc/lag-lead-analytic-functions.php

Oracle Data Gaps

im looking for a query to fill this condition:
That currently gives us the number of BACs at the entity (which is something we need). The database assigns the BAC IDs consecutively within each accounting entity. So we need to add one more field to the query showing the current highest BAC ID at the entity. And once we have that, just filter the results down to anyplace the number of records doesn't equal the highest ID.
My current query:
select accounting_entity_id, count(bac_id)
from dc.pl_bac_information
group by accounting_entity_id
having count(bac_id) > 1;
Use analytic functions for this:
select bi.*
from (select bi.*, max(bac_id) over (partition by accounting_entity_id) as max_bac_id
from dc.pl_bac_information bi
) bi
where bac_id = max_bac_id;
This assumes you are using Oracle.
SELECT ACCOUNTING_ENTITY_ID
FROM DC.PL_BAC_INFORMATION
HAVING COUNT(BAC_ID) > 1 AND COUNT(BAC_ID) != MAX(BAC_ID)
GROUP BY ACCOUNTING_ENTITY_ID;

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

BigQuery: GROUP BY clause for QUANTILES

Based on the bigquery query reference, currently Quantiles do not allow any kind of grouping by another column. I am mainly interested in getting medians grouped by a certain column. The only work around I see right now is to generate a quantile query per distinct group member where the group member is a condition in the where clause.
For example I use the below query for every distinct row in column-y if I want to get the desired result.
SELECT QUANTILE( <column-x>, 1001)
FROM <table>
WHERE
<column-y> == <each distinct row in column-y>
Does the big query team plan on having some functionality to allow grouping on quantiles in the future?
Is there a better way to get what I am trying to get here?
Thanks
With the recently announced percentile_cont() window function you can get medians.
Look at the example in the announcement blog post:
http://googlecloudplatform.blogspot.com/2013/06/google-bigquery-bigger-faster-smarter-analytics-functions.html
SELECT MAX(median) AS median, room FROM (
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
)
GROUP BY room
While there are efficient algorithms to compute quantiles they are somewhat memory intensive - trying to do multiple quantile calculations in a single query gets expensive.
There are plans to improve QUANTILES, but I don't know what the timeline is.
Do you need median? Can you filter outliers and do an average of the remainder?
If your per-group size is fixed, you may be able to hack it using combination of order, nest and nth. For instance, if there are 9 distinct values of f2 per value of f1, for median:
select f1,nth(5,f2) within record from (
select f1,nest(f2) f2 from (
select f1, f2 from table
group by f1,f2
order by f2
) group by f1
);
Not sure if the sorted order in subquery is guaranteed to survive the second group, but it worked in a simple test I tried.

Oracle Group by issue

I have the below query. The problem is the last column productdesc is returning two records and the query fails because of distinct. Now i need to add one more column in where clause of the select query so that it returns one record. The issue is that the column i need
to add should not be a part of group by clause.
SELECT product_billing_id,
billing_ele,
SUM(round(summary_net_amt_excl_gst/100)) gross,
(SELECT DISTINCT description
FROM RES.tariff_nt
WHERE product_billing_id = aa.product_billing_id
AND billing_ele = aa.billing_ele) productdescr
FROM bil.bill_sum aa
WHERE file_id = 38613 --1=1
AND line_type = 'D'
AND (product_billing_id, billing_ele) IN (SELECT DISTINCT
product_billing_id,
billing_ele
FROM bil.bill_l2 )
AND trans_type_desc <> 'Change'
GROUP BY product_billing_id, billing_ele
I want to modify the select statement to the below way by adding a new filter to the where clause so that it returns one record .
(SELECT DISTINCT description
FROM RRES.tariff_nt
WHERE product_billing_id = aa.product_billing_id
AND billing_ele = aa.billing_ele
AND (rate_structure_start_date <= TO_DATE(aa.p_effective_date,'yyyymmdd')
AND rate_structure_end_date > TO_DATE(aa.p_effective_date,'yyyymmdd'))
) productdescr
The aa.p_effective_date should not be a part of GROUP BY clause. How can I do it? Oracle is the Database.
So there are multiple RES.tariff records for a given product_billing_id/billing_ele, differentiated by the start/end dates
You want the description for the record that encompasses the 'p_effective_date' from bil.bill_sum. The kicker is that you can't (or don't want to) include that in the group by. That suggests you've got multiple rows in bil.bill_sum with different effective dates.
The issue is what do you want to happen if you are summarising up those multiple rows with different dates. Which of those dates do you want to use as the one to get the description.
If it doesn't matter, simply use MIN(aa.p_effective_date), or MAX.
Have you looked into the Oracle analytical functions. This is good link Analytical Functions by Example