Computing grouped medians in DolphinDB - sql

I have a DFS table in DolphinDB. I tried to run a query that would compute grouped medians on this table. But it just threw an exception.
select median(col1) from t group by col2
The aggregated function in column med(v1) doesn't have a map-reduce implementation and can't be applied to a partitioned or distributed table.
Seems to me that DolphinDB does not support distributed median algorithm.

The aggregated function median differs from avgerage in that it can't be solved by map-reduce. So we have to pull the data and then apply the median function to each group.
DolphinDB's repartition mechanism make such work much easier.
ds = repartitionDS(<select first(col2) as col2, median(col1) as col1 from t>,`col2, VALUE)
mr(ds, x->x,,unionAll{false})

Related

rename results in data frame pandas

in pandas data frame it try to make some statistical analysis on column (heart rate) it aggregate with patient id and hour of measure, then make all statistical analysis
(mean,max,etc)
, my question is how to rename the return result ( to name sum_heart_rate instead of sum, min_heart_rate instead of min )
as follows
newdataframe= df2.groupby(['DayHour','subject_id']).agg({"Heart Rate":['sum' ,'min','max','std', 'count','var','skew']})
You can use the below template. You add more columns if needed.
newdataframe= (df2.groupby(['DayHour','subject_id']).agg(sum_heart_rate =('heart rate', 'sum'), min_heart_rate =('heart rate','min'))
For pandas version below 0.25 use code below
newdataframe = df2.groupby('date')['heart rate'].agg([('sum_heart_rate','sum'), ('min_heart_rate','min')])

Postgresql Writing max() Window function with multiple partition expressions?

I am trying to get the max value of column A ("original_list_price") over windows defined by 2 columns (namely - a unique identifier, called "address_token", and a date field, called "list_date"). I.e. I would like to know the max "original_list_price" of rows with both the same address_token AND list_date.
E.g.:
SELECT
address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1
The query already takes >10 minutes when I use just 1 expression in the PARTITION (e.g. using address_token only, nothing after that). Sometimes the query times out. (I use Mode Analytics and get this error: An I/O error occurred while sending to the backend) So my questions are:
1) Will the Window function with multiple PARTITION BY expressions work?
2) Any other way to achieve my desired result?
3) Any way to make Windows functions, especially the Partition part run faster? e.g. use certain data types over others, try to avoid long alphanumeric string identifiers?
Thank you!
The complexity of the window functions partitioning clause should not have a big impact on performance. Do realize that your query is returning all the rows in the table, so there might be a very large result set.
Window functions should be able to take advantage of indexes. For this query:
SELECT address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1;
You want an index on table1(address_token, list_date, original_list_price).
You could try writing the query as:
select t1.*,
(select max(t2.original_list_price)
from table1 t2
where t2.address_token = t1.address_token and t2.list_date = t1.list_date
) as max_list_price
from table1 t1;
This should return results more quickly, because it doesn't have to calculate the window function value first (for all rows) before returning values.

How to properly compute weighted average for zeroes in SQL

I have following problem - I'm computing weighted average in SQL, as following: SUM(Value * Weight) / SUM(Weight). However, there can be issue that rows are empty => SUM(Weight) == 0), and in this case the query fails. Is it somehow possible to return '0' as result in this case?
I have tried CASE SUM(Weight) WHEN 0 THEN 0 ELSE SUM(Value * Weight) / SUM(Weight) END, but I'm afraid that it evaluates SUM(Weight) twice, and that can be fairly expensive in my case.
Use NULLIF and ISNULL:
ISNULL(SUM(Value * Weight) / NULLIF(SUM(Weight),0),0)
The SQL engine doesn't compute sum(Weight) twice, just once. The conceptual process is:
compute the full cartesian join of all tables in the from clause
apply the join criteria to filter the results
apply the where clause criteria to filter the results
partition this result set into groups as defined by the group by clause
collapse each such group into one row, computing any aggregate functions that have been specified, keeping only those columns listed in the result set (aggregrate functions and grouping columns),
apply the criteria in the having clause to filter the grouped results,
drop all columns but those specified in the queries result columns, creating those that are computed expressions.
apply the ordering specified in the order by statement.
No actual SQL engine does this, but it must behave as if that is what happened. Your aggregate function is computed just once, along with any other aggregate functions, in a single pass.

BigQuery COUNT(DISTINCT value) vs COUNT(value)

I found a glitch/bug in bigquery.
We got a table based on Bank Statistic data under the
starschema.net:clouddb:bank.Banks_token
If i run the following query:
SELECT count(*) as totalrow,
count(DISTINCT BankId ) as bankidcnt
FROM bank.Banks_token;
And i get the following result:
Row totalrow bankidcnt
1 9513 9903
My problem is that if i have 9513row how could i get 9903row, which is 390row more than the rowcount in the table.
In BigQuery, COUNT DISTINCT is a statistical approximation for all results greater than 1000.
You can provide an optional second argument to give the threshold at which approximations are used. So if you use COUNT(DISTINCT BankId, 10000) in your example, you should see the exact result (since the actual amount of rows is less than 10000). Note, however, that using a larger threshold can be costly in terms of performance.
See the complete documentation here:
https://developers.google.com/bigquery/docs/query-reference#aggfunctions
UPDATE 2017:
With BigQuery #standardSQL COUNT(DISTINCT) is always exact. For approximate results use APPROX_COUNT_DISTINCT(). Why would anyone use approx results? See this article.
I've used EXACT_COUNT_DISTINCT() as a way to get the exact unique count. It's cleaner and more general than COUNT(DISTINCT value, n > numRows)
Found here: https://cloud.google.com/bigquery/query-reference#aggfunctions

BigQuery: GROUP BY clause for QUANTILES

Based on the bigquery query reference, currently Quantiles do not allow any kind of grouping by another column. I am mainly interested in getting medians grouped by a certain column. The only work around I see right now is to generate a quantile query per distinct group member where the group member is a condition in the where clause.
For example I use the below query for every distinct row in column-y if I want to get the desired result.
SELECT QUANTILE( <column-x>, 1001)
FROM <table>
WHERE
<column-y> == <each distinct row in column-y>
Does the big query team plan on having some functionality to allow grouping on quantiles in the future?
Is there a better way to get what I am trying to get here?
Thanks
With the recently announced percentile_cont() window function you can get medians.
Look at the example in the announcement blog post:
http://googlecloudplatform.blogspot.com/2013/06/google-bigquery-bigger-faster-smarter-analytics-functions.html
SELECT MAX(median) AS median, room FROM (
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
)
GROUP BY room
While there are efficient algorithms to compute quantiles they are somewhat memory intensive - trying to do multiple quantile calculations in a single query gets expensive.
There are plans to improve QUANTILES, but I don't know what the timeline is.
Do you need median? Can you filter outliers and do an average of the remainder?
If your per-group size is fixed, you may be able to hack it using combination of order, nest and nth. For instance, if there are 9 distinct values of f2 per value of f1, for median:
select f1,nth(5,f2) within record from (
select f1,nest(f2) f2 from (
select f1, f2 from table
group by f1,f2
order by f2
) group by f1
);
Not sure if the sorted order in subquery is guaranteed to survive the second group, but it worked in a simple test I tried.