For example, i wanted to write Window functions like sum over (window)
Since over clause is not supported by Druid, how do i achieve the same using Druid Native query API or SQL API?
You should use a GroupBy Query. As Druid is a time series database, you have to specify your interval (window) where you want to query data from. You can use aggregation methods over this data, for example a SUM() aggregation.
If you want, you can also do extra filtering within your aggregation, like "only sum records where city=paris"). You could also apply the SUM aggregation only to records which exists in a certain time window within your selected interval.
If you are a PHP user then maybe this package is handy for you: https://github.com/level23/druid-client#sum
We have tried to implement an easy way to query such data.
Related
I am trying to aggregate strings that belong to the same product code in one row. Which Qlik sense aggregation function should I use?
image
I am able to aggregate integers in such example, but failed for string aggregation.
Have you tried maxstring() - this is a string aggregation function.
As x3ja mentioned, you can use an aggregation function in charts that will work for strings, including:
MaxString()
Only()
Concat()
These can result in the type of thing you're looking for:
It's worth noting, though, that this sort of problem is almost always an issue with the underlying data model. Depending on what your source data looks like, you should consider investigating your use of Join and/or Concatenate. You can see more info on how to use those functions on this Qlik Help page.
Here's a very basic example of using a Join to properly combine the data in a way that results in all data showing up a single record without needing any aggregations in the table chart:
I am working on bigquery with standard sql and I have the following problem.
I am transforming a table with millions of data, but I will only work with the data of yesterday and today.
The result of that query (which is already listed) I have to store in another table.
The problem is that what must be executed every 1 hour and when creating the scheduled query and placing the option of "write append", the data that has been previously saved will be duplicated.
I need something like "write to table if it does not exist"
You should have your scheduled query written with replace in mind:
REPLACE TABLE `dataset.mytable`
AS
SELECT 1;
this way you fully replace on each run.
Update:
You may use MERGE statement to skip existing rows and add only new ones.
Materialized views are appending only new data.
They can query only single table, support only a limited set of aggregation functions (APPROX_COUNT_DISTINCT, ARRAY_AGG, AVG, COUNT, HLL_COUNT.INIT, MAX, MIN, SUM) and do not support a computation on top of an aggregation but maybe they will fit your use case.
So I have the following model in Django:
class MemberLoyalty(models.Model):
date_time = models.DateField(primary_key=True)
member = models.ForeignKey(Member, models.DO_NOTHING)
loyalty_value = models.IntegerField()
My goal is to have all the tuples grouped by the member with the most recent date. There are many ways to do it, one of them is using a subquery that groups by the member with max date_time and filtering member_loyalty with its results. The working sql for this solution is as follows:
SELECT
*
FROM
member_loyalty
WHERE
(date_time , member_id) IN (SELECT
max(date_time), member_id
FROM
member_loyalty
GROUP BY member_id);
Another way to do this would be by joining with the subquery.
How could i translate this on a django query? I could not find a way to filter with two fields using IN, nor a way to join with a subquery using a specific ON statement.
I've tried:
cls.objects.values('member_id', 'loyalty_value').annotate(latest_date=Max('date_time'))
But it starts grouping by the loyalty_value.
Also tried building the subquery, but cant find how to join it or use it on a filter:
subquery = cls.objects.values('member_id').annotate(max_date=Max('date_time'))
Also, I am using Mysql so I can not make use of the .distinct('param') method.
This is a typical greatest-per-group query. Stack-overflow even has a tag for it.
I believe the most efficient way to do it with the recent versions of Django is via a window query. Something along the lines should do the trick.
MemberLoyalty.objects.all().annotate(my_max=Window(
expression=Max('date_time'),
partition_by=F('member')
)).filter(my_max=F('date_time'))
Update: This actually won't work, because Window annotations are not filterable. I think in order to filter on window annotation you need to wrap it inside a Subquery, but with Subquery you are actually not obligated to use a Window function, there is another way to do it, which is my next example.
If either MySQL or Django does not support window queries, then a Subquery comes into play.
MemberLoyalty.objects.filter(
date_time=Subquery(
(MemberLoyalty.objects
.filter(member=OuterRef('member'))
.values('member')
.annotate(max_date=Max('date_time'))
.values('max_date')[:1]
)
)
)
If event Subqueries are not available (pre Django 1.11) then this should also work:
MemberLoyalty.objects.annotate(
max_date=Max('member__memberloyalty_set__date_time')
).filter(max_date=F('date_time'))
I'm looking for a list of pre-defined aggregation functions in Spark SQL. I have in mind something analogous to Presto Aggregate Functions.
I Ctrl+F'd around a little in the SQL API docs to no avail... it's also hard to tell at a glance which functions are for aggregation vs. not. For example, if I didn't know avg is an aggregation function I'd be hard pressed to tell it is one (in a way that's actually scalable to the full set of functions):
avg - avg(expr) - Returns the mean calculated from values of a group.
If such a list doesn't exist, can someone at least confirm to me that there's no pre-defined function like any/bool_or or all/bool_and to determine if any or all of a boolean column in a group are true (or false)?
For now, my workaround is
select grp_col, count(if(bool_col, true, NULL)) > 0 any_agg
Just take a look at Spark Docs on Aggregate functions section
The list of functions is here under Relational Grouped Dataset - specifically the API's that return DataFrame (not RelationalGroupedDataSet):
https://spark.apache.org/docs/latest/api/scala/index.html?org/apache/spark/sql/RelationalGroupedDataset.html#org.apache.spark.sql.RelationalGroupedDataset
I'm trying to group calls by month but I need to do it in the database and not with ruby. Here is the current code:
Call.limit(1000).group_by { |t| t.created_at.month }
Which returns:
SELECT `calls`.* FROM `calls` ORDER BY created_at desc LIMIT 1000
Then ruby does the grouping. What should I do to make the database do the work ?
Thank you.
The short answer, is that you cannot achieve the same result at SQL level.
Here's the full explanation.
First of all, what should be the result of that call? You can use the PG/SQL Group BY statement, however it's likely the result is not what you expect.
The Group By syntax is designed to group rows with a pattern, and compute and aggregate function. In your case, even assuming you create a query that uses date_trunc to group by a part of the timestamp, the aggregate function does not permit you to return a dataset structured like the Ruby group_by method.
Why do you want to compute such grouping at database level?
If you have specific requirements or computation limits, then work on a custom method.
Use Call.limit(1000).group("month(created_at)")
Please checkout mysql date-time methods appropriate in your case. But .group() will do the mysql grouping.
http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html#function_month