Matlab: Explicit or named association of `splitapply` arguments with table VariableNames - sql

I've been biting the bullet on the extra code and bookkeeping needed in Matlab to perform common SQL operations. Here is an example of a typical SQL code pattern for generating metrics that summarize a data table tDat:
SELECT vGrouping, MEAN( x - y ) AS rollup1, VAR(y+z) AS rollup2
INTO tRollups FROM tDat GROUP BY vGrouping
My SQL's a bit rusty, but the general idea should be clear to SQLers. Here is the Matlab equivalent:
% Create test data
tDat = array2table( floor(10*rand(5,3)) , ...
'VariableNames',{'x','y','z'} );
tDat.vGrouping = ( rand(5,1) > 0.5 )
% Calculate summary metrics for each group of data
[vGroup,grps] = findgroups(tDat.vGrouping)
fRollup = #(a,b,c)[ mean(a-b) var(b+c) ] % Calculates summary metric
rollups = splitapply( fRollup, tDat(:,{'x','y','z'}), vGroup )
% Code pattern 1 to assemble results
tRollups = [ array2table( grps , 'VariableNames',{'group'} ) ...
array2table( rollups , ...
'VariableNames',{'rollup1','rollup2'} ) ]
% Code pattern 2 to assemble results
tRollups = array2table( [grps rollups], ...
'VariableNames',{'group','rollup1','rollup2'} )
It's not a fair comparison because the Matlab code contains the data setup, as well as two possible code patterns for assembling the summary metrics. Furthermore, I've added comments -- not to make the Matlab code more voluminous, but because it is so much busier that some cognitive signposts are needed to aid in the reading.
Code volume aside, however, one of the things that bugs me is that the rollup expressions in fRollup are not explicitly associated with the names of the input or output data columns. The arguments are dummy arguments, and the actual input data columns from tDat are specified in the splitapply invocation. The association with the fRollup arguments are positional, so field/variable names themselves aren't able to enforce correct association. Likewise, the output columns in tRollups are specified in the array2table invocation, again positionally associated with the fRollup output.
This makes the rather simple relationships in the SQL statement very difficult to see in the Matlab code. Is there an alternate pattern or design idiom that doesn't have this drawback, but hopefully, doesn't incur much in the way of other drawbacks?
AFTERNOTE: For some reason, even though the following doesn't solve the named/explicit association of splitapply input/output arguments with actual input/output variables, I still find it easier to see the relationships. The code definitely looks less noisier. The key is that the function fRollup for generating summary metrics on the data now returns multiple outputs instead of bundling them into a single arrayed output. This allows me to explicitly name properties of scalar struct ssRollups as the target of the assignment. I don't need to all sorts of conversions to tables, with the extra code to designate VariableNames, just to concatenate the results with the identified groups. Instead, the group identities start off as just another property grps in the same struct (ssRollups) as the splitapply results -- in fact, it is the first property that brings the struct into existence.
% File tmp.m
%-----------
function tmp
% Create test data
tDat = array2table( floor(10*rand(5,3)) , ...
'VariableNames',{'x','y','z'} );
tDat.vGrouping = ( rand(5,1) > 0.5 )
% Find the groups
[ vGroup, ssRollups.grps ] = findgroups(tDat.vGrouping)
% Calculate summary metrics for each group of data
[ ssRollups.rollup1 ssRollups.rollup2 ] = ...
splitapply( #fRollup, tDat(:,{'x','y','z'}), vGroup );
% Display use nice table formatting
struct2table( ssRollups )
end % function tmp
function [rollup1 rollup2] = fRollup(a,b,c)
rollup1 = mean(a-b);
rollup2 = var(b+c);
end % function fRollup
As a multiple-output function, however, fRollup seems better suited to a non-anonymous function. To me, it actually seems to document the multiple outputs better, despite the less compact code. It may be just one of those situations where more compactness is less readable, causing the data relationships to be harder to see. However, it does require that the entire passage of code be made into a function (tmp in this case), unless you don't mind breaking out fRollup into it's own function and m-file. I prefer not to litter my file system with such tiny snippet functions meant to be used in one place.

This "answer" doesn't directly deal with explicit named association between the actual input/output variables and the arguments of the function handle supplied to splitapply. However, it significantly simplifies the code in the initial example, hopefully making it clearer to see the relationship between the function arguments and the input/output variables. This solution was initially included in the AFTERNOTE in the question. Since better answers don't appear to be forthcoming soon, I've decided to hive it out as the answer. It uses deal to implement an anonymous multiple-output function for splitapply to use on the data groups that are delineated by its grouping argument.
% Create test data
tDat = array2table( floor(10*rand(5,3)) , ...
'VariableNames',{'x','y','z'} );
tDat.vGrouping = ( rand(5,1) > 0.5 )
% Find the groups
[ vGroup, ssRollups.grps ] = findgroups(tDat.vGrouping)
% Calculate summary metrics for each group of data
fRollup = #(a,b,c) deal( mean(a-b), var(b+c) )
[ ssRollups.rollup1 ssRollups.rollup2 ] = ...
splitapply( fRollup, tDat(:,{'x','y','z'}), vGroup );
% Display use nice table formatting
struct2table( ssRollups )
Until a better solution emerges, this approach will be my go-to idiom for splitapply.
Here is a variation that uses a table variable for the output of splitapply. This might be more convenient when using multiple grouping variables because findgroups will pass the grouping variable names to the output variable tRollups on the LHS:
% Create test data
tDat = array2table( floor(10*rand(8,3)) , ...
'VariableNames',{'x','y','z'} );
tDat = [ tDat ...
array2table( rand(8,2)>0.5 , ...
'VariableNames',{'vGrpng1','vGrpng2'} ) ];
% Find the groups
[ vGroup, tRollups ] = findgroups(tDat(:,{'vGrpng1','vGrpng2'}));
% Calculate summary metrics for each group of data
fRollup = #(a,b,c) deal( mean(a-b), var(b+c) )
[ tRollups.rollup1 tRollups.rollup2 ] = ...
splitapply( fRollup, tDat(:,{'x','y','z'}), vGroup );
tRollups
And here is a version that uses multiple grouping variables and uses a scalar struct instead of a table for the outputs of findgroup and splitapply:
% Create test data
tDat = array2table( floor(10*rand(8,3)) , ...
'VariableNames',{'x','y','z'} );
tDat.vGrpng1 = rand(8,1)>0.5 ;
tDat.vGrpng2 = rand(8,1)>0.5
% Find the groups
[ vGroup, ssRollups.vGrpng1, ssRollups.vGrpng2 ] = ...
findgroups( tDat.vGrpng1, tDat.vGrpng2 );
% Calculate summary metrics for each group of data
fRollup = #(a,b,c) deal( mean(a-b), var(b+c) )
[ ssRollups.rollup1 ssRollups.rollup2 ] = ...
splitapply( fRollup, tDat(:,{'x','y','z'}), vGroup );
% Display using nice table formatting
struct2table( ssRollups )

Related

replace .withColumn with a df.select

I am doing a basic transformation on my pyspark dataframe but here i am using multiple .withColumn statements.
def trim_and_lower_col(col_name):
return F.when(F.trim(col_name) == "", F.lit("unspecified")).otherwise(F.lower(F.trim(col_name)))
df = (
source_df.withColumn("browser", trim_and_lower_col("browser"))
.withColumn("browser_type", trim_and_lower_col("browser_type"))
.withColumn("domains", trim_and_lower_col("domains"))
)
I read that creating multiple withColumn statements isn't very efficient and i should use df.select() instead.
I tried this:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
df = (
source_df.select([trim_and_lower_col(col).alias(col) for col in cols_to_transform] + source_df.columns)
)
but it gives me a duplicate column error
What else can I try?
The duplicate column comes because you pass each transformed column twice in that list, once as your newly transformed column (through .alias) as original column (by name in source_df.columns). This solution will allow you to use a single select statement, preserve the column order and not hit the duplication issue:
df = (
source_df.select([trim_and_lower_col(col).alias(col) if col in cols_to_transform else col for col in source_df.columns])
)
Chaining many .withColumn does pose a problem as the unresolved query plan can get pretty large and cause StackOverflow error on Spark driver during query plan optimisation. One good explanation of this problem is shared here: https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015
You are naming your new columns the following: .alias(col).
That means that they have the same name as the column you use to create the new one.
During the creation (using .withColumn) this does not pose a problem. As soon as you are trying to select, Spark does not know which column to pick.
You could fix it for example by giving the new columns a suffix:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
df = (
source_df.select([trim_and_lower_col(col).alias(f"{col}_new") for col in cols_to_transform] + source_df.columns)
)
Another solution, which does pollute the DAG though, would be:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
for col in cols_to_transform:
source_df = source_df.withColumn(col, trim_and_lower_col(col))
If you only have these few withColumns, keep using them.. It's still way more readable thus way more maintainable and self explanatory..
If you look into it, you'll see that spark says to be careful with the withColumns when you have like 200 of them.
Using select makes your code more error prone too since it's more complex to read.
Now, if you have many columns, I would define
the list of the column to transform,
the list of the column to keep
then do the select
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
cols_to_keep = [c for c in df.columns if c not in cols_to_transform]
cols_transformed = [trim_and_lower_col(c).alias(c) for c in cols_to_transform]
source_df.select(*cols_to_keep, *cols_transformed)
This would give you the same column order as the withColumns.

how to dynamically build select list from a API payload using PyPika

I have a JSON API payload containing tablename, columnlist - how to build a SELECT query from it using pypika?
So far I have been able to use a string columnlist, but not able to do advanced querying using functions, analytics etc.
from pypika import Table, Query, functions as fn
def generate_sql (tablename, collist):
table = Table(tablename)
columns = [str(table)+'.'+each for each in collist]
q = Query.from_(table).select(*columns)
return q.get_sql(quote_char=None)
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue)']
print (generate_sql(tablename, collist)) #1
table = Table(tablename)
q = Query.from_(table).select(table.id, table.fname, fn.Sum(table.revenue))
print (q.get_sql(quote_char=None)) #2
#1 outputs
SELECT "customers".id,"customers".fname,"customers".fn.Sum(revenue) FROM customers
#2 outputs correctly
SELECT id,fname,SUM(revenue) FROM customers
You should not be trying to assemble the query in a string by yourself, that defeats the whole purpose of pypika.
What you can do in your case, that you have the name of the table and the columns coming as texts in a json object, you can use * to unpack those values from the collist and use the syntax obj[key] to get the table attribute with by name with a string.
q = Query.from_(table).select(*(table[col] for col in collist))
# SELECT id,fname,fn.Sum(revenue) FROM customers
Hmm... that doesn't quite work for the fn.Sum(revenue). The goal is to get SUM(revenue).
This can get much more complicated from this point. If you are only sending column names that you know to belong to that table, the above solution is enough.
But if you have complex sql expressions, making reference to sql functions or even different tables, I suggest you to rethink your decision of sending that as json. You might end up with something as complex as pypika itself, like a custom parser or wathever. than your better option here would be to change the format of your json response object.
If you know you only need to support a very limited set of capabilities, it could be feasible. For example, you can assume the following constraints:
all column names refer to only one table, no joins or alias
all functions will be prefixed by fn.
no fancy stuff like window functions, distinct, count(*)...
Then you can do something like:
from pypika import Table, Query, functions as fn
import re
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue / 2)', 'revenue % fn.Count(id)']
def parsed(cols):
pattern = r'(?:\bfn\.[a-zA-Z]\w*)|([a-zA-Z]\w*)'
subst = lambda m: f"{'' if m.group().startswith('fn.') else 'table.'}{m.group()}"
yield from (re.sub(pattern, subst, col) for col in cols)
table = Table(tablename)
env = dict(table=table, fn=fn)
q = Query.from_(table).select(*(eval(col, env) for col in parsed(collist)))
print (q.get_sql(quote_char=None)) #2
Output:
SELECT id,fname,SUM(revenue/2),MOD(revenue,COUNT(id)) FROM customers

Can I store intermediate results?

I want to store intermediate results to avoid multiple calculations for one thing. What I'm looking for is something like this:
h1_activ = sigmoid(self.bias_visiblie + T.dot(D, self.W))
h1_sample = h1_activ > rnds.uniform((n_samples, self.n_hidden ))
f_h1_sample = theano.function(
inputs=[D],
outputs=h1_sample,
# I'd like to take the result from 'h1_sample' and store it into 'H1_sample'
updates=[(self.H1_sample, ??? )]
)
The code above does not run of course but is there a way to do something like this? Storing an intermediate value into a shared variable?
You can write the final results, which use the same intermediate results, in the same theano.function.
For example:
h1_activ = sigmoid(self.bias_visiblie + T.dot(D, self.W))
h1_sample = h1_activ > rnds.uniform((n_samples, self.n_hidden ))
# h2_sample use the intermediate result h1_sample.
h2_sample = h1_sample * 2
f_h1_sample = theano.function(
inputs=[D],
outputs=[h1_sample, h2_sample],
)
h2_smaple is a final result which uses h1_sample.
Also you can save the intermediate results and use them as inputs in another theano.function.
Different theano.functions correspond to different computation graphs. I think no calculation can be shared between different computation graphs.

Factorial in Google BigQuery

I need to compute the factorial of a variable in Google BigQuery - is there a function for this? I cannot find one in the documentation here:
https://cloud.google.com/bigquery/query-reference#arithmeticoperators
My proposed solution at this point is to compute the factorial for numbers 1 through 100 and upload that as a table and join with that table. If you have something better, please advise.
As context may reveal a best solution, the factorial is used in the context of computing a Poisson probability of a random variable (number of events in a window of time). See the first equation here: https://en.wikipedia.org/wiki/Poisson_distribution
Try below. Quick & dirty example
select number, factorial
FROM js(
// input table
(select number from
(select 4 as number),
(select 6 as number),
(select 12 as number)
),
// input columns
number,
// output schema
"[{name: 'number', type: 'integer'},
{name: 'factorial', type: 'integer'}]",
// function
"function(r, emit){
function fact(num)
{
if(num<0)
return 0;
var fact=1;
for(var i=num;i>1;i--)
fact*=i;
return fact;
}
var factorial = fact(r.number)
emit({number: r.number, factorial: factorial});
}"
)
If the direct approach works for the values you need the Poisson distribution calculated on, then cool. If you reach the point where it blows up or gives you inaccurate results, then read on for numerical analysis fun times.
In general you'll get better range and numerical stability if you do the arithmetic on the logarithms, and then exp() as the final operation.
You want: c^k / k! exp(-c).
Compute its log, ln( c^k / k! exp(-c) ),
i.e. k ln(x) - ln(k!) - c
Take exp() of that.
Well but how do we get ln(k!) without computing k!? There's a function called the gamma function, whose practical point here is that its logarithm gammaln() can be approximated directly, and ln(k!) = gammaln(k+1).
There is a Javascript gammaln() in Phil Mainwaring's answer here, which I have not tested, but assuming it works it should fit into a UDF for you.
Extending Mikhail's answer to be general and correct for computing the factorial for all number 1 to n, where n < 500, the following solution holds and can be computed efficiently:
select number, factorial
FROM js(
// input table
(
SELECT
ROW_NUMBER() OVER() AS number,
some_thing_from_the_table
FROM
[any table with at least LIMIT many entries]
LIMIT
100 #Change this to any number to compute factorials from 1 to this number
),
// input columns
number,
// output schema
"[{name: 'number', type: 'integer'},
{name: 'factorial', type: 'float'}]",
// function
"function(r, emit){
function fact(num)
{
if(num<0)
return 0;
var fact=1;
for(var i=num;i>1;i--)
fact*=i;
return fact;
}
#Use toExponential and parseFloat to handle large integers in both Javascript and BigQuery
emit({number: r.number, factorial: parseFloat(fact(r.number).toExponential())});
}"
)
You can get up to 27! using SQL UDF. Above that value NUMERIC type gets overflow error.
CREATE OR REPLACE FUNCTION factorial(integer_expr INT64) AS ( (
SELECT
ARRAY<numeric>[
1,
1,
2,
6,
24,
120,
720,
5040,
40320,
362880,
3628800,
39916800,
479001600,
6227020800,
87178291200,
1307674368000,
20922789888000,
355687428096000,
6402373705728000,
121645100408832000,
2432902008176640000,
51090942171709440000.,
1124000727777607680000.,
25852016738884976640000.,
620448401733239439360000.,
15511210043330985984000000.,
403291461126605635584000000.,
10888869450418352160768000000.][
OFFSET
(integer_expr)] AS val ) );
select factorial(10);

Setting group_by in specialized query

I need to perform data smoothing using averaging, with a non-standard group_by variable that is created on-the-fly. My model consists of two tables:
class WthrStn(models.Model):
name=models.CharField(max_length=64, error_messages=MOD_ERR_MSGS)
owner_email=models.EmailField('Contact email')
location_city=models.CharField(max_length=32, blank=True)
location_state=models.CharField(max_length=32, blank=True)
...
class WthrData(models.Model):
stn=models.ForeignKey(WthrStn)
date=models.DateField()
time=models.TimeField()
temptr_out=models.DecimalField(max_digits=5, decimal_places=2)
temptr_in=models.DecimalField(max_digits=5, decimal_places=2)
class Meta:
ordering = ['-date','-time']
unique_together = (("date", "time", "stn"),)
The data in WthrData table are entered from an xml file in variable time increments, currently 15 or 30 minutes, but that could vary and change over time. There are >20000 records in that table. I want to provide an option to display the data smoothed to variable time units, e.g. 30 minutes, 1, 2 or N hours (60, 120, 180, etc minutes)
I am using SQLIte3 as the DB engine. I tested the following sql, which proved quite adequate to perform the smoothing in 'bins' of N-minutes duration:
select id, date, time, 24*60*julianday(datetime(date || time))/N jsec, avg(temptr_out)
as temptr_out, avg(temptr_in) as temptr_in, avg(barom_mmhg) as barom_mmhg,
avg(wind_mph) as wind_mph, avg(wind_dir) as wind_dir, avg(humid_pct) as humid_pct,
avg(rain_in) as rain_in, avg(rain_rate) as rain_rate,
datetime(avg(julianday(datetime(date || time)))) as avg_date from wthr_wthrdata where
stn_id=19 group by round(jsec,0) order by stn_id,date,time;
Note I create an output variable 'jsec' using the SQLite3 function 'julianday', which returns number of days in the integer part and fraction of day in the decimal part. So, multiplying by 24*60 gives me number of minutes. Dividing by N-minute resolution gives me a nice 'group by' variable, compensating for varying time increments of the raw data.
How can I implement this in Django? I have tried the objects.raw(), but that returns a RawQuerySet, not a QuerySet to the view, so I get error messages from the html template:
</p>
Number of data entries: {{ valid_form|length }}
</p>
I have tried using a standard Query, with code like this:
wthrdta=WthrData.objects.all()
wthrdta.extra(select={'jsec':'24*60*julianday(datetime(date || time))/{}'.format(n)})
wthrdta.extra(select = {'temptr_out':'avg(temptr_out)',
'temptr_in':'avg(temptr_in)',
'barom_mmhg':'avg(barom_mmhg)',
'wind_mph':'avg(wind_mph)',
'wind_dir':'avg(wind_dir)',
'humid_pct':'avg(humid_pct)',
'rain_in':'avg(rain_in)',
'rain_sum_in':'sum(rain_in)',
'rain_rate':'avg(rain_rate)',
'avg_date':'datetime(avg(julianday(datetime(date || time))))'})
Note that here I use the sql-avg functions instead of using the django aggregate() or annotate(). This seems to generate correct sql code, but I cant seem to get the group_by set properly to my jsec data that is created at the top.
Any suggestions for how to approach this? All I really need is to have the QuerySet.raw() method return a QuerySet, or something that can be converted to a QuerySet instead of RawQuerySet. I can not find an easy way to do that.
The answer to this turns out to be really simple, using a hint I found from
[https://gist.github.com/carymrobbins/8477219][1]
though I modified his code slightly. To return a QuerySet from a RawQuerySet, all I did was add to my models.py file, right above the WthrData class definition:
class MyManager(models.Manager):
def raw_as_qs(self, raw_query, params=()):
"""Execute a raw query and return a QuerySet. The first column in the
result set must be the id field for the model.
:type raw_query: str | unicode
:type params: tuple[T] | dict[str | unicode, T]
:rtype: django.db.models.query.QuerySet
"""
cursor = connection.cursor()
try:
cursor.execute(raw_query, params)
return self.filter(id__in=(x[0] for x in cursor))
finally:
cursor.close()
Then in my class definition for WthrData:
class WthrData(models.Model):
objects=MyManager()
......
and later in the WthrData class:
def get_smoothWthrData(stn_id,n):
sqlcode='select id, date, time, 24*60*julianday(datetime(date || time))/%s jsec, avg(temptr_out) as temptr_out, avg(temptr_in) as temptr_in, avg(barom_mmhg) as barom_mmhg, avg(wind_mph) as wind_mph, avg(wind_dir) as wind_dir, avg(humid_pct) as humid_pct, avg(rain_in) as rain_in, avg(rain_rate) as rain_rate, datetime(avg(julianday(datetime(date || time)))) as avg_date from wthr_wthrdata where stn_id=%s group by round(jsec,0) order by stn_id,date,time;'
return WthrData.objects.raw_as_qs(sqlcode,[n,stn_id]);
This allows me to grab results from the highly populated WthrData table smoothed over time increments, and the results come back as a QuerySet instead of RawQuerySet