Dynamic TABLE_DATE_RANGE in BigQuery - sql

Is there a way of using a date variable as an argument of TABLE_DATE_RANGE()?
I mean, my goal is to analyze the behavior of users in the next week after they've purchased.
What I try to get is something like that:
TABLE_DATE_RANGE([mydata.],
TIMESTAMP(purchaseDate),
TIMESTAMP(DATE_ADD(purchaseDate,7,'DAY')))
where I've previously calculated 'purchaseDate', querying a fixed period of time. This will make dynamic the queried time range for each user. I'm not sure if this approach is against the BQ structure logic.

TABLE_DATE_RANGE will not accept any field names at least because none available at a time of evaluation

Related

What's the best way to account for missing records when performing aggregate queries?

I have a table in QuestDB with IoT sensor data. The usual operation pattern is that sensors write info to a table while they have an active internet connection. This means they are anywhere from a few minutes to a few hours per day or constantly sending me data. When I want to run an aggregate query on top of this, how can I account for missing values?
If I want an average by minute over a 24 hour period, but 4 hours of data is missing, will my results be skewed? For example:
select avg(tempFahren) from (iot_logger timestamp(ts)) sample by 1m
It becomes obvious that I'm skipping directly to the next reported value when graphing so instead of a cyclical pattern, I get a sudden cliff when the sensor comes online again:
If you want to fill missing values, there is also the option to use the FILL keyword in SAMPLE BY aggregations. There are a few ways you can use this, such as filling by previous value, linear interpolation, or specify a constant:
select ts, avg(tempFahren) from (iot_logger timestamp(ts)) sample by 1m fill(linear);
There are some more examples of how to use this on the official documentation
Aggregation functions like avg() ignore missing data (for example null values).
So no, your results will not be skewed if your sensors do not send data for some time.

How to make a query that computes the difference of two timefield objects/attributes in Django?

Suppose I have a model that has four attributes:
name,
time in,
time out,
date.
time in and time out are timefield objects. Now, I want to write a django query that tells me who was available in the office for most time duration in a given range.
I am not sure how do I calculate the time difference (time out - time in) on the fly. Do I need to put another attribute like time duration? I was hoping to avoid that.
I don't think it's possible using vanilla Django ORM.
Two solutions come to my mind:
Fetch the results in RAM and do the computation.
Add a new field to take care of the duration into your model. You can first do an update query to calculate the duration for all rows in your db.
Class.objects.update(duration=F('time_out')-F('time_in'))
And then you can order_by duration and get the first entry as your max duration.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

Classify data using mahout

I'm new to Apache Mahout and working on a classsification problem.
The Problem states:
There exists a set of data in a text file and I need to fetch some or all of the data from the file depending upon the given span of time.
Span of time : Each record would have a Date of transaction.
So, time span would be calculated using the logic (Sys_Date - Transaction_Date).
Thus, output would vary depending upon whether data is required for last month / week / specific number of days.
How can this filtering be achieved using Apache Mahout.
This by itself does not sound like a machine learning problem at all. You want to put your data in a database of some kind and query for records in a date range. Then, you want to do something with that data. This is not something ML tools do.
I haven't been working properly with hadoop yet. But it seems to me that this video should help:
http://www.youtube.com/watch?v=KwW7bQRykHI&feature=player_embedded
After the filtering, you can use result in mahout (for solving the classification problem)

Ruby Rails Complex SQL with aggregate function and DayOfWeek

Rails 2.3.4
I have searched google, and have not found an answer to my dilemma.
For this discussion, I have two models. Users and Entries. Users can have many Entries (one for each day).
Entries have values and sent_at dates.
I want to query and display the average value of entries for a user BY DAY OF WEEK. So if a user has entered values for, say, the past 3 weeks, I want to show the average value for Sundays, Mondays, etc. In MySQL, it is simple:
SELECT DAYOFWEEK(sent_at) as day, AVG(value) as average FROM entries WHERE user_id = ? GROUP BY 1
That query will return between 0 and 7 records, depending upon how many days a user has had at least one entry.
I've looked at find_by_sql, but while I am searching Entry, I don't want to return an Entry object; instead, I need an array of up to 7 days and averages...
Also, I am concerned a bit about the performance of this, as we would like to load this to the user model when a user logs in, so that it can be displayed on their dashboard. Any advice/pointers are welcome. I am relatively new to Rails.
You can query the database directly, no need to use an actual ActiveRecord object. For example:
ActiveRecord::Base.connection.execute "SELECT DAYOFWEEK(sent_at) as day, AVG(value) as average FROM entries WHERE user_id = #{user.id} GROUP BY DAYOFWEEK(sent_at);"
This will give you a MySql::Result or MySql2::Result that you can then use each or all on this enumerable, to view your results.
As for caching, I would recommend using memcached, but any other rails caching strategy will work as well. The nice benefit of memcached is that you can have your cache expire after a certain amount of time. For example:
result = Rails.cache.fetch('user/#{user.id}/averages', :expires_in => 1.day) do
# Your sql query and results go here
end
This would put your results into memcached for one day under the key 'user//averages'. For example if you were user with id 10 your averages would be in memcached under 'user/10/average' and the next time you went to perform this query (within the same day) the cached version would be used instead of actually hitting the database.
Untested, but something like this should work:
#user.entries.select('DAYOFWEEK(sent_at) as day, AVG(value) as average').group('1').all
NOTE: When you use select to specify columns explicitly, the returned objects are read only. Rails can't reliably determine what columns can and can't be modified. In this case, you probably wouldn't try to modify the selected columns, but you can'd modify your sent_at or value columns through the resulting objects either.
Check out the ActiveRecord Querying Guide for a breakdown of what's going on here in a fairly newb-friendly format. Oh, and if that query doesn't work, please post back so others that may stumble across this can see that (and I can possibly update).
Since that won't work due to entries returning an array, we can try using join instead:
User.where(:user_id => params[:id]).joins(:entries).select('...').group('1').all
Again, I don't know if this will work. Usually you can specify where after joins, but I haven't seen select combined in there. A tricky bit here is that the select is probably going to eliminate returning any data about the user at all. It might make more sense just to eschew find_by_* methods in favor of writing a method in the Entry model that just calls your query with select_all (docs) and skips the association mapping.