Best way in MySQL or Rails to get AVG per day within a specific date range - sql

I'm trying to make a graph in Rails, for example the avg sales amount per day for each day in a given date range
Say I have a products_sold model which has a "sales_price" float attribute. But if a specific day has no sales (e.g none in the model/db), I want to return simply 0.
What's the best way in MySQL/Rails to get this done? I know I can do something like this:
(This SQL query might be the completely wrong way to get what I'm wanting too)
SELECT avg(sales_price) AS avg, DATE_FORMAT(created_at, '%m-%d-%Y') AS date
FROM products_sold WHERE merchant_id = 1 GROUP BY date;
And get results like this:
| avg | date |
23 01-03-2009
50 01-05-2009
34 01-07-2009
... ...
What I'd like to get is this:
| avg | date |
23 01-03-2009
0 01-04-2009
50 01-05-2009
0 01-06-2009
34 01-07-2009
0 01-08-2009
... ...
Can I do this with SQL or will I have to post-process the results to find what dates in the daterange aren't in the SQL result set? Perhaps I need some sub-selects or IF statements?
Thanks for any help everyone.

Is there a reason (other than the date one already mentioned) why you wouldn't use the built-in group function capabilities in ActiveRecord? You seem to be concerned about "post-processing", which I don't think is really something to worry about.
You're in Rails, so you should probably be looking for a Rails solution first[1]. My first thought would be to do something like
Product.average(:sales_price, :group => "DATE(created_at)", :conditions => ["merchant_id=?", 1])
which ActiveRecord turned into pretty much the SQL you described. Assuming there's a declared has_many association between Merchant and Product, then you'd probably be better using that, so something like:
ave_prices = Merchant.find(1).products.average(:sales_price, :group => "DATE(created_at)")
(I'm hoping that your description of the model as "products_sold" is some kind of transcription error, btw - if not, you're somewhat off-message with your class naming!)
After all that, you're back where you started, but you got there in a more conventional Rails way (and Rails really values conventions!). Now we need to fill in the gaps.
I'll assume you know your date range, let's say it's defined as all dates from from_date to to_date.
date_aves = (from_date..to_date).map{|dt| [dt, 0]}
That builds the complete list of dates as an array. We don't need the dates where we got an average:
ave_price_dates = ave_prices.collect{|ave_price| ave_price[0]} # build an array of dates
date_aves.delete_if { |dt| ave_price.dates.index(dt[0]) } # remove zero entries for dates retrieved from DB
date_aves.concat(ave_prices) # add the query results
date_aves.sort_by{|ave| ave[0] } # sort by date
That lot looks a bit cluttered to me: I think it could be terser and cleaner. I'd investigate building a Hash or Struct rather than staying in arrays.
[1] I'm not saying don't use SQL - situations do occur where ActiveRecord can't generate the most efficient query and you fall back on find_by_sql. That's fine, it's supposed to be like that, but I think you should try to use it only as a last resort.

For any such query, you will need to find a mechanism to generate a table with one row for each date that you want to report on. Then you will do an outer join of that table with the data table you are analyzing. You may also have to play with NVL or COALESCE to convert nulls into zeroes.
The hard part is working out how to generate the (temporary) table that contains the list of dates for the range you need to analyze. That is DBMS-specific.
Your idea of mapping date/time values to a single date is spot on, though. You'd need to pull a similar trick - mapping all the dates to an ISO 8601 date format like 2009-W01 for week 01 - if you wanted to analyze weekly sales.
Also, you would do better to map your DATE format to 2009-01-08 notation because then you can sort in date order using a plain character sort.

To dry up a bit:
ave_prices = Merchant.find(1).products.average(:sales_price, :group => "DATE(created_at)")
date_aves = (from_date..to_date).map{|dt| [dt, ave_prices[dt.strftime "%Y-%m-%d"] || 0]}

Does MySQL have set-returning functions? I.e. functions that return different values on each row of a query? As an example from PostgreSQL, you can do:
select 'foo', generate_series(3, 5);
This will produce a result set consisting of 2 columns and 3 rows, where the left column contains 'foo' on each row and the right column contains 3, 4 and 5.
So, assuming you have an equivalent of generate_series() in MySQL, and subqueries: What you need is a LEFT OUTER JOIN from this function to the query that you already have. That will ensure you see each date appear in the output:
SELECT
avg(sales_price) as avg,
DATE_FORMAT(the_date, '%m-%d-%Y') as date
FROM (select cast('2008-JAN-01' as date) + generate_series(0, 364) as the_date) date_range
LEFT OUTER JOIN products_sold on (the_date = created_at)
WHERE merchant_id = 1
GROUP BY date;
You may need to fiddle with this a bit to get the syntax right for MySQL.

Related

Select and manipulate SQL data, DISTINCT and SUM?

Im trying to make a small report for myself to see how my much time I get inputed in my system every day.
The goal is to have my SQL to sum up the name, Total time worked and Total NG product found for one specific day.
In this order:
1.) Sort out my data for a specific 'date'. I.E 2016-06-03
2.) Present a DISTINCT value for 'operators'
3.) SUM() all time registered at this 'date' and by this 'operator' under 'total_working_time_h'
4.) SUM() all no_of_defects registered at this 'date' and by this 'operator' under 'no_of_defects'
date, operator, total_working_time_h, no_of_defects
Currently I get the data I want by using the Query below. But now I need both the DISTINCT value of the operator and the SUM of the information. Can I use sub-queries for this or should it be done by a loop? Any other hints where I can learn more about how to solve this?
If i run the DISTINCT function I don't get the opportunity to sum my data the way I try.
SELECT date, operator, total_working_time_h, no_of_defects FROM {$table_work_hours} WHERE date = '2016-06-03' "
Without knowing the table structure or contents, the following query is only a good guess. The bits to notice and work with are sum() and GROUP BY. Actually syntax will vary a bit depending on what RDBMS you are using.
SELECT
date
,operator
,SUM(total_working_time_h) AS total_working_time_h
,SUM(no_of_defects) AS no_of_defects
FROM {$table_work_hours}
WHERE date = '2016-06-03'
GROUP BY
date
,operator
(Take out the WHERE clause or replace it with a range of dates to get results per operator per date.)
I'm not sure why you are trying to do DISTINCT. You want to know the data, no of hours, etc for a specific date.
do this....
Select Date, Operator, 'SumWorkHrs'=sum(total_working_time_h),
'SumDefects'=sum(no_ofDefects) from {$table_work_hours}
Where date='2016-06-03'
Try this:
SELECT SUM(total_working_time) as total_working_time,
SUM(no_of_defects) as no_of_defects ,
DISTINCT(operator) AS operator FROM {$table_work_hours} WHERE
date = '2016-06-03'

OrientDB Time Span Search Query

In OrientDB I have setup a time series using this use case. However, instead of appending my Vertex as an embedded list to the respective hour I have opted to just create an edge from the hour to the time dependent Vertex.
For arguments sake lets say that each hour has up to 60 time Vertex each identified by a timestamp. This means I can perform the following query to obtain a specific desired Vertex:
SELECT FROM ( SELECT expand( month[5].day[12].hour[0].out() ) FROM Year WHERE year = 2015) WHERE timestamp = 1434146922
I can see from the use case that I can use UNION to get several specified time branches in one go.
SELECT expand( records ) FROM (
SELECT union( month[3].day[20].hour[10].out(), month[3].day[20].hour[11].out() ) AS records
FROM Year WHERE year = 2015
)
This works fine if you only have a small number of branches but it doesn't work very well if you want to get all the records for a given time span. Say you wanted to get all the records between;
month[3].day[20].hour[11] -> month[3].day[29].hour[23]
I could iterate through the timespan and create a huge union query but at some point I guess the query would be too long and my guess is that it wouldn't be very efficient. I could also completely bypass the time branches and query the Vectors directly based on the timestamp.
SELECT FROM Sector WHERE timestamp BETWEEN 1406588622 AND 1406588624
The problem being that you loose all efficiencies gained by the time branches.
By experimenting and reading a bit about data types in orientdb, I found that :
The squared brackets allow to :
filtering by one index, example out()[0]
filtering by multiple indexes, example out()[0,2,4]
filtering by ranges, example out()[0-9]
OPTION 1 (UPDATE) :
Using a union to join on multiple time is the only option if you don't want to create all indexes and if your range is small. Here is a query exemple using union in the documentation.
OPTION 2 :
If you always have all the indexes created for your time and if you filter on wide ranges, you should filter by ranges. This is more performant then option 1 for the cost of having to create all indexes on which you want to filter on. Official documentation about field part.
This is how the query would look like :
select
*
from
(
select
expand(hour[0-23].out())
from
(select
expand(month[3].day[20-29])
from
Year
where
year = 2015)
)
where timestamp > 1406588622
I would highly recommend reading this.

sqlalchemy select by date column only x newset days

suppose I have a table MyTable with a column some_date (date type of course) and I want to select the newest 3 months data (or x days).
What is the best way to achieve this?
Please notice that the date should not be measured from today but rather from the date range in the table (which might be older then today)
I need to find the maximum date and compare it to each row - if the difference is less than x days, return it.
All of this should be done with sqlalchemy and without loading the entire table.
What is the best way of doing it? must I have a subquery to find the maximum date? How do I select last X days?
Any help is appreciated.
EDIT:
The following query works in Oracle but seems inefficient (is max calculated for each row?) and I don't think that it'll work for all dialects:
select * from my_table where (select max(some_date) from my_table) - some_date < 10
You can do this in a single query and without resorting to creating datediff.
Here is an example I used for getting everything in the past day:
one_day = timedelta(hours=24)
one_day_ago = datetime.now() - one_day
Message.query.filter(Message.created > one_day_ago).all()
You can adapt the timedelta to whatever time range you are interested in.
UPDATE
Upon re-reading your question it looks like I failed to take into account the fact that you want to compare two dates which are in the database rather than today's day. I'm pretty sure that this sort of behavior is going to be database specific. In Postgres, you can use straightforward arithmetic.
Operations with DATEs
1. The difference between two DATES is always an INTEGER, representing the number of DAYS difference
DATE '1999-12-30' - DATE '1999-12-11' = INTEGER 19
You may add or subtract an INTEGER to a DATE to produce another DATE
DATE '1999-12-11' + INTEGER 19 = DATE '1999-12-30'
You're probably using timestamps if you are storing dates in postgres. Doing math with timestamps produces an interval object. Sqlalachemy works with timedeltas as a representation of intervals. So you could do something like:
one_day = timedelta(hours=24)
Model.query.join(ModelB, Model.created - ModelB.created < interval)
I haven't tested this exactly, but I've done things like this and they have worked.
I ended up doing two selects - one to get the max date and another to get the data
using the datediff recipe from this thread I added a datediff function and using the query q = session.query(MyTable).filter(datediff(max_date, some_date) < 10)
I still don't think this is the best way, but untill someone proves me wrong, it will have to do...

How to get all rows from a table inserted in a particular date.

I am trying to write a query that gets all the rows of a table for a particular date.
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE='2013-05-07'
However that does not work, because in the table the COLUMN_CONTAINING_DATE contains data like '2013-05-07 00:00:01' etc. So, this would work
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE>='2013-05-07' AND COLUMN_CONTAINING_DATE<'2013-05-08'
However, I dont want to go for option 2 because that feels like a hacky way. I would rather put a query that says get me all the rows for a give date and somehow not bother about the minutes and hours in the COLUMN_CONTAINING_DATE.
I am trying to have this query run on both H2 and DB2.
Any suggestions?
You can do:
select *
from MY_Table
where trunc(COLUMN_CONTAINING_DATE) = '2013-05-07';
However, the version that you describe as a "hack" is actually better. By wrapping a function around the data, many SQL optimizers will not use indexes. With just direct comparisons, an index would definitely be used.
Use something like this
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE=DATE('2013-05-07')
You can ease this if you use the Temporal data management capability from DB2 10.1.
For more information:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1204db2temporaldata/
If your concerns are related to the different data types (timestamp in the column, and a string containing a date), you can do this:
SELECT * FROM MY_TABLE
WHERE
COLUMN_CONTAINING_DATE >= '2013-05-07 00:00:00'
and COLUMN_CONTAINING_DATE < '2013-05-08 00:00:00'
and I'd pay attention to the formatting of the where clause, because this will improve readability a lot, if you have to look at your queries two months later. Just pick a style you prefer for ranges like "a <= x < b". Unfortunately SQL's between does not support this.
One could argue that the milliseconds are still missing, so perfectionists may append another ".0" in the timestamp ...

How can I query just the month and day of a DATE column?

I have a date of birth DATE column in a customer table with ~13 million rows. I would like to query this table to find all customers who were born on a certain month and day of that month, but any year.
Can I do this by casting the date into a char and doing a subscript query on the cast, or should I create an aditional char column, update it to hold just the month and day, or create three new integer columns to hold month, day and year, respectively?
This will be a very frequently used query criteria...
EDIT:... and the table has ~13 million rows.
Can you please provide an example of your best solution?
If it will be frequently used, consider a 'functional index'. Searching on that term at the Informix 11.70 InfoCentre produces a number of relevant hits.
You can use:
WHERE MONTH(date_col) = 12 AND DAY(date_col) = 25;
You can also play games such as:
WHERE MONTH(date_col) * 100 + DAY(date_col) = 1225;
This might be more suitable for a functional index, but isn't as clear for everyday use. You could easily write a stored procedure too:
Note that in the absence of a functional index, invoking functions on a column in the criterion means that an index is unlikely to be used.
CREATE FUNCTION mmdd(date_val DATE DEFAULT TODAY) RETURNING SMALLINT AS mmdd;
RETURN MONTH(date_val) * 100 + DAY(date_val);
END FUNCTION;
And use it as:
WHERE mmdd(date_col) = 1225;
Depending on how frequently you do this and how fast it needs to run you might think about splitting the date column into day, month and year columns. This would make search faster but cause all sorts of other problems when you want to retrieve a whole date (and also problems in validating that it is a date) - not a great idea.
Assuming speed isn't a probem I would do something like:
select *
FROM Table
WHERE Month(*DateOfBirthColumn*) = *SomeMonth* AND DAY(*DateOfBirthColumn*) = *SomeDay*
I don't have informix in front of me at the moment but I think the syntax is right.