Note: I am following http://railscasts.com/episodes/364-active-record-reputation-system for this example.
I have a Location model where I am letting the user rank (i.e. 1 to 5 stars) the item:
has_reputation :location_rating, source: :user, aggregated_by:
:average
In my controller I am either deleting the rating:
#location.delete_evaluation(:location_rating, current_user)
or updating it:
#location.add_or_update_evaluation(:location_rating, params[:rating] , current_user)
I am brutally confused how to query:
the current average rating of a Location by all users
the rating of a given Location by a given user
the count of ratings on a given Location
I am trying #location.reputation_for(:location_rating) and getting some weird behavior. If a given user deletes a rating and then the new reputation_for value seems to averaging in the old ratings.
For example:
1.9.3p0 :038 > RSEvaluation.all
RSEvaluation Load (1.7ms) SELECT "rs_evaluations".* FROM "rs_evaluations"
+----+-----------------+-----------+-------------+-----------+-------------+-------+-------------------------+-------------------------+
| id | reputation_name | source_id | source_type | target_id | target_type | value | created_at | updated_at |
+----+-----------------+-----------+-------------+-----------+-------------+-------+-------------------------+-------------------------+
| 46 | tracking | 1 | User | 6 | Location | 0.0 | 2012-08-28 05:15:13 UTC | 2012-08-28 05:15:13 UTC |
| 48 | tracking | 1 | User | 5 | Location | 0.0 | 2012-09-04 14:44:37 UTC | 2012-09-04 14:44:37 UTC |
| 51 | tracking | 1 | User | 8 | Location | 0.0 | 2012-09-12 06:51:58 UTC | 2012-09-12 06:51:58 UTC |
| 52 | tracking | 1 | User | 1 | Location | 0.0 | 2012-09-12 14:54:39 UTC | 2012-09-12 14:54:39 UTC |
| 54 | location_rating | 1 | User | 5 | Location | 3.0 | 2012-09-19 05:10:46 UTC | 2012-09-19 05:18:16 UTC |
| 56 | tracking | 11 | User | 5 | Location | 0.0 | 2012-09-19 05:47:10 UTC | 2012-09-19 05:47:10 UTC |
| 58 | location_rating | 11 | User | 3 | Location | 2.0 | 2012-09-19 06:33:12 UTC | 2012-09-19 06:33:12 UTC |
| 61 | location_rating | 11 | User | 5 | Location | 5.0 | 2012-09-19 07:15:42 UTC | 2012-09-19 07:15:42 UTC |
+----+-----------------+-----------+-------------+-----------+-------------+-------+-------------------------+-------------------------+
On Location id = 5 should my reputation not be (3 + 5) / 2 = 4? I currently get 2.59375. I update User 11's rating to 1 and the new reputation is 0.59375.
I am sure I am missing something obvious here (as I am a Rails newbie).
Version 2.0.0 has a fix for this problem so you should update your gem.
Related
I am trying to downsample and upsample time series data on MonetDB.
Time series database systems (TSDS) usually have an option to make the downsampling and upsampling with an operator like SAMPLE BY (1h).
My time series data looks like the following:
sql>select * from datapoints limit 5;
+----------------------------+------------+--------------------------+--------------------------+--------------------------+--------------------------+--------------------------+
| time | id_station | temperature | discharge | ph | oxygen | oxygen_saturation |
+============================+============+==========================+==========================+==========================+==========================+==========================+
| 2019-03-01 00:00:00.000000 | 0 | 407.052 | 0.954 | 7.79 | 12.14 | 12.14 |
| 2019-03-01 00:00:10.000000 | 0 | 407.052 | 0.954 | 7.79 | 12.13 | 12.13 |
| 2019-03-01 00:00:20.000000 | 0 | 407.051 | 0.954 | 7.79 | 12.13 | 12.13 |
| 2019-03-01 00:00:30.000000 | 0 | 407.051 | 0.953 | 7.79 | 12.12 | 12.12 |
| 2019-03-01 00:00:40.000000 | 0 | 407.051 | 0.952 | 7.78 | 12.12 | 12.12 |
+----------------------------+------------+--------------------------+--------------------------+--------------------------+--------------------------+--------------------------+
I tried the following query but the results are obtained by aggregating all the values from different days, which is not what I am looking for:
sql>SELECT EXTRACT(HOUR FROM time) AS "hour",
AVG(pH) AS avg_ph
FROM datapoints
GROUP BY "hour";
| hour | avg_ph |
+======+==========================+
| 0 | 8.041121283524923 |
| 1 | 8.041086970785418 |
| 2 | 8.041152801724111 |
| 3 | 8.04107828783526 |
| 4 | 8.041060110153223 |
| 5 | 8.041167286877407 |
| ... | ... |
| 23 | 8.041219444444451 |
I tried then to aggregate the time series data first based on the day then on the hour:
SELECT EXTRACT(DATE FROM time) AS "day", EXTRACT(HOUR FROM time) AS "hour",
AVG(pH) AS avg_ph
FROM datapoints
GROUP BY "day", "hour";
But I am getting the following exception:
syntax error, unexpected sqlDATE in: "select extract(date"
My question: how could I aggregate/downsample the data to a specific period of time (e.g. obtain an aggregated value every 2 days or 12 hours)?
I want to track the yearly spend of users based on a particular month from which we start their cycle. This is to keep track of their yearly spend so that they don't exceed the allowed limits. I have the following two tables:
Spend (Contains 1 row per user per month) (I can modify the date column of this table to any date format as needed, if it helps):
+----+-----------+------+-------+-------+
| ID | Date | Year | Month | Spend |
+----+-----------+------+-------+-------+
| 11 | 01-Sep-19 | 2019 | 9 | 10 |
+----+-----------+------+-------+-------+
| 11 | 01-Oct-19 | 2019 | 10 | 23 |
+----+-----------+------+-------+-------+
| 11 | 01-Nov-19 | 2019 | 11 | 27 |
+----+-----------+------+-------+-------+
| 11 | 01-Dec-19 | 2019 | 12 | 14 |
+----+-----------+------+-------+-------+
| 11 | 01-Jan-20 | 2020 | 1 | 13 |
+----+-----------+------+-------+-------+
| 11 | 01-Feb-20 | 2020 | 2 | 33 |
+----+-----------+------+-------+-------+
| 11 | 01-Mar-20 | 2020 | 3 | 25 |
+----+-----------+------+-------+-------+
| 11 | 01-Apr-20 | 2020 | 4 | 17 |
+----+-----------+------+-------+-------+
| 11 | 01-May-20 | 2020 | 5 | 14 |
+----+-----------+------+-------+-------+
| 11 | 01-Jun-20 | 2020 | 6 | 10 |
+----+-----------+------+-------+-------+
| 11 | 01-Jul-20 | 2020 | 7 | 46 |
+----+-----------+------+-------+-------+
| 11 | 01-Aug-20 | 2020 | 8 | 53 |
+----+-----------+------+-------+-------+
| 11 | 01-Sep-20 | 2020 | 9 | 38 |
+----+-----------+------+-------+-------+
| 11 | 01-Oct-20 | 2020 | 10 | 22 |
+----+-----------+------+-------+-------+
| 11 | 01-Nov-20 | 2020 | 11 | 29 |
+----+-----------+------+-------+-------+
| 50 | 01-Jul-20 | 2020 | 7 | 56 |
+----+-----------+------+-------+-------+
| 50 | 01-Aug-20 | 2020 | 8 | 62 |
+----+-----------+------+-------+-------+
| 50 | 01-Sep-20 | 2020 | 9 | 77 |
+----+-----------+------+-------+-------+
| 50 | 01-Oct-20 | 2020 | 10 | 52 |
+----+-----------+------+-------+-------+
| 50 | 01-Nov-20 | 2020 | 11 | 45 |
+----+-----------+------+-------+-------+
Billing Cycle (contains the months between which we calculate their total spends):
+-----+------------+----------+
| ID | StartMonth | EndMonth |
+-----+------------+----------+
| 11 | 10 | 9 |
+-----+------------+----------+
| 50 | 9 | 8 |
+-----+------------+----------+
Sample Output:
+----+-------+------------+
| ID | Cycle | TotalSpend |
+----+-------+------------+
| 11 | 1 | 10 |
+----+-------+------------+
| 11 | 2 | 313 |
+----+-------+------------+
| 11 | 3 | 51 |
+----+-------+------------+
| 50 | 1 | 118 |
+----+-------+------------+
| 50 | 2 | 174 |
+----+-------+------------+
In the sample output, for ID = 11, cycle 1 indicates spend in Sep'19, cycle 2 indicates total spend from Oct'19 (Month 10) to Sep'20 (Month 9) and cycle 3 indicates total spend for the next 12 months from Oct'20 (till whichever month data is present).
I'm a beginner to SQL and I believe doing this might require the use of CTE/Subqueries. Would appreciate any help or guidance for this.
Since this seems to be an exercise of some sort, I'm not going to provide a full answer, but give you hints to how this could be solved conceptually.
First I think you should associate entries to effective cycles (with cycle number) for the required date range. This could be done by using a recursive CTE. These are not the most efficient approach, but since we don't have the effective cycles with their numbers as a distinct table it can be a working solution nevertheless.
The result then just needs to be grouped by the ID and cycle number and the amounts summed up, and you're done.
I have a table of daily data and a table of monthly data. I'm trying to retrieve one daily record corresponding to each monthly record. The wrinkles are that some days are missing from the daily data and the field I care about, new_status, is sometimes null on the month_end_date.
month_df
| ID | month_end_date |
| -- | -------------- |
| 1 | 2019-07-31 |
| 1 | 2019-06-30 |
| 2 | 2019-10-31 |
daily_df
| ID | daily_date | new_status |
| -- | ---------- | ---------- |
| 1 | 2019-07-29 | 1 |
| 1 | 2019-07-30 | 1 |
| 1 | 2019-08-01 | 2 |
| 1 | 2019-08-02 | 2 |
| 1 | 2019-08-03 | 2 |
| 1 | 2019-06-29 | 0 |
| 1 | 2019-06-30 | 0 |
| 2 | 2019-10-30 | 5 |
| 2 | 2019-10-31 | NULL |
| 2 | 2019-11-01 | 6 |
| 2 | 2019-11-02 | 6 |
I want to fuzzy join daily_df to monthly_df where daily_date is >= month_end_dt and less than some buffer afterwards (say, 5 days). I want to keep only the record with the minimum daily date and a non-null new_status.
This post solves the issue using an OUTER APPLY in SQL Server, but that seems not to be an option in Spark SQL. I'm wondering if there's a method that is similarly computationally efficient that works in Spark.
Let's say I have these tables.
Users
| id | name |
|----|------|
| 1 | bob |
Posts
| id | title | created_at | user_id |
|----|---------------|----------------------------|---------|
| 1 | hello world | 2020-05-15 18:29:13.163687 | 1 |
| 2 | hello world 2 | 2020-06-15 18:29:13.163687 | 1 |
| 3 | hello world 3 | 2020-07-15 18:29:13.163687 | 1 |
Snoozes
| id | start_at | end_at | user_id |
|----|----------------------------|----------------------------|---------|
| 1 | 2020-05-01 18:29:13.163687 | 2020-05-30 18:29:13.163687 | 1 |
| 2 | 2020-06-01 18:29:13.163687 | 2020-06-30 18:29:13.163687 | 1 |
| 3 | 2020-07-01 18:29:13.163687 | 2020-07-13 18:29:13.163687 | 1 |
For each user, I want to get the posts that they created when they were not in snooze mode. The number of snooze mode instances they have will vary.
If done correctly with the example data, the only post I'd get back is post id 3.
You can use not exists:
select p.*
from posts p
where not exists (select 1
from snoozes s
where p.user_id = s.user_id and
p.created_at between s.start_at and s.end_at
);
I have an ACTIVE_TRANSPORTATION table:
+--------+----------+--------+
| ATN_ID | TYPE | LENGTH |
+--------+----------+--------+
| 1 | SIDEWALK | 20.6 |
| 2 | SIDEWALK | 30.1 |
| 3 | TRAIL | 15.9 |
| 4 | TRAIL | 40.4 |
| 5 | SIDEWALK | 35.2 |
| 6 | TRAIL | 50.5 |
+--------+----------+--------+
It is related to an INSPECTION table via the ATN_ID:
+---------+--------+------------------+
| INSP_ID | ATN_ID | LENGTH_INSPECTED |
+---------+--------+------------------+
| 101 | 2 | 15.2 |
| 102 | 3 | 5.4 |
| 103 | 5 | 15.9 |
| 104 | 6 | 20.1 |
+---------+--------+------------------+
I want to summarize the information like this:
+----------+--------+-------------------+
| TYPE | LENGTH | PERCENT_INSPECTED |
+----------+--------+-------------------+
| SIDEWALK | 85.9 | 36% |
| TRAIL | 106.8 | 23% |
+----------+--------+-------------------+
How can I do this within a single query?
Here is the updated answer using ACCESS 2010. Note that LENGTH is reserved in ACCESS, so it needs to be changed to LENGTH_
SELECT
TYPE,
SUM(LENGTH) as LENGTH_,
SUM(IIF(ISNULL(LENGTH_INSPECTED),0, LENGTH_INSPECTED))/SUM(LENGTH) as PERCENT_INSPECTED
FROM
ACTIVE_TRANSPORTATION A
LEFT JOIN INSPECTION B
ON A.ATN_ID = B.ATN_ID
GROUP BY TYPE
Here is the answer using T-SQL in SQL SERVER 2014 I had originally
SELECT SUM(LENGTH) as LENGTH,
SUM(ISNULL(LENGTH_INSPECTED,0))/SUM(LENGTH) as PERCENT_INSPECTED,
TYPE
FROM
ACTIVE_TRANSPORTATION A
LEFT JOIN INSPECTION B
ON A.ATN_ID = B.ATN_ID
GROUP BY TYPE
Let me know if you need it to be converted to percent, rounded, etc, but I'm guessing that part is easy for you.