I want to implement a function in that could compute rolling mean and median on a time series data using nearest neighbour points and I am wondering how I can do it in SQL.
To illustrate what I am trying to do:
Assuming I have
data containing information on parked vehicles across all carparks in USA everyday (From 2022-01-01 till 2022-07-18) as well as their exact parking location (latitude longitude) and the cost of the vehicle.
Vehicle Tracking System containing 10 unique Tracker ID, Date Installed, its latitude and longitude in a particular carpark
Example of the first data:
Date, vehicle ID, cost of vehicle (in $), latitude, longitude
2022-07-18, 1, $4000, 3.15, 100.15
2022-07-18, 2, $3000, 3.11, 100.13
2022-07-17, 3, $5000, 3.12, 100.10
...
Example of second data:
Tracker ID, Date Installed, latitude, longitude
A, 2022-07-18, 3.14, 100.12
B, 2022-07-20, 5.10, 105.14
For each Tracker, I want to know what is the mean and median cost of the nearest 5 vehicles parked within 1km of its vicinity based on the past 5 days parking data, as well as past 10 days parking data from its date of Installation (Vehicles can freely move in and out of a carpark anytime)
Example of how my output should be:
Tracker ID, Mean_Cost_5, Median_Cost_5, Mean_Cost_10, Median_Cost_10
Tracker 1, $4000, $4000, $7000, $5000 (Using carpark data from 5 and 10 days before 2022-07-18)
This should be done for each of the 10 Trackers.
How can I implement it in SQL?
Related
I have a suicide data of 100 countries from year 1985-2016.
It has a column which contains, suicides/100k pop, which is basically suicide rate.
But the problem is that the data is divided in each year and then further in age groups and gender.
I want the overall top 10 countries with the highest suicides/100k pop.
I have attached a preview of the table.
I am not able to build a query since the data is divided for each year you can't sum the suicide rate for each year to find the overall suicide rate.
I built a DataFrame where there are the following data:
Daily Price of Gas Future of N Day;
Daily Price of Petroil
Future of N Day;
Daily Price of Dau-Ahead Eletricity Market in
Italy;
The data are taken from 2010 to 2022 time range, so 12 years of historical time data.
The DataFrame head looks like this:
PETROIL GAS ELECTRICITY
0 64.138395 2.496172 68.608696
1 65.196161 2.482612 113.739130
2 64.982403 2.505938 112.086957
3 64.272606 2.500000 110.043478
4 65.993436 2.521739 95.260870
So on this DataFrame I tried to build the Correlation Matric throught the Pandas metod .corr() and faced one big issue:
If I take all 12 years as data I get:
almost Zero as correlation between Electricity and Petroil price;
low correlation (0.12) between Electricity and Gas price;
While if I try to split in three time range (2010-2014; 2014-2018; 2018-2022) I get really high correlation (in both case between 0.40 and 0.60).
So I am here asking these two questions:
Why I get this so high difference when I split the time ranges?
Considering I am doing this kind of analysis to use Petroil and Gas
prices to predict the electricity price, which of these two analysis
should I consider? The first one (with low correlation) that
considers the entire time range or the second one (with higher
correlation) that is split into different time ranges?
Thank you for your answers.
I am a stock trader who visualizes data in QuickSight. I identify the trades I want to submit to the market, sometimes for the same stock, at the same time, but in opposite directions depending on the price of the stock at that time. See below for an example of trades I might identify for 1/19/22 0800:
Date
Hour
Stock
Direction
Price
Volume
1/19/22
0800
Apple
BUY
$10
2
1/19/22
0800
Apple
SELL
$20
1
1/19/22
0800
Microsoft
BUY
$15
3
Using QuickSight, I want to visualize (in pivot tables and charts) the volume that I trade, using the maximum possible trade volume. For example, QuickSight simply sums the Volume column to 6, when really I want it to sum to 5, because the max possible trade volume for that hour is 5 (the Apple trades in the example are mutually exclusive, because the stock price cannot be both beneath $10, triggering a BUY, and above $20, triggering a SELL at the same date-time. Therefore, I want the day's traded volume to reflect the MAX possible volume I could have traded (2+3)).
I have used the maxOver() function as so: maxOver({volume}, [{stock}, {date}, {hour}], PRE_AGG), but I would like to view my trade volume rolled up to the day as so:
Date
Volume
1/19
5
Is there a way to do this using QuickSight calculated fields? Should this aggregation be done with a SQL custom field?
Add a new calculated field called
volume_direction_specifier
{Volume} * 10 + ifelse({Direction}='BUY', 1, 2)
This is a single number that will indicate the direction and volume. (this is needed in cases where the max possible volume is the same for both the BUY and SELL entries within the same hour).
Then compute the maxOver on this new field in a calculated field called max_volume_direction_specifier
maxOver({volume_direction_specifier}, [{stock}, {date}, {hour}], PRE_AGG)
Add a new field which will give the Volume for rows that have the max volume_direction_specifier per hour
volume_for_max_trade_volume_per_hour
ifelse(volume_direction_specifier = max_volume_direction_specifier, {volume}, null)
And finally, you should be able to add volume_for_max_trade_volume_per_hour to your table (grouped by day) and its SUM will give the maximum possible trade volume per day.
My goal is to generate a report showing the average occupancy of a garage (y-axis) at a given day of the week and/or time of day. My data model is as follows:
Garage has_many Cars and Garage has_many Appointments, through: :cars
Car has_many Appointments
Appointment has fields such as:
picked_up_at (datetime)
returned_at (datetime)
Also, Garage has a field capacity (integer), which is the maximum number of cars that will fit in the garage.
If I have a list of Appointments spanning the last 6 months, and I would like to generate a line-graph with the x-axis showing each day of the week, broken down into 4-hour intervals, and the y-axis showing the average % occupancy (# of cars in the garage / capacity) over the 6 month period for the given day/hour interval, how can I go about gathering this data to report on?
E.g. a car is In from the time of one Appointment's return until the next Appointment's pickup, and Out from the Appointment's pickup until it's returned_at time.
I am having a lot of trouble making the connection from these data points to the best way to meaningfully report on and present them to the end user.
I am using Rails 4.1 and Ruby 2.0.
Edit: SQL Fiddle - http://sqlfiddle.com/#!9/a72fe/1
This query would do it all (adapted to your added fiddle):
SELECT a.ts, g.*, round((a.ct * numeric '100') / g.capacity, 2) AS pct
FROM (
SELECT ts, c.garage_id, count(*) AS ct
FROM generate_series(timestamp '2015-06-01 00:00' -- lower and
, timestamp '2015-12-01 00:00' -- upper bound of range
, interval '4h') ts
JOIN appointment a ON a.picked_up_at <= ts -- incl. lower
AND (a.returned_at > ts OR
a.returned_at IS NULL) -- excl. upper bound
JOIN car c ON c.id = a.car_id
GROUP BY 1, 2
) a
JOIN garage g ON g.id = a.garage_id
ORDER BY 1, 2;
SQL Fiddle.
If returned_at IS NULL, this query assumes that the car is still in use. So NULL shouldn't occur for other cases or you have an error in the calculation.
First, I build the time series with the convenient generate_series() function.
Then join to appointments where the timestamp falls inside a booking.
I assume every appointment with including lower and excluding upper timestamp as it the widespread convention.
Aggregate and count before we join to garages (faster this way). Compare:
Aggregate a single column in query with many columns
Percent calculations in the outer SELECT.
I multiply the bigint number with numeric (or optionally real or float) to preserve fractional digits, which would be cut off in an integer division. Then I round to two fractional digits.
Note that this is not exactly the average percentage of each 4-hour period, but only the current percentage at each point in time, which is an approximation of the true average. You might start with an odd timestamp like '2015-06-01 01:17' so not to fall in between bookings that would probably turn over at full hours or something, which might increase the mean error of the approximation.
You can do an exact calculation for 4h periods, too, but that's more sophisticated. One simple technique would be to reduce the interval to 10 minutes or some granularity that's detailed enough to capture the full picture.
Related (with an example for exact calculation):
Calculate working hours between 2 dates in PostgreSQL
Average stock history table
I have a table that is used to calculate a daily completion score by individuals at various locations. Example: on day 1, 9/10 people completed the task, so the location score is 90%. The dimension is "ReferenceDate." The expression is a calculation of count(distinct if(taskcompleted=yes, AccountNumber)) / count(distinct AccountNumber).
Now, I want to report on the average scores per month. I DO NOT want to aggregate all the data and then divide; I want the daily average. Example:
day 1: 9/10 = 90%
day 2: 90/100 = 90% (many more people showed up a the same location)
average of two days is 90%.
it's not 99/110
and it also not distinct(99) / distinct(110). It is the more simple (.9 + .9) /2
Does this make sense?
What I have now is a line graph showing the daily trend across many months. I need to roll that up into bar charts by month and then compare multiple locations so we can see what locations are having the lower average completion scores.
You need to use the aggr() function to tell QlikView to do the sum day by day and then average the answers.
It should look something like this. (I just split the lines to show which terms are working together.
avg(
aggr(
count(distinct if(taskcompleted=yes, AccountNumber))
/ count(distinct AccountNumber)
,ReferenceDate)
)