SQL Calculate cumulative total based on rows within the same table - sql

I'm trying to calculate a cumulative total for a field for each row in a table.
Consider a number of passengers on a bus, I know how many people get on & off at each stop, but i need to add to this the load on the bus, arriving at each stop.
I've got as far as getting a field which will calculate how the load changes at each stop, but how do I get the load from the stop before it? note, there are a number of trips within the same table, so for Stop 1 on a new trip, the load would be zero.
I've tried searching, but being new to this, I'm not even sure what i should be looking for and the results I do get I'm not even sure are relevant!
SELECT [Tripnumber], [Stop], Sum([Boarders] - [Alighters]) AS LoadChange
FROM table
Group By [Tripnumber], [Stop], [Boarders], [Alighters]
Order By [Tripnumber], [Stop]

You can use window functions:
SELECT [Tripnumber], [Stop],
Sum([Boarders] - [Alighters]) OVER (PARTITION BY tripnumber ORDER BY Stop) As LoadChange
FROM table;
I don't think the GROUP BY is necessary.

Related

I need to have moving average on a window of 5 rows and 2 columns same row might have same value (see the details)

the table I have is like:
year_round
home_team_name
away_team_name
home_kicks
away_kicks
I need moving average on home_kicks and away_kicks. In one record there are home_kicks and away_kicks. For home kicks I need also to calculate same team when it is as away kicks in previous rounds. At the moment I use a veery complex and slow query. I want to use partition and window functions of new SQL. I cannot get my head around this. I know how to do moving average with partition and window function for one column like home_kicks but I cannot add here same team when it plays as away team in previous rounds.
select fms.year_round ,
avg(fms.home_kicks) over (partition by fms.home_team_name order by fms.year_round desc range between 5 preceding and 1 preceding) as avg_home_kicks_5
from fw_matches_stats fms
order by fms.year_round desc
I guess I need to self inner join to select home_team as away_team but not sure how to do the calculations.

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

PowerPivot Ranking Groups using DAX's Rankx - Ranking Using Sum of a Field

Am trying to rank groups by summing a field (not a calculated column) for each group so I get a static answer for each row in my table.
For example, I may have a table with state, agent, and sales. Sales is a field, not a measure. There can be many agents within a state, so there are many rows for each individual state. I am trying to rank the states by total sales within each state.
I have tried many things, but the ones that make the most sense to me are:
rankx(CALCULATETABLE(Table,allexcept(Table,Table[AGENT]),sum([Sales]),,DESC)
and
=rankx(SUMMARIZE(State,Table[State],"Sales",sum(Table[Sales])),[Sales])
The first one is creating a table where it sums sales without grouping by Agent. and then tries to rank based on that. I get #ERROR on this one.
The second one creates a table using SUMMARIZE with only sum of Sales grouped by state, then tries to take that table and rank the states based on Sales. For this one I get a rank of 1 for every row.
I think, but am not sure, that my problem is coming from the sales being a static field and not a calculated measure. I can't figure out where to go from here. Any help?
Assuming your data looks something like this...
...have you tried this:
Ranking Measure = RANKX(ALL('Table'[STATE]),CALCULATE(SUM('Table'[Sales])))
The ALL('Table'[STATE]) says to rank all states. The CALCULATE(SUM('Table'[Sales])) says to rank by the sum of their sales. The CALCULATE wrapper is important; a plain SUM('Table'[Sales]) will be filtered to the current row context, resulting in every state being ranked #1. (Alternatively, you can spin off SUM('Table'[Sales]) into a separate Sales measure - which I'd recommend.)
Note: the ranks will change based on slicers/filters (e.g. a filter by agent will re-rank the states by that agent). If you're looking for a static rank of states by their total sales (i.e. not affected by filters on agent and always looking at the entire table), then try this:
Static Ranking Measure = CALCULATE([Ranking Measure], ALLEXCEPT('Table', 'Table'[State]))
This takes the same ranking measure, but removes all filters except the state filter (which you need to leave, as that's the column you're ranking by).
I did figure out a solution that's pretty simple, but it's messier than I'd like. If it's the only thing that works though, that's okay.
I created a new table with each distinct state along with a sum of sales then just do a basic RANKX on that table.

Calculate First Time Buyer and Repeating Buyers using MAQL Queries in GOOD-DATA platform

I have been recently working with GOOD-DATA platform. I don't have that much experience in MAQL, but I am working on it. I did some metric and reports in GOOD-DATA platform. Recently I tried to create a metric for calculating Total Buyers,First Time Buyers and Repeating Buyers. I created these three reports and working perfect.But when i try to add a order date parent filter the first time buyers and repeating buyers value getting wrong. please have Look at Following queries.
I can find out the correct values using sql queries.
MAQL Queries:
TOTAL ORDERS-
SELECT COUNT(NexternalOrderNo) BY CustomerNo WITHOUT PF
TOTAL FIRSTTIMEBUYERS-
SELECT COUNT(CustomerNo) WHERE (TOTAL ORDER WO PF=1) WITHOUT PF
TOTAL REPEATINGBUYERS-
SELECT COUNT(CustomerNo) WHERE (TOTAL ORDER WO PF>1) WITHOUT PF
Can any one suggest a logic for finding these values using MAQL
It's not clear what you want to do. If you could provide more details about the report you need to get, it would be great.
It's not necessary to put "without pf" into the metrics. This clause bans filter application, so when you remove it, the parent filter will be used there. And you will probably get what you want. Specifically, modify this:
SELECT COUNT(CustomerNo) WHERE (TOTAL ORDER WO PF>1) WITHOUT PF
to:
SELECT COUNT(CustomerNo) WHERE (TOTAL ORDER WO PF>1)
The only thing you miss here is "ALL IN ALL OTHER DIMENSIONS" aka "ALL OTHER".
This keyword locks and overrides all attributes in all other dimensions—keeping them from having any affect on the metric. You can read about it more in MAQL Reference Guide.
FIRSTTIMEBUYERS:
SELECT COUNT(CustomerNo)
WHERE (SELECT IFNULL(COUNT(NexternalOrderNo), 0) BY Customer ID, ALL OTHER) = 1
REPEATINGBUYERS:
SELECT COUNT(CustomerNo)
WHERE (SELECT IFNULL(COUNT(NexternalOrderNo), 0) BY Customer ID, ALL OTHER) > 1

SQL convert sample points into durations

This is similar to Compute dates and durations in mysql query, except that I don't have a unique ID column to work with, and I have samples not start/end points.
As an interesting experiment, I set cron to ps aux > 'date +%Y-%m-%d_%H-%M'.txt. I now have around 250,000 samples of "what the machine was running".
I would like to turn this into a list of "process | cmd | start | stop". The assumption is that a 'start' event is the first time when the pair existed, and a 'stop' event is the first sample where it stopped existing: there is no chance of a sample "missing" or anything.
That said, what ways exist for doing this transformation, preferably using SQL (on the grounds that I like SQL, and this seems like a nice challenge). Assuming that pids cannot be repeated this is a trivial task (put everything in a table, SELECT MIN(time), MAX(time), pid GROUP BY pid). However, since PID/cmd pairs are repeated (I checked, there are duplicates), I need a method that does a true "find all contiguous segments" search.
If necessary I can do something of the form
Load file0 -> oldList
ForEach fileN:
Load fileN ->newList
oldList-newList = closedN
newList-oldList = openedN
oldList=newList
But that is not SQL and not interesting. And who knows, I might end up having real SQL data to deal with with this property at some point.
I'm thinking something where one first constructs a table of diff's, and then joins all close's against all open's and pulls the minimum-distance close after each open, but I'm wondering if there's a better way.
You don't mention what database you are using. Let me assume that you are using a database that supports ranking functions, since that simplifies the solution.
The key to solving this is an observation. You want to assign an id to each pid to see if it is unique. I am going to assume that a pid represents a single process when the pid did not appear in the previous timestamped output.
Now, the idea is:
Assign a sequential number to each set of output. The first call to ps gets 1, the next 2, and so on, based on date.
Assign a sequential number to each pid, based on date. The first appearance gets 1, the next 2, and so on.
For pids that appear in sequence, the difference is a constant. We can call this the groupid for that set.
So, this is the query in action:
select groupid, pid, min(time), max(time)
from (select t.*,
(dense_rank() over (order by time) -
row_number() over (partition by pid order by time)
) as groupid
from t
) t
group by groupid, pid
This works in most databases (SQL Server, Oracle, DB2, Postgres, Teradata, among others). It does not work in MySQL because MySQL does not support the window/analytic functions.