How long "1" is - sql

I have question to SQL Server 2008 R2.
I have the following data: SQL Query and Output
I am trying to calculate the difference in time between consecutive records where wartosc = 1 and wartosc = 0. I then want to sum this difference for all records.
For instance in my example:
Row 1 has wartosc = 1 and czas = 2016-07-14 22:01:36 and
Row 2 has wartosc = 0 and czas = 2016-07-14 22:02:06
I would like to find the time difference between Row 1 and Row 2. I would like to do this for all records and sum the resulting time differences.
I add rows when value is changing, now I must subtract time where 0 is to 1
for example:
(2016-07-14 22:02:06 - 2016-07-14 22:01:36) + (2016-07-14 22:04:56
- 2016-07-14 22:02:11) + (2016-07-14 22:17:01 - 2016-07-14 22:14:56) ...

Related

updating the next several row values based on the value of a row in another column

I'm trying to figure out how to add the values of one column (the amount column) to the next few rows based on the condition of another column (the days column). If the condition of the days column is greater than 1, for each day greater than 1 I add the amount column to that many following rows. So if days is three, I add the amount to the next two rows (the first day is just the current row). I actually think this is easier if I make a copy of the amount column, so I made a copy called backlog.
So let's say I have an amount column that represents the amount of support tickets that need to be resolved each day. Each amount has a number of days it takes for the amount to be resolved. I need the total amount to be a sum of the value today and the sum of the outstanding tickets. So if I have an amount of 1 for 2 days, I have 1 ticket amount today and I add that same 1 tomorrow to the ticket amount of tomorrow. If this doesn't make sense, the below examples will. I have a solution as well, but my main issue is doing this efficiently.
Here is a sample dataframe to use:
amount = list(np.zeros(10)) + [random.randint(1,3) for val in range(15)]
random.shuffle(amount)
ex = pd.DataFrame({
'Amount': amount
})
ex.loc[ex['Amount']>0, 'Days'] = [random.randint(0,4) for val in range(15)]
ex.loc[ex['Amount']==0, 'Days'] = 0
ex['Days'] = ex['Days'].astype(int)
ex['Backlog'] = ex['Amount']
ex.head(10)
Input Dataframe:
Amount
Days
Backlog
2
0
2
1
3
1
2
2
2
3
0
3
Desired Output Dataframe:
Amount
Days
Backlog
2
0
2
1
3
1
2
2
3
3
0
6
In the last two values of the backlog column, I have a value of 3 (2 from the current day amount plus 1 from the prior day amount) and a value of 6 (3 for the current day + 2 from the previous day amount + 1 from two days ago).
I have made code for this below, which I think achieves the outcome:
for i in range(0, len(ex['Amount'])):
Days = ex['Days'].iloc[i]
if Days >= 2:
for j in range (1,Days):
if (i+j)>= len(ex['Amount']):
break
ex['Backlog'].iloc[i+j] += ex['Amount'].iloc[i]
The problem is that I'm already using two for loops to slice the data frame for two features first, so when this code is used as a function for a very large data frame it runs far too slowly, and my main goal has been to implement a faster way to do this. Is there a more efficient pandas method to achieve the same outcome? Possibly without having to use slow iteration or a nested for loop? I'm at a loss.

Calculate time difference in milliseconds using Pandas

I have a dataframe timings as follows:
start_ms end_ms
0 2020-09-01T08:11:19.336Z 2020-09-01T08:11:19.336Z
1 2020-09-01T08:11:20.652Z 2020-09-01T08:11:20.662Z
2 2020-09-01T08:11:20.670Z 2020-09-01T08:11:20.688Z
I'm trying to calculate the time difference between the start_ms and end_ms of each row in milliseconds, i.e. I wish to get the result
start_ms end_ms diff
0 2020-09-01T08:11:19.336Z 2020-09-01T08:11:19.336Z 0
1 2020-09-01T08:11:20.652Z 2020-09-01T08:11:20.662Z 10
2 2020-09-01T08:11:20.670Z 2020-09-01T08:11:20.688Z 18
I can convert the timestamp to datetime column by column, but I'm not sure if the order of the values are retained.
start_ms_time = pd.to_datetime(timings['start_ms'])
end_ms_time = pd.to_datetime(timings['end_ms'])
Is it possible to convert the timestamps to datetime inside timings, and add the time difference column? Do I even need to convert to get the difference? How do I calculate the time difference in milliseconds?
Subtract columns by Series.sub and then use Series.dt.components:
start_ms_time = pd.to_datetime(timings['start_ms'])
end_ms_time = pd.to_datetime(timings['end_ms'])
timings['diff'] = end_ms_time.sub(start_ms_time).dt.components.milliseconds
print (timings)
start_ms end_ms diff
0 2020-09-01T08:11:19.336Z 2020-09-01T08:11:19.336Z 0
1 2020-09-01T08:11:20.652Z 2020-09-01T08:11:20.662Z 10
2 2020-09-01T08:11:20.670Z 2020-09-01T08:11:20.688Z 18
Or Series.dt.total_seconds with multiple by 1000 and cast to integers:
timings['diff'] = end_ms_time.sub(start_ms_time).dt.total_seconds().mul(1000).astype(int)
print (timings)
start_ms end_ms diff
0 2020-09-01T08:11:19.336Z 2020-09-01T08:11:19.336Z 0
1 2020-09-01T08:11:20.652Z 2020-09-01T08:11:20.662Z 10
2 2020-09-01T08:11:20.670Z 2020-09-01T08:11:20.688Z 18

Grouping rows so a column sums to no more than 10 per group

I have a table that looks like:
col1
------
2
2
3
4
5
6
7
with values sorted in ascending order.
I want to assign each row to groups with labels 0,1,...,n so that each group has a total of no more than 10. So in the above example it would look like this:
col1 |label
------------
2 0
2 0
3 0
4 1
5 1
6 2
7 3
I tried using this:
floor(sum(col1) OVER (partition by ORDER BY col1 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /10))
But this doesn't work correctly because it is performing the operations
as:
floor(2/10) = 0
floor([2+2]/10) = 0
floor([2+2+3]/10) = 0
floor([2+2+3+4]/10) = 1
floor([2+2+3+4+5]/10 = 1
floor([2+2+3+4+5+6]/10 = 2
floor([2+2+3+4+5+6+7]/10) = 2
It's all coincidentally correct until the last calculation, because even though
[2+2+3+4+5+6+7] / 10 = 2.9
and
floor(2.9) = 2
what it should do is realise 6+7 is > 10 so the 5th row with value 7 needs be in its own group so iterate the group number + 1 and allocate this row into a new group.
What I really want it to do is when it encounters a sum > 10 then set group number = group number + 1, allocate the CURRENT ROW into this new group, and then finally set the new start row to be the CURRENT ROW.
This is too long for a comment.
Solving this problem requires scanning the table, row-by-row. In SQL, this would be through a recursive CTE (or hierarchical query). Hive supports neither of these.
The issue is that each time a group is defined, the difference between 10 and the sum is "forgotten". That is, when you are further down in the list, what happens earlier on is not a simple accumulation of the available data. You need to know how it was split into groups.
A related problem is solvable. The related problem would assign all rows to groups of size 10, splitting rows between two groups. Then you would know what group a later row is in based only on the cumulative sum of the previous rows.

Calculate diff() between selected rows

I have a dataframe with ordered times (in seconds) and a column that is either 0 or 1:
time bit
index
0 0.24 0
1 0.245 0
2 0.47 1
3 0.471 1
4 0.479 0
5 0.58 1
... ... ...
I want to select those rows where the time difference is, let's say <0.01 s. But only those differences between rows with bit 1 and bit 0. So in the above example I would only select row 3 and 4 (or any one of them). I thought that I would calculate the diff() of the time column. But I need to somehow select on the 0/1 bit.
Coming from the future to answer this one. You can apply a function to the dataframe that finds the indices of the rows that adhere to the condition and returns the row pairs accordingly:
def filter_(x, threshold = 0.01):
indices = df.index[(df.time.diff() < threshold) & (df.bit.diff().abs() == 1)]
mask = indices | indices - 1
return x[mask]
print(df.apply(filter_, args = (0.01,)))
Output:
time bit
3 0.471 1
4 0.479 0

SQL query to find unique values

I need to write a query that truncates based on a selection and outputs another table. The selection criteria is as follows: for each common ID loop through the AGREE column to find a Y, if no Y then output 0, if a single Y then output that year, if multiple Y then output the most current year.
Input table:
ID AGREE YEAR
1 N 2003
2 Y 2005
2 N 2015
3 N 2005
3 N 2007
3 Y 2011
3 Y 1999
4 N 2005
4 N 2010
Output table:
ID AGREE YEAR
1 N 0
2 Y 2005
3 Y 2011
4 N 0
Here is my solution:
Select id, max(agree), max(case when agree = 'Y' then year else 0 end)
from [input table]
group by id
It rests on grouping by the id field and using max statements to return a "Y" if it is present for the group, and then return the largest number for year when agree is "Y". Note that you say "most recent" - if this table contains years in the future it would not return the most recent but instead the furthest into the future.
Note: There is an alternate way of doing this that is often faster that involves using sub-queries. If you run into performance issues it would be worth pursuing.