How do you iterate through a data frame based on the value in a row - pandas

I have a data frame which I am trying to iterate through, however not based on time, but on an increase of 10 for example
Column A
Column B
12:05
1
13:05
6
14:05
11
15:05
16
so in this case it would return a new data frame with the rows with 1 and 11. How am I able to do this? The different methods that I have tried such as asfreq resample etc. don't seem to work. They say invalid frequency. The reason I think about this is that it is not time based. What is the function that allows me to do this that isn't time based but based on a numerical value such as 10 or 7. I don't want the every nth number, but every time the column value changes by 10 from the last selected value. ex 1 to 11 then if the next values were 12 15 17 21, it would be 21.

here is one way to do it
# do a remainder division, and choose rows where remainder is zero
# offset by the first value, to make calculation simpler
first_val = df.loc[0]['Column B']
df.loc[((df['Column B'] - first_val) % 10).eq(0)]
Column A Column B
0 12:05 1
2 14:05 11

Related

Find max and last value from a googlesheet query skipping x rows

I have a data set in google sheets, for each week of data I have 3 rows. I wish to query the data in every second row to calculate the max value and the last value.
For instance:
ROW
DATA
1
800
2
Text
3
500
4
More text
5
600
6
Blah
7
700
8
Blah
For Max value I have the following which will return 800
MAX(FILTER(QUERY(A1:A,"Select * skipping 2"), QUERY(A1:A,"Select * skipping 2") <> 0))
How do I change it up to return the last value? Which should return 700
try:
=LOOKUP(2^99,FILTER(A:A,A:A<>0))
#rockinfreakshow answer will successfully find the last number.
To filter a range by n amount of rows, you can use:
=FILTER(A:A,MOD(ROW(A:A),n)=1)
Change n with your desired value, and 1 with the number of row you want to get. 1 for the first, 2 for the second, but 0 if you want the nth one. To find MAX, just wrap it in MAX()
To find the last one, even if it's a text or number, you can use SORTN and SEQUENCE:
=SORTN(FILTER(A:A,MOD(ROW(A:A),n)=1,A:A<>""),1,1, SEQUENCE(COUNTA(FILTER(A:A,MOD(ROW(A:A),n)=1,A:A<>""))),0)
It orders the elements in reverse order and only chooses the first one
Remember to change n with the number of rows and =1 with the number of row you want to choose

Collapse pandas DataFrame based on daily column value

I have a pandas DataFrame with multiple measurements per day (for example hourly measurements, but that is not necessarily the case), but I want to keep only the hour for which a certain column is the daily minimum.
My one day in my data frame looks somewhat like this
DATE Value Distance
17 1979-1-2T00:00:00.0 15.5669870447436 34.87
18 1979-1-2T01:00:00.0 81.6306803714536 31.342
19 1979-1-2T02:00:00.0 83.1854759740486 33.264
20 1979-1-2T03:00:00.0 23.8659679630303 32.34
21 1979-1-2T04:00:00.0 63.2755504429306 31.973
22 1979-1-2T05:00:00.0 91.2129044773733 34.091
23 1979-1-2T06:00:00.0 76.493130052689 36.837
24 1979-1-2T07:00:00.0 63.5443183375785 34.383
25 1979-1-2T08:00:00.0 40.9255407683688 35.275
26 1979-1-2T09:00:00.0 54.5583051827551 32.152
27 1979-1-2T10:00:00.0 26.2690011881422 35.104
28 1979-1-2T11:00:00.0 71.3059740399097 37.28
29 1979-1-2T12:00:00.0 54.0111262724049 38.963
30 1979-1-2T13:00:00.0 91.3518048568241 36.696
31 1979-1-2T14:00:00.0 81.7651763485069 34.832
32 1979-1-2T15:00:00.0 90.5695814525067 35.473
33 1979-1-2T16:00:00.0 88.4550315358515 30.998
34 1979-1-2T17:00:00.0 41.6276969038137 32.353
35 1979-1-2T18:00:00.0 79.3818377264749 30.15
36 1979-1-2T19:00:00.0 79.1672568582629 37.07
37 1979-1-2T20:00:00.0 1.48337999844262 28.525
38 1979-1-2T21:00:00.0 87.9110385474789 38.323
39 1979-1-2T22:00:00.0 38.6646421460678 23.251
40 1979-1-2T23:00:00.0 88.4920153764757 31.236
I would like to keep all rows that have the minimum "distance" per day, so for the one day shown above, one would have only one row left (the one with index value 39). I know how to collapse the data frame so that I only have the Distance column left. I can do that - if I first set the DATE as index - with
df_short = df.groupby(df.index.floor('D'))["Distance"].min()
But I also want the Value column in my final result. How do I keep all columns?
It doesn't seem to work if I do
df_short = df.groupby(df.index.floor('D')).min(["Distance"])
This does keep all the columns in the final result, but it seems like the outcome is wrong, so I'm not sure what this does.
Maybe this is already posted somewhere, but I have trouble finding it.
You can use aggregate
df_short = df.groupby(df.index.floor('D')).agg({'Distance': min, 'Value': max})
If you want the kept Value column is the same with minimum of Distance column:
df_short = df.loc[df.groupby(df.index.floor('D'))['Distance'].idxmin(), :]
Make a datetime Index:
df.DATE = pd.to_datetime(df.DATE) # If not already datetime.
df.set_index('DATE', inplace=True)
Resample and find the min Distance's location:
df.loc[df.resample('D')['Distance'].idxmin()]
Output:
Value Distance
DATE
1979-01-02 22:00:00 38.664642 23.251

Pandas count rows before/after after current row

I need to calculate some measures on a window of my dataframe, with the value of interest in the centre of the window. To be more clear I use an example: if I have a dataset of 10 rows and a window size of 2, when I am in the 5th row I need to compute for example the mean of the values in 3rd, 4th, 5th, 6th and 7th row. When I am in the first row, I will not have the previous rows so I need to use only the following ones (so in the example, to compute the mean of 1st, 2nd and 3rd rows); if there are some rows but not enough, I need to use all the rows that are present (so fpr example if I am in the 2nd row, I will use 1st, 2nd, 3rd and 4th). How can I do that? As the title of my question suggest, the first idea I had was to count the number of rows preceding and following the current one, but I don't know how to do that. I am not forced to use this method, so if you have any suggestions on a better method feel free to share it.
What you want is a rolling mean with min_periods=1, center=True:
df = pd.DataFrame({'col': range(10)})
N = 2 # numbers of rows before/after to include
df['rolling_mean'] = df['col'].rolling(2*N+1, min_periods=1, center=True).mean()
output:
col rolling_mean
0 0 1.0
1 1 1.5
2 2 2.0
3 3 3.0
4 4 4.0
5 5 5.0
6 6 6.0
7 7 7.0
8 8 7.5
9 9 8.0
I assume that you have the target_row and window_size numbers as an input. You are trying to do an operation on a window_size of rows around the target_row in a dataframe df, and I gather from your question that you already know that you can't just grab +/- the window size, because it might exceed the size of the dataframe. Instead, just quickly define the resulting start and end rows based on the dataframe size, and then pull out the window you want:
start_row = max(target_row - window_size, 0)
end_row = min(target_row + window_size, len(df)-1)
window = df.iloc[start_row:end_row+1,:]
Then you can perform whatever operation you want on the window such as taking an average with window.mean().

SQL Max Consecutive Values in a number set using recursion

The following SQL query is supposed to return the max consecutive numbers in a set.
WITH RECURSIVE Mystery(X,Y) AS (SELECT A AS X, A AS Y FROM R)
UNION (SELECT m1.X, m2.Y
FROM Mystery m1, Mystery m2
WHERE m2.X = m1.Y + 1)
SELECT MAX(Y-X) + 1 FROM Mystery;
This query on the set {7, 9, 10, 14, 15, 16, 18} returns 3, because {14 15 16} is the longest chain of consecutive numbers and there are three numbers in that chain. But when I try to work through this manually I don't see how it arrives at that result.
For example, given the number set above I could create two columns:
m1.x
m2.y
7
7
9
9
10
10
14
14
15
15
16
16
18
18
If we are working on rows and columns, not the actual data, as I understand it WHERE m2.X = m1.Y + 1 takes the value from the next row in Y and puts it in the current row of X, like so
m1.X
m2.Y
9
7
10
9
14
10
15
14
16
15
18
16
18
Null?
The main part on which I am uncertain is where in the SQL recursion actually happens. According to Denis Lukichev recursion is the R part - or in this case the RECURSIVE Mystery(X,Y) - and stops when the table is empty. But if the above is true, how would the table ever empty?
Since I don't know how to proceed with the above, let me try a different direction. If WHERE m2.X = m1.Y + 1 is actually a comparison, the result should be:
m1.X
m2.Y
14
14
15
15
16
16
But at this point, it seems that it should continue recursively on this until only two rows are left (nothing else to compare). If it stops here to get the correct count of 3 rows (2 + 1), what is actually stopping the recursion?
I understand that for the above example the MAX(Y-X) + 1 effectively returns the actual number of recursion steps and adds 1.
But if I have 7 consecutive numbers and the recursion flows down to 2 rows, should this not end up with an incorrect 3 as the result? I understand recursion in C++ and other languages, but this is confusing to me.
Full disclosure, yes it appears this is a common university question, but I am retired, discovered this while researching recursion for my use, and need to understand how it works to use similar recursion in my projects.
Based on this db<>fiddle shared previously, you may find it instructive to alter the CTE to include an iteration number as follows, and then to show the content of the CTE rather than the output of final SELECT. Here's an amended CTE and its content after the recursion is complete:
Amended CTE
WITH RECURSIVE Mystery(X,Y) AS ((SELECT A AS X, A AS Y, 1 as Z FROM R)
UNION (SELECT m1.X, m2.A, Z+1
FROM Mystery m1
JOIN R m2 ON m2.A = m1.Y + 1))
CTE Content
x
y
z
7
7
1
9
9
1
10
10
1
14
14
1
15
15
1
16
16
1
18
18
1
9
10
2
14
15
2
15
16
2
14
16
3
The Z field holds the iteration count. Where Z = 1 we've simply got the rows from the table R. The, values X and Y are both from the field A. In terms of what we are attempting to achieve these represent sequences consecutive numbers, which start at X and continue to (at least) Y.
Where Z = 2, the second iteration, we find all the rows first iteration where there is a value in R which is one higher than our Y value, or one higher than the last member of our sequence of consecutive numbers. That becomes the new highest number, and we add one to the number of iterations. As only three numbers in our original data set have successors within the set, there are only three rows output in the second iteration.
Where Z = 3, the third iteration, we find all the rows of the second iteration (note we are not considering all the rows of the first iteration again), where there is, again, a value in R which is one higher than our Y value, or one higher than the last member of our sequence of consecutive numbers. That, again, becomes the new highest number, and we add one to the number of iterations.
The process will attempt a fourth iteration, but as there are no rows in R where the value is one more than the Y values from our third iteration, no extra data gets added to the CTE and recursion ends.
Going back to the original db<>fiddle, the process then searches our CTE content to output MAX(Y-X) + 1, which is the maximum difference between the first and last values in any consecutive sequence, plus one. This finds it's value from the record produced in the third iteration, using ((16-14) + 1) which has a value of 3.
For this specific piece of code, the output is always equivalent to the value in the Z field as every addition of a row through the recursion adds one to Z and adds one to Y.

How to create new columns using groupby based on logical expressions

I have this CSV file
http://www.sharecsv.com/s/2503dd7fb735a773b8edfc968c6ae906/whatt2.csv
I want to create three columns, 'MT_Value','M_Value', and 'T_Data', one who has the mean of the data grouped by year and month, which I accomplished by doing this.
data.groupby(['Year','Month']).mean()
But for M_value I need to do the mean of only the values different from zero, and for T_Data I need the count of the values that are zero divided by the total of values, I guess that for the last one I need to divide the amount of values that are zero by the amount of total data grouped, but honestly I am a bit lost. I looked on google and they say something about transform but I didn't understood very well
Thank you.
You could do something like this:
(data.assign(M_Value=data.Valor.where(data.Valor!=0),
T_Data=data.Valor.eq(0))
.groupby(['Year','Month'])
[['Valor','M_Value','T_Data']]
.mean()
)
Explanation: assign will create new columns with respective names. Now
data.Valor.where(data.Valor!=0) will replace 0 values with nan, which will be ignored when we call mean().
data.Valor.eq(0) will replace 0 with 1 and other values with 0. So when you do mean(), you compute count(Valor==0)/total_count().
Output:
Valor M_Value T_Data
Year Month
1970 1 2.306452 6.500000 0.645161
2 1.507143 4.688889 0.678571
3 2.064516 7.111111 0.709677
4 11.816667 13.634615 0.133333
5 7.974194 11.236364 0.290323
... ... ... ...
1997 10 3.745161 7.740000 0.516129
11 11.626667 21.800000 0.466667
12 0.564516 4.375000 0.870968
1998 1 2.000000 15.500000 0.870968
2 1.545455 5.666667 0.727273
[331 rows x 3 columns]