is there efficient way for pandas to get tail rows with a condition - pandas

I want to get tail rows with a condition
For example:
I want to get all negative tail rows from a column 'A' like:
test = pd.DataFrame({'A':[-8, -9, -10, 1, 2, 3, 0, -1,-2,-3]})
I expect a 'method' to get new frame like:
A
0 -1
1 -2
2 -3
note that, it is not certain of how many 'negative' numbers are in the tail. So I can not run test.tail(3)
It looks like the pandas provided 'tail()' function can only run with a given number.
But my input data frame might be too large that I dont want run a simple loop to check one by one
Is there a smart way to do that?

Is this what you wanted?
test = pd.DataFrame({'A':[-8, -9, -10, 1, 2, 3, 0, -1,-2,-3]})
test = test.iloc[::-1]
test.loc[test.index.max():test[test['A'].ge(0)].index[0]+1]
Output:
A
9 -3
8 -2
7 -1
edit, if you want to get it back into the original order:
test.loc[test.index.max():test[test['A'].ge(0)].index[0]+1].iloc[::-1]
A
7 -1
8 -2
9 -3
Optional also .reset_index(drop=True) if you need a index starting at 0.

What's the tail for? It seems like you just need the negative numbers
test.query("A < 0")
Update: Find where sign changes, split the array and choose last one
split_points = (test.A.shift(1)<0) == (test.A<0)
np.split(test, split_points.loc[lambda x: x==False].index.tolist())[-1]
Output:
A
7 -1
8 -2
9 -3

Just share a picture of performance comparing above two given answers
Thansk Patry and Macro

I improved my above test, and did another round test, as I feel the old 'testing sample' size was too small,and afaid the %%time measurement might not accurate.
My new test uses a very big head numbers with size of 10000000 and tail with 3 negative numbers
so the new test can prove how the whole data frame size impact the over all performance.
code is like bellow:
%%time
arr = np.arange(1,10000000,1)
arr = np.concatenate((arr, [-2,-3,-4]))
test = pd.DataFrame({'A':arr})
test = test.iloc[::-1]
test.loc[test.index.max():test[test['A'].ge(0)].index[0]+1].iloc[::-1]
%%time
arr = np.arange(1,10000000,1)
arr = np.concatenate((arr, [-2,-3,-4]))
test = pd.DataFrame({'A':arr})
split_points = (test.A.shift(1)<0) == (test.A<0)
np.split(test, split_points.loc[lambda x: x==False].index.tolist())[-1]
due to system impacts, I tested 10 times, the above 2 methods are very much performs the similar. In about 50% cases Patryk's code even performs faster
Check out this image bellow

Related

Pandas DataFrame: How to drop continuous rows gracefully

I have a data frame, having two type of rows: SWITCH and RESULT
My expectation is to drop the adjacent "SWITCH" and keep the last SWITCH in the block only, but keep all the RESULT rows.
I did it using data frame iterrows and I basically scanned line by line. This is not pythonic.
Can you please advise if you are seeing a better way?
Below is the sample data, and the code I'm using:
import pandas as pd
data = {'TYPE':['SWITCH','SWITCH','SWITCH',
'SWITCH','RESULT','RESULT','RESULT',
'RESULT','RESULT','SWITCH','SWITCH',
'RESULT','RESULT','RESULT','RESULT'],
'RESULT':['YES',
'NO','NO','YES',
'DONE','DONE','DONE',
'DONE','DONE','NO',
'YES','DONE','DONE',
'DONE','DONE']}
df = pd.DataFrame(data)
l = []
start=-1
for index, row in df.iterrows():
type = row["TYPE"]
if type == "RESULT":
if start == -1:
start = index
elif type == "SWITCH":
if start== -1:
df.drop(index=[*range(index, index+1, 1)], inplace=True)
continue
end = index-1
if start <= end:
df.drop(index=[*range(start,end,1)], inplace=True)
start = index + 1
print(df)
Just checked the output and found my code didn't do what I'm looking for:
Before applying the code
As index 0~index 3 are all "SWITCH", I want to drop the index 0/1/2 and keep the index 3 only, as this is a "block of switch"
Similarily, for index 9/10 i want to keep index 10 only
TYPE RESULT
0 SWITCH YES
1 SWITCH NO
2 SWITCH NO
3 SWITCH YES
4 RESULT DONE
5 RESULT DONE
6 RESULT DONE
7 RESULT DONE
8 RESULT DONE
9 SWITCH NO
10 SWITCH YES
11 RESULT DONE
12 RESULT DONE
13 RESULT DONE
14 RESULT DONE
Expected output:
TYPE RESULT
3 SWITCH YES
4 RESULT DONE
5 RESULT DONE
6 RESULT DONE
7 RESULT DONE
8 RESULT DONE
10 SWITCH YES
11 RESULT DONE
12 RESULT DONE
13 RESULT DONE
14 RESULT DONE
Actual output:
TYPE RESULT
8 RESULT DONE
9 SWITCH NO
10 SWITCH YES
11 RESULT DONE
12 RESULT DONE
13 RESULT DONE
14 RESULT DONE
If I understand you correctly, for each group of consecutive rows with TYPE == "SWITCH" you want to keep the last row. This can be done as follows:
df_processed = df[(df.TYPE != "SWITCH") | (df.TYPE.shift(-1) != "SWITCH")]
The output for the provided example data is
Iterating the rows of a dataframe is considered bad practice and should be avoided.
I believe you are looking for something along these lines:
# Get the rows where TYPE == RESULT
df_type_result = df[df['TYPE'] == 'RESULT']
# Get the last index when the result type == SWITCH
idxs = df.reset_index().groupby(['TYPE', 'RESULT']).last().loc['SWITCH']['index']
df_type_switch = df.loc[idxs]
# Concatenate and sort the results
df_result = pd.concat([df_type_result, df_type_switch]).sort_index()
df_result
A lazy solution
df["DROP"] = df["TYPE"].shift(-1)
df = df.loc[~((df["TYPE"]=="SWITCH")&(df["DROP"]=="SWITCH"))]
df.drop(columns="DROP", inplace=True)

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

How do I combine and aggregate Data Frame rows

I have a data frame which looks somewhat like this:
endPoint power time
device1 -4 0
device2 3 0
device3 -2 0
device4 0 0
device5 5 0
device6 -5 0
device1 4 1
device2 -3 1
device3 5 1
device4 -2 1
device5 1 1
device6 4 1
....
device1 6 x
device2 -5 x
device3 4 x
device4 3 x
device5 -1 x
device6 1 x
I want to change it into something like this:
span powerAboveThreshold time
device1-device3 true 0
device2-device6 true 0
...
devicex-devicey false w
I want to aggregate rows into two new columns and group this by time and span. The value of powerAboveThreshold depends on whether or not the power for either device in the span is above 0 - so if devicex or devicey is below 0 then it will be false.
As a side-note, there is one span of devices which contains 4 devices - whereas the rest contain just 2 devices. I need to design with this in mind.
device1-device3-device6-device2
I am using the Apache Spark DataFrame API/Spark SQL to accomplish this.
edit:
Could I convert the DataFrame to an RDD and compute it that way?
edit2:
Follow-up questions to Daniel L:
Seems like a great answer from what I have understood so far. I have a few questions:
Will the RDD have the expected structure if I convert it from a DataFrame?
What is going on in this part of the program? .aggregateByKey(Result())((result, sample) => aggregateSample(result, sample), addResults). I see that it runs aggregateSample() with each key-value pair (result, sample), but how does the addResults call work? Will it be called on each item relating to a key to add each successive Result generated by aggregateSample to the previous ones? I don't fully understand.
What is .map(_._2) doing?
In what situation will result.span be empty in the aggregateSample function?
In what situation will res1.span be empty in the addResults function?
Sorry for all of the questions, but I'm new to functional programming, Scala, and Spark so I have a lot to wrap my head around!
I'm not sure you can do the text concatenation as you want on DataFrames (maybe you can), but on a normal RDD you can do this:
val rdd = sc.makeRDD(Seq(
("device1", -4, 0),
("device2", 3, 0),
("device3", -2, 0),
("device4", 0, 0),
("device5", 5, 0),
("device6", -5, 0),
("device1", 4, 1),
("device2", -3, 1),
("device3", 5, 1),
("device4", 1, 1),
("device5", 1, 1),
("device6", 4, 1)))
val spanMap = Map(
"device1" -> 1,
"device2" -> 1,
"device3" -> 1,
"device4" -> 2,
"device5" -> 2,
"device6" -> 1
)
case class Result(var span: String = "", var aboveThreshold: Boolean = true, var time: Int = -1)
def aggregateSample(result: Result, sample: (String, Int, Int)) = {
result.time = sample._3
result.aboveThreshold = result.aboveThreshold && (sample._2 > 0)
if(result.span.isEmpty)
result.span += sample._1
else
result.span += "-" + sample._1
result
}
def addResults(res1: Result, res2: Result) = {
res1.aboveThreshold = res1.aboveThreshold && res2.aboveThreshold
if(res1.span.isEmpty)
res1.span += res2.span
else
res1.span += "-" + res2.span
res1
}
val results = rdd
.map(x => (x._3, spanMap.getOrElse(x._1, 0)) -> x) // Create a key to agregate with, by time and span
.aggregateByKey(Result())((result, sample) => aggregateSample(result, sample), addResults)
.map(_._2)
results.collect().foreach(println(_))
And it prints this, which is what I understood you needed:
Result(device4-device5,false,0)
Result(device4-device5,true,1)
Result(device1-device2-device3-device6,false,0)
Result(device1-device2-device3-device6,false,1)
Here I use a map that tells me which devices go together (for your pairings and 4-device exception), you might want to replace it with some other function, hardcode it as a static function to avoid serialization or use a broadcast variable.
=================== Edit ==========================
Seems like a great answer from what I have understood so far.
Feel free to upvote/accept it, helps me an others looking for things to answer :-)
Will the RDD have the expected structure if I convert it from a DataFrame?
Yes, the main difference is that a DataFrame includes a schema, so it can better optimize the underling calls, should be trivial to use this schema directly or map to the tuples I used as an example, I did it mostly for convenience. Hernan just posted another answer that shows some of this (and also copied the initial test data I used for convenience), so I won't repeat that piece, but as he mentions, your device-span grouping and presentation is tricky and thus I preferred a more imperative way on an RDD.
What is going on in this part of the program? .aggregateByKey(Result())((result, sample) => aggregateSample(result, sample), addResults). I see that it runs aggregateSample() with each key-value pair (result, sample), but how does the addResults call work? Will it be called on each item relating to a key to add each successive Result generated by aggregateSample to the previous ones? I don't fully understand.
aggregateByKey is a very optimal function. To avoid shuffling all data from one node to another to later merge, it first does local aggregation of samples to a single result per key, locally (the first function). They it shuffles these results around and adds the up (the second function).
What is .map(_._2) doing?
Simply discarding the key from the key/value RDD after aggregation, you no longer care about it, so I just keep the result.
In what situation will result.span be empty in the aggregateSample function?
In what situation will res1.span be empty in the addResults function?
When you do aggregation, you need to provide a "zero" value. For instance, if you where aggregating numbers, Spark would do (0 + firstValue) + secondValue... etc. The if clause prevent the adding of a spurious '-' before the first device name, since we put it between devices. No different than dealing for instance with one extra comma on a list of items, etc. Check the documentation and samples for aggregateByKey, it will help you a lot.
This is the implementation for dataframes (without the concated names):
val data = Seq(
("device1", -4, 0),
("device2", 3, 0),
("device3", -2, 0),
("device4", 0, 0),
("device5", 5, 0),
("device6", -5, 0),
("device1", 4, 1),
("device2", -3, 1),
("device3", 5, 1),
("device4", 1, 1),
("device5", 1, 1),
("device6", 4, 1)).toDF("endPoint", "power", "time")
val mapping = Seq(
"device1" -> 1,
"device2" -> 1,
"device3" -> 1,
"device4" -> 2,
"device5" -> 2,
"device6" -> 1).toDF("endPoint", "span")
data.as("A").
join(mapping.as("B"), $"B.endpoint" === $"A.endpoint", "inner").
groupBy($"B.span", $"A.time").
agg(min($"A.power" > 0).as("powerAboveThreshold")).
show()
Concated names are quite a bit harder, this requires you either to write your own UDAF (supported in the next version of Spark), or to use a combination of Hive functions.

Dataframe non-null values differ from value_counts() values

There is an inconsistency with dataframes that I cant explain. In the following, I'm not looking for a workaround (already found one) but an explanation of what is going on under the hood and how it explains the output.
One of my colleagues which I talked into using python and pandas, has a dataframe "data" with 12,000 rows.
"data" has a column "length" that contains numbers from 0 to 20. she wants to divided the dateframe into groups by length range: 0 to 9 in group 1, 9 to 14 in group 2, 15 and more in group 3. her solution was to add another column, "group", and fill it with the appropriate values. she wrote the following code:
data['group'] = np.nan
mask = data['length'] < 10;
data['group'][mask] = 1;
mask2 = (data['length'] > 9) & (data['phraseLength'] < 15);
data['group'][mask2] = 2;
mask3 = data['length'] > 14;
data['group'][mask3] = 3;
This code is not good, of course. the reason it is not good is because you dont know in run time whether data['group'][mask3], for example, will be a view and thus actually change the dataframe, or it will be a copy and thus the dataframe would remain unchanged. It took me quit sometime to explain it to her, since she argued correctly that she is doing an assignment, not a selection, so the operation should always return a view.
But that was not the strange part. the part the even I couldn't understand is this:
After performing this set of operation, we verified that the assignment took place in two different ways:
By typing data in the console and examining the dataframe summary. It told us we had a few thousand of null values. The number of null values was the same as the size of mask3 so we assumed the last assignment was made on a copy and not on a view.
By typing data.group.value_counts(). That returned 3 values: 1,2 and 3 (surprise) we then typed data.group.value_counts.sum() and it summed up to 12,000!
So by method 2, the group column contained no null values and all the values we wanted it to have. But by method 1 - it didnt!
Can anyone explain this?
see docs here.
You dont' want to set values this way for exactly the reason you pointed; since you don't know if its a view, you don't know that you are actually changing the data. 0.13 will raise/warn that you are attempting to do this, but easiest/best to just access like:
data.loc[mask3,'group'] = 3
which will guarantee you inplace setitem

Comparing vectors

I am new to R and am trying to find a better solution for accomplishing this fairly simple task efficiently.
I have a data.frame M with 100,000 lines (and many columns, out of which 2 columns are relevant to this problem, I'll call it M1, M2). I have another data.frame where column V1 with about 10,000 elements is essential to this task. My task is this:
For each of the element in V1, find where does it occur in M2 and pull out the corresponding M1. I am able to do this using for-loop and it is terribly slow! I am used to Matlab and Perl and this is taking for EVER in R! Surely there's a better way. I would appreciate any valuable suggestions in accomplishing this task...
for (x in c(1:length(V$V1)) {
start[x] = M$M1[M$M2 == V$V1[x]]
}
There is only 1 element that will match, and so I can use the logical statement to directly get the element in start vector. How can I vectorize this?
Thank you!
Here is another solution using the same example by #aix.
M[match(V$V1, M$M2),]
To benchmark performance, we can use the R package rbenchmark.
library(rbenchmark)
f_ramnath = function() M[match(V$V1, M$M2),]
f_aix = function() merge(V, M, by.x='V1', by.y='M2', sort=F)
f_chase = function() M[M$M2 %in% V$V1,] # modified to return full data frame
benchmark(f_ramnath(), f_aix(), f_chase(), replications = 10000)
test replications elapsed relative
2 f_aix() 10000 12.907 7.068456
3 f_chase() 10000 2.010 1.100767
1 f_ramnath() 10000 1.826 1.000000
Another option is to use the %in% operator:
> set.seed(1)
> M <- data.frame(M1 = sample(1:20, 15, FALSE), M2 = sample(1:20, 15, FALSE))
> V <- data.frame(V1 = sample(1:20, 10, FALSE))
> M$M1[M$M2 %in% V$V1]
[1] 6 8 11 9 19 1 3 5
Sounds like you're looking for merge:
> M <- data.frame(M1=c(1,2,3,4,10,3,15), M2=c(15,6,7,8,-1,12,5))
> V <- data.frame(V1=c(-1,12,5,7))
> merge(V, M, by.x='V1', by.y='M2', sort=F)
V1 M1
1 -1 10
2 12 3
3 5 15
4 7 3
If V$V1 might contain values not present in M$M2, you may want to specify all.x=T. This will fill in the missing values with NAs instead of omitting them from the result.