Python - Looping through dataframe using methods other than .iterrows() - pandas

Here is the simplified dataset:
Character x0 x1
0 T 0.0 1.0
1 h 1.1 2.1
2 i 2.2 3.2
3 s 3.3 4.3
5 i 5.5 6.5
6 s 6.6 7.6
8 a 8.8 9.8
10 s 11.0 12.0
11 a 12.1 13.1
12 m 13.2 14.2
13 p 14.3 15.3
14 l 15.4 16.4
15 e 16.5 17.5
16 . 17.6 18.6
The simplified dataset is generated by the following code:
ch = ['T']
x0 = [0]
x1 = [1]
string = 'his is a sample.'
for s in string:
df = pd.DataFrame(list(zip(ch, x0, x1)), columns = ['Character', 'x0', 'x1'])
df = df.drop(df.loc[df['Character'] == ' '].index)
x0 and x1 represents the starting and ending position of each Character, respectively. Assume that the distance between any two adjacent characters equals to 0.1. In other words, if the difference between x0 of a character and x1 of the previous character is 0.1, the two characters belongs to the same string. If such difference is larger than 0.1, the character should be the start of a new string, etc. I need to produce a dataframe of strings and their respective x0 and x1, which is done by looping through the dataframe using .iterrows()
string = []
x0 = []
x1 = []
for index, row in df.iterrows():
if index == 0:
if round(row['x0']-x1[-1],1) == 0.1:
string[-1] += row['Character']
x1[-1] = row['x1']
df_string = pd.DataFrame(list(zip(string, x0, x1)), columns = ['String', 'x0', 'x1'])
Here is the result:
String x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
Is there any other faster way to achieve this?

You could use groupby + agg:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna([0, 'x0'])).abs()
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
# group and aggregate accordingly
res = df.groupby(grouper).agg({ 'Character' : ''.join, 'x0' : 'first', 'x1' : 'last' })
Character x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
The tricky part is this one:
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
The idea is to convert the column of diffs (same) into a True or False column, where every time a True appears it means a new group needs to be created. The cumsum will take care of assigning the same id to each group.
As suggested by #ShubhamSharma, you could do:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)
# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()
The other part remains the same.


pandas: filter rows of DataFrame with other condition dataframe

I have a pandas dataframe like this
id f1 f2 f3 f4
1 8.327 9.905 8.133 0.785
2 3.549 0.452 7.798 5.797
3 0.011 0.238 1.291 7.593
4 0.325 0.792 4.643 4.3
5 7.093 7.312 3.641 9.88
6 2.88 7.834 5.727 6.984
7 5.554 1.649 4.018 0.623
8 2.501 2.941 9.323 0.565
9 1.032 6.961 3.905 8.116
10 9.68 7.922 7.015 7.542
11 8.096 4.344 1.153 5.244
I would like to filter data by other condition dataframe. I want to find out records that satisfy all the following conditions.
variable interval
1 f1 (0,4)
2 f2 [1,3]
3 f3 (5,+np.inf)
4 f4 [0,10]
I know I can achieve this with the following code.
df.query('f1>0 and f1<4 and f2>=1 and f2<=3 and f3>5 and f4>=0 and f4<=10')
# or
The downside is that I need to modify the code if the conditions change. Is there a pythonic way to handle this issue?
You can dynamically construct a query using:
left_op = lambda y: '>=' if y.closed_left else '>'
right_op = lambda y: '<=' if y.closed_right else '<'
construct_query = lambda x, y: f"({x}{left_op(y)}{y.left} and {x}{right_op(y)}{y.right})"
qry = " and ".join(
df2.apply(lambda x: construct_query(x.variable, x.interval),
axis = 1).tolist()
where df2 is your second dataframe with variable and interval columns.
For your example data qry looks like:
'(f1>0 and f1<4) and (f2>=1 and f2<=3) and (f3>=5 and f3<inf) and (f4>=0 and f4<=10)'
Now if you do df.query(qry) it will give:
id f1 f2 f3 f4
7 8 2.501 2.941 9.323 0.565

Add rows based on condition and also create and update new columns

I have a pandas dataframe with few thousand rows, subset of it is below
fr var
1.1 10px
2.9 12pz
Expected Output:
fr var vard varv
1.1 10px -5 xval
1.1 10px 5 zval
2.9 12pz -6 zval
2.9 12pz 6 xval
For rows - Each row is to be split into two
Conditions for new columns:
'vard' - divide the numeric part of 'var' column by 2 and store it as two rows in 'vard', one negative and one positive value.
'varv' - if 'px' is in 'var column' and 'vard' has negative value, then 'varv' should be 'xval' else 'zval'.
Similarly if 'pz' is in 'var' column and 'vard' has negative value, then 'varv' should be 'zval' else 'xval'.
I have read through various answers with almost similar problems and tried many option like 'iterrows', 'shift', 'explode' etc but not able to get the expected output.
Use Series.str.extract first for numeric and non numeric part, convert numeric part to integers and divide by 2, then join with multiple value by -1 in concat, sorting index and create default, last use numpy.where for set new values by conditions:
df[['vard','varv']] = df['var'].str.extract('(\d+)(\D+)')
df['vard'] = df['vard'].astype(int).div(2)
df = pd.concat([df, df.assign(vard = df['vard']*-1)]).sort_index().reset_index(drop=True)
m = (df['varv'].eq('px') & df['vard'].lt(0)) | df['varv'].eq('pz') & df['vard'].gt(0)
df['varv'] = np.where(m, 'zval','xval')
print (df)
fr var vard varv
0 1.1 10px 5.0 xval
1 1.1 10px -5.0 zval
2 2.9 12pz 6.0 zval
3 2.9 12pz -6.0 xval
it is something that can easily be done using the melt function.
# recreate your dataframe
df = pd.DataFrame(columns=['fr','var'])
# split the var into its two components by creating two new columns
df['vard_p'] = df['var'].str[:-2]
df['vard_p'] = df['vard_p'].astype(float)/2
df['vard_n'] = -df['vard_p']
# get the vard from the var (I assumed it was simply the last character in the string)
df['varv'] = df['var'].str[-1]+'val'
# and here you melt on the two new vard columns to get the dataframe in the format you wanted
df = pd.melt(df, id_vars=['fr','var','varv'], value_vars=['vard_p','vard_n'])
# now rename or drop the new columns
fr var varv vard
0 1.1 10px xval 5.0
1 2.9 12pz zval 6.0
2 1.1 10px xval -5.0
3 2.9 12pz zval -6.0
Hope it helped

How can I merge two data frames on a range of dates? [duplicate]

Consider the following data.tables. The first defines a set of regions with start and end positions for each group 'x':
d1 <- data.table(x = letters[1:5], start = c(1,5,19,30, 7), end = c(3,11,22,39,25))
setkey(d1, x, start)
# x start end
# 1: a 1 3
# 2: b 5 11
# 3: c 19 22
# 4: d 30 39
# 5: e 7 25
The second data set has the same grouping variable 'x', and positions 'pos' within each group:
d2 <- data.table(x = letters[c(1,1,2,2,3:5)], pos = c(2,3,3,12,20,52,10))
setkey(d2, x, pos)
# x pos
# 1: a 2
# 2: a 3
# 3: b 3
# 4: b 12
# 5: c 20
# 6: d 52
# 7: e 10
Ultimately I'd like to extract the rows in 'd2' where 'pos' falls within the range defined by 'start' and 'end', within each group x. The desired result is
# x pos start end
# 1: a 2 1 3
# 2: a 3 1 3
# 3: c 20 19 22
# 4: e 10 7 25
The start/end positions for any group x will never overlap but there may be gaps of values not in any region.
Now, I believe I should be using a rolling join. From what i can tell, I cannot use the "end" column in the join.
I've tried
d1[d2, roll = TRUE, nomatch = 0, mult = "all"][start <= end]
and got
# x start end
# 1: a 2 3
# 2: a 3 3
# 3: c 20 22
# 4: e 10 25
which is the right set of rows I want; However "pos" has become "start" and the original "start" has been lost. Is there a way to preserve all the columns with the roll join so i could report "start", "pos", "end" as desired?
Overlap joins was implemented with commit 1375 in data.table v1.9.3, and is available in the current stable release, v1.9.4. The function is called foverlaps. From NEWS:
29) Overlap joins #528 is now here, finally!! Except for type="equal" and maxgap and minoverlap arguments, everything else is implemented. Check out ?foverlaps and the examples there on its usage. This is a major feature addition to data.table.
Let's consider x, an interval defined as [a, b], where a <= b, and y, another interval defined as [c, d], where c <= d. The interval y is said to overlap x at all, iff d >= a and c <= b 1. And y is entirely contained within x, iff a <= c,d <= b 2. For the different types of overlaps implemented, please have a look at ?foverlaps.
Your question is a special case of an overlap join: in d1 you have true physical intervals with start and end positions. In d2 on the other hand, there are only positions (pos), not intervals. To be able to do an overlap join, we need to create intervals also in d2. This is achieved by creating an additional variable pos2, which is identical to pos (d2[, pos2 := pos]). Thus, we now have an interval in d2, albeit with identical start and end coordinates. This 'virtual, zero-width interval' in d2 can then be used in foverlap to do an overlap join with d1:
require(data.table) ## 1.9.3
d2[, pos2 := pos]
foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)
# x start end pos pos2
# 1: a 1 3 2 2
# 2: a 1 3 3 3
# 3: c 19 22 20 20
# 4: e 7 25 10 10
by.y by default is key(y), so we skipped it. by.x by default takes key(x) if it exists, and if not takes key(y). But a key doesn't exist for d2, and we can't set the columns from y, because they don't have the same names. So, we set by.x explicitly.
The type of overlap is within, and we'd like to have all matches, only if there is a match.
NB: foverlaps uses data.table's binary search feature (along with roll where necessary) under the hood, but some function arguments (types of overlaps, maxgap, minoverlap etc..) are inspired by the function findOverlaps() from the Bioconductor package IRanges, an excellent package (and so is GenomicRanges, which extends IRanges for Genomics).
So what's the advantage?
A benchmark on the code above on your data results in foverlaps() slower than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps = 0.021 seconds). But does it really matter at this granularity?
What would be really interesting is to see how well it scales - in terms of both speed and memory. In Gabor's answer, we join based on the key column x. And then filter the results.
What if d1 has about 40K rows and d2 has a 100K rows (or more)? For each row in d2 that matches x in d1, all those rows will be matched and returned, only to be filtered later. Here's an example of your Q scaled only slightly:
Generate data:
n = 20e3L; k = 100e3L
idx1 = sample(100, n, TRUE)
idx2 = sample(100, n, TRUE)
d1 = data.table(x = sample(letters[1:5], n, TRUE),
start = pmin(idx1, idx2),
end = pmax(idx1, idx2))
d2 = data.table(x = sample(letters[1:15], k, TRUE),
pos1 = sample(60:150, k, TRUE))
d2[, pos2 := pos1]
ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)
# user system elapsed
# 3.028 0.635 3.745
This took ~ 1GB of memory in total, out of which ans1 is 420MB. Most of the time spent here is on subset really. You can check it by setting the argument verbose=TRUE.
Gabor's solutions:
## new session - data.table solution
setkey(d1, x)
ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]
# user system elapsed
# 15.714 4.424 20.324
And this took a total of ~3.5GB.
I just noted that Gabor already mentions the memory required for intermediate results. So, trying out sqldf:
# new session - sqldf solution
system.time(ans3 <- sqldf("select * from d1 join
d2 using (x) where pos1 between start and end"))
# user system elapsed
# 73.955 1.605 77.049
Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.
[The answers were verified to be identical after removing pos2 from ans1 and setting key on both answers.]
Note that this overlap join is designed with problems where d2 doesn't necessarily have identical start and end coordinates (ex: genomics, the field where I come from, where d2 is usually about 30-150 million or more rows).
foverlaps() is stable, but is still under development, meaning some arguments and names might get changed.
NB: Since I mentioned GenomicRanges above, it is also perfectly capable of solving this problem. It uses interval trees under the hood, and is quite memory efficient as well. In my benchmarks on genomics data, foverlaps() is faster. But that's for another (blog) post, some other time.
data.table v1.9.8+ has a new feature - non-equi joins. With that, this operation becomes even more straightforward:
require(data.table) #v1.9.8+
# no need to set keys on `d1` or `d2`
d2[d1, .(x, pos=x.pos, start, end), on=.(x, pos>=start, pos<=end), nomatch=0L]
# x pos start end
# 1: a 2 1 3
# 2: a 3 1 3
# 3: c 20 19 22
# 4: e 10 7 25
1) sqldf This is not data.table but complex join criteria are easy to specify in a straight forward manner in SQL:
sqldf("select * from d1 join d2 using (x) where pos between start and end")
x start end pos
1 a 1 3 2
2 a 1 3 3
3 c 19 22 20
4 e 7 25 10
2) data.table For a data.table answer try this:
setkey(d1, x)
setkey(d2, x)
d1[d2][between(pos, start, end)]
x start end pos
1: a 1 3 2
2: a 1 3 3
3: c 19 22 20
4: e 7 25 10
Note that this does have the disadvantage of forming the possibly large intermeidate result d1[d2] which SQL may not do. The remaining solutions may have this problem too.
3) dplyr This suggests the corresponding dplyr solution. We also use between from data.table:
library(data.table) # between
d1 %>%
inner_join(d2) %>%
filter(between(pos, start, end))
Joining by: "x"
x start end pos
1 a 1 3 2
2 a 1 3 3
3 c 19 22 20
4 e 7 25 10
4) merge/subset Using only the base of R:
subset(merge(d1, d2), start <= pos & pos <= end)
x start end pos
1: a 1 3 2
2: a 1 3 3
3: c 19 22 20
4: e 7 25 10
Added Note that the data table solution here is much faster than the one in the other answer:
dt1 <- function() {
d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d1, x, start)
idx1 = d1[d2, which=TRUE, roll=Inf] # last observation carried forwards
setkey(d1, x, end)
idx2 = d1[d2, which=TRUE, roll=-Inf] # next observation carried backwards
idx = which(! & !
ans1 <<- cbind(d1[idx1[idx]], d2[idx, list(pos)])
dt2 <- function() {
d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d1, x)
ans2 <<- d1[d2][between(pos, start, end)]
benchmark(dt1(), dt2())[1:4]
## test replications elapsed relative
## 1 dt1() 100 1.45 1.667
## 2 dt2() 100 0.87 1.000 <-- from (2) above
Overlap joins are available in dplyr 1.1.0 via the function join_by.
With join_by, you can do overlap join with between, or manually with >= and <=:
inner_join(d2, d1, by = join_by(x, between(pos, start, end)))
# x pos start end
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25
inner_join(d2, d1, by = join_by(x, pos >= start, pos <= end))
# x pos start end
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25
Using fuzzyjoin :
result <- fuzzyjoin::fuzzy_inner_join(d1, d2,
by = c('x', 'pos' = 'start', 'pos' = 'end'),
match_fun = list(`==`, `>=`, `<=`))
# x.x pos x.y start end
# <chr> <dbl> <chr> <dbl> <dbl>
#1 a 2 a 1 3
#2 a 3 a 1 3
#3 c 20 c 19 22
#4 e 10 e 7 25
Since fuzzyjoin returns all the columns we might need to do some cleaning to keep the columns that we want.
result %>% select(x = x.x, pos, start, end)
# A tibble: 4 x 4
# x pos start end
# <chr> <dbl> <dbl> <dbl>
#1 a 2 1 3
#2 a 3 1 3
#3 c 20 19 22
#4 e 10 7 25

Using value_counts in pandas with conditions

I have a column with around 20k values. I've used the following function in pandas to display their counts:
weather_data is the dataframe and snowfall is the column.
My results are:
0.0 12683
M 7224
T 311
0.2 32
0.1 31
0.5 20
0.3 18
1.0 14
0.4 13
Is there a way to:
Display the counts of only a single variable or number
Use an if condition to display the counts of only those values which satisfy the condition?
I'll be as clear as possible without having a full example as piRSquared suggested you to provide.
value_counts' output is a Series, therefore the values in your originale Series can be retrieved from the value_counts' index. Displaying only the result of one of the variables then is exactly slicing your series:
my_value_count = weather_data["snowfall"].value_counts()
0.0 12683
If you want to display only for a list of variables:
0.0 12683
0.2 32
0.1 31
As you have M and T in your values, I suspect the other values will be treated as strings and not float. Otherwise you could use:
my_value_count.loc[my_value_count.index < 0.4]
0.0 12683
0.2 32
0.1 31
0.3 18
Use an if condition to display the counts of only those values which satisfy the condition?
First create a new column based on the condition you want. Then you can use groupby and sum.
For example, if you want to count the frequency only if a column has a non-null value. In my case, if there is an actual completion_date non-null value:
dataset['Has_actual_completion_date'] = np.where(dataset['ACTUAL_COMPLETION_DATE'].isnull(), 0, 1)
dataset['Mitigation_Plans_in_progress'] = dataset['Has_actual_completion_date'].groupby(dataset['HAZARD_ID']).transform('sum')

Most efficient way to shift MultiIndex time series

I have a DataFrame that consists of many stacked time series. The index is (poolId, month) where both are integers, the "month" being the number of months since 2000. What's the best way to calculate one-month lagged versions of multiple variables?
Right now, I do something like:
cols_to_shift = ["bal", ...5 more columns...]
df_shift = df[cols_to_shift].groupby(level=0).transform(lambda x: x.shift(-1))
For my data, this took me a full 60 s to run. (I have 48k different pools and a total of 718k rows.)
I'm converting this from R code and the equivalent data.table call:
dt.shift <- dt[, list(bal=myshift(bal), ...), by=list(poolId)]
only takes 9 s to run. (Here "myshift" is something like "function(x) c(x[-1], NA)".)
Is there a way I can get the pandas verison to be back in line speed-wise? I tested this on 0.8.1.
Edit: Here's an example of generating a close-enough data set, so you can get some idea of what I mean:
ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))
That took 64 s when I tried it. This data has every series starting at month 0; really, they should all end at month np.max(lens), with ragged start dates, but good enough.
Edit 2: Here's some comparison R code. This takes 0.8 s. Factor of 80, not good.
ids <- 1:48000
lens <- as.integer(pmax(1, round(rnorm(ids, mean=15, sd=9.5))))
id.vec <- rep(ids, times=lens)
lens.shift <- c(0, lens[-length(lens)])
mon.vec <- (1:sum(lens)) - rep(cumsum(lens.shift), times=lens)
n <- length(id.vec)
dt <- data.table(pool=id.vec, month=mon.vec, a=rnorm(n), b=rnorm(n), c=rnorm(n), d=rnorm(n), e=rnorm(n))
setkey(dt, pool, month)
myshift <- function(x) c(x[-1], NA)
system.time(dt.shift <- dt[, list(month=month, a=myshift(a), b=myshift(b), c=myshift(c), d=myshift(d), e=myshift(e)), by=pool])
I would suggest you reshape the data and do a single shift versus the groupby approach:
result = df.unstack(0).shift(1).stack()
This switches the order of the levels so you'd want to swap and reorder:
result = result.swaplevel(0, 1).sortlevel(0)
You can verify it's been lagged by one period (you want shift(1) instead of shift(-1)):
In [17]: result.ix[1]
a b c d e
1 0.752511 0.600825 0.328796 0.852869 0.306379
2 0.251120 0.871167 0.977606 0.509303 0.809407
3 0.198327 0.587066 0.778885 0.565666 0.172045
4 0.298184 0.853896 0.164485 0.169562 0.923817
5 0.703668 0.852304 0.030534 0.415467 0.663602
6 0.851866 0.629567 0.918303 0.205008 0.970033
7 0.758121 0.066677 0.433014 0.005454 0.338596
8 0.561382 0.968078 0.586736 0.817569 0.842106
9 0.246986 0.829720 0.522371 0.854840 0.887886
10 0.709550 0.591733 0.919168 0.568988 0.849380
11 0.997787 0.084709 0.664845 0.808106 0.872628
12 0.008661 0.449826 0.841896 0.307360 0.092581
13 0.727409 0.791167 0.518371 0.691875 0.095718
14 0.928342 0.247725 0.754204 0.468484 0.663773
15 0.934902 0.692837 0.367644 0.061359 0.381885
16 0.828492 0.026166 0.050765 0.524551 0.296122
17 0.589907 0.775721 0.061765 0.033213 0.793401
18 0.532189 0.678184 0.747391 0.199283 0.349949
In [18]: df.ix[1]
a b c d e
0 0.752511 0.600825 0.328796 0.852869 0.306379
1 0.251120 0.871167 0.977606 0.509303 0.809407
2 0.198327 0.587066 0.778885 0.565666 0.172045
3 0.298184 0.853896 0.164485 0.169562 0.923817
4 0.703668 0.852304 0.030534 0.415467 0.663602
5 0.851866 0.629567 0.918303 0.205008 0.970033
6 0.758121 0.066677 0.433014 0.005454 0.338596
7 0.561382 0.968078 0.586736 0.817569 0.842106
8 0.246986 0.829720 0.522371 0.854840 0.887886
9 0.709550 0.591733 0.919168 0.568988 0.849380
10 0.997787 0.084709 0.664845 0.808106 0.872628
11 0.008661 0.449826 0.841896 0.307360 0.092581
12 0.727409 0.791167 0.518371 0.691875 0.095718
13 0.928342 0.247725 0.754204 0.468484 0.663773
14 0.934902 0.692837 0.367644 0.061359 0.381885
15 0.828492 0.026166 0.050765 0.524551 0.296122
16 0.589907 0.775721 0.061765 0.033213 0.793401
17 0.532189 0.678184 0.747391 0.199283 0.349949
Perf isn't too bad with this method (it might be a touch slower in 0.9.0):
In [19]: %time result = df.unstack(0).shift(1).stack()
CPU times: user 1.46 s, sys: 0.24 s, total: 1.70 s
Wall time: 1.71 s