Count previous occurrences in Pandas [duplicate] - pandas

I feel like there is a better way than this:
import pandas as pd
df = pd.DataFrame(
columns=" index c1 c2 v1 ".split(),
data= [
[ 0, "A", "X", 3, ],
[ 1, "A", "X", 5, ],
[ 2, "A", "Y", 7, ],
[ 3, "A", "Y", 1, ],
[ 4, "B", "X", 3, ],
[ 5, "B", "X", 1, ],
[ 6, "B", "X", 3, ],
[ 7, "B", "Y", 1, ],
[ 8, "C", "X", 7, ],
[ 9, "C", "Y", 4, ],
[ 10, "C", "Y", 1, ],
[ 11, "C", "Y", 6, ],]).set_index("index", drop=True)
def callback(x):
x['seq'] = range(1, x.shape[0] + 1)
return x
df = df.groupby(['c1', 'c2']).apply(callback)
print df
To achieve this:
c1 c2 v1 seq
0 A X 3 1
1 A X 5 2
2 A Y 7 1
3 A Y 1 2
4 B X 3 1
5 B X 1 2
6 B X 3 3
7 B Y 1 1
8 C X 7 1
9 C Y 4 1
10 C Y 1 2
11 C Y 6 3
Is there a way to do it that avoids the callback?

use cumcount(), see docs here
In [4]: df.groupby(['c1', 'c2']).cumcount()
Out[4]:
0 0
1 1
2 0
3 1
4 0
5 1
6 2
7 0
8 0
9 0
10 1
11 2
dtype: int64
If you want orderings starting at 1
In [5]: df.groupby(['c1', 'c2']).cumcount()+1
Out[5]:
0 1
1 2
2 1
3 2
4 1
5 2
6 3
7 1
8 1
9 1
10 2
11 3
dtype: int64

This might be useful
df = df.sort_values(['userID', 'date'])
grp = df.groupby('userID')['ItemID'].aggregate(lambda x: '->'.join(tuple(x))).reset_index()
print(grp)
it will create a sequence like this

If you have a dataframe similar to the one below and you want to add seq column by building it from c1 or c2, i.e. keep a running count of similar values (or until a flag comes up) in other column(s), read on.
df = pd.DataFrame(
columns=" c1 c2 seq".split(),
data= [
[ "A", 1, 1 ],
[ "A1", 0, 2 ],
[ "A11", 0, 3 ],
[ "A111", 0, 4 ],
[ "B", 1, 1 ],
[ "B1", 0, 2 ],
[ "B111", 0, 3 ],
[ "C", 1, 1 ],
[ "C11", 0, 2 ] ])
then first find group starters, (str.contains() (and eq()) is used below but any method that creates a boolean Series such as lt(), ne(), isna() etc. can be used) and call cumsum() on it to create a Series where each group has a unique identifying value. Then use it as the grouper on a groupby().cumsum() operation.
In summary, use a code similar to the one below.
# build a grouper Series for similar values
groups = df['c1'].str.contains("A$|B$|C$").cumsum()
# or build a grouper Series from flags (1s)
groups = df['c2'].eq(1).cumsum()
# groupby using the above grouper
df['seq'] = df.groupby(groups).cumcount().add(1)

The cleanliness of Jeff's answer is nice, but I prefer to sort explicitly...though generally without overwriting my df for these type of use-cases (e.g. Shaina Raza's answer).
So, to create a new column sequenced by 'v1' within each ('c1', 'c2') group:
df["seq"] = df.sort_values(by=['c1','c2','v1']).groupby(['c1','c2']).cumcount()
you can check with:
df.sort_values(by=['c1','c2','seq'])
or, if you want to overwrite the df, then:
df = df.sort_values(by=['c1','c2','seq']).reset_index()

Related

pandas groupby per-group value

I have this data:
df = pd.DataFrame({
"dim1": [ "aaa", "aaa", "aaa", "aaa", "aaa", "aaa" ],
"dim2": [ "xxx", "xxx", "xxx", "yyy", "yyy", "yyy" ],
"iter": [ 0, 1, 2, 0, 1, 2 ],
"value1": [ 100, 101, 99, 500, 490, 510 ],
"value2": [ 10000, 10100, 9900, 50000, 49000, 51000 ],
})
I then groupby dim1/dim2 and out of all iterations, I pick value1/value2 for the minimum value1:
df = df.groupby(["dim1", "dim2"], group_keys=False) \
.apply(lambda x: x.sort_values("value1").head(1)).drop(columns=["iter"])
which returns:
dim1 dim2 value1 value2
aaa xxx 99 9900
aaa yyy 490 49000
My question: how can I add a new column that contains the min value1 per dim1 group:
dim1 dim2 value1 value2 new_col
aaa xxx 99 9900 99
aaa yyy 490 49000 99
I tried something like this, which didn't work:
df["new_col"] = df.groupby(["dim1"], group_keys=False) \
.apply(lambda x: x.value1.head(1))
IIUC, you can use .groupby + .transform afterwards:
df["new_col"] = df.groupby("dim1")["value1"].transform("min")
print(df)
Prints:
dim1 dim2 value1 value2 new_col
2 aaa xxx 99 9900 99
4 aaa yyy 490 49000 99

Calculate the difference between all rows and a specific row in the dataframe

This is a similar question to this thread.
Lets consider df as:
df = pd.DataFrame([["a", 2, 3], ["b", 5, 6], ["c", 8, 9],["a", 0, 0], ["a", 8, 7], ["c", 2, 1]], columns = ["A", "B", "C"])
How can you calculate the difference between all rows and the row at Nth index in a group (lowest index for EACH group) for column "B", and put it in column "D"? I want to calculate mean square displacement for my data and I want to calculate the difference of values in a column in each group with the first appeared row in that group.
I tried:
df['D'] = df.groupby(["A"])['B'].sub(df.groupby(['A'])["B"].iloc[0])
Group = df.groupby(["A"])
However using .sub and groupby raise the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'sub'
the desired result would be like this:
A B C D
0 a 2 3 0 *lowest index in group "a"
1 b 5 6 0 *lowest index in group "b"
2 c 8 9 0 *lowest index in group "c"
3 a 0 0 -2
4 a 8 7 6
5 c 2 1 -6
I guess this answer could be enough of a hint for you:
import pandas as pd
df = pd.DataFrame([["a", 2, 3], ["b", 5, 6], ["c", 8, 9],["a", 0, 0], ["a", 8, 7], ["c", 2, 1]], columns = ["A", "B", "C"])
print("df:")
print(df)
print()
groupA = df.groupby(['A'])
print("groupA:")
print(groupA.groups)
print()
print("lowest indices for each group from columnA:")
lowest_indices = dict()
for k, v in groupA.groups.items():
lowest_indices[k] = v[0]
print(lowest_indices)
print()
columnB = df['B']
print("columnB:")
print(columnB)
print()
df['D'] = df['B']
for i in range(len(df)):
group_at_i = df['A'].iloc[i]
lowest_index_of_that = lowest_indices[group_at_i]
b_element_at_that_index = df['B'].iloc[lowest_index_of_that]
the_difference = df['B'].iloc[i] - b_element_at_that_index
df.loc[i, 'D'] = the_difference
print("df:")
print(df)

pandas row wise comparison and apply condition

This is my dataframe:
df = pd.DataFrame(
{
"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],
"score": [3, 5, 6, 2, 4, 1],
}
)
I want to compare the score of bob_x with 'bob_y, and retain the row with the lowest, and do the same for jay_xandjay_y. No change is required for madandjoe`.
You can first split the names by _ and keep the first part, then groupby and keep the lowest value:
import pandas as pd
df = pd.DataFrame({"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],"score": [3, 5, 6, 2, 4, 1]})
df['name'] = df['name'].str.split('_').str[0]
df.groupby('name')['score'].min().reset_index()
Result:
name
score
0
bob
2
1
jay
4
2
joe
1
3
mad
5

Lookup smallest value greater than current

I have an objects table and a lookup table. In the objects table, I'm looking to add the smallest value from the lookup table that is greater than the object's number.
I found this similar question but it's about finding a value greater than a constant, rather than changing for each row.
In code:
import pandas as pd
objects = pd.DataFrame([{"id": 1, "number": 10}, {"id": 2, "number": 30}])
lookup = pd.DataFrame([{"number": 3}, {"number": 12}, {"number": 40}])
expected = pd.DataFrame(
[
{"id": 1, "number": 10, "smallest_greater": 12},
{"id": 2, "number": 30, "smallest_greater": 40},
]
)
First compare each value lookup['number'] by objects['number'] to 2d boolean mask, then add cumsum and compare first value by 1 and get position by numpy.argmax for set value by lookup['number'].
Output is generated with numpy.where for overwrite all not matched values to NaN.
objects = pd.DataFrame([{"id": 1, "number": 10}, {"id": 2, "number": 30},
{"id": 3, "number": 100},{"id": 4, "number": 1}])
print (objects)
id number
0 1 10
1 2 30
2 3 100
3 4 1
m1 = lookup['number'].values >= objects['number'].values[:, None]
m2 = np.cumsum(m1, axis=1) == 1
m3 = np.any(m1, axis=1)
out = lookup['number'].values[m2.argmax(axis=1)]
objects['smallest_greater'] = np.where(m3, out, np.nan)
print (objects)
id number smallest_greater
0 1 10 12.0
1 2 30 40.0
2 3 100 NaN
3 4 1 3.0
smallest_greater = []
for i in objects['number']: smallest_greater.append(lookup['number'[lookup[lookup['number']>i].sort_values(by='number').index[0]])
objects['smallest_greater'] = smallest_greater

Tensorflow, Reshape like a convolution

I have a matrix [3,3,256], my final output must be [4,2,2,256], I have to use a reshape like a 'convolution' without changing the values. (In this case using a filter 2x2). Is there a method to do this using tensorflow?
If I understand your question correctly, you want to store the original values redundantly in the new structure, like this (without the last dim of 256):
[ [ 1 2 3 ] [ [ 1 2 ] [ [ 2 3 ] [ [ 4 5 ] [ [ 5 6 ]
[ 4 5 6 ] => [ 4 5 ] ], [ 5 6 ] ], [ 7 8 ] ], [ 8 9 ] ]
[ 7 8 9 ] ]
If yes, you can use indexing, like this, with x being the original tensor, and then stack them:
x2 = []
for i in xrange( 2 ):
for j in xrange( 2 ):
x2.append( x[ i : i + 2, j : j + 2, : ] )
y = tf.stack( x2, axis = 0 )
Based on your comment, if you really want to avoid using any loops, you might utilize the tf.extract_image_patches, like below (tested code) but you should run some tests because this might actually be much worse than the above in terms of efficiency and perfomance:
import tensorflow as tf
sess = tf.Session()
x = tf.constant( [ [ [ 1, -1 ], [ 2, -2 ], [ 3, -3 ] ],
[ [ 4, -4 ], [ 5, -5 ], [ 6, -6 ] ],
[ [ 7, -7 ], [ 8, -8 ], [ 9, -9 ] ] ] )
xT = tf.transpose( x, perm = [ 2, 0, 1 ] ) # have to put channel dim as batch for tf.extract_image_patches
xTE = tf.expand_dims( xT, axis = -1 ) # extend dims to have fake channel dim
xP = tf.extract_image_patches( xTE, ksizes = [ 1, 2, 2, 1 ],
strides = [ 1, 1, 1, 1 ], rates = [ 1, 1, 1, 1 ], padding = "VALID" )
y = tf.transpose( xP, perm = [ 3, 1, 2, 0 ] ) # move dims back to original and new dim up front
print( sess.run(y) )
Output (horizontal separator lines added manually for readability):
[[[[ 1 -1]
[ 2 -2]]
[[ 4 -4]
[ 5 -5]]]
[[[ 2 -2]
[ 3 -3]]
[[ 5 -5]
[ 6 -6]]]
[[[ 4 -4]
[ 5 -5]]
[[ 7 -7]
[ 8 -8]]]
[[[ 5 -5]
[ 6 -6]]
[[ 8 -8]
[ 9 -9]]]]
I have a similar problem with you and I found that in tf.contrib.kfac.utils there is a function called extract_convolution_patches. Suppose you have a tensor X with shape (1, 3, 3, 256) where the initial 1 marks batch size, you can call
Y = tf.contrib.kfac.utils.extract_convolution_patches(X, (2, 2, 256, 1), padding='VALID')
Y.shape # (1, 2, 2, 2, 2, 256)
The first two 2's will be the number of your output filters (makes up the 4 in your description). The latter two 2's will be the shape of the filters. You can then call
Y = tf.reshape(Y, [4,2,2,256])
to get your final result.