I am new to data analysis , I wand to find cell position which containing input string.
example:
Price | Rate p/lot | Total Comm|
947.2 1.25 CAD 1.25
129.3 2.1 CAD 1.25
161.69 0.8 CAD 2.00
How do I find position of string "CAD 2.00".
Required output is (2,2)
In [353]: rows, cols = np.where(df == 'CAD 2.00')
In [354]: rows
Out[354]: array([2], dtype=int64)
In [355]: cols
Out[355]: array([2], dtype=int64)
Replace columns names to numeric by range, stack and for first occurence of value use idxmax:
d = dict(zip(df.columns, range(len(df.columns))))
s = df.rename(columns=d).stack()
a = (s == 'CAD 2.00').idxmax()
print (a)
(2, 2)
If want check all occurencies use boolean indexing and convert MultiIndex to list:
a = s[(s == 'CAD 1.25')].index.tolist()
print (a)
[(0, 2), (1, 2)]
Explanation:
Create dict for rename columns names to range:
d = dict(zip(df.columns, range(len(df.columns))))
print (d)
{'Rate p/lot': 1, 'Price': 0, 'Total Comm': 2}
print (df.rename(columns=d))
0 1 2
0 947.20 1.25 CAD 1.25
1 129.30 2.10 CAD 1.25
2 161.69 0.80 CAD 2.00
Then reshape by stack for MultiIndex with positions:
s = df.rename(columns=d).stack()
print (s)
0 0 947.2
1 1.25
2 CAD 1.25
1 0 129.3
1 2.1
2 CAD 1.25
2 0 161.69
1 0.8
2 CAD 2.00
dtype: object
Compare by string:
print (s == 'CAD 2.00')
0 0 False
1 False
2 False
1 0 False
1 False
2 False
2 0 False
1 False
2 True
dtype: bool
And get position of first True - values of MultiIndex:
a = (s == 'CAD 2.00').idxmax()
print (a)
(2, 2)
Another solution is use numpy.nonzero for check values, zip values together and convert to list:
i, j = (df.values == 'CAD 2.00').nonzero()
t = list(zip(i, j))
print (t)
[(2, 2)]
i, j = (df.values == 'CAD 1.25').nonzero()
t = list(zip(i, j))
print (t)
[(0, 2), (1, 2)]
A simple alternative:
def value_loc(value, df):
for col in list(df):
if value in df[col].values:
return (list(df).index(col), df[col][df[col] == value].index[0])
Related
I need to remove last n rows where Status equals 1
v = df[df['Status'] == 1].count()
f = df[df['Status'] == 0].count()
diff = v - f
diff
df2 = df[~df['Status'] == 1].tail(diff).all() #ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
df2
Check whether Status is equal to 1 and get only those places where it is (.loc[lambda s: s] is doing that using boolean indexing). The index of n such rows from tail will be dropped:
df.drop(df.Status.eq(1).loc[lambda s: s].tail(n).index)
sample run:
In [343]: df
Out[343]:
Status
0 1
1 2
2 3
3 2
4 1
5 1
6 1
7 2
In [344]: n
Out[344]: 2
In [345]: df.Status.eq(1)
Out[345]:
0 True
1 False
2 False
3 False
4 True
5 True
6 True
7 False
Name: Status, dtype: bool
In [346]: df.Status.eq(1).loc[lambda s: s]
Out[346]:
0 True
4 True
5 True
6 True
Name: Status, dtype: bool
In [347]: df.Status.eq(1).loc[lambda s: s].tail(n)
Out[347]:
5 True
6 True
Name: Status, dtype: bool
In [348]: df.Status.eq(1).loc[lambda s: s].tail(n).index
Out[348]: Int64Index([5, 6], dtype='int64')
In [349]: df.drop(df.Status.eq(1).loc[lambda s: s].tail(n).index)
Out[349]:
Status
0 1
1 2
2 3
3 2
4 1
7 2
Using groupBy() and transform() to mark columns to keep:
df = pd.DataFrame({"Status": [1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1]})
n = 3
df["Keep"] = df.groupby("Status")["Status"].transform(
lambda x: x.reset_index().index < len(x) - n if x.name == 1 else True
)
df.loc[df["Keep"]].drop(columns="Keep")
what i have length can be of different values/ so somethimes 1 id has 4 rows with different values in column val, the other columns have all the same values
df1 = pd.DataFrame({'id':[1,1,1,2,2,2,3,3,3], 'val': ['06123','nick','#gmail','06454','abey','#gmail','06888','sisi'], 'media': ['nrc','nrc','nrc','nrc','nrc','nrc','nrc','nrc']})
what i need
id kolom 1 kolom2 kolom 3 media
1 06123 nick #gmail nrc
2 06454 abey #gmail nrc
3 6888 sisi None nrc
I hope I gave a good example, in the corrected way, thanks for the help
df2 = df1.groupby('id').agg(list)
df2['col 1'] = df2['val'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2['col 2'] = df2['val'].apply(lambda x: x[1] if len(x) > 1 else 'None')
df2['col 3'] = df2['val'].apply(lambda x: x[2] if len(x) > 2 else 'None')
df2['media'] = df2['media'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2.drop(columns='val')
Here is another way. Since your original dataframe doesn't have lists with the same length (which will get you a ValueError, you can define it as:
data = {"id":[1,1,1,2,2,2,3,3,3],
"val": ["06123","nick","#gmail","06454","abey","#gmail","06888","sisi"],
"media": ["nrc","nrc","nrc","nrc","nrc","nrc","nrc","nrc"]}
df = pd.DataFrame.from_dict(data, orient="index")
df = df.transpose()
>>> df
id val media
0 1 06123 nrc
1 1 nick nrc
2 1 #gmail nrc
3 2 06454 nrc
4 2 abey nrc
5 2 #gmail nrc
6 3 06888 nrc
7 3 sisi nrc
8 3 NaN NaN
Afterwards, you can replace with np.nan values with an empty string, so that you can groupby your id column and join the values in val separated by a ,.
df = df.replace(np.nan, "", regex=True)
df_new = df.groupby(["id"])["val"].apply(lambda x: ",".join(x)).reset_index()
>>> df_new
id val
0 1.0 06123,nick,#gmail
1 2.0 06454,abey,#gmail
2 3.0 06888,sisi,
Then, you only need to transform the new val column into 3 columns by splitting the string inside, with any method you want. For example,
new_cols = df_new["val"].str.split(",", expand=True) # Good ol' split
df_new["kolom 1"] = new_cols[0] # Assign to new columns
df_new["kolom 2"] = new_cols[1]
df_new["kolom 3"] = new_cols[2]
df_new.drop("val", 1, inplace=True) # Delete previous val
df_new["media"] = "nrc" # Add the media column again
df_new = df_new.replace("", np.nan, regex=True) # If necessary, replace empty string with np.nan
>>> df_new
id kolom 1 kolom 2 kolom 3 media
0 1.0 06123 nick #gmail nrc
1 2.0 06454 abey #gmail nrc
2 3.0 06888 sisi NaN nrc
I have the following dataframe
df = pd.DataFrame([{'id':'a', 'val':1}, {'id':'b', 'val':2}, {'id':'c', 'val': 0}, {'id':'d', 'val':0}])
What I want is to replace 0's with +1 of the max value
The result I want is as follows:
df = pd.DataFrame([{'id':'a', 'val':1}, {'id':'b', 'val':2}, {'id':'c', 'val': 3}, {'id':'d', 'val':4}])
I tried the following:
for _, r in df.iterrows():
if r.val == 0:
r.val = df.val.max()+1
However, it there a one-line way to do the above
Filter only 0 rows with boolean indexing and DataFrame.loc and assign range with count Trues values of condition with add maximum value and 1, because python count from 0 in range:
df.loc[df['val'].eq(0), 'val'] = range(df['val'].eq(0).sum()) + df.val.max() + 1
print (df)
id val
0 a 1
1 b 2
2 c 3
3 d 4
I have the following dataframe
import pandas as pd
d = {
'ID':[1,2,3,4,5],
'Price1':[5,9,4,3,9],
'Price2':[9,10,13,14,18],
'Type':['A','A','B','C','D'],
}
df = pd.DataFrame(data = d)
df
For applying the formula without condition I use the following code
df = df.eval(
'Price = (Price1*Price1)/2'
)
df
How to do the formulas without splitting the dataframe which had different conditions
Need a new column called Price_on_type
The formula is differing for each type
For type A the formula for Price_on_type = Price1+Price1
For type B the formula for Price_on_type = (Price1+Price1)/2
For type C the formula for Price_on_type = Price1
For type D the formula for Price_on_type = Price2
Expected Output:
import pandas as pd
d = {
'ID':[1,2,3,4,5],
'Price1':[5,9,4,3,9],
'Price2':[9,10,13,14,18],
'Price':[12.5,40.5, 8.0, 4.5, 40.5],
'Price_on_type':[14,19,8.0,3,18],
}
df = pd.DataFrame(data = d)
df
You can use numpy.select:
masks = [df['Type'] == 'A',
df['Type'] == 'B',
df['Type'] == 'C',
df['Type'] == 'D']
vals = [df.eval('(Price1*Price1)'),
df.eval('(Price1*Price1)/2'),
df['Price1'],
df['Price2']]
Or:
vals = [df['Price1'] + df['Price2'],
(df['Price1'] + df['Price2']) / 2,
df['Price1'],
df['Price2']]
df['Price_on_type'] = np.select(masks, vals)
print (df)
ID Price1 Price2 Type Price_on_type
0 1 5 9 A 14.0
1 2 9 10 A 19.0
2 3 4 13 B 8.5
3 4 3 14 C 3.0
4 5 9 18 D 18.0
If your data is not too big, using apply with custom function on axis=1
def Prices(x):
dict_sw = {
'A': x.Price1 + x.Price2,
'B': (x.Price1 + x.Price2)/2,
'C': x.Price1,
'D': x.Price2,
}
return dict_sw[x.Type]
In [239]: df['Price_on_type'] = df.apply(Prices, axis=1)
In [240]: df
Out[240]:
ID Price1 Price2 Type Price_on_type
0 1 5 9 A 14.0
1 2 9 10 A 19.0
2 3 4 13 B 8.5
3 4 3 14 C 3.0
4 5 9 18 D 18.0
Or using the trick convert True to 1 and False to 0
df['Price_on_type'] = \
(df.Type == 'A') * (df.Price1 + df.Price2) + \
(df.Type == 'B') * (df.Price1 + df.Price2)/2 + \
(df.Type == 'C') * df.Price1 + \
(df.Type == 'D') * df.Price2
Out[308]:
ID Price1 Price2 Type Price_on_type
0 1 5 9 A 14.0
1 2 9 10 A 19.0
2 3 4 13 B 8.5
3 4 3 14 C 3.0
4 5 9 18 D 18.0
I would like to find a pattern in a dataframe in a categorical variable going down rows. I can see how to use Series.shift() to look up / down and using boolean logic to find the pattern, however, I want to do this with a grouping variable and also label all rows that are part of the pattern, not just the starting row.
Code:
import pandas as pd
from numpy.random import choice, randn
import string
# df constructor
n_rows = 1000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
'group_var': choice(list(string.ascii_uppercase), n_rows),
'row_pat': choice([0, 1, 2, 3], n_rows),
'values': randn(n_rows)})
# sorting
df.sort_values(by=['group_var', 'date_time'], inplace=True)
df.head(10)
Which returns this:
I can find the start of the pattern (with no grouping though) by this:
# the row ordinal pattern to detect
p0, p1, p2, p3 = 1, 2, 2, 0
# flag the row at the start of the pattern
df['pat_flag'] = \
df['row_pat'].eq(p0) & \
df['row_pat'].shift(-1).eq(p1) & \
df['row_pat'].shift(-2).eq(p2) & \
df['row_pat'].shift(-3).eq(p3)
df.head(10)
What i cant figure out, is how to do this only withing the "group_var", and instead of returning True for the start of the pattern, return true for all rows that are part of the pattern.
Appreciate any tips on how to solve this!
Thanks...
I think you have 2 ways - simplier and slowier solution or faster complicated.
use Rolling.apply and test pattern
replace 0s to NaNs by mask
use bfill with limit (same as fillna with method='bfill') for repeat 1
then fillna NaNs to 0
last cast to bool by astype
pat = np.asarray([1, 2, 2, 0])
N = len(pat)
df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
.apply(lambda x: (x==pat).all())
.mask(lambda x: x == 0)
.bfill(limit=N-1)
.fillna(0)
.astype(bool)
)
If is important performance, use strides, solution from link was modify:
use rolling window approach
compare with pattaern and return Trues for match by all
get indices of first occurencies by np.mgrid and indexing
create all indices with list comprehension
compare by numpy.in1d and create new column
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)
Another solution, thanks #divakar:
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
Timings:
np.random.seed(456)
import pandas as pd
from numpy.random import choice, randn
from scipy.ndimage.morphology import binary_dilation
import string
# df constructor
n_rows = 100000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
'group_var': choice(list(string.ascii_uppercase), n_rows),
'row_pat': choice([0, 1, 2, 3], n_rows),
'values': randn(n_rows)})
# sorting
df.sort_values(by=['group_var', 'date_time'], inplace=True)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)
print (df.iloc[460:480])
date_time group_var row_pat values rm0 rm1 rm2
12045 2019-06-25 21:00:00 A 3 -0.081152 False False False
12094 2019-06-27 22:00:00 A 1 -0.818167 False False False
12125 2019-06-29 05:00:00 A 0 -0.051088 False False False
12143 2019-06-29 23:00:00 A 0 -0.937589 False False False
12145 2019-06-30 01:00:00 A 3 0.298460 False False False
12158 2019-06-30 14:00:00 A 1 0.647161 False False False
12164 2019-06-30 20:00:00 A 3 -0.735538 False False False
12210 2019-07-02 18:00:00 A 1 -0.881740 False False False
12341 2019-07-08 05:00:00 A 3 0.525652 False False False
12343 2019-07-08 07:00:00 A 1 0.311598 False False False
12358 2019-07-08 22:00:00 A 1 -0.710150 True True True
12360 2019-07-09 00:00:00 A 2 -0.752216 True True True
12400 2019-07-10 16:00:00 A 2 -0.205122 True True True
12404 2019-07-10 20:00:00 A 0 1.342591 True True True
12413 2019-07-11 05:00:00 A 1 1.707748 False False False
12506 2019-07-15 02:00:00 A 2 0.319227 False False False
12527 2019-07-15 23:00:00 A 3 2.130917 False False False
12600 2019-07-19 00:00:00 A 1 -1.314070 False False False
12604 2019-07-19 04:00:00 A 0 0.869059 False False False
12613 2019-07-19 13:00:00 A 2 1.342101 False False False
In [225]: %%timeit
...: df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
...: .apply(lambda x: (x==pat).all())
...: .mask(lambda x: x == 0)
...: .bfill(limit=N-1)
...: .fillna(0)
...: .astype(bool)
...: )
...:
1 loop, best of 3: 356 ms per loop
In [226]: %%timeit
...: arr = df['row_pat'].values
...: b = np.all(rolling_window(arr, N) == pat, axis=1)
...: c = np.mgrid[0:len(b)][b]
...: d = [i for x in c for i in range(x, x+N)]
...: df['rm2'] = np.in1d(np.arange(len(arr)), d)
...:
100 loops, best of 3: 7.63 ms per loop
In [227]: %%timeit
...: arr = df['row_pat'].values
...: b = np.all(rolling_window(arr, N) == pat, axis=1)
...:
...: m = (rolling_window(arr, len(pat)) == pat).all(1)
...: m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
...: df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
...:
100 loops, best of 3: 7.25 ms per loop
You could make use of the pd.rolling() methods and then simply compare the arrays that it returns with the array that contains the pattern that you are attempting to match on.
pattern = np.asarray([1.0, 2.0, 2.0, 0.0])
n_obs = len(pattern)
df['rolling_match'] = (df['row_pat']
.rolling(window=n_obs , min_periods=n_obs)
.apply(lambda x: (x==pattern).all())
.astype(bool) # All as bools
.shift(-1 * (n_obs - 1)) # Shift back
.fillna(False) # convert NaNs to False
)
It is important to specify the min periods here in order to ensure that you only find exact matches (and so the equality check won't fail when the shapes are misaligned). The apply function is doing a pairwise check between the two arrays, and then we use the .all() to ensure all match. We convert to a bool, and then call shift on the function to move it to being a 'forward looking' indicator instead of only occurring after the fact.
Help on the rolling functionality available here -
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
This works.
It works like this:
a) For every group, it takes a window of size 4 and scans through the column until it finds the combination (1,2,2,0) in exact sequence. As soon as it finds the sequence, it populates the corresponding index values of new column 'pat_flag' with 1.
b) If it doesn't find the combination, it populates the column with 0.
pattern = [1,2,2,0]
def get_pattern(df):
df = df.reset_index(drop=True)
df['pat_flag'] = 0
get_indexes = []
temp = []
for index, row in df.iterrows():
mindex = index +1
# get the next 4 values
for j in range(mindex, mindex+4):
if j == df.shape[0]:
break
else:
get_indexes.append(j)
temp.append(df.loc[j,'row_pat'])
# check if sequence is matched
if temp == pattern:
df.loc[get_indexes,'pat_flag'] = 1
else:
# reset if the pattern is not found in given window
temp = []
get_indexes = []
return df
# apply function to the groups
df = df.groupby('group_var').apply(get_pattern)
## snippet of output
date_time group_var row_pat values pat_flag
41 2018-03-13 21:00:00 C 3 0.731114 0
42 2018-03-14 05:00:00 C 0 1.350164 0
43 2018-03-14 11:00:00 C 1 -0.429754 1
44 2018-03-14 12:00:00 C 2 1.238879 1
45 2018-03-15 17:00:00 C 2 -0.739192 1
46 2018-03-18 06:00:00 C 0 0.806509 1
47 2018-03-20 06:00:00 C 1 0.065105 0
48 2018-03-20 08:00:00 C 1 0.004336 0
Expanding on Emmet02's answer: using the rolling function for all groups and setting match-column to 1 for all matching pattern indices:
pattern = np.asarray([1,2,2,0])
# Create a match column in the main dataframe
df.assign(match=False, inplace=True)
for group_var, group in df.groupby("group_var"):
# Per group do rolling window matching, the last
# values of matching patterns in array 'match'
# will be True
match = (
group['row_pat']
.rolling(window=len(pattern), min_periods=len(pattern))
.apply(lambda x: (x==pattern).all())
)
# Get indices of matches in current group
idx = np.arange(len(group))[match == True]
# Include all indices of matching pattern,
# counting back from last index in pattern
idx = idx.repeat(len(pattern)) - np.tile(np.arange(len(pattern)), len(idx))
# Update matches
match.values[idx] = True
df.loc[group.index, 'match'] = match
df[df.match==True]
edit: Without a for loop
# Do rolling matching in group clause
match = (
df.groupby("group_var")
.rolling(len(pattern))
.row_pat.apply(lambda x: (x==pattern).all())
)
# Convert NaNs
match = (~match.isnull() & match)
# Get indices of matches in current group
idx = np.arange(len(df))[match]
# Include all indices of matching pattern
idx = idx.repeat(len(pattern)) - np.tile(np.arange(len(pattern)), len(idx))
# Mark all indices that are selected by "idx" in match-column
df = df.assign(match=df.index.isin(df.index[idx]))
You can do this by defining a custom aggregate function, then using it in group_by statement, finally merge it back to the original dataframe. Something like this:
Aggregate function:
def pattern_detect(column):
# define any other pattern to detect here
p0, p1, p2, p3 = 1, 2, 2, 0
column.eq(p0) & \
column.shift(-1).eq(p1) & \
column.shift(-2).eq(p2) & \
column.shift(-3).eq(p3)
return column.any()
Use group by function next:
grp = df.group_by('group_var').agg([patter_detect])['row_pat']
Now merge it back to the original dataframe:
df = df.merge(grp, left_on='group_var',right_index=True, how='left')