Pandas how to find position of cell contains sub-string

Pandas how to find position of cell contains sub-string - pandas

Example:
Price | Rate p/lot | Total Comm|
947.2 1.25 BAM 1.25
129.3 2.1 $ 1.25
161.69 $ 0.8 CAD 2.00
If I have search for ['$','CAD']:-
Expected output:-
[(1, 2), (2, 1),(2,2)]

Sorry, find solution like this,It may help someone
import pandas as pd
df = pd.DataFrame([[947.2, 1.25, 'BAM 1.25'],
[129.3, 2.1, '$ 1.25'],
[161.69, '0.8 $', 'CAD 2.00']],
columns=['Price', 'Rate p/lot', 'Total Comm'])
row, column = (df.applymap(lambda x: x if any(s in str(x) for s in ['$','CAD']) else None )).values.nonzero()
t = list(zip(row,column))

You can use in with applymap:
i, j = (df.applymap(lambda x: '$' in str(x))).values.nonzero()
t = list(zip(i, j))
print (t)
[(1, 2), (2, 1)]
i, j = (df.applymap(lambda x: any(y for y in L if y in str(x)))).values.nonzero()
#another solution
#i, j = (df.applymap(lambda x: any(s in str(x) for s in L))).values.nonzero()
t = list(zip(i, j))
print (t)
[(1, 2), (2, 1), (2, 2)]

Use str.contains:
df = df.astype(str)
from itertools import product
result = reduce(lambda x,y:x+y, [list(product([i],list(df.iloc[:,i][df.iloc[:,i].str.contains('\$|CAD')].index))) for i in range(len(df.columns))])
Output
[(1, 2), (2, 1), (2, 2)]

Related

how do I select rows from pandas df without returning False values?

I have a df and I need to select rows based on some conditions in multiple columns.
Here is what I have
import pandas as pd
dat = [('p','q', 5), ('k','j', 2), ('p','-', 5), ('-','p', 4), ('q','pkjq', 3), ('pkjq','q', 2)
df = pd.DataFrame(dat, columns = ['a', 'b', 'c'])
df_dat = df[(df[['a','b']].isin(['k','p','q','j']) & df['c'] > 3)] | df[(~df[['a','b']].isin(['k','p','q','j']) & df['c'] > 2 )]
Expected result = [('p','q', 5), ('p','-', 5), ('-','p', 4), ('q','pkjq', 3)]
Result I am getting is an all false dataframe

When you have the complicate condition I recommend, make the condition outside the slice
cond1 = df[['a','b']].isin(['k','p','q','j']).any(1) & df['c'].gt(3)
cond2 = (~df[['a','b']].isin(['k','p','q','j'])).any(1) & df['c'].gt(2)
out = df.loc[cond1 | cond2]
Out[305]:
a b c
0 p q 5
2 p - 5
3 - p 4
4 q pkjq 3

Matrix Vector Product across Multiple Dimensions

I have two arrays:
A = torch.rand((64, 128, 10, 10))
B = torch.rand((64, 128, 10))
I would like to compute the product, represented by C, where we do a matrix-vector multiplication across the first and second dimensions of A and B, so:
# C should have shape: (64, 128, 10)
for i in range(0, 64):
for j in range(0, 128):
C[i,j] = torch.matmul(A[i,j], B[i,j])
Does anyone know how to do this using torch.einsum? I tried the following, but I am getting an incorrect result.
C = torch.einsum('ijkl, ijk -> ijk', A, B)

Here's the options with numpy. (I don't have torch)
In [120]: A = np.random.random((64, 128, 10, 10))
...: B = np.random.random((64, 128, 10))
Your iterative reference case:
In [122]: C = np.zeros((64,128,10))
...: # C should have shape: (64, 128, 10)
...: for i in range(0, 64):
...: for j in range(0, 128):
...: C[i,j] = np.matmul(A[i,j], B[i,j])
...:
matmul with full broadcasting:
In [123]: D = np.matmul(A, B[:,:,:,None])
In [125]: C.shape
Out[125]: (64, 128, 10)
In [126]: D.shape # D has an extra size 1 dimension
Out[126]: (64, 128, 10, 1)
In [127]: np.allclose(C,D[...,0]) # or use squeeze
Out[127]: True
The einsum equivalent:
In [128]: E = np.einsum('ijkl,ijl->ijk', A, B)
In [129]: np.allclose(C,E)
Out[129]: True

Pandas Return Cell position containing string

I am new to data analysis , I wand to find cell position which containing input string.
example:
Price | Rate p/lot | Total Comm|
947.2 1.25 CAD 1.25
129.3 2.1 CAD 1.25
161.69 0.8 CAD 2.00
How do I find position of string "CAD 2.00".
Required output is (2,2)

In [353]: rows, cols = np.where(df == 'CAD 2.00')
In [354]: rows
Out[354]: array([2], dtype=int64)
In [355]: cols
Out[355]: array([2], dtype=int64)

Replace columns names to numeric by range, stack and for first occurence of value use idxmax:
d = dict(zip(df.columns, range(len(df.columns))))
s = df.rename(columns=d).stack()
a = (s == 'CAD 2.00').idxmax()
print (a)
(2, 2)
If want check all occurencies use boolean indexing and convert MultiIndex to list:
a = s[(s == 'CAD 1.25')].index.tolist()
print (a)
[(0, 2), (1, 2)]
Explanation:
Create dict for rename columns names to range:
d = dict(zip(df.columns, range(len(df.columns))))
print (d)
{'Rate p/lot': 1, 'Price': 0, 'Total Comm': 2}
print (df.rename(columns=d))
0 1 2
0 947.20 1.25 CAD 1.25
1 129.30 2.10 CAD 1.25
2 161.69 0.80 CAD 2.00
Then reshape by stack for MultiIndex with positions:
s = df.rename(columns=d).stack()
print (s)
0 0 947.2
1 1.25
2 CAD 1.25
1 0 129.3
1 2.1
2 CAD 1.25
2 0 161.69
1 0.8
2 CAD 2.00
dtype: object
Compare by string:
print (s == 'CAD 2.00')
0 0 False
1 False
2 False
1 0 False
1 False
2 False
2 0 False
1 False
2 True
dtype: bool
And get position of first True - values of MultiIndex:
a = (s == 'CAD 2.00').idxmax()
print (a)
(2, 2)
Another solution is use numpy.nonzero for check values, zip values together and convert to list:
i, j = (df.values == 'CAD 2.00').nonzero()
t = list(zip(i, j))
print (t)
[(2, 2)]
i, j = (df.values == 'CAD 1.25').nonzero()
t = list(zip(i, j))
print (t)
[(0, 2), (1, 2)]

A simple alternative:
def value_loc(value, df):
for col in list(df):
if value in df[col].values:
return (list(df).index(col), df[col][df[col] == value].index[0])

Fast way combine multiple columns of type float into one column of type array(float)

I have a dataset like this:
df = pd.DataFrame({
"333-0": [123,123,123],
"5985-0.0": [1,2,3],
"5985-0.1":[1,2,3],
"5985-0.2":[1,2,3]
},
index = [0,1,2] )
Here, we have three columns ["5985-0.0", "5985-0.1", "5985-0.2"] that represent the first, second and third float readings of thing 5985-0 -- i.e. .x represents an array index.
I'd like to take multiple columns and collapse them into a single column 5985-0 containing some kind of list of float, which I can do like this:
srccols = ["5985-0.0", "5985-0.1", "5985-0.2"]
df["5985-0"] = df[srccols].apply(tuple, axis=1)
df.dropna(srccols, axis=1)
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
which I can then store as an SQL table with an array column.
However, apply(tuple) is very slow. Is there a faster, more idiomatic pandas way to combine multiple columns into one.
(First person to say "normalized" gets a downvote).

My Choice
Assuming I know the columns
thing = '5985-0'
cols = ['5985-0.0', '5985-0.1', '5985-0.2']
k = len(cols)
v = df.values
l = [v[:, df.columns.get_loc(c)].tolist() for c in cols]
s = pd.Series(list(zip(*l)), name=thing)
df.drop(cols, 1).join(s)
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
Base Case
Using filter, join, and apply(tuple, 1)
thing = '5985-0'
d = df.filter(like=thing)
s = d.apply(tuple, 1).rename(thing)
cols = d.columns
df.drop(cols, 1).join(s)
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
Option 2
Using filter, join, pd.Series
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(d.values.tolist(), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
333-0 5985-0
0 123 [1, 1, 1]
1 123 [2, 2, 2]
2 123 [3, 3, 3]
Option 3
Using filter, join, pd.Series, and zip
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(list(zip(*d.values.T)), name=thing)
cols = d.columns
print(df.drop(cols, 1).join(s))
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
Timing
Large Data Set
df = pd.concat([df] * 10000, ignore_index=True
%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = d.apply(tuple, 1).rename(thing)
cols = d.columns
df.drop(cols, 1).join(s)
1 loop, best of 3: 350 ms per loop
%%timeit
thing = '5985-0'
cols = ['5985-0.0', '5985-0.1', '5985-0.2']
k = len(cols)
v = df.values
l = [v[:, df.columns.get_loc(c)].tolist() for c in cols]
s = pd.Series(list(zip(*l)), name=thing)
df.drop(cols, 1).join(s)
100 loops, best of 3: 4.06 ms per loop
%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(d.values.tolist(), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
100 loops, best of 3: 4.56 ms per loop
%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(list(zip(*d.values.T)), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
100 loops, best of 3: 6.89 ms per loop

Data handling for matplotlib histogram with error bars

I've got a data set which is a list of tuples in python like this:
dataSet = [(6.1248199999999997, 27), (6.4400500000000003, 4), (5.9150600000000004, 1), (5.5388400000000004, 38), (5.82559, 1), (7.6892199999999997, 2), (6.9047799999999997, 1), (6.3516300000000001, 76), (6.5168699999999999, 1), (7.4382099999999998, 1), (5.4493299999999998, 1), (5.6254099999999996, 1), (6.3227700000000002, 1), (5.3321899999999998, 11), (6.7402300000000004, 4), (7.6701499999999996, 1), (5.4589400000000001, 3), (6.3089700000000004, 1), (6.5926099999999996, 2), (6.0003000000000002, 5), (5.9845800000000002, 1), (6.4967499999999996, 2), (6.51227, 6), (7.0302600000000002, 1), (5.7271200000000002, 49), (7.5311300000000001, 7), (5.9495800000000001, 2), (5.1487299999999996, 18), (5.7637099999999997, 6), (5.5144500000000001, 44), (6.7988499999999998, 1), (5.2578399999999998, 1)]
Where the first element of the tuple is an energy and the second a counter, how many sensor where affected.
I want to create a histogram to study the relation between the number of affected sensors and the energy. I'm pretty new to matplotlib (and python), but this is what I've done so far:
import math
import matplotlib.pyplot as plt
dataSet = [(6.1248199999999997, 27), (6.4400500000000003, 4), (5.9150600000000004, 1), (5.5388400000000004, 38), (5.82559, 1), (7.6892199999999997, 2), (6.9047799999999997, 1), (6.3516300000000001, 76), (6.5168699999999999, 1), (7.4382099999999998, 1), (5.4493299999999998, 1), (5.6254099999999996, 1), (6.3227700000000002, 1), (5.3321899999999998, 11), (6.7402300000000004, 4), (7.6701499999999996, 1), (5.4589400000000001, 3), (6.3089700000000004, 1), (6.5926099999999996, 2), (6.0003000000000002, 5), (5.9845800000000002, 1), (6.4967499999999996, 2), (6.51227, 6), (7.0302600000000002, 1), (5.7271200000000002, 49), (7.5311300000000001, 7), (5.9495800000000001, 2), (5.1487299999999996, 18), (5.7637099999999997, 6), (5.5144500000000001, 44), (6.7988499999999998, 1), (5.2578399999999998, 1)]
binWidth = .2
binnedDataSet = []
#create another list and append the "binning-value"
for item in dataSet:
binnedDataSet.append((item[0], item[1], math.floor(item[0]/binWidth)*binWidth))
energies, sensorHits, binnedEnergy = [[q[i] for q in binnedDataSet] for i in (0,1,2)]
plt.plot(binnedEnergy, sensorHits, 'ro')
plt.show()
This works so far (although it doesn't even look like a histogram ;-) but OK), but now I want to calculate the mean value for each bin and append some error bars.
What's the way to do it? I looked at histogram examples for matplotlib, but they all use one-dimensional data which will be counted, so you get a frequency spectrum… That's not really what I want.

I am somewhat confused by exactly what you are trying to do, but I think this (to first order) will do what I think you want:
bin_width = .2
bottom = 5.0
top = 8.0
binned_data = [0.0] * int(math.ceil(((top - bottom) / bin_width)))
binned_count = [0] * int(math.ceil(((top - bottom) / bin_width)))
n_bins = len(binned_data)
for E, cnt in dataSet:
if E < bottom or E > top:
print 'out of range'
continue
bin_id = int(math.floor(n_bins * (E - bottom) / (top - bottom)))
binned_data[bin_id] += cnt
binned_count[bin_id] += 1
binned_avergaed_data = [C_sum / hits if hits > 0 else 0 for C_sum, hits in zip(binned_data, binned_count)]
bin_edges = [bottom + j * bin_width for j in range(len(binned_data))]
plt.bar(bin_edges, binned_avergaed_data, width=bin_width)
I would also suggest looking into numpy, it would make this much simpler to write.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas how to find position of cell contains sub-string - pandas

Example: Price | Rate p/lot | Total Comm| 947.2 1.25 BAM 1.25 129.3 2.1 $ 1.25 161.69 $ 0.8 CAD 2.00 If I have search for ['$','CAD']:- Expected output:- [(1, 2), (2, 1),(2,2)]

Use str.contains: df = df.astype(str) from itertools import product result = reduce(lambda x,y:x+y, [list(product([i],list(df.iloc[:,i][df.iloc[:,i].str.contains('\$|CAD')].index))) for i in range(len(df.columns))]) Output [(1, 2), (2, 1), (2, 2)]

Related

how do I select rows from pandas df without returning False values?

Matrix Vector Product across Multiple Dimensions

Pandas Return Cell position containing string

Fast way combine multiple columns of type float into one column of type array(float)

Data handling for matplotlib histogram with error bars

Categories

Resources