Pandas how to find position of cell contains sub-string - pandas

Example:
Price | Rate p/lot | Total Comm|
947.2 1.25 BAM 1.25
129.3 2.1 $ 1.25
161.69 $ 0.8 CAD 2.00
If I have search for ['$','CAD']:-
Expected output:-
[(1, 2), (2, 1),(2,2)]

Sorry, find solution like this,It may help someone
import pandas as pd
df = pd.DataFrame([[947.2, 1.25, 'BAM 1.25'],
[129.3, 2.1, '$ 1.25'],
[161.69, '0.8 $', 'CAD 2.00']],
columns=['Price', 'Rate p/lot', 'Total Comm'])
row, column = (df.applymap(lambda x: x if any(s in str(x) for s in ['$','CAD']) else None )).values.nonzero()
t = list(zip(row,column))

You can use in with applymap:
i, j = (df.applymap(lambda x: '$' in str(x))).values.nonzero()
t = list(zip(i, j))
print (t)
[(1, 2), (2, 1)]
i, j = (df.applymap(lambda x: any(y for y in L if y in str(x)))).values.nonzero()
#another solution
#i, j = (df.applymap(lambda x: any(s in str(x) for s in L))).values.nonzero()
t = list(zip(i, j))
print (t)
[(1, 2), (2, 1), (2, 2)]

Use str.contains:
df = df.astype(str)
from itertools import product
result = reduce(lambda x,y:x+y, [list(product([i],list(df.iloc[:,i][df.iloc[:,i].str.contains('\$|CAD')].index))) for i in range(len(df.columns))])
Output
[(1, 2), (2, 1), (2, 2)]

Related

how do I select rows from pandas df without returning False values?

I have a df and I need to select rows based on some conditions in multiple columns.
Here is what I have
import pandas as pd
dat = [('p','q', 5), ('k','j', 2), ('p','-', 5), ('-','p', 4), ('q','pkjq', 3), ('pkjq','q', 2)
df = pd.DataFrame(dat, columns = ['a', 'b', 'c'])
df_dat = df[(df[['a','b']].isin(['k','p','q','j']) & df['c'] > 3)] | df[(~df[['a','b']].isin(['k','p','q','j']) & df['c'] > 2 )]
Expected result = [('p','q', 5), ('p','-', 5), ('-','p', 4), ('q','pkjq', 3)]
Result I am getting is an all false dataframe
When you have the complicate condition I recommend, make the condition outside the slice
cond1 = df[['a','b']].isin(['k','p','q','j']).any(1) & df['c'].gt(3)
cond2 = (~df[['a','b']].isin(['k','p','q','j'])).any(1) & df['c'].gt(2)
out = df.loc[cond1 | cond2]
Out[305]:
a b c
0 p q 5
2 p - 5
3 - p 4
4 q pkjq 3

Matrix Vector Product across Multiple Dimensions

I have two arrays:
A = torch.rand((64, 128, 10, 10))
B = torch.rand((64, 128, 10))
I would like to compute the product, represented by C, where we do a matrix-vector multiplication across the first and second dimensions of A and B, so:
# C should have shape: (64, 128, 10)
for i in range(0, 64):
for j in range(0, 128):
C[i,j] = torch.matmul(A[i,j], B[i,j])
Does anyone know how to do this using torch.einsum? I tried the following, but I am getting an incorrect result.
C = torch.einsum('ijkl, ijk -> ijk', A, B)
Here's the options with numpy. (I don't have torch)
In [120]: A = np.random.random((64, 128, 10, 10))
...: B = np.random.random((64, 128, 10))
Your iterative reference case:
In [122]: C = np.zeros((64,128,10))
...: # C should have shape: (64, 128, 10)
...: for i in range(0, 64):
...: for j in range(0, 128):
...: C[i,j] = np.matmul(A[i,j], B[i,j])
...:
matmul with full broadcasting:
In [123]: D = np.matmul(A, B[:,:,:,None])
In [125]: C.shape
Out[125]: (64, 128, 10)
In [126]: D.shape # D has an extra size 1 dimension
Out[126]: (64, 128, 10, 1)
In [127]: np.allclose(C,D[...,0]) # or use squeeze
Out[127]: True
The einsum equivalent:
In [128]: E = np.einsum('ijkl,ijl->ijk', A, B)
In [129]: np.allclose(C,E)
Out[129]: True

Pandas Return Cell position containing string

I am new to data analysis , I wand to find cell position which containing input string.
example:
Price | Rate p/lot | Total Comm|
947.2 1.25 CAD 1.25
129.3 2.1 CAD 1.25
161.69 0.8 CAD 2.00
How do I find position of string "CAD 2.00".
Required output is (2,2)
In [353]: rows, cols = np.where(df == 'CAD 2.00')
In [354]: rows
Out[354]: array([2], dtype=int64)
In [355]: cols
Out[355]: array([2], dtype=int64)
Replace columns names to numeric by range, stack and for first occurence of value use idxmax:
d = dict(zip(df.columns, range(len(df.columns))))
s = df.rename(columns=d).stack()
a = (s == 'CAD 2.00').idxmax()
print (a)
(2, 2)
If want check all occurencies use boolean indexing and convert MultiIndex to list:
a = s[(s == 'CAD 1.25')].index.tolist()
print (a)
[(0, 2), (1, 2)]
Explanation:
Create dict for rename columns names to range:
d = dict(zip(df.columns, range(len(df.columns))))
print (d)
{'Rate p/lot': 1, 'Price': 0, 'Total Comm': 2}
print (df.rename(columns=d))
0 1 2
0 947.20 1.25 CAD 1.25
1 129.30 2.10 CAD 1.25
2 161.69 0.80 CAD 2.00
Then reshape by stack for MultiIndex with positions:
s = df.rename(columns=d).stack()
print (s)
0 0 947.2
1 1.25
2 CAD 1.25
1 0 129.3
1 2.1
2 CAD 1.25
2 0 161.69
1 0.8
2 CAD 2.00
dtype: object
Compare by string:
print (s == 'CAD 2.00')
0 0 False
1 False
2 False
1 0 False
1 False
2 False
2 0 False
1 False
2 True
dtype: bool
And get position of first True - values of MultiIndex:
a = (s == 'CAD 2.00').idxmax()
print (a)
(2, 2)
Another solution is use numpy.nonzero for check values, zip values together and convert to list:
i, j = (df.values == 'CAD 2.00').nonzero()
t = list(zip(i, j))
print (t)
[(2, 2)]
i, j = (df.values == 'CAD 1.25').nonzero()
t = list(zip(i, j))
print (t)
[(0, 2), (1, 2)]
A simple alternative:
def value_loc(value, df):
for col in list(df):
if value in df[col].values:
return (list(df).index(col), df[col][df[col] == value].index[0])

Fast way combine multiple columns of type float into one column of type array(float)

I have a dataset like this:
df = pd.DataFrame({
"333-0": [123,123,123],
"5985-0.0": [1,2,3],
"5985-0.1":[1,2,3],
"5985-0.2":[1,2,3]
},
index = [0,1,2] )
Here, we have three columns ["5985-0.0", "5985-0.1", "5985-0.2"] that represent the first, second and third float readings of thing 5985-0 -- i.e. .x represents an array index.
I'd like to take multiple columns and collapse them into a single column 5985-0 containing some kind of list of float, which I can do like this:
srccols = ["5985-0.0", "5985-0.1", "5985-0.2"]
df["5985-0"] = df[srccols].apply(tuple, axis=1)
df.dropna(srccols, axis=1)
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
which I can then store as an SQL table with an array column.
However, apply(tuple) is very slow. Is there a faster, more idiomatic pandas way to combine multiple columns into one.
(First person to say "normalized" gets a downvote).
My Choice
Assuming I know the columns
thing = '5985-0'
cols = ['5985-0.0', '5985-0.1', '5985-0.2']
k = len(cols)
v = df.values
l = [v[:, df.columns.get_loc(c)].tolist() for c in cols]
s = pd.Series(list(zip(*l)), name=thing)
df.drop(cols, 1).join(s)
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
Base Case
Using filter, join, and apply(tuple, 1)
thing = '5985-0'
d = df.filter(like=thing)
s = d.apply(tuple, 1).rename(thing)
cols = d.columns
df.drop(cols, 1).join(s)
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
Option 2
Using filter, join, pd.Series
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(d.values.tolist(), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
333-0 5985-0
0 123 [1, 1, 1]
1 123 [2, 2, 2]
2 123 [3, 3, 3]
Option 3
Using filter, join, pd.Series, and zip
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(list(zip(*d.values.T)), name=thing)
cols = d.columns
print(df.drop(cols, 1).join(s))
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
Timing
Large Data Set
df = pd.concat([df] * 10000, ignore_index=True
%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = d.apply(tuple, 1).rename(thing)
cols = d.columns
df.drop(cols, 1).join(s)
1 loop, best of 3: 350 ms per loop
%%timeit
thing = '5985-0'
cols = ['5985-0.0', '5985-0.1', '5985-0.2']
k = len(cols)
v = df.values
l = [v[:, df.columns.get_loc(c)].tolist() for c in cols]
s = pd.Series(list(zip(*l)), name=thing)
df.drop(cols, 1).join(s)
100 loops, best of 3: 4.06 ms per loop
%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(d.values.tolist(), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
100 loops, best of 3: 4.56 ms per loop
%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(list(zip(*d.values.T)), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
100 loops, best of 3: 6.89 ms per loop

Data handling for matplotlib histogram with error bars

I've got a data set which is a list of tuples in python like this:
dataSet = [(6.1248199999999997, 27), (6.4400500000000003, 4), (5.9150600000000004, 1), (5.5388400000000004, 38), (5.82559, 1), (7.6892199999999997, 2), (6.9047799999999997, 1), (6.3516300000000001, 76), (6.5168699999999999, 1), (7.4382099999999998, 1), (5.4493299999999998, 1), (5.6254099999999996, 1), (6.3227700000000002, 1), (5.3321899999999998, 11), (6.7402300000000004, 4), (7.6701499999999996, 1), (5.4589400000000001, 3), (6.3089700000000004, 1), (6.5926099999999996, 2), (6.0003000000000002, 5), (5.9845800000000002, 1), (6.4967499999999996, 2), (6.51227, 6), (7.0302600000000002, 1), (5.7271200000000002, 49), (7.5311300000000001, 7), (5.9495800000000001, 2), (5.1487299999999996, 18), (5.7637099999999997, 6), (5.5144500000000001, 44), (6.7988499999999998, 1), (5.2578399999999998, 1)]
Where the first element of the tuple is an energy and the second a counter, how many sensor where affected.
I want to create a histogram to study the relation between the number of affected sensors and the energy. I'm pretty new to matplotlib (and python), but this is what I've done so far:
import math
import matplotlib.pyplot as plt
dataSet = [(6.1248199999999997, 27), (6.4400500000000003, 4), (5.9150600000000004, 1), (5.5388400000000004, 38), (5.82559, 1), (7.6892199999999997, 2), (6.9047799999999997, 1), (6.3516300000000001, 76), (6.5168699999999999, 1), (7.4382099999999998, 1), (5.4493299999999998, 1), (5.6254099999999996, 1), (6.3227700000000002, 1), (5.3321899999999998, 11), (6.7402300000000004, 4), (7.6701499999999996, 1), (5.4589400000000001, 3), (6.3089700000000004, 1), (6.5926099999999996, 2), (6.0003000000000002, 5), (5.9845800000000002, 1), (6.4967499999999996, 2), (6.51227, 6), (7.0302600000000002, 1), (5.7271200000000002, 49), (7.5311300000000001, 7), (5.9495800000000001, 2), (5.1487299999999996, 18), (5.7637099999999997, 6), (5.5144500000000001, 44), (6.7988499999999998, 1), (5.2578399999999998, 1)]
binWidth = .2
binnedDataSet = []
#create another list and append the "binning-value"
for item in dataSet:
binnedDataSet.append((item[0], item[1], math.floor(item[0]/binWidth)*binWidth))
energies, sensorHits, binnedEnergy = [[q[i] for q in binnedDataSet] for i in (0,1,2)]
plt.plot(binnedEnergy, sensorHits, 'ro')
plt.show()
This works so far (although it doesn't even look like a histogram ;-) but OK), but now I want to calculate the mean value for each bin and append some error bars.
What's the way to do it? I looked at histogram examples for matplotlib, but they all use one-dimensional data which will be counted, so you get a frequency spectrum… That's not really what I want.
I am somewhat confused by exactly what you are trying to do, but I think this (to first order) will do what I think you want:
bin_width = .2
bottom = 5.0
top = 8.0
binned_data = [0.0] * int(math.ceil(((top - bottom) / bin_width)))
binned_count = [0] * int(math.ceil(((top - bottom) / bin_width)))
n_bins = len(binned_data)
for E, cnt in dataSet:
if E < bottom or E > top:
print 'out of range'
continue
bin_id = int(math.floor(n_bins * (E - bottom) / (top - bottom)))
binned_data[bin_id] += cnt
binned_count[bin_id] += 1
binned_avergaed_data = [C_sum / hits if hits > 0 else 0 for C_sum, hits in zip(binned_data, binned_count)]
bin_edges = [bottom + j * bin_width for j in range(len(binned_data))]
plt.bar(bin_edges, binned_avergaed_data, width=bin_width)
I would also suggest looking into numpy, it would make this much simpler to write.