Get row and column index of value in Pandas df - pandas

Currently I'm trying to automate scheduling.
I'll get requirement as a .csv file.
However, the number of day changes by month, and personnel also changes occasionally, which means the number of columns and rows is not fixed.
So, I want to put value '*' as a marker meaning end of a table. Unfortunately, I can't find a function or method that take a value as a parameter and return a(list of) index(name of column and row or index numbers).
Is there any way that I can find a(or a list of) index of a certain value?(like coordinate)
for example, when the data frame is like below,
|column_1 |column_2
------------------------
1 | 'a' | 'b'
------------------------
2 | 'c' | 'd'
how can I get 'column_2' and '2' by the value, 'd'? It's something similar to the opposite of .loc or .iloc.

Interesting question. I also used a list comprehension, but with np.where. Still I'd be surprised if there isn't a less clunky way.
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
[(i, np.where(df[i] == 'd')[0].tolist()) for i in list(df) if len(np.where(df[i] == 'd')[0]) > 0]
> [[('column_2', [1])]
Note that it returns the numeric (0-based) index, not the custom (1-based) index you have. If you have a fixed offset you could just add a +1 or whatever to the output.

If I understand what you are looking for: Find the (index value, column location) for a value in a dataframe. You can use list comprehension in a loop. Probably wont be the fastest if your dataframe is large.
# assume this dataframe
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
# list comprehension
[(df[col][df[col].eq('abc')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 0), (3, 0), (1, 1)]
change df.columns.get_loc to col if you want the column value rather than location:
[(df[col][df[col].eq('abc')].index[i], col) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 'col'), (3, 'col'), (1, 'col2')]

I might be misunderstanding something, but np.where should get the job done.
df_tmp = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
solution = np.where(df_tmp == 'd')
solution should contain row and column index.
Hope this helps!

To search single value:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df == 'd'].stack().index.tolist()
[Out]:
[('column_2', 2)]
To search a list of values:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df.isin(['a', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (2, 'column_2')]
Also works when value occurs at multiple places:
df = pd.DataFrame({'column_1':['test','test'], 'column_2':['test','test']}, index=[1,2])
df[df == 'test'].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_1'), (2, 'column_2')]
Explanation
Select cells where the condition matches:
df[df.isin(['a', 'b', 'd'])]
[Out]:
column_1 column_2
1 a b
2 NaN d
stack() reshapes the columns to index:
df[df.isin(['a', 'b', 'd'])].stack()
[Out]:
1 column_1 a
column_2 b
2 column_2 d
Now the dataframe is a multi-index:
df[df.isin(['a', 'b', 'd'])].stack().index
[Out]:
MultiIndex([(1, 'column_1'),
(1, 'column_2'),
(2, 'column_2')],
)
Convert this multi-index to list:
df[df.isin(['a', 'b', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Note
If a list of values are searched, the returned result does not preserve the order of input values:
df[df.isin(['d', 'b', 'a'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]

Had a similar need and this worked perfectly
# deals with case sensitivity concern
df = raw_df.applymap(lambda s: s.upper() if isinstance(s, str) else s)
# get the row index
value_row_location = df.isin(['VALUE']).any(axis=1).tolist().index(True)
# get the column index
value_column_location = df.isin(['VALUE']).any(axis=0).tolist().index(True)
# Do whatever you want to do e.g Replace the value above that cell
df.iloc[value_row_location - 1, value_column_location] = 'VALUE COLUMN'

Related

Select columns from data frame using 1, 0 list , pandas

I have a list of 1,0 where each element is corresponding to an index of a column on a data frame, for example:
df.columns = ['a','b','c']
binary_list = [0,1,0]
based on that I want to select only b column from the data frame as on my binary list it has 1 only corresponds to b
is there a way to perform that in pandas?
P.S this is my first time posting on stackoverflow, apologies if I am not following a specific styling
If the binary list is aligned with the columns, you can use boolean indexing:
df = pd.DataFrame([[1, 2, 3]], columns=['a', 'b', 'c'])
binary_list = [0,1,0]
df.loc[:, map(bool, binary_list)]
Output:
b
0 2

cancatenate multiple dfs with same dimensions and apply functions to cell values of all dfs and store result in the cell

df1 = pd.DataFrame(np.random.randint(0,9,size=(2, 2)))
df2 = pd.DataFrame(np.random.randint(0,9,size=(2, 2)))
Lets say after concatenate df1 and df2(real case I have many dfs with 700*200 size) in a way that I get something like below table(I dont need to see this table, just for explanation)
col a
col b
row a
[1.4]
[7,8]
row b
[9,2]
[2,0]
Then i want to pass each cell values to below compute function and add the result it from to the cell
def compute(row, column, cell_values):
baseline_df = [2, 4, 6, 7, 8]
result = baseline_df
for values in cell_values:
if (column-row) != dict[values]: # dict contain specific values
result = baseline_df
else:
result = result.apply(func, value=values)
return result.loc[column-row]
def func(df, value):
# operation
result_df = df*value
return result_df
What I want is get df1 and df2 , concatenate and apply above function and get the results. In a really fast way.
In the actual use case , df quite big and if it run for all cells it would take significant amount of time, i need a faster way to perform this.
Note:
This is my idea of doing this. I hope you understand what my requirements are. Please let me know if that is not clear.
currently, i am using something like below, just get the max value of the cell and do the calculation(func)later
This will just give the max value of all cells combined,
dfs = pd.concat(grid).max(level=0)
Final result should be something like this after calculation(same 2d array with new cell data)
col a
col b
row a
0.1
0.7
row b
0.9
0,6
Different approaches are also welcome

Error in using Pandas groupby.apply to drop duplication

I have a Pandas data frame which has some duplicate values, not rows. I want to use groupby.apply to remove the duplication. An example is as follows.
df = pd.DataFrame([['a', 1, 1], ['a', 1, 2], ['b', 1, 1]], columns=['A', 'B', 'C'])
A B C
0 a 1 1
1 a 1 2
2 b 1 1
# My function
def get_uniq_t(df):
if df.shape[0] > 1:
df['D'] = df.C * 10 + df.B
df = df[df.D == df.D.max()].drop(columns='D')
return df
df = df.groupby('A').apply(get_uniq_t)
Then I get the following value error message. The issue seems to do with creating the new column D. If I create the column D outside the function, the code seems running fine. Can someone help explain what caused the value error message?
ValueError: Shape of passed values is (3, 3), indices imply (2, 3)
The problem with your code is that it attempts to modify
the original group.
Other problem is that this function should return a single row
not a DataFrame.
Change your function to:
def get_uniq_t(df):
iMax = (df.C * 10 + df.B).idxmax()
return df.loc[iMax]
Then its application returns:
A B C
A
a a 1 2
b b 1 1
Edit following the comment
In my opinion, it is not allowed to modify the original group,
as it would indirectly modify the original DataFrame.
At least it displays a warning about this and is considered a bad practice.
Search the Web for SettingWithCopyWarning for more extensive description.
My code (get_uniq_t function) does not modify the original group.
It only returns one row from the current group.
The returned row is selected based on which row returns the greatest value
of df.C * 10 + df.B. So when you apply this function, the result is a new
DataFrame, with consecutive rows equal to results of this function
for consecutive groups.
You can perform an operation equivalent to modification, when you
create some new content, e.g. as the result of groupby instruction
and then save it under the same variable which so far held the source
DataFrame.

dataframe selecting rows with given conditions and operating

I have a dataframe looks like this:
import pandas as pd
df = pd.DataFrame({'AA': [1, 1, 2, 2], 'BB': ['C', 'D', 'C', 'D'], 'CC': [10,20,30,40], 'DD':[], 'EE':[]})
Now, I want to multiply a value in the column 'CC' with number 2 if 'AA'= 1 and 'BB'='C'. For example, the first row will meet the conditions, so the value in the column 'CC,' which is 10 will be multiplied by 2 and the output will go to the same row in 'DD' column.
I will have other requirements for other pairs of 'AA' and 'BB,' but it will be a good start if I can get the idea of how to apply multiplication on rows that meet conditions.
Thank you so much.
m0 = df.AA == 1
m1 = df.BB == "C"
df.loc[m0 & m1, "DD"] = df.loc[m0 & m1, "CC"] * 2

Pandas: Issue with min() on Categorical columns

I have the following df where columns A,B,C are categorical variables with strict ordering:
df = DataFrame([[0, 1, 'PASS', 'PASS', 'PASS'],
[0, 2, 'CHAIN', 'FAIL', 'PASS'],
[0, 3, 'PASS', 'PASS', 'TATPG'],
[0, 4, 'FAIL', 'PASS', 'FAIL'],
[0, 5, 'FAIL', 'ATPG', 'FAIL']],
columns = ['X', 'Y', 'A', 'B', 'C'])
for c in ['A','B','C']:
df[c] = df[c].astype('category', categories=['CHAIN', 'ATPG', 'TATPG', 'PASS', 'FAIL'], ordered=True)`
I want to create a new column D which is defined by the min('A', 'B', 'C'). For example, row 1 says 'CHAIN'. That is the smallest value. Hence, D[1] = CHAIN and so on. The D column should result as follows:
D[0] = PASS, D[1] = CHAIN, D[2] = TPATG, D[3] = PASS, D[4] = ATPG
I tried:
df['D'] = df[['A','B','C']].apply(min, axis=1)
However, this does not work as apply() makes the A/B/C column become of type object and hence min() sorts the values lexicographically instead of the ordering that I provided.
I also tried:
df['D'] = df[['A', 'B', 'C']].transpose().min(axis=0)
tranpose() too results in the columns A/B/C getting changed to type object instead of category.
Any ideas on how to do this correctly? I'd rather not recast the columns as categorical a 2nd time if using apply(). In general, I'll be creating a bunch of indicator columns using this formula:
df[indicator] = df[[any subset of (A,B,C)]].min()
I have found a solution that applies sorted with keys:
d = {'CHAIN': 0,
'ATPG': 1,
'TATPG': 2,
'PASS': 3,
'FAIL':4}
def func(row):
return sorted(row, key=lambda x:d[x])[0]
df['D'] = df[['A','B','C']].apply(func, axis=1)
It gives you the result you're looking for:
0 PASS
1 CHAIN
2 TATPG
3 PASS
4 ATPG
However it does not make use of panda's native sorting of categorical variables.