Pandas: Issue with min() on Categorical columns

Pandas: Issue with min() on Categorical columns - pandas

I have the following df where columns A,B,C are categorical variables with strict ordering:
df = DataFrame([[0, 1, 'PASS', 'PASS', 'PASS'],
[0, 2, 'CHAIN', 'FAIL', 'PASS'],
[0, 3, 'PASS', 'PASS', 'TATPG'],
[0, 4, 'FAIL', 'PASS', 'FAIL'],
[0, 5, 'FAIL', 'ATPG', 'FAIL']],
columns = ['X', 'Y', 'A', 'B', 'C'])
for c in ['A','B','C']:
df[c] = df[c].astype('category', categories=['CHAIN', 'ATPG', 'TATPG', 'PASS', 'FAIL'], ordered=True)`
I want to create a new column D which is defined by the min('A', 'B', 'C'). For example, row 1 says 'CHAIN'. That is the smallest value. Hence, D[1] = CHAIN and so on. The D column should result as follows:
D[0] = PASS, D[1] = CHAIN, D[2] = TPATG, D[3] = PASS, D[4] = ATPG
I tried:
df['D'] = df[['A','B','C']].apply(min, axis=1)
However, this does not work as apply() makes the A/B/C column become of type object and hence min() sorts the values lexicographically instead of the ordering that I provided.
I also tried:
df['D'] = df[['A', 'B', 'C']].transpose().min(axis=0)
tranpose() too results in the columns A/B/C getting changed to type object instead of category.
Any ideas on how to do this correctly? I'd rather not recast the columns as categorical a 2nd time if using apply(). In general, I'll be creating a bunch of indicator columns using this formula:
df[indicator] = df[[any subset of (A,B,C)]].min()

I have found a solution that applies sorted with keys:
d = {'CHAIN': 0,
'ATPG': 1,
'TATPG': 2,
'PASS': 3,
'FAIL':4}
def func(row):
return sorted(row, key=lambda x:d[x])[0]
df['D'] = df[['A','B','C']].apply(func, axis=1)
It gives you the result you're looking for:
0 PASS
1 CHAIN
2 TATPG
3 PASS
4 ATPG
However it does not make use of panda's native sorting of categorical variables.

Related

Restructuring/sorting data in panda for ANOVA

I want to manipulate a data set in order to make it suitable for ANOVA testing. The current way the df is structured is like df1, with many data points, of several types and separated by contextual categories. As I understand it (which may be wrong), I need to change the structure of the df so that it more resembles df2. I'm sure it's something to do with melt and sort, but I'm not sure how to get all the way there. What's the way/is there a better way to do ANOVA testing on this kind of data?
The real df I'm using has hundreds of data points, and many more types and categories, so it has to be a solution that can be applied realistically to more than 6 values.
df1 = pd.DataFrame({'length': [1, 2, 3, 4, 5, 6],
'width': [1, 2, 3, 4, 5, 6],
'type': ['A', 'B', 'C', 'A', 'B', 'C'],
'type2': ['x', 'y', 'x', 'y', 'y', 'x']})
df2 = pd.DataFrame({'A(x) length': [*length values that are types A,X*],
'B(x) length': [*length values that are types B,X*],
'C(x) length': [*length values that are types C,X*]})
**edited df2 to more accurately reflect what I'm asking. Maybe df restructuring isn't the answer - How would I write the anova prompt to apply the test to df1?
fvalue, pvalue =f_oneway(df2[*Axlength*], df2[*Bxlength*], df2[*Cxlength*])

The exact expected output remains unclear, but you might want:
df2 = df.melt(['type', 'type2'])
group = df2['type']+'('+df2['type2']+') '+df2['variable']
df2 = df2.groupby(group)['value'].agg(list)
Output:
A(x) length [1]
A(x) width [1]
A(y) length [4]
A(y) width [4]
B(y) length [2, 5]
B(y) width [2, 5]
C(x) length [3, 6]
C(x) width [3, 6]
Name: value, dtype: object

Select columns from data frame using 1, 0 list , pandas

I have a list of 1,0 where each element is corresponding to an index of a column on a data frame, for example:
df.columns = ['a','b','c']
binary_list = [0,1,0]
based on that I want to select only b column from the data frame as on my binary list it has 1 only corresponds to b
is there a way to perform that in pandas?
P.S this is my first time posting on stackoverflow, apologies if I am not following a specific styling

If the binary list is aligned with the columns, you can use boolean indexing:
df = pd.DataFrame([[1, 2, 3]], columns=['a', 'b', 'c'])
binary_list = [0,1,0]
df.loc[:, map(bool, binary_list)]
Output:
b
0 2

How to convert pandas multiindex into table columns and index?

I have a DataFrame
It has a multi-index and each row has a value
I want to unstack it but it has duplicated keys. Now I want to convet it into this format:
where index1 and index2 is one of the multi-index, and index3 and index4 is another multi-index, value1~3 are values in the original df, if the value doesn't exist, the value is nan.
How should I do this?

See if this example helps you:
import pandas as pd
df = pd.DataFrame(data={'value': [1, 3, 5, 8, 5, 7]}, index=[[1, 2, 3, 4, 5, 6], ['A', 'B', 'A', 'C', 'C', 'A']])
df = df.reset_index(level=1)
df = df.pivot_table(values='value', index=df.index, columns='level_1')
Original data:
Result:

Why I can't merge the columns together

My goal is to transform the array to DataFrame, and the error occurred only at the columns=...
housing_extra = pd.DataFrame(housing_extra_attribs,
index=housing_num.index,
columns=[housing.columns,'rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
Consequently, it returns
AssertionError: Number of manager items must equal union of block items
# manager items: 4, # tot_items: 12
It said I only do input 4 columns, but the housing.columns itself has 9 columns
here, when I run housing.columns ;
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income',
'ocean_proximity'],
dtype='object')
So, My question is how can I merge the existed column which is housing.columns with the 3 new columns; ['rooms_per_household', 'population_per_household', 'bedrooms_per_room'] together.

You can use Index.union to add a list of columns to existing dataframe columns:
columns= housing.columns.union(
['rooms_per_household', 'population_per_household', 'bedrooms_per_room'],
sort=False)
Or convert to list and then add the remaining columns as list:
columns = (housing.columns.tolist() +
['rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
Then:
housing_extra = pd.DataFrame(housing_extra_attribs,
index=housing_num.index,
columns=columns)
Some example:
Assume this df:
df = pd.util.testing.makeDataFrame()
print(df.columns)
#Index(['A', 'B', 'C', 'D'], dtype='object')
When you pass this into a list:
[df.columns,'E','F','G']
you get:
[Index(['userId', 'column_1', 'column_2', 'column_3'], dtype='object'),'E','F','G']
v/s when you use union:
df.columns.union(['E','F','G'],sort=False)
You get:
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

Get row and column index of value in Pandas df

Currently I'm trying to automate scheduling.
I'll get requirement as a .csv file.
However, the number of day changes by month, and personnel also changes occasionally, which means the number of columns and rows is not fixed.
So, I want to put value '*' as a marker meaning end of a table. Unfortunately, I can't find a function or method that take a value as a parameter and return a(list of) index(name of column and row or index numbers).
Is there any way that I can find a(or a list of) index of a certain value?(like coordinate)
for example, when the data frame is like below,
|column_1 |column_2
------------------------
1 | 'a' | 'b'
------------------------
2 | 'c' | 'd'
how can I get 'column_2' and '2' by the value, 'd'? It's something similar to the opposite of .loc or .iloc.

Interesting question. I also used a list comprehension, but with np.where. Still I'd be surprised if there isn't a less clunky way.
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
[(i, np.where(df[i] == 'd')[0].tolist()) for i in list(df) if len(np.where(df[i] == 'd')[0]) > 0]
> [[('column_2', [1])]
Note that it returns the numeric (0-based) index, not the custom (1-based) index you have. If you have a fixed offset you could just add a +1 or whatever to the output.

If I understand what you are looking for: Find the (index value, column location) for a value in a dataframe. You can use list comprehension in a loop. Probably wont be the fastest if your dataframe is large.
# assume this dataframe
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
# list comprehension
[(df[col][df[col].eq('abc')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 0), (3, 0), (1, 1)]
change df.columns.get_loc to col if you want the column value rather than location:
[(df[col][df[col].eq('abc')].index[i], col) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 'col'), (3, 'col'), (1, 'col2')]

I might be misunderstanding something, but np.where should get the job done.
df_tmp = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
solution = np.where(df_tmp == 'd')
solution should contain row and column index.
Hope this helps!

To search single value:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df == 'd'].stack().index.tolist()
[Out]:
[('column_2', 2)]
To search a list of values:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df.isin(['a', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (2, 'column_2')]
Also works when value occurs at multiple places:
df = pd.DataFrame({'column_1':['test','test'], 'column_2':['test','test']}, index=[1,2])
df[df == 'test'].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_1'), (2, 'column_2')]
Explanation
Select cells where the condition matches:
df[df.isin(['a', 'b', 'd'])]
[Out]:
column_1 column_2
1 a b
2 NaN d
stack() reshapes the columns to index:
df[df.isin(['a', 'b', 'd'])].stack()
[Out]:
1 column_1 a
column_2 b
2 column_2 d
Now the dataframe is a multi-index:
df[df.isin(['a', 'b', 'd'])].stack().index
[Out]:
MultiIndex([(1, 'column_1'),
(1, 'column_2'),
(2, 'column_2')],
)
Convert this multi-index to list:
df[df.isin(['a', 'b', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Note
If a list of values are searched, the returned result does not preserve the order of input values:
df[df.isin(['d', 'b', 'a'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]

Had a similar need and this worked perfectly
# deals with case sensitivity concern
df = raw_df.applymap(lambda s: s.upper() if isinstance(s, str) else s)
# get the row index
value_row_location = df.isin(['VALUE']).any(axis=1).tolist().index(True)
# get the column index
value_column_location = df.isin(['VALUE']).any(axis=0).tolist().index(True)
# Do whatever you want to do e.g Replace the value above that cell
df.iloc[value_row_location - 1, value_column_location] = 'VALUE COLUMN'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas: Issue with min() on Categorical columns - pandas

Related

Restructuring/sorting data in panda for ANOVA

Select columns from data frame using 1, 0 list , pandas

How to convert pandas multiindex into table columns and index?

Why I can't merge the columns together

Get row and column index of value in Pandas df

Categories

Resources