Get column-index from column-name in pandas? [duplicate] - pandas

In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?

Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.

Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]

DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()

For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)

When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])

Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)

To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.

how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]

When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something

import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results

Related

Pandas: how do I select the columns of a dataframe that are of type list?

I have a pandas dataframe which some columns are of type list. All my columns are of the same type (or all elements are a list or not). How do I select just these columns?
If I get the dtypes they all are of object type:
In [44]: df = pd.DataFrame([['a', [1,2]], ['b', [3,4]]], columns=['x', 'y'])
In [45]: df
Out[45]:
x y
0 a [1, 2]
1 b [3, 4]
In [46]: df.dtypes
Out[46]:
x object
y object
dtype: object
You can write your own function that designates whether or not a column contains all lists:
def all_lists(array):
return all(isinstance(x, list) for x in array)
list_types = df.apply(all_lists)
print(list_types)
x False
y True
dtype: bool
Then to subset our data using the list_types Series:
list_type_subset = df.loc[:, list_types]
print(list_type_subset)
y
0 [1, 2]
1 [3, 4]
Some important notes:
If you have a huge dataframe checking if each value is a list object can be time consuming, since we're iterating over each value and checking by hand.
list is not an actual dtype- the "object" dtype is essentially a catchall used by pandas when the dtype is not a number, boolean, datetime, or other official dtype. This is why we have to do things in a slightly roundabout way.
I just got an ugly way to do it:
df.loc[:, [name for name, typ in df.iloc[0].map(type).items() if typ==list]]
The result is:
Out[66]:
y
0 [1, 2]
1 [3, 4]
I'd accept a more elegant answer :-)

Compare two pandas data frame from csv

I have 2 csv files and i need to compare them using by pandas. The values in these two files are the same so I expect the df result to be empty but it shows to me they are different. Do you think i miss something when i read csv files? or another things to test/fix?
df1=pd.read_csv('apc2019.csv', sep = '|', lineterminator=True)
df2=pd.read_csv('apc2020.csv', sep = '|', lineterminator=True)
df = pd.concat([df1,df2]).drop_duplicates(keep=False)
print(df)
I'd recommend to find what's the difference first, but it is hard with the pd.equals since it will only give you either True or False, can you try this?
from pandas._testing import assert_frame_equal
assert_frame_equal(df1, df2)
This will tell you exactly the difference, and it has different levels of 'tolerance' (for example if you don't care about the column names, of the types etc)
Details here
If you want to compare with a tolerance in values:
In [20]: from pandas._testing import assert_frame_equal
...: df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [1, 9]})
...: df2 = pd.DataFrame({'a': [1, 2], 'b': [3, 5], 'c': [1.5, 8.5]})
In [21]: assert_frame_equal(df1, df2, check_less_precise=-1, check_dtype=False)
By defaut chekc_dtype is True, so it will raise an exception if you have floats vs ints.
The other parameter to change is the check_less_precise by using negatives you make the allowed error bigger

NumPy: generalize one-hot encoding to k-hot encoding

I'm using this code to one-hot encode values:
idxs = np.array([1, 3, 2])
vals = np.zeros((idxs.size, idxs.max()+1))
vals[np.arange(idxs.size), idxs] = 1
But I would like to generalize it to k-hot encoding (where shape of vals would be same, but each row can contain k ones).
Unfortunatelly, I can't figure out how to index multiple cols from each row. I tried vals[0:2, [[0, 1], [3]] to select first and second column from first row and third column from second row, but it does not work.
It's called advanced-indexing.
to select first and second column from first row and third column from second row
You just need to pass the respective rows and columns in separate iterables (tuple, list):
In [9]: a
Out[9]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
In [10]: a[[0, 0, 1],[0, 1, 3]]
Out[10]: array([0, 1, 8])

Indexing a sub-array by lists [duplicate]

This question already has an answer here:
Assign values to numpy.array
(1 answer)
Closed 5 years ago.
I have some array A and 2 lists of indices ind1 and ind2, one for each axis. Now this gives me a slice of the array, to which I need to assign some new values. Problem is, my approach for this does not work.
Let me demonstrate with an example. First I create an array, and try to access some slice:
>>> A=numpy.arange(9).reshape(3,3)
>>> ind1, ind2 = [0,1], [1,2]
>>> A
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> A[ind1,ind2]
array([1, 5])
Now this just gives me 2 values, not the 2-by-2 matrix I was going for. So I tried this:
>>> A[ind1,:][:,ind2]
array([[1, 2],
[4, 5]])
Okay, better. Now let's say these value should be 0:
>>> A[ind1,:][:,ind2]=0
>>> A
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
If I try to assign like this, the array A does not get updated, because of the double indexing (I am only assigning to some copy of A, which gets discarded). Is there some way to index the sub array by just indexing once?
Note: Indexing by selecting some appropriate range like A[:2,1:3] would work for this example, but I need something that works with any arbitrary list of indices.
What about using meshgrid to create your 2d-indexes? As follows
>>> import numpy as np
>>> A = np.arange(9).reshape(3,3)
>>> ind1, ind2 = [0,1],[1,2]
>>> ind12 = np.meshgrid(ind1,ind2, indexing='ij')
>>> # = np.ix_(ind1,ind2) as pointed out by #Divakar
>>> A[ind12]
[[1 2]
[4 5]]
And finally
>>> A[ind12] = 0
>>> A
[[0 0 0]
[3 0 0]
[6 7 8]]
Which works with any arbitrary list of indices.
>>> ind1, ind2 = [0,2],[0,2]
>>> ind12 = np.meshgrid(ind1,ind2, indexing='ij')
>>> A[ind12] = 100
[[100 1 100]
[ 3 4 5]
[100 7 100]]
As pointed out by #hpaulj in comments, note that np.ix_(ind1,ind2) is actually equivalent to the following use of np.meshgrid,
>>> np.meshgrid(ind1,ind2, indexing='ij', sparse=True)
Which is a priori even more efficient. This is a major point in the np.ix_'s favor when the parameters indexing and sparse are constantly set to 'ij' and True respectively.

Seaborn groupby pandas Series

I want to visualize my data into box plots that are grouped by another variable shown here in my terrible drawing:
So what I do is to use a pandas series variable to tell pandas that I have grouped variables so this is what I do:
import pandas as pd
import seaborn as sns
#example data for reproduciblity
a = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
])
#converting second column to Series
a.ix[:,1] = pd.Series(a.ix[:,1])
#Plotting by seaborn
sns.boxplot(a, groupby=a.ix[:,1])
And this is what I get:
However, what I would have expected to get was to have two boxplots each describing only the first column, grouped by their corresponding column in the second column (the column converted to Series), while the above plot shows each column separately which is not what I want.
A column in a Dataframe is already a Series, so your conversion is not necessary. Furthermore, if you only want to use the first column for both boxplots, you should only pass that to Seaborn.
So:
#example data for reproduciblity
df = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn
sns.boxplot(df.a, groupby=df.b)
I changed your example a little bit, giving columns a label makes it a bit more clear in my opinion.
edit:
If you want to plot all columns separately you (i think) basically want all combinations of the values in your groupby column and any other column. So if you Dataframe looks like this:
a b grouper
0 2 5 1
1 4 9 2
2 5 3 1
3 10 6 2
4 9 7 2
5 3 11 1
And you want boxplots for columns a and b while grouped by the column grouper. You should flatten the columns and change the groupby column to contain values like a1, a2, b1 etc.
Here is a crude way which i think should work, given the Dataframe shown above:
dfpiv = df.pivot(index=df.index, columns='grouper')
cols_flat = [dfpiv.columns.levels[0][i] + str(dfpiv.columns.levels[1][j]) for i, j in zip(dfpiv.columns.labels[0], dfpiv.columns.labels[1])]
dfpiv.columns = cols_flat
dfpiv = dfpiv.stack(0)
sns.boxplot(dfpiv, groupby=dfpiv.index.get_level_values(1))
Perhaps there are more fancy ways of restructuring the Dataframe. Especially the flattening of the hierarchy after pivoting is hard to read, i dont like it.
This is a new answer for an old question because in seaborn and pandas are some changes through version updates. Because of this changes the answer of Rutger is not working anymore.
The most important changes are from seaborn==v0.5.x to seaborn==v0.6.0. I quote the log:
Changes to boxplot() and violinplot() will probably be the most disruptive. Both functions maintain backwards-compatibility in terms of the kind of data they can accept, but the syntax has changed to be more similar to other seaborn functions. These functions are now invoked with x and/or y parameters that are either vectors of data or names of variables in a long-form DataFrame passed to the new data parameter.
Let's now go through the examples:
# preamble
import pandas as pd # version 1.1.4
import seaborn as sns # version 0.11.0
sns.set_theme()
Example 1: Simple Boxplot
df = pd.DataFrame([[2, 1] ,[4, 2],[5, 1],
[10, 2],[9, 2],[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn with x and y as parameter
sns.boxplot(x='b', y='a', data=df)
Example 2: Boxplot with grouper
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
# usinge pandas melt
df_long = pd.melt(df, "grouper", var_name='a', value_name='b')
# join two columns together
df_long['a'] = df_long['a'].astype(str) + df_long['grouper'].astype(str)
sns.boxplot(x='a', y='b', data=df_long)
Example 3: rearanging the DataFrame to pass is directly to seaborn
def df_rename_by_group(data:pd.DataFrame, col:str)->pd.DataFrame:
'''This function takes a DataFrame, groups by one column and returns
a new DataFrame where the old columnnames are extended by the group item.
'''
grouper = df.groupby(col)
max_length_of_group = max([len(values) for item, values in grouper.indices.items()])
_df = pd.DataFrame(index=range(max_length_of_group))
for i in grouper.groups.keys():
helper = grouper.get_group(i).drop(col, axis=1).add_suffix(str(i))
helper.reset_index(drop=True, inplace=True)
_df = _df.join(helper)
return _df
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
df_new = df_rename_by_group(data=df, col='grouper')
sns.boxplot(data=df_new)
I really hope this answer helps to avoid some confusion.
sns.boxplot() doesnot take groupby.
Probably you are gonna see
TypeError: boxplot() got an unexpected keyword argument 'groupby'.
The best idea to group data and use in boxplot passing the data as groupby dataframe value.
import seaborn as sns
grouDataFrame = nameDataFrame(['A'])['B'].agg(sum).reset_index()
sns.boxplot(y='B', x='A', data=grouDataFrame)
Here B column data contains numeric value and grouped is done on the basis of A. All the grouped value with their respective column are added and boxplot diagram is plotted. Hope this helps.