pandas dataframe generic column values - pandas

Is there a consistent way of getting pandas column values by DF['ColName'], including index column? If 'ColName' is an index column, you get KeyError.
It is very inconvenient that every time you need to determine whether a column being passed in is an index column or not, then handle it differently.
Thank you.

consider the dataframe df
df = pd.DataFrame(
dict(
A=[1, 2, 3],
B=[4, 5, 6],
C=['x', 'y', 'z'],
),
pd.MultiIndex.from_tuples(
[
('cat', 'red'),
('dog', 'blue'),
('bird', 'yellow')
],
names=['species', 'color']
)
)
print(df)
A B C
species color
cat red 1 4 x
dog blue 2 5 y
bird yellow 3 6 z
you can always refer to levels of the index in the same way you'd refer to columns if you reset_index() first.
Grab column 'A'
df.reset_index()['A']
0 1
1 2
2 3
Name: A, dtype: int64
Grab 'color' without reset_index()
df['color']
> KeyError
With reset_index()
0 red
1 blue
2 yellow
Name: color, dtype: object
This doesn't come without it's downside. That index was potentially useful to have for column 'A'
df['A']
species color
cat red 1
dog blue 2
bird yellow 3
Name: A, dtype: int64
Automatically aligned the 'index' with the values of column 'A' which was the whole point of it being the index.

Related

How to drop a row if the first column's value is empty?

I have a dataframe in which the 1st column in some of the rows are empty, and I want to drop such rows. I saw this as one way to drop row:
df = df.dropna(axis=0, subset=['1st_row'])
I don't know the column names and I want to drop by column index (the 1st column). Is that possible?
you could select columns (or rows) by position using iloc
for instance, the following would drop all rows where the first column is null
df = pd.DataFrame({
'a': [pd.NA, 1, 2, 3, pd.NA, 4, 5],
'b': list('abcdefg')
})
df2 = df[df.iloc[:,0].notnull()]
df2 outputs:
a b
1 1 b
2 2 c
3 3 d
5 4 f
6 5 g

Re-define dataframe index with map function

I have a dataframe like this. I wanted to know how can I apply map function to its index and rename it into a easier format.
df = pd.DataFrame({'d': [1, 2, 3, 4]}, index=['apple_017', 'orange_054', 'orange_061', 'orange_053'])
df
d
apple_017 1
orange_054 2
orange_061 3
orange_053 4
There are only two labels in the indeces of the dataframe, so it's either apple or orange in this case and this is how I tried:
data.index = data.index.map(i = "apple" if "apple" in i else "orange")
(Apparently it's not how it works)
Desired output:
d
apple 1
orange 2
orange 3
orange 4
Appreciate anyone's help and suggestion!
Try via split():
df.index=df.index.str.split('_').str[0]
OR
via map():
df.index=df.index.map(lambda x:'apple' if 'apple' in x else 'orange')
output of df:
d
apple 1
orange 2
orange 3
orange 4

Pandas conditional creation of a series/dataframe column for entries containing lists

I started with the this post:
Pandas conditional creation of a series/dataframe column
In my specific case my data look like this:
Type Set
1 A [1,2,3]
2 B [1,2,3]
3 B [3,2,1]
4 C [2,4,1]
I borrowed the idea using np.where, so if I need to create a new col based on the last element from the list in each entry, I wrote:
df['color'] = np.where(df['Set'].str(-1)==3, 'green', 'red'), and this yields:
Set Type color
0 Z [1,2,3] green
1 Z [1,2,3] green
2 X [3,2,1] red
3 Y [2,4,1] red
Now I wish to be more flexible, want to say, if 3 shows in the list at all, I will assign color=green. I tried using in or isin(), they don't work with np.where. Wish to learn what other options in a similar format I have above. (not using for loop if possible).
The desired output:
Set Type color
0 Z [1,2,3] green
1 Z [1,2,3] green
2 X [3,2,1] green
3 Y [2,4,1] red
Try with explode
df['new']= np.where(df.Set.explode().eq(3).any(level=0),'green','red')
df
Out[131]:
Type new
0 [1, 2, 3] green
1 [1, 2, 3] green
2 [3, 2, 1] green
3 [2, 4, 1] red
You can also convert to string and use str.contains:
find=3
df['color'] = np.where(df["Set"].astype(str).str.contains(str(find)),'green','red')
Or with a dataframe where the condition will be
pd.DataFrame(df["Set"].to_list(),index=df.index).eq(3).any(1)
print(df)
Type Set color
1 A [1, 2, 3] green
2 B [1, 2, 3] green
3 B [3, 2, 1] green
4 C [2, 4, 1] red

Reset the categories of categorical index in Pandas

I have a dataframe with a column being categorical.
I remove all the rows having one the categories.
How can I make sure the resulting dataframe has only those categories that exist and does not keep the deleted categories in its index?
df = pd.DataFrame({'color':np.random.choice(['Blue','Green','Brown','Red'], 50)})
df.color = df.color.astype('category')
df.color.head()
Output:
0 Blue
1 Green
2 Blue
3 Green
4 Brown
Name: color, dtype: category
Categories (4, object): [Blue, Brown, Green, Red]
Remove Brown from dataframe and category.
df = df.query('color != "Brown"')
df.color = df.color.cat.remove_categories('Brown')
df.color.head()
Output:
0 Blue
1 Green
2 Blue
3 Green
7 Red
Name: color, dtype: category
Categories (3, object): [Blue, Green, Red]
How can I make sure the resulting dataframe has only those categories that exist and does not keep the deleted categories in its index?
There's (now?) a pandas function doing exactly that: remove_unused_categories
This function only has one parameter inplace, which is deprecated since pandas 1.2.0. Hence, the following example (based on Scott's answer) does not use inplace:
>>> df = pd.DataFrame({'color':np.random.choice(['Blue','Green','Brown','Red'], 50)})
... df.color = df.color.astype('category')
... df.color.head()
0 Green
1 Brown
2 Blue
3 Red
4 Brown
Name: color, dtype: category
Categories (4, object): [Blue, Brown, Green, Red]
>>> df = df[df.color != "Brown"]
... df.color = df.color.cat.remove_unused_categories()
... df.color.head()
0 Green
2 Blue
3 Red
5 Red
6 Green
Name: color, dtype: category
Categories (3, object): [Blue, Green, Red]

How to deal with list-like data in a pandas DataFrame column

Consider the following example:
I have a table of emails, each with an email id, and two label columns, generated through different code paths, containing a list of labels associated with those emails.
df = pd.DataFrame({
'id': [1,2,3,4],
'labels1': [np.array(['red']), np.array(['blue', 'green']), np.array(['blue']), np.nan],
'labels2': [np.nan, np.nan, np.array(['yellow', 'purple']), np.array(['magenta'])]
})
df
id labels1 labels2
0 1 [red] NaN
1 2 [blue, green] NaN
2 3 [blue] [yellow, purple]
3 4 NaN [magenta]
So, I need a way to produce the following DataFrame:
df_merge
id labels
0 1 [red]
1 2 [blue, green]
2 3 [blue, yellow, purple]
3 4 [magenta]
But using lambda functions as I might do with scalar column data throws a ValueError exception:
df.apply(lambda x: np.unique(np.append(x['labels1'], x['labels2'])), axis=1)
ValueError: Shape of passed values is (4, 2), indices imply (4, 4)
I've tried many different variations on the above, all to no avail. I'm wondering if perhaps array-like column data like this is a pandas anti-pattern, and if so, what are better approaches?
make NaN into [] using applymap
sum across rows
df[['id']].assign(
labels=labels.applymap(lambda x: x if isinstance(x, list) else []).sum(1)
)
id labels
0 1 [red]
1 2 [blue, green]
2 3 [blue, yellow, purple]
3 4 [magenta]