Combining Two Pandas Categorical Columns into One Column

Combining Two Pandas Categorical Columns into One Column - pandas

I have a Pandas DataFrame that has two categorical columns:
df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
"y": ["L", "L", "M", "M", "M", "N", "M"],
}).astype("category", "category")
x y
0 A L
1 A L
2 B M
3 B M
4 B M
5 C N
6 C M
I would like to combine the two columns and store the results as a new categorical column but separated by " - ". One naive way of doing this is to convert the columns to strings:
df.assign(z=df.x.astype(str) + " - " + df.y.astype(str))
x y z
0 A L A - L
1 A L A - L
2 B M B - M
3 B M B - M
4 B M B - M
5 C N C - N
6 C M C - M
This works for a small toy example but I need z to be of category dtype (not string). However, my x and y contains categorical strings (with 88903 and 39132 categories for x and y, respectively) that may be 50-100 characters long and around 500K rows. So, converting these columns to strings first is causing the memory to explode.
Is there a more efficient way to get a categorical output without using a ton of memory and taking too long?

You can try this:
import pandas as pd
from itertools import product
# original data
df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
"y": ["L", "L", "M", "M", "M", "N", "M"],
}).astype("category", "category")
# extract unique categories
c1 = df.x.cat.categories
c2 = df.y.cat.categories
# make data frame with all possible category combinations
df_cats = pd.DataFrame(list(product(c1, c2)), columns=['x', 'y'])
# create desired column
df_cats = df_cats.assign(grp=df_cats.x.astype('str') + '-' + df_cats.y.astype('str'))
# join this column to the original data
pd.merge(df, df_cats, how="left", left_on=["x", "y"], right_on=["x", "y"])

Related

How to make the columns selections in dataframe when passed few columns as list

I have the function contains 3 parameters:
def foo(df, columns, additional_col=None):
df = df[columns + additoinal columns]
if additional_col parameter then only it should append it to columns else it should keep columns as column selection
Example:
columns = ["A", "B", "C", "D"]
addtional_col = ["X", "Y"]
if additional_col is passed while calling the function foo then column selection would be
df["A", "B", "C", "D", "X", "Y"] elseif additional_col is None then df["A", "B", "C", "D"]
tried join, map and split but couldn`t achieve the desire output. Need help on immediate basis.
Thanks

Firstly, you will need to make sure that you make a copy of the columns list to prevent unexpected side effects of extending the original list.
If additional_col has items in the list it will equate to True when used in an if-statement.
So if additional_col has items, you can extend the columns list using the extend() function.
If it does not have items, then just use the original columns list.
Here is the code:
Code:
def foo(df, columns, additional_col=None):
columns = list(columns)
if additional_col:
columns.extend(additional_col)
df = df[columns]
else:
df = df[columns]
return df
data = pd.DataFrame({"A":[1,2,3], "B":[4,5,6], "C":[7,8,9], "X":['a','b','c'], "Y":['d','e','f']})
cols = ["A","B","C"]
a = ["X","Y"]
print(foo(data, cols,a))
print("-------------------")
print(foo(data, cols))
Output:
A B C X Y
0 1 4 7 a d
1 2 5 8 b e
2 3 6 9 c f
-------------------
A B C
0 1 4 7
1 2 5 8
2 3 6 9

How to find rows of list items with a key-search element in pandas? [duplicate]

This question already has answers here:
Python & Pandas: How to query if a list-type column contains something?
(7 answers)
How to filter a DataFrame column of lists for those that contain a certain item
(1 answer)
Closed 7 months ago.
I have a pandas dataframe with lists. I want to be able to search using one item in the list. For example,
import pandas as pd
# initialize list elements
data = [[10, ["a", "b", "c"] ], [20, ["d", "e", "f"]], [30, ["c"]]]
# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['Numbers', "Characters"])
# print data
df
Numbers Characters_List
0 10 [a, b, c]
1 20 [d, e, f]
2 30 ["c"]
If I search for "e", the output should be,
Numbers Characters_List
0 20 [d, e, f]

df[df.apply(lambda x: "e" in x.Characters, axis=1)]

serachCharacter='e'
df[df['Characters'].apply(lambda x:len(set(x).intersection(set(f'{serachCharacter}')))>0)]
Numbers Characters
1 20 [d, e, f]

s = df['Characters'].explode()=='e'
df.loc[s[s].index]
Result
Numbers Characters
1 20 [d, e, f]

Get first column in dataframe that exists from list of column names

Given a list of column names, only some or none exist in a dataframe, what's the least verbose way of getting the first existing column or None?
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["a", "b", "c"])
cols = ["d", "e", "c"]
This is fairly short but fails with StopIteration for no matching columns:
col = next(filter(lambda c: c in df, cols))
df[col]
0 3
1 6
Name: c, dtype: int64
Is there a better way?

You can do it with:
col = next(filter(lambda c: c in df, cols), None)

One idea:
col = next(iter(df.columns.intersection(cols, sort=False)), None)

#Learnings is a mess answered it beautifully and you should use that solution but here is another one line solution with walrus operator.
col = intersect[0] if (intersect:= [c for c in cols if c in df.columns]) else None

Exclude low sample counts from Pandas' "groupby" calculations

Using Pandas, I'd like to "groupby" and calculate the mean values for each group of my Dataframe. I do it like this:
dict = {
"group": ["A", "B", "C", "A", "A", "B", "B", "C", "A"],
"value": [5, 6, 8, 7, 3, 9, 4, 6, 5]
}
import pandas as pd
df = pd.DataFrame(dict)
print(df)
g = df.groupby([df['group']]).mean()
print(g)
Which gives me:
value
group
A 5.000000
B 6.333333
C 7.000000
However, I'd like to exclude groups which have, let's say, less than 3 entries (so that the mean has somewhat of a value). In this case, it would exclude group "C" from the results. How can I implement this?

Filter the group based on the length and then take the mean.
df = df.groupby('group').filter(lambda x : len(x) > 5).mean()
#if you want the mean group-wise after filtering the required groups
result = df.groupby('group').filter(lambda x : len(x) >= 3).groupby('group').mean().reset_index()
Output:
group value
0 A 5.000000
1 B 6.333333

Merge dataframes rows if fields are in the dictionary

I want to merge fields if values are in the dictionary.
I have two pandas dataframes:
Name Values
"ABC" ["A", "B", "C"]
"DEF" ["D", "E", "F"]
and
Value First Second
"A" 0 5
"B" 2 1
"C" 3 5
"Z" 3 0
I would like to get:
Name First Second
"ABC" 5 11
"Z" 3 0
Is there an easy way to make that ? I didn't find something good

Try with explode then map the key
s = df['Value'].map(df2.explode('Values').set_index('Values')['Name'])
out = df.groupby(s).agg({'Name' : ''.join, 'First':'sum', 'Second':'sum'})

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Combining Two Pandas Categorical Columns into One Column - pandas

Related

How to make the columns selections in dataframe when passed few columns as list

How to find rows of list items with a key-search element in pandas? [duplicate]

Get first column in dataframe that exists from list of column names

Exclude low sample counts from Pandas' "groupby" calculations

Merge dataframes rows if fields are in the dictionary

Categories

Resources