I have a Pandas DataFrame that has two categorical columns:
df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
"y": ["L", "L", "M", "M", "M", "N", "M"],
}).astype("category", "category")
x y
0 A L
1 A L
2 B M
3 B M
4 B M
5 C N
6 C M
I would like to combine the two columns and store the results as a new categorical column but separated by " - ". One naive way of doing this is to convert the columns to strings:
df.assign(z=df.x.astype(str) + " - " + df.y.astype(str))
x y z
0 A L A - L
1 A L A - L
2 B M B - M
3 B M B - M
4 B M B - M
5 C N C - N
6 C M C - M
This works for a small toy example but I need z to be of category dtype (not string). However, my x and y contains categorical strings (with 88903 and 39132 categories for x and y, respectively) that may be 50-100 characters long and around 500K rows. So, converting these columns to strings first is causing the memory to explode.
Is there a more efficient way to get a categorical output without using a ton of memory and taking too long?
You can try this:
import pandas as pd
from itertools import product
# original data
df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
"y": ["L", "L", "M", "M", "M", "N", "M"],
}).astype("category", "category")
# extract unique categories
c1 = df.x.cat.categories
c2 = df.y.cat.categories
# make data frame with all possible category combinations
df_cats = pd.DataFrame(list(product(c1, c2)), columns=['x', 'y'])
# create desired column
df_cats = df_cats.assign(grp=df_cats.x.astype('str') + '-' + df_cats.y.astype('str'))
# join this column to the original data
pd.merge(df, df_cats, how="left", left_on=["x", "y"], right_on=["x", "y"])
Related
I have the function contains 3 parameters:
def foo(df, columns, additional_col=None):
df = df[columns + additoinal columns]
if additional_col parameter then only it should append it to columns else it should keep columns as column selection
Example:
columns = ["A", "B", "C", "D"]
addtional_col = ["X", "Y"]
if additional_col is passed while calling the function foo then column selection would be
df["A", "B", "C", "D", "X", "Y"] elseif additional_col is None then df["A", "B", "C", "D"]
tried join, map and split but couldn`t achieve the desire output. Need help on immediate basis.
Thanks
Firstly, you will need to make sure that you make a copy of the columns list to prevent unexpected side effects of extending the original list.
If additional_col has items in the list it will equate to True when used in an if-statement.
So if additional_col has items, you can extend the columns list using the extend() function.
If it does not have items, then just use the original columns list.
Here is the code:
Code:
def foo(df, columns, additional_col=None):
columns = list(columns)
if additional_col:
columns.extend(additional_col)
df = df[columns]
else:
df = df[columns]
return df
data = pd.DataFrame({"A":[1,2,3], "B":[4,5,6], "C":[7,8,9], "X":['a','b','c'], "Y":['d','e','f']})
cols = ["A","B","C"]
a = ["X","Y"]
print(foo(data, cols,a))
print("-------------------")
print(foo(data, cols))
Output:
A B C X Y
0 1 4 7 a d
1 2 5 8 b e
2 3 6 9 c f
-------------------
A B C
0 1 4 7
1 2 5 8
2 3 6 9
This question already has answers here:
Python & Pandas: How to query if a list-type column contains something?
(7 answers)
How to filter a DataFrame column of lists for those that contain a certain item
(1 answer)
Closed 7 months ago.
I have a pandas dataframe with lists. I want to be able to search using one item in the list. For example,
import pandas as pd
# initialize list elements
data = [[10, ["a", "b", "c"] ], [20, ["d", "e", "f"]], [30, ["c"]]]
# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['Numbers', "Characters"])
# print data
df
Numbers Characters_List
0 10 [a, b, c]
1 20 [d, e, f]
2 30 ["c"]
If I search for "e", the output should be,
Numbers Characters_List
0 20 [d, e, f]
df[df.apply(lambda x: "e" in x.Characters, axis=1)]
serachCharacter='e'
df[df['Characters'].apply(lambda x:len(set(x).intersection(set(f'{serachCharacter}')))>0)]
Numbers Characters
1 20 [d, e, f]
s = df['Characters'].explode()=='e'
df.loc[s[s].index]
Result
Numbers Characters
1 20 [d, e, f]
Given a list of column names, only some or none exist in a dataframe, what's the least verbose way of getting the first existing column or None?
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["a", "b", "c"])
cols = ["d", "e", "c"]
This is fairly short but fails with StopIteration for no matching columns:
col = next(filter(lambda c: c in df, cols))
df[col]
0 3
1 6
Name: c, dtype: int64
Is there a better way?
You can do it with:
col = next(filter(lambda c: c in df, cols), None)
One idea:
col = next(iter(df.columns.intersection(cols, sort=False)), None)
#Learnings is a mess answered it beautifully and you should use that solution but here is another one line solution with walrus operator.
col = intersect[0] if (intersect:= [c for c in cols if c in df.columns]) else None
Using Pandas, I'd like to "groupby" and calculate the mean values for each group of my Dataframe. I do it like this:
dict = {
"group": ["A", "B", "C", "A", "A", "B", "B", "C", "A"],
"value": [5, 6, 8, 7, 3, 9, 4, 6, 5]
}
import pandas as pd
df = pd.DataFrame(dict)
print(df)
g = df.groupby([df['group']]).mean()
print(g)
Which gives me:
value
group
A 5.000000
B 6.333333
C 7.000000
However, I'd like to exclude groups which have, let's say, less than 3 entries (so that the mean has somewhat of a value). In this case, it would exclude group "C" from the results. How can I implement this?
Filter the group based on the length and then take the mean.
df = df.groupby('group').filter(lambda x : len(x) > 5).mean()
#if you want the mean group-wise after filtering the required groups
result = df.groupby('group').filter(lambda x : len(x) >= 3).groupby('group').mean().reset_index()
Output:
group value
0 A 5.000000
1 B 6.333333
I want to merge fields if values are in the dictionary.
I have two pandas dataframes:
Name Values
"ABC" ["A", "B", "C"]
"DEF" ["D", "E", "F"]
and
Value First Second
"A" 0 5
"B" 2 1
"C" 3 5
"Z" 3 0
I would like to get:
Name First Second
"ABC" 5 11
"Z" 3 0
Is there an easy way to make that ? I didn't find something good
Try with explode then map the key
s = df['Value'].map(df2.explode('Values').set_index('Values')['Name'])
out = df.groupby(s).agg({'Name' : ''.join, 'First':'sum', 'Second':'sum'})