Merge dataframes rows if fields are in the dictionary - pandas

I want to merge fields if values are in the dictionary.
I have two pandas dataframes:
Name Values
"ABC" ["A", "B", "C"]
"DEF" ["D", "E", "F"]
and
Value First Second
"A" 0 5
"B" 2 1
"C" 3 5
"Z" 3 0
I would like to get:
Name First Second
"ABC" 5 11
"Z" 3 0
Is there an easy way to make that ? I didn't find something good

Try with explode then map the key
s = df['Value'].map(df2.explode('Values').set_index('Values')['Name'])
out = df.groupby(s).agg({'Name' : ''.join, 'First':'sum', 'Second':'sum'})

Related

How to make the columns selections in dataframe when passed few columns as list

I have the function contains 3 parameters:
def foo(df, columns, additional_col=None):
df = df[columns + additoinal columns]
if additional_col parameter then only it should append it to columns else it should keep columns as column selection
Example:
columns = ["A", "B", "C", "D"]
addtional_col = ["X", "Y"]
if additional_col is passed while calling the function foo then column selection would be
df["A", "B", "C", "D", "X", "Y"] elseif additional_col is None then df["A", "B", "C", "D"]
tried join, map and split but couldn`t achieve the desire output. Need help on immediate basis.
Thanks
Firstly, you will need to make sure that you make a copy of the columns list to prevent unexpected side effects of extending the original list.
If additional_col has items in the list it will equate to True when used in an if-statement.
So if additional_col has items, you can extend the columns list using the extend() function.
If it does not have items, then just use the original columns list.
Here is the code:
Code:
def foo(df, columns, additional_col=None):
columns = list(columns)
if additional_col:
columns.extend(additional_col)
df = df[columns]
else:
df = df[columns]
return df
data = pd.DataFrame({"A":[1,2,3], "B":[4,5,6], "C":[7,8,9], "X":['a','b','c'], "Y":['d','e','f']})
cols = ["A","B","C"]
a = ["X","Y"]
print(foo(data, cols,a))
print("-------------------")
print(foo(data, cols))
Output:
A B C X Y
0 1 4 7 a d
1 2 5 8 b e
2 3 6 9 c f
-------------------
A B C
0 1 4 7
1 2 5 8
2 3 6 9

Cannot append items to end of list and concatenate data frames

I am looping through data frame columns to obtain specific pieces of data - so far I have been successful except for the last three pieces of data I need. When I attempt to append these pieces of data to the list, they are appended to the beginning of the list and not at the end (I need them to be at the end).
Therefore, when I convert this list into a data frame and attempt to concatenate it to another data frame I have prepared, the values are all in the wrong places.
This is my code:
descs = ["a", "b", "c", "d", "e", "f", "g"]
data =[]
stats = []
for desc in descs:
data.append({
"Description":desc
})
for column in df:
if df[column].name == "column1":
counts = df[column].value_counts()
stats.extend([
counts.sum(),
counts[True],
counts[False]
])
elif df[column].name == "date_column":
stats.append(
df[column].min().date()
)
#Everything is fine up until this `elif` block
#I THINK THIS IS WHERE THE PROBLEM IS I DONT KNOW HOW TO FIX IT
elif df[column].name == "column2":
stats.extend([
df[column].max() ,
round(df[column].loc[df["column1"] == True].agg({column:"mean"}),2),
round(df[column].loc[df["column1"] == False].agg({column:"mean"}),2)
])
Up until the second elifblock, when I run this code and concatenate data and stats as data frames pd.concat([pd.DataFrame(data), pd.DataFrame({"Statistic":stats}), axis = 1) I get the following output - which is the output I want:
Description
Statistic
"a"
38495
"b"
3459
"c"
234
"d"
1984-06-2
"e"
NaN
"f"
NaN
"g"
NaN
When I run the above code chunk including the second elif block, the output is messed up
Description
Statistic
"a"
[78, [454],[45]]
"b"
38495
"c"
3459
"d"
234
"e"
1984-06-2
"f"
NaN
"g"
NaN
Those values in the first index of the data frame [78, 454, 45] should be in the place (and in that order) where NaNs appear in the first table
What am I doing wrong?
You're really close to making this work the way you want!
A couple things to make your life simpler:
df[column].name isn't needed because you can just use column
Looping through columns and having multiple if statements on their names to calculate summary statistics works, but you'll make your life easier if you look into .groupby() with .agg()
And that brings me to your issue: .agg() returns a pandas Series, and you just want a single number. Try
round(df[column].loc[df["column1"] == False].mean(),2)
instead. :)
Update: Now it looks like you are hitting the second elif with the first column, so re-order your DataFrame columns to be in the order you want them in:
cols = ["column1", "date_column", "column2"]
for column in cols:
if column == "column1":

Combining Two Pandas Categorical Columns into One Column

I have a Pandas DataFrame that has two categorical columns:
df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
"y": ["L", "L", "M", "M", "M", "N", "M"],
}).astype("category", "category")
x y
0 A L
1 A L
2 B M
3 B M
4 B M
5 C N
6 C M
I would like to combine the two columns and store the results as a new categorical column but separated by " - ". One naive way of doing this is to convert the columns to strings:
df.assign(z=df.x.astype(str) + " - " + df.y.astype(str))
x y z
0 A L A - L
1 A L A - L
2 B M B - M
3 B M B - M
4 B M B - M
5 C N C - N
6 C M C - M
This works for a small toy example but I need z to be of category dtype (not string). However, my x and y contains categorical strings (with 88903 and 39132 categories for x and y, respectively) that may be 50-100 characters long and around 500K rows. So, converting these columns to strings first is causing the memory to explode.
Is there a more efficient way to get a categorical output without using a ton of memory and taking too long?
You can try this:
import pandas as pd
from itertools import product
# original data
df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
"y": ["L", "L", "M", "M", "M", "N", "M"],
}).astype("category", "category")
# extract unique categories
c1 = df.x.cat.categories
c2 = df.y.cat.categories
# make data frame with all possible category combinations
df_cats = pd.DataFrame(list(product(c1, c2)), columns=['x', 'y'])
# create desired column
df_cats = df_cats.assign(grp=df_cats.x.astype('str') + '-' + df_cats.y.astype('str'))
# join this column to the original data
pd.merge(df, df_cats, how="left", left_on=["x", "y"], right_on=["x", "y"])

Hto convert pandas df into dict based on repeated rows (row has to be key)

I want to ask Hto convert pandas df into dict like I show below:
I have df like this:
col1 col2
"a" 12
"a" 2
"b" 34
"c" 9
"c" 45
and I need dict like this:
d = {"a":[12, 2], "b":[34], "c":[9, 45]}
Anyone give me some tips?
Kindly help.
First aggregate to lists and then convert output to dict:
df.groupby('col1')['col2'].agg(list).to_dict()

Translate my SKUs using a dictionary with Pandas

I have a table which has internal SKUs in column 0 and then synonyms along that row. The number of synonyms is not constant (ranging from 0 to 7, but will have a tendency to grow)
I need an effient function which will allow me to get SKUs from one column in a large table and translate them to synonym 0 from my other table.
This is my current function which takes an array of SKUs from one table, searches for them in another and gives me the first column value where it finds a synonym.
def new_array(dfarray1, array1, trans_dic):
missing_values = set([])
new_array = []
for value in array1:
pos = trans_dic.eq(str(value)).any(axis=1)
if len(pos[pos]) > 0 :
new_array.append(trans_dic['sku_0'][pos[pos].index[0]])
else:
missing_values.add(str(value))
if len(missing_values) > 0 :
print("The following values are missing in dictionary. They are in DF called:"+dfarray1)
print(missing_values)
sys.exit()
else:
return new_array
I'm sure that this is very badly written because it takes my laptop about 3 minutes to go through about 75K values only. Can anyone help me make this faster?
Some questions asked previously:
What types are your function parameters? (can guess pandas, but no way to know for sure)
Yes. I am working on two pandas dataframes.
What does your table even look like?
Dictionary table:
SKU0
Synonym 0
Synonym 1
Synonym 2
foo
bar
bar1
foo1
baar1
foo2
baaar0
baar2
Values table:
SKU
Value
Value1
value1
foo
3
1
7
baar1
4
5
7
baaar0
5
5
9
Desired table:
SKU
Value
Value1
value1
foo
3
1
7
foo1
4
5
7
foo2
5
5
9
What does the rest of your code that is calling this function look like?
df1.sku = new_array('df1', list(df1.sku), sku_dic)
Given the dictionary dataframe in the format
df_dict = pd.DataFrame({
"SKU0": ["foo", "foo1", "foo2"],
"Synonym 0": ["bar", "baar1", "baaar0"],
"Synonym 1": ["bar1", np.nan, np.nan],
"Synonym 2": [np.nan, np.nan, "baar2"]
})
and a values dataframe in the format
df_values = pd.DataFrame({
"SKU": ["foo", "baar1", "baaar0"],
"Value": [3, 4, 5],
"Value1": [1, 5, 5],
"value1": [7, 7, 9]
})
you can get the output you want by first using pd.melt to restructure your dictionary dataframe and then join it to your values dataframe. Then you can use some extra logic to check which column to take the final value from and to select the final columns needed.
(
df_dict
# converts dict df from wide to long format
.melt(id_vars=["SKU0"])
# filters rows where there is no synonym
.loc[lambda x: x["value"].notna()]
# join dictionary with values df
.merge(df_values, how="right", left_on="value", right_on="SKU")
# get final value by taking the value from column "SKU0" if available, else "SKU"
.assign(SKU = lambda x: np.where(x["SKU0"].isna(), x["SKU"], x["SKU0"]))
# select final columns needed in output
[["SKU", "Value", "Value1", "value1"]]
)
# output
SKU Value Value1 value1
0 foo 3 1 7
1 foo1 4 5 7
2 foo2 5 5 9