Conditional mapping among columns of two data frames with Pandas Data frame - pandas

I needed your advice regarding how to map columns between data-frames:
I have put it in simple way so that it's easier for you to understand:
df = dataframe
EXAMPLE:
df1 = pd.DataFrame({
"X": [],
"Y": [],
"Z": []
})
df2 = pd.DataFrame({
"A": ['', '', 'A1'],
"C": ['', '', 'C1'],
"D": ['D1', 'Other', 'D3'],
"F": ['', '', ''],
"G": ['G1', '', 'G3'],
"H": ['H1', 'H2', 'H3']
})
Requirement:
1st step:
We needed to track a value for X column on df1 from columns A, C, D respectively. It would stop searching once it finds any value and would select it.
2nd step:
If the selected value is "Other" then X column of df1 would map columns F, G, and H respectively until it finds any value.
Result:
X
0 D1
1 H2
2 A1
Thank you so much in advance

Try this:
def first_non_empty(df, cols):
"""Return the first non-empty, non-null value among the specified columns per row"""
return df[cols].replace('', pd.NA).bfill(axis=1).iloc[:, 0]
col_x = first_non_empty(df2, ['A','C','D'])
col_x = col_x.mask(col_x == 'Other', first_non_empty(df2, ['F','G','H']))
df1['X'] = col_x

Related

Pandas Dataframe GroupBy Agg - LAMBDA - single values go to preexisting or new lists and preexisting lists fusion

I have this DataFrame to groupby key:
df = pd.DataFrame({
'key': ['1', '1', '1', '2', '2', '3', '3', '4', '4', '5'],
'data1': [['A', 'B', 'C'], 'D', 'P', 'E', ['F', 'G', 'H'], ['I', 'J'], ['K', 'L'], 'M', 'N', 'O']
'data2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
df
I want to make the groupby key and sum data2, it's ok for this part.
But concerning data1, I want to :
If a list doesn't exist yet:
Single values don't change when key was not duplicated
Single values assigned to a key are combined into a new list
If a list already exist:
Other single values are append to it
Other lists values are append to it
The resulting DataFrame should then be :
dfgood = pd.DataFrame({
'key': ['1', '2', '3', '4', '5'],
'data1': [['A', 'B', 'C', 'D', 'P'], ['F', 'G', 'H', 'E'], ['I', 'J', 'K', 'L'], ['M', 'N'], 'O']
'data2': [6, 9, 13, 17, 10]
})
dfgood
In fact, I don't really care about the order of data1 values into the lists, it could also be any structure that keep them together, even a string with separators or a set, if it's easier to make it go the way you think best to do this.
I thought about two solutions :
Going that way :
dfgood = df.groupby('key', as_index=False).agg({
'data1' : lambda x: x.iloc[0].append(x.iloc[1]) if type(x.iloc[0])==list else list(x),
'data2' : sum,
})
dfgood
It doesn't work because of index out of range in x.iloc[1].
I also tried, because data1 was organized like this in another groupby from the question on this link:
dfgood = df.groupby('key', as_index=False).agg({
'data1' : lambda g: g.iloc[0] if len(g) == 1 else list(g)),
'data2' : sum,
})
dfgood
But it's creating new lists from preexisting lists or values and not appending data to already existing lists.
Another way to do it, but I think it's more complicated and there should be a better or faster solution :
Turning data1 lists and single values into individual series with apply,
use wide_to_long to keep single values for each key,
Then groupby applying :
dfgood = df.groupby('key', as_index=False).agg({
'data1' : lambda g: g.iloc[0] if len(g) == 1 else list(g)),
'data2' : sum,
})
dfgood
I think my problem is that I don't know how to use lambdas correctly and I try stupid things like x.iloc[1] in the previous example. I've looked at a lot of tutorial about lambdas, but it's still fuzzy in my mind.
There is problem combinations lists with scalars, possible solution is create first lists form scalars and then flatten them in groupby.agg:
dfgood = (df.assign(data1 = df['data1'].apply(lambda y: y if isinstance(y, list) else [y]))
.groupby('key', as_index=False).agg({
'data1' : lambda x: [z for y in x for z in y],
'data2' : sum,
})
)
print (dfgood)
key data1 data2
0 1 [A, B, C, D, P] 6
1 2 [E, F, G, H] 9
2 3 [I, J, K, L] 13
3 4 [M, N] 17
4 5 [O] 10
Another idea is use flatten function for flatten only lists, not strings:
#https://stackoverflow.com/a/5286571/2901002
def flatten(foo):
for x in foo:
if hasattr(x, '__iter__') and not isinstance(x, str):
for y in flatten(x):
yield y
else:
yield x
dfgood = (df.groupby('key', as_index=False).agg({
'data1' : lambda x: list(flatten(x)),
'data2' : sum}))
You could explode to get individual rows, then aggregate again with groupby+agg after taking care of masking the duplicated values in data2 (to avoid summing duplicates):
(df.explode('data1')
.assign(data2=lambda d: d['data2'].mask(d.duplicated(['key', 'data2']), 0))
.groupby('key')
.agg({'data1': list, 'data2': 'sum'})
)
output:
data1 data2
key
1 [A, B, C, D, P] 6
2 [E, F, G, H] 9
3 [I, J, K, L] 13
4 [M, N] 17
5 [O] 10

Collapsing a PANDAs dataframe into a single column of all items and their occurances

I have a data frame consisting of a mixture of NaN's and strings e.g
data = {'String1':['NaN', 'tree', 'car', 'tree'],
'String2':['cat','dog','car','tree'],
'String3':['fish','tree','NaN','tree']}
ddf = pd.DataFrame(data)
I want to
1:count the total number of items and put in a new data frame e.g
NaN=2
tree=5
car=2
fish=1
cat=1
dog=1
2:Count the total number of items when compared to a separate longer list (column of a another data frame, e.g
df['compare'] =
NaN
tree
car
fish
cat
dog
rabbit
Pear
Orange
snow
rain
Thanks
Jason
For the first question:
from collections import Counter
data = {
"String1": ["NaN", "tree", "car", "tree"],
"String2": ["cat", "dog", "car", "tree"],
"String3": ["fish", "tree", "NaN", "tree"],
}
ddf = pd.DataFrame(data)
a = Counter(ddf.stack().tolist())
df_result = pd.DataFrame(dict(a), index=['Count']).T
df = pd.DataFrame({'vals':['NaN', 'tree', 'car', 'fish', 'cat', 'dog', 'rabbit', 'Pear', 'Orange', 'snow', 'rain']})
df_counts = df.vals.map(df_result.to_dict()['Count'])
THis should do :)
You can use the following code for count of items over all data frame.
import pandas as pd
data = {'String1':['NaN', 'tree', 'car', 'tree'],
'String2':['cat','dog','car','tree'],
'String3':['fish','tree','NaN','tree']}
df = pd.DataFrame(data)
def get_counts(df: pd.DataFrame) -> dict:
res = {}
for col in df.columns:
vc = df[col].value_counts().to_dict()
for k,v in vc.items():
if k in res:
res[k] += v
else:
res[k] = v
return res
counts = get_counts(df)
Output
>>> print(counts)
{'tree': 5, 'car': 2, 'NaN': 2, 'cat': 1, 'dog': 1, 'fish': 1}

How to split every string in a list in a dataframe column

I have a dataframe with a column containing a list of strings 'A:B'. I'd like to modify this so there is a new column which contains a set split by ':' containing the first element.
data = [
{'Name': 'A', 'Servers':['A:s1', 'B:s2', 'C:s3', 'C:s2']},
{'Name': 'B', 'Servers':['B:s1', 'C:s2', 'B:s3', 'A:s2']},
{'Name': 'C', 'Servers':['G:s1', 'X:s2', 'Y:s3']}
]
df = pd.DataFrame(data)
df
df['Clusters'] = [
{'A', 'B', 'C'},
{'B', 'C', 'A'},
{'G', 'X', 'Y'}
]
Learn how to use apply
In [5]: df['Clusters'] = df['Servers'].apply(lambda x: {p.split(':')[0] for p in x})
In [6]: df
Out[6]:
Name Servers Clusters
0 A [A:s1, B:s2, C:s3, C:s2] {A, B, C}
1 B [B:s1, C:s2, B:s3, A:s2] {C, B, A}
2 C [G:s1, X:s2, Y:s3] {X, Y, G}

Arrays to row in pandas

I have following dict which I want to convert into pandas. this dict have nested list which can appear for one node but not other.
dis={"companies": [{"object_id": 123,
"name": "Abd ",
"contact_name": ["xxxx",
"yyyy"],
"contact_id":[1234,
33455]
},
{"object_id": 654,
"name": "DDSPP"},
{"object_id": 987,
"name": "CCD"}
]}
AS
object_id, name, contact_name, contact_id
123,Abd,xxxx,1234
123,Abd,yyyy,
654,DDSPP,,
987,CCD,,
How can i achive this
I was trying to do like
abc = pd.DataFrame(dis).set_index['object_id','contact_name']
but it says
'method' object is not subscriptable
This is inspired from #jezrael answer in this link: Splitting multiple columns into rows in pandas dataframe
Use:
s = {"companies": [{"object_id": 123,
"name": "Abd ",
"contact_name": ["xxxx",
"yyyy"],
"contact_id":[1234,
33455]
},
{"object_id": 654,
"name": "DDSPP"},
{"object_id": 987,
"name": "CCD"}
]}
df = pd.DataFrame(s) #convert into DF
df = df['companies'].apply(pd.Series) #this splits the internal keys and values into columns
split1 = df.apply(lambda x: pd.Series(x['contact_id']), axis=1).stack().reset_index(level=1, drop=True)
split2 = df.apply(lambda x: pd.Series(x['contact_name']), axis=1).stack().reset_index(level=1, drop=True)
df1 = pd.concat([split1,split2], axis=1, keys=['contact_id','contact_name'])
pd.options.display.float_format = '{:.0f}'.format
print (df.drop(['contact_id','contact_name'], axis=1).join(df1).reset_index(drop=True))
Output with regular index:
name object_id contact_id contact_name
0 Abd 123 1234 xxxx
1 Abd 123 33455 yyyy
2 DDSPP 654 nan NaN
3 CCD 987 nan NaN
Is this something you were looking for?
If you have just only one column needs to convert, then you can use something more shortly, like this:
df = pd.DataFrame(d['companies'])
d = df.loc[0].apply(pd.Series)
d[1].fillna(d[0], inplace=True)
df.drop([0],0).append(d.T)
Otherwise, if you need to do this with more then one raw, you can use it, but it have to be modified.

Selecting data from a dataframe based on a tuple

Suppose I have the following dataframe
df = DataFrame({'vals': [1, 2, 3, 4],
'ids': ['a', 'b', 'a', 'n']})
I want to select all the rows which are in the list
[ (1,a), (3,f) ]
I have tried using boolean indexing like so
to_search = { 'vals' : [1,3],
'ids' : ['a', 'f']
}
df.isin(to_search)
I expect only the first row to match but I get the first and the third row
ids vals
0 True True
1 True False
2 True True
3 False False
Is there any way to match exactly the values at a particular index instead of matching any value?
You might create a DataFrame for what you want to match, and then merge it:
In [32]: df2 = DataFrame([[1,'a'],[3,'f']], columns=['vals', 'ids'])
In [33]: df.merge(df2)
Out[33]:
ids vals
0 a 1