Dictionary Unique Keys Rename and Replace - pandas

I have a dictionary format structure like this
df = pd.DataFrame({'ID' : ['A', 'B', 'C'],
'CODES' : [{"1407273790":5,"1801032636":20,"1174813554":1,"1215470448":2,"1053754655":4,"1891751228":1},
{"1497066526":19,"1801032636":16,"1215470448":11,"1891751228":18},
{"1215470448":8,"1407273790":4},]})
Now I want to create a unique list of keys and create names for them like this -
np_code np_rename
1407273790 np_1
1801032636 np_2
1174813554 np_3
1215470448 np_4
1053754655 np_5
1891751228 np_6
1497066526 np_7
And finally replace the new names in main dataframe df -
df = pd.DataFrame({'ID' : ['A', 'B', 'C'],
'CODES' : [{"np_1":5,"np_2":20,"np_3":1,"np_4":2,"np_5":4,"np_6":1},
{"np_7":19,"1801032636":16,"np_4":11,"np_6":18},
{"np_4":8,"np_1":4},]})

You can use apply here:
Assuming the unique list dataframe is unique_list_df:
u = df['CODES'].map(lambda x: [*x.keys()]).explode().unique()
d = dict(zip(u,'np_'+pd.Index((pd.factorize(u)[0]+1).astype(str))))
f = lambda x: {d.get(k,k): v for k,v in x.items()}
df['CODES'] = df['CODES'].apply(f)
print(df)
ID CODES
0 A {'np_1': 5, 'np_2': 20, 'np_3': 1, 'np_4': 2, ...
1 B {'np_7': 19, 'np_2': 16, 'np_4': 11, 'np_6': 18}
2 C {'np_4': 8, 'np_1': 4}

Related

How can I add several columns within a dataframe (broadcasting)?

import numpy as np
import pandas as pd
data = [[30, 19, 6], [12, 23, 14], [8, 18, 20]]
df = pd.DataFrame(data = data, index = ['A', 'B', 'C'], columns = ['Bulgary', 'Robbery', 'Car Theft'])
df
I get the following:
Bulgary
Robbery
Car Theft
A
30
19
6
B
12
23
14
C
8
18
20
I would like to assign:
df['Total'] = df['Bulgary'] + df['Robbery'] + df['Car Theft']
But does this operation have to be done manually? I am looking for a function that can handle conveniently.
#pseudocode
#df['Total'] = df.Some_Column_Adding_Function([0:3])
#df['Total'] == df['Bulgary'] + df['Robbery'] + df['Car Theft'] returns True
Similarly, how do I add across rows?
Use sum:
df['Total'] = df.sum(axis=1)
Or if you want subset of columns:
df['Total'] = df[df.columns[0:3]].sum(axis=1)
# or df['Total'] = df[['Bulgary', 'Robbery', 'Car Theft']].sum(axis=1)

Pandas Dataframe GroupBy Agg - LAMBDA - single values go to preexisting or new lists and preexisting lists fusion

I have this DataFrame to groupby key:
df = pd.DataFrame({
'key': ['1', '1', '1', '2', '2', '3', '3', '4', '4', '5'],
'data1': [['A', 'B', 'C'], 'D', 'P', 'E', ['F', 'G', 'H'], ['I', 'J'], ['K', 'L'], 'M', 'N', 'O']
'data2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
df
I want to make the groupby key and sum data2, it's ok for this part.
But concerning data1, I want to :
If a list doesn't exist yet:
Single values don't change when key was not duplicated
Single values assigned to a key are combined into a new list
If a list already exist:
Other single values are append to it
Other lists values are append to it
The resulting DataFrame should then be :
dfgood = pd.DataFrame({
'key': ['1', '2', '3', '4', '5'],
'data1': [['A', 'B', 'C', 'D', 'P'], ['F', 'G', 'H', 'E'], ['I', 'J', 'K', 'L'], ['M', 'N'], 'O']
'data2': [6, 9, 13, 17, 10]
})
dfgood
In fact, I don't really care about the order of data1 values into the lists, it could also be any structure that keep them together, even a string with separators or a set, if it's easier to make it go the way you think best to do this.
I thought about two solutions :
Going that way :
dfgood = df.groupby('key', as_index=False).agg({
'data1' : lambda x: x.iloc[0].append(x.iloc[1]) if type(x.iloc[0])==list else list(x),
'data2' : sum,
})
dfgood
It doesn't work because of index out of range in x.iloc[1].
I also tried, because data1 was organized like this in another groupby from the question on this link:
dfgood = df.groupby('key', as_index=False).agg({
'data1' : lambda g: g.iloc[0] if len(g) == 1 else list(g)),
'data2' : sum,
})
dfgood
But it's creating new lists from preexisting lists or values and not appending data to already existing lists.
Another way to do it, but I think it's more complicated and there should be a better or faster solution :
Turning data1 lists and single values into individual series with apply,
use wide_to_long to keep single values for each key,
Then groupby applying :
dfgood = df.groupby('key', as_index=False).agg({
'data1' : lambda g: g.iloc[0] if len(g) == 1 else list(g)),
'data2' : sum,
})
dfgood
I think my problem is that I don't know how to use lambdas correctly and I try stupid things like x.iloc[1] in the previous example. I've looked at a lot of tutorial about lambdas, but it's still fuzzy in my mind.
There is problem combinations lists with scalars, possible solution is create first lists form scalars and then flatten them in groupby.agg:
dfgood = (df.assign(data1 = df['data1'].apply(lambda y: y if isinstance(y, list) else [y]))
.groupby('key', as_index=False).agg({
'data1' : lambda x: [z for y in x for z in y],
'data2' : sum,
})
)
print (dfgood)
key data1 data2
0 1 [A, B, C, D, P] 6
1 2 [E, F, G, H] 9
2 3 [I, J, K, L] 13
3 4 [M, N] 17
4 5 [O] 10
Another idea is use flatten function for flatten only lists, not strings:
#https://stackoverflow.com/a/5286571/2901002
def flatten(foo):
for x in foo:
if hasattr(x, '__iter__') and not isinstance(x, str):
for y in flatten(x):
yield y
else:
yield x
dfgood = (df.groupby('key', as_index=False).agg({
'data1' : lambda x: list(flatten(x)),
'data2' : sum}))
You could explode to get individual rows, then aggregate again with groupby+agg after taking care of masking the duplicated values in data2 (to avoid summing duplicates):
(df.explode('data1')
.assign(data2=lambda d: d['data2'].mask(d.duplicated(['key', 'data2']), 0))
.groupby('key')
.agg({'data1': list, 'data2': 'sum'})
)
output:
data1 data2
key
1 [A, B, C, D, P] 6
2 [E, F, G, H] 9
3 [I, J, K, L] 13
4 [M, N] 17
5 [O] 10

Collapsing a PANDAs dataframe into a single column of all items and their occurances

I have a data frame consisting of a mixture of NaN's and strings e.g
data = {'String1':['NaN', 'tree', 'car', 'tree'],
'String2':['cat','dog','car','tree'],
'String3':['fish','tree','NaN','tree']}
ddf = pd.DataFrame(data)
I want to
1:count the total number of items and put in a new data frame e.g
NaN=2
tree=5
car=2
fish=1
cat=1
dog=1
2:Count the total number of items when compared to a separate longer list (column of a another data frame, e.g
df['compare'] =
NaN
tree
car
fish
cat
dog
rabbit
Pear
Orange
snow
rain
Thanks
Jason
For the first question:
from collections import Counter
data = {
"String1": ["NaN", "tree", "car", "tree"],
"String2": ["cat", "dog", "car", "tree"],
"String3": ["fish", "tree", "NaN", "tree"],
}
ddf = pd.DataFrame(data)
a = Counter(ddf.stack().tolist())
df_result = pd.DataFrame(dict(a), index=['Count']).T
df = pd.DataFrame({'vals':['NaN', 'tree', 'car', 'fish', 'cat', 'dog', 'rabbit', 'Pear', 'Orange', 'snow', 'rain']})
df_counts = df.vals.map(df_result.to_dict()['Count'])
THis should do :)
You can use the following code for count of items over all data frame.
import pandas as pd
data = {'String1':['NaN', 'tree', 'car', 'tree'],
'String2':['cat','dog','car','tree'],
'String3':['fish','tree','NaN','tree']}
df = pd.DataFrame(data)
def get_counts(df: pd.DataFrame) -> dict:
res = {}
for col in df.columns:
vc = df[col].value_counts().to_dict()
for k,v in vc.items():
if k in res:
res[k] += v
else:
res[k] = v
return res
counts = get_counts(df)
Output
>>> print(counts)
{'tree': 5, 'car': 2, 'NaN': 2, 'cat': 1, 'dog': 1, 'fish': 1}

How to split every string in a list in a dataframe column

I have a dataframe with a column containing a list of strings 'A:B'. I'd like to modify this so there is a new column which contains a set split by ':' containing the first element.
data = [
{'Name': 'A', 'Servers':['A:s1', 'B:s2', 'C:s3', 'C:s2']},
{'Name': 'B', 'Servers':['B:s1', 'C:s2', 'B:s3', 'A:s2']},
{'Name': 'C', 'Servers':['G:s1', 'X:s2', 'Y:s3']}
]
df = pd.DataFrame(data)
df
df['Clusters'] = [
{'A', 'B', 'C'},
{'B', 'C', 'A'},
{'G', 'X', 'Y'}
]
Learn how to use apply
In [5]: df['Clusters'] = df['Servers'].apply(lambda x: {p.split(':')[0] for p in x})
In [6]: df
Out[6]:
Name Servers Clusters
0 A [A:s1, B:s2, C:s3, C:s2] {A, B, C}
1 B [B:s1, C:s2, B:s3, A:s2] {C, B, A}
2 C [G:s1, X:s2, Y:s3] {X, Y, G}

Aggregate/Remove duplicate rows in DataFrame based on swapped index levels

Sample input
import pandas as pd
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
Which looks like this:
value
from to type
A B 1 5
B C 2 2
A 1 1
C B 1 3
Goal
I now want to remove "duplicate" rows from this in the following sense: for each row with an arbitrary index (from, to, type), if there exists a row (to, from, type), the value of the second row should be added to the first row and the second row be dropped. In the example above, the row (B, A, 1) with value 1 should be added to the first row and dropped, leading to the following desired result.
Sample result
value
from to type
A B 1 6
B C 2 2
C B 1 3
This is my best try so far. It feels unnecessarily verbose and clunky:
# aggregate val of rows with (from,to,type) == (to,from,type)
df2 = df.reset_index()
df3 = df2.rename(columns={'from':'to', 'to':'from'})
df_both = df.join(df3.set_index(
['from', 'to', 'type']),
rsuffix='_b').sum(axis=1)
# then remove the second, i.e. the (to,from,t) row
rows_to_keep = []
rows_to_remove = []
for a,b,t in df_both.index:
if (b,a,t) in df_both.index and not (b,a,t) in rows_to_keep:
rows_to_keep.append((a,b,t))
rows_to_remove.append((b,a,t))
df_final = df_both.drop(rows_to_remove)
df_final
Especially the second "de-duplication" step feels very unpythonic. (How) can I improve these steps?
Not sure how much better this is, but it's certainly different
import pandas as pd
from collections import Counter
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
ls = df.to_records()
ls = list(ls)
ls2=[]
for l in ls:
i=0
while i <= l[3]:
ls2.append(list(l)[:3])
i+=1
counted = Counter(tuple(sorted(entry)) for entry in ls2)