group by key with pandas series and export to_dict() - pandas

I have a dictionary that looks like this:
d = {1:0, 2:0, 3:1, 4:0, 5:2, 6:1, 7:2, 8:0}
And I want to group by .keys() such as I get:
pandas_ordered = { 0:[1,2,4,8], 1:[3,6], 2:[5,7] }
But with this command for
pd.Series(list(d.values())).groupby(list(partition.keys())).to_dict()
Bellow is an example:
# Example:
import pandas as pd
d = {1:0, 2:0, 3:1, 4:0, 5:2, 6:1, 7:2, 8:0}
def pandas_groupby(dictionary):
values = list(dictionary.values())
keys = list(dictionary.keys())
return pd.Series(values).groupby(keys).to_dict()
pandas_groupby(d)
The above code produces the error:
AttributeError: Cannot access callable attribute 'to_dict' of
'SeriesGroupBy' objects, try using the 'apply' method
Any ideas on how to do this?

You dict is already given by the groups in your groupby
d = {1:0, 2:0, 3:1, 4:0, 5:2, 6:1, 7:2, 8:0}
s = pd.Series(d)
s.groupby(s).groups
{0: Int64Index([1, 2, 4, 8], dtype='int64'),
1: Int64Index([3, 6], dtype='int64'),
2: Int64Index([5, 7], dtype='int64')}
But of course, can always agg and customize
s.groupby(s).agg(lambda x: tuple(x.index)).to_dict()
{0: (1, 2, 4, 8), 1: (3, 6), 2: (5, 7)}

Related

Julia "MethodError: no method matching build_tree"

I have a very simple sample script:
using Pkg
Pkg.add("DecisionTree")
Pkg.add("DataFrames")
using DataFrames
using DecisionTree
dat = DataFrame(A=[1, 2, 3, 4, 5], B=[2, 5, 1, 2, 6])
model = build_tree(dat[!, "A"], dat[!, "B"])
Which returns an error:
ERROR: LoadError: MethodError: no method matching build_tree(::Vector{Int64}, ::Vector{Int64})
Closest candidates are:
build_tree(::AbstractVector{T}, ::AbstractMatrix{S}) where {S, T} at C:\Users\**\.julia\packages\DecisionTree\iWCbW\src\classification\main.jl:74
build_tree(::AbstractVector{T}, ::AbstractMatrix{S}, ::Any) where {S, T} at C:\Users\**\.julia\packages\DecisionTree\iWCbW\src\classification\main.jl:74
build_tree(::AbstractVector{T}, ::AbstractMatrix{S}, ::Any, ::Any) where {S, T} at C:\Users\**\.julia\packages\DecisionTree\iWCbW\src\classification\main.jl:74
What is going on? How do I deal with that?
Your data types do not match. Try this:
C = reshape(dat[!, "B"], (1, 5))
model = DecisionTree.build_tree(dat[!, "A"], C')

Dictionary Unique Keys Rename and Replace

I have a dictionary format structure like this
df = pd.DataFrame({'ID' : ['A', 'B', 'C'],
'CODES' : [{"1407273790":5,"1801032636":20,"1174813554":1,"1215470448":2,"1053754655":4,"1891751228":1},
{"1497066526":19,"1801032636":16,"1215470448":11,"1891751228":18},
{"1215470448":8,"1407273790":4},]})
Now I want to create a unique list of keys and create names for them like this -
np_code np_rename
1407273790 np_1
1801032636 np_2
1174813554 np_3
1215470448 np_4
1053754655 np_5
1891751228 np_6
1497066526 np_7
And finally replace the new names in main dataframe df -
df = pd.DataFrame({'ID' : ['A', 'B', 'C'],
'CODES' : [{"np_1":5,"np_2":20,"np_3":1,"np_4":2,"np_5":4,"np_6":1},
{"np_7":19,"1801032636":16,"np_4":11,"np_6":18},
{"np_4":8,"np_1":4},]})
You can use apply here:
Assuming the unique list dataframe is unique_list_df:
u = df['CODES'].map(lambda x: [*x.keys()]).explode().unique()
d = dict(zip(u,'np_'+pd.Index((pd.factorize(u)[0]+1).astype(str))))
f = lambda x: {d.get(k,k): v for k,v in x.items()}
df['CODES'] = df['CODES'].apply(f)
print(df)
ID CODES
0 A {'np_1': 5, 'np_2': 20, 'np_3': 1, 'np_4': 2, ...
1 B {'np_7': 19, 'np_2': 16, 'np_4': 11, 'np_6': 18}
2 C {'np_4': 8, 'np_1': 4}

How can I speed up this function in Python?

I am trying to figure out a way to speed up this function. I am trying to do all pairwise comparisons between the rows and columns of a dataframe (pairwise_df) and store the result. The comparison requires two numpy arrays of continuous values taken from another dataframe (df).
pairwise_df = pd.DataFrame(index = ['insert1', 'insert2', 'insert3'], columns = ['insert1', 'insert2', 'insert3'])
df = pd.DataFrame(data = [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
[2, 3, 4, 5, 7, 9, 10, 1, 2, 3]], index = ['insert1', 'insert2', 'insert3'], columns = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
for row in list(pairwise_df.index.values):
for col in list(pairwise_df):
pairwise_df.at[row, col] = cosine_sim(np.array(df.loc[row]), np.array(df.loc[col]))
This works, but takes about 18mins to run on a 2000 x 2000 dataframe, and i'm sure there are ways to speed this up, but my programming experience is minimal.
The cosine_sim function is here, but the function used will vary so it doesn't matter too much:
def cosine_sim(x, y):
dot = np.dot(x, y)
norma = np.linalg.norm(x)
normb = np.linalg.norm(y)
cos = dot / (norma * normb)
return cos
Thanks!
You can avoid loops to compute cosine similarity by creating the array of all combinations using np.tile and np.reshape. The trick here is to use np.einsum to replace the dot product.
m = df.values
x = np.tile(m, m.shape[0]).reshape(-1, m.shape[1])
y = np.tile(m.T, m.shape[0]).T
c = np.einsum('ij,ij->i', x, y) / (np.linalg.norm(x, axis=1) * np.linalg.norm(y, axis=1))
>>> c.reshape(-1, m.shape[0])
array([[1. , 0.57142857, 0.75283826],
[0.57142857, 1. , 0.74102903],
[0.75283826, 0.74102903, 1. ]])

Aggregate/Remove duplicate rows in DataFrame based on swapped index levels

Sample input
import pandas as pd
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
Which looks like this:
value
from to type
A B 1 5
B C 2 2
A 1 1
C B 1 3
Goal
I now want to remove "duplicate" rows from this in the following sense: for each row with an arbitrary index (from, to, type), if there exists a row (to, from, type), the value of the second row should be added to the first row and the second row be dropped. In the example above, the row (B, A, 1) with value 1 should be added to the first row and dropped, leading to the following desired result.
Sample result
value
from to type
A B 1 6
B C 2 2
C B 1 3
This is my best try so far. It feels unnecessarily verbose and clunky:
# aggregate val of rows with (from,to,type) == (to,from,type)
df2 = df.reset_index()
df3 = df2.rename(columns={'from':'to', 'to':'from'})
df_both = df.join(df3.set_index(
['from', 'to', 'type']),
rsuffix='_b').sum(axis=1)
# then remove the second, i.e. the (to,from,t) row
rows_to_keep = []
rows_to_remove = []
for a,b,t in df_both.index:
if (b,a,t) in df_both.index and not (b,a,t) in rows_to_keep:
rows_to_keep.append((a,b,t))
rows_to_remove.append((b,a,t))
df_final = df_both.drop(rows_to_remove)
df_final
Especially the second "de-duplication" step feels very unpythonic. (How) can I improve these steps?
Not sure how much better this is, but it's certainly different
import pandas as pd
from collections import Counter
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
ls = df.to_records()
ls = list(ls)
ls2=[]
for l in ls:
i=0
while i <= l[3]:
ls2.append(list(l)[:3])
i+=1
counted = Counter(tuple(sorted(entry)) for entry in ls2)

Generator to yield gap tuples from zipped iterables

Let's say that I have an arbitrary number of iterables, all of which can be assumed to be sorted, and contain elements all of the same type (integers, for illustration's sake).
a = (1, 2, 3, 4, 5)
b = (2, 4, 5)
c = (1, 2, 3, 5)
I would like to write a generator function yielding the following:
(1, None, 1)
(2, 2, 2)
(3, None, 3)
(4, 4, None)
(5, 5, 5)
In other words, progressively yield sorted tuples with gaps where elements are missing from the input iterables.
My take on this, using only iterators, not heaps:
a = (1, 2, 4, 5)
b = (2, 5)
c = (1, 2, 6)
d = (1,)
inputs = [iter(x) for x in (a, b, c, d)]
def minwithreplacement(currents, inputs, minitem, done):
for i in xrange(len(currents)):
if currents[i] == minitem:
try:
currents[i] = inputs[i].next()
except StopIteration:
currents[i] = None
done[0] += 1
yield minitem
else:
yield None
def dothing(inputs):
currents = [it.next() for it in inputs]
done = [0]
while done[0] != len(currents):
yield minwithreplacement(currents, inputs, min(x for x in currents if x), done)
print [list(x) for x in dothing(inputs)] #Consuming iterators for display purposes
>>>[[1, None, 1, 1], [2, 2, 2, None], [4, None, None, None], [5, 5, None, None], [None, None, 6, None]]
We first need a variation of heapq.merge which also yields the index. You can get that by copy-pasting heapq.merge, and replacing each yield v with yield itnum, v. (I omit that part from my answer for readability).
Now we can do:
from collections import deque, OrderedDict
def f(*iterables):
pending = OrderedDict()
for i, v in merge(iterables):
if (not pending) or pending.keys()[-1] < v:
# a new greatest value
pending[v] = [None] * len(iterables)
pending[v][i] = v
# yield all values smaller than v
while len(pending) > 1 and pending.keys()[0] < v:
yield pending.pop(pending.keys()[0])
# yield remaining
while pending:
yield pending.pop(pending.keys()[0])
print list(f((1,2,3,4,5), (2,4,5), (1,2,3,5)))
=> [[1, None, 1], [2, 2, 2], [3, None, 3], [4, 4, None], [5, 5, 5]]