Convert pandas Series/DataFrame to numpy matrix, unpacking coordinates from index - numpy

I have a pandas series as so:
A 1
B 2
C 3
AB 4
AC 5
BA 4
BC 8
CA 5
CB 8
Simple code to convert to a matrix as such:
1 4 5
4 2 8
5 8 3
Something fairly dynamic and built in, rather than many loops to fix this 3x3 problem.

You can do it this way.
import pandas as pd
# your raw data
raw_index = 'A B C AB AC BA BC CA CB'.split()
values = [1, 2, 3, 4, 5, 4, 8, 5, 8]
# reformat index
index = [(a[0], a[-1]) for a in raw_index]
multi_index = pd.MultiIndex.from_tuples(index)
df = pd.DataFrame(values, columns=['values'], index=multi_index)
df.unstack()
df.unstack()
Out[47]:
values
A B C
A 1 4 5
B 4 2 8
C 5 8 3

For pd.DataFrame uses .values member or else .to_records(...) method
For pd.Series use .unstack() method as Jianxun Li said
import numpy as np
import pandas as pd
d = pd.DataFrame(data = {
'var':['A','B','C','AB','AC','BA','BC','CA','CB'],
'val':[1,2,3,4,5,4,8,5,8] })
# Here are some options for converting to np.matrix ...
np.matrix( d.to_records(index=False) )
# matrix([[(1, 'A'), (2, 'B'), (3, 'C'), (4, 'AB'), (5, 'AC'), (4, 'BA'),
# (8, 'BC'), (5, 'CA'), (8, 'CB')]],
# dtype=[('val', '<i8'), ('var', 'O')])
# Here you can add code to rearrange it, e.g.
[(val, idx[0], idx[-1]) for val,idx in d.to_records(index=False) ]
# [(1, 'A', 'A'), (2, 'B', 'B'), (3, 'C', 'C'), (4, 'A', 'B'), (5, 'A', 'C'), (4, 'B', 'A'), (8, 'B', 'C'), (5, 'C', 'A'), (8, 'C', 'B')]
# and if you need numeric row- and col-indices:
[ (val, 'ABCDEF...'.index(idx[0]), 'ABCDEF...'.index(idx[-1]) ) for val,idx in d.to_records(index=False) ]
# [(1, 0, 0), (2, 1, 1), (3, 2, 2), (4, 0, 1), (5, 0, 2), (4, 1, 0), (8, 1, 2), (5, 2, 0), (8, 2, 1)]
# you can sort by them:
sorted([ (val, 'ABCDEF...'.index(idx[0]), 'ABCDEF...'.index(idx[-1]) ) for val,idx in d.to_records(index=False) ], key=lambda x: x[1:2] )

Related

Create dataframe from rows which comprise differing column names

(Question rewritten as per comment-suggestions)
Suppose I have data like this:
{
2012: [ ('A', 9), ('C', 7), ('D', 4) ],
2013: [ ('B', 7), ('C', 6), ('E', 1) ]
}
How would I construct a dataframe that will account for the 'missing columns' in the rows?
i.e.
year A B C D E
0 2012 9 0 7 4 0
1 2013 0 7 6 0 1
I suppose I can perform a trivial preliminary manipulation to get:
[
[ ('year', 2012), ('A', 9), ('C', 7), ('D', 4) ],
[ ('year', 2013), ('B', 7), ('C', 6), ('E', 1) ]
]
You could first apply the method suggested in this post by #jezrael, create a df with the standard constructor, and then use df.pivot to get the df in the desired shape:
import pandas as pd
data = {
2012: [ ('A', 9), ('C', 7), ('D', 4) ],
2013: [ ('B', 7), ('C', 6), ('E', 1) ]
}
L = [(k, *t) for k, v in data.items() for t in v]
df = pd.DataFrame(L).rename(columns={0:'year'})\
.pivot(index='year', columns=1, values=2).fillna(0).reset_index(drop=False)
df.columns.name = None
print(df)
year A B C D E
0 2012 9.0 0.0 7.0 4.0 0.0
1 2013 0.0 7.0 6.0 0.0 1.0
If the values are all ints, you could do .fillna(0).astype(int).reset_index(drop=False), of course.
import pandas as pd
data = {
2012: [ ('A', 9), ('C', 7), ('D', 4) ],
2013: [ ('B', 7), ('C', 6), ('E', 1) ]
}
L = [(k, *t) for k, v in data.items() for t in v]
df = pd.DataFrame(L).rename(columns={0:'year'})\
.pivot(index='year', columns=1, values=2).fillna(0).reset_index(drop=False)
df.columns.name = None
df = df.astype({'A':'int','B':'int','C':'int','D':'int','E':'int'})

List of the (row, col) of the n largest values in a numeric pandas DataFrame?

Given a Pandas DataFrame of numeric values how can one produce a list of the .loc cell locations that one can then use to then obtain the corresponding n largest values in the entire DataFame?
For example:
A
B
C
D
E
X
1.3
3.6
33
61.38
0.3
Y
3.14
2.71
64
23.2
21
Z
1024
42
66
137
22.2
T
63.123
111
1.23
14.16
50.49
An n of 3 would produce the (row,col) pairs for the values 1024, 137 and 111.
These locations could then, as usual, be fed to .loc to extract those values from the DataFrame. i.e.
df.loc['Z','A']
df.loc['Z','D']
df.loc['T','B']
Note: It is easy to mistake this question for one that involves .idxmax. That isn't applicable due to the fact that there may be multiple values selected from a row and/or column in the n largest.
You could try:
>>> data = {0 : [1.3, 3.14, 1024, 63.123], 1: [3.6, 2.71, 42, 111], 2 : [33, 64, 66, 1.23], 3 : [61.38, 23.2, 137, 14.16], 4 : [0.3, 21, 22.2, 50.49] }
>>> df = pd.DataFrame(data)
>>> df
0 1 2 3 4
0 1.300 3.60 33.00 61.38 0.30
1 3.140 2.71 64.00 23.20 21.00
2 1024.000 42.00 66.00 137.00 22.20
3 63.123 111.00 1.23 14.16 50.49
>>>
>>> a = list(zip(*df.stack().nlargest(3).index.labels))
>>> a
[(2, 0), (2, 3), (3, 1)]
>>> # then ...
>>> df.loc[a[0]]
1024.0
>>>
>>> # all sorted in decreasing order ...
>>> list(zip(*df.stack().nlargest(20).index.labels))
[(2, 0), (2, 3), (3, 1), (2, 2), (1, 2), (3, 0), (0, 3), (3, 4), (2, 1), (0, 2), (1, 3), (2, 4), (1, 4), (3, 3), (0, 1), (1, 0), (1, 1), (0, 0), (3, 2), (0, 4)]
Edit: In pandas versions 0.24.0 and above, MultiIndex.labels has been replaced by MultiIndex.codes(see Deprecations in What’s new in 0.24.0 (January 25, 2019)). The above code will throw AttributeError: 'MultiIndex' object has no attribute 'labels' and needs to be updated as follows:
>>> a = list(zip(*df.stack().nlargest(3).index.codes))
>>> a
[(2, 0), (2, 3), (3, 1)]
Edit 2: This question has become a "moving target", as the OP keeps changing it (this is my last update/edit). In the last update, OP's dataframe looks as follows:
>>> data = {'A' : [1.3, 3.14, 1024, 63.123], 'B' : [3.6, 2.71, 42, 111], 'C' : [33, 64, 66, 1.23], 'D' : [61.38, 23.2, 137, 14.16], 'E' : [0.3, 21, 22.2, 50.49] }
>>> df = pd.DataFrame(data, index=['X', 'Y', 'Z', 'T'])
>>> df
A B C D E
X 1.300 3.60 33.00 61.38 0.30
Y 3.140 2.71 64.00 23.20 21.00
Z 1024.000 42.00 66.00 137.00 22.20
T 63.123 111.00 1.23 14.16 50.49
The desired output can be obtained using:
>>> a = df.stack().nlargest(3).index
>>> a
MultiIndex([('Z', 'A'),
('Z', 'D'),
('T', 'B')],
)
>>>
>>> df.loc[a[0]]
1024.0
The trick is to use np.unravel_index on the np.argsort
Example:
import numpy as np
import pandas as pd
N = 5
df = pd.DataFrame([[11, 3, 50, -3],
[5, 73, 11, 100],
[75, 9, -2, 44]])
s_ix = np.argsort(df.values, axis=None)[::-1][:N]
labels = np.unravel_index(s_ix, df.shape)
labels = list(zip(*labels))
print(labels) # --> [(1, 3), (2, 0), (1, 1), (0, 2), (2, 3)]
print(df.loc[labels[0]]) # --> 100

Pure PostgreSQL replacement for PL/R sample() function?

Our new database does not (and will not) support PL/R usage, which we rely on extensively to implement a random weighted sample function:
CREATE OR REPLACE FUNCTION sample(
ids bigint[],
size integer,
seed integer DEFAULT 1,
with_replacement boolean DEFAULT false,
probabilities numeric[] DEFAULT NULL::numeric[])
RETURNS bigint[]
LANGUAGE 'plr'
COST 100
VOLATILE
AS $BODY$
set.seed(seed)
ids = as.integer(ids)
if (length(ids) == 1) {
s = rep(ids,size)
} else {
s = sample(ids,size, with_replacement,probabilities)
}
return(s)
$BODY$;
Is there a purely SQL approach to this same function? This post shows an approach that selects a single random row, but does not have the functionality of sampling multiple groups at once.
As far as I know, SQL Fiddle does not support PLR, so see below for a quick replication example:
CREATE TABLE test
(category text, uid integer, weight numeric)
;
INSERT INTO test
(category, uid, weight)
VALUES
('a', 1, 45),
('a', 2, 10),
('a', 3, 25),
('a', 4, 100),
('a', 5, 30),
('b', 6, 20),
('b', 7, 10),
('b', 8, 80),
('b', 9, 40),
('b', 10, 15),
('c', 11, 20),
('c', 12, 10),
('c', 13, 80),
('c', 14, 40),
('c', 15, 15)
;
SELECT category,
unnest(diffusion_shared.sample(array_agg(uid ORDER BY uid),
1,
1,
True,
array_agg(weight ORDER BY uid))
) as uid
FROM test
WHERE category IN ('a', 'b')
GROUP BY category;
Which outputs:
category uid
'a' 4
'b' 8
Any ideas?

pandas Multiindex - set_index with list of tuples

I experienced following issue. I have an existing MultiIndex and want to replace the single level with a list of tuples. But I got some strange value error
Code to reproduce:
idx = pd.MultiIndex.from_tuples([(1, u'one'), (1, u'two'),
(2, u'one'), (2, u'two')],
names=['foo', 'bar'])
idx.set_levels([3, 5], level=0) # works fine
idx.set_levels([(1,2),(3,4)], level=0) #TypeError: Levels must be list-like
Can anyone comment:
1) What's the issue?
2) What's the best method to replace index (int values -> tuple values)
Thanks!
For me working new contructor:
idx = pd.MultiIndex.from_product([[(1,2),(3,4)], idx.levels[1]], names=idx.names)
print (idx)
MultiIndex(levels=[[(1, 2), (3, 4)], ['one', 'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['foo', 'bar'])
EIT1:
df = pd.DataFrame({'A':list('abcdef'),
'B':[1,2,1,2,2,1],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')}).set_index(['B','C'])
#dynamic generate dictioanry with list of tuples
new = [(1, 2), (3, 4)]
d = dict(zip(df.index.levels[0], new))
print (d)
{1: (1, 2), 2: (3, 4)}
#explicit define dictionary
d = {1:(1,2), 2:(3,4)}
#rename first level of MultiInex
df = df.rename(index=d, level=0)
print (df)
A D E F
B C
(1, 2) 7 a 1 5 a
(3, 4) 8 b 3 3 a
(1, 2) 9 c 5 6 a
(3, 4) 4 d 7 9 b
2 e 1 2 b
(1, 2) 3 f 0 4 b
EDIT:
new = [(1, 2), (3, 4)]
lvl0 = list(map(tuple, np.array(new)[pd.factorize(idx.get_level_values(0))[0]].tolist()))
print (lvl0)
[(1, 2), (1, 2), (3, 4), (3, 4)]
idx = pd.MultiIndex.from_arrays([lvl0, idx.get_level_values(1)], names=idx.names)
print (idx)
MultiIndex(levels=[[(1, 2), (3, 4)], ['one', 'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['foo', 'bar'])

Pandas: apply tupleize_cols to dataframe without to_csv()?

I like the tupleize_cols option in the to_csv() function. Is this function available on a in-memory dataframe? I would like to clean up the tuples of the multi-indexed columns to 'reportable' column names automatically.
Thanks,
Luc
Just use .values on the index
In [1]: i = pd.MultiIndex.from_product([[1,2,3],['a','b','c']])
In [2]: i
Out[2]:
MultiIndex(levels=[[1, 2, 3], [u'a', u'b', u'c']],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])
In [3]: i.values
Out[3]:
array([(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), (2, 'b'), (2, 'c'),
(3, 'a'), (3, 'b'), (3, 'c')], dtype=object)