Pandas: apply tupleize_cols to dataframe without to_csv()? - pandas

I like the tupleize_cols option in the to_csv() function. Is this function available on a in-memory dataframe? I would like to clean up the tuples of the multi-indexed columns to 'reportable' column names automatically.
Thanks,
Luc

Just use .values on the index
In [1]: i = pd.MultiIndex.from_product([[1,2,3],['a','b','c']])
In [2]: i
Out[2]:
MultiIndex(levels=[[1, 2, 3], [u'a', u'b', u'c']],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])
In [3]: i.values
Out[3]:
array([(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), (2, 'b'), (2, 'c'),
(3, 'a'), (3, 'b'), (3, 'c')], dtype=object)

Related

pd.MultiIndex from product

from pandas documentation:
numbers = [0, 1, 2]
colors = ['green', 'purple']
pd.MultiIndex.from_product([numbers, colors],names=['number', 'color'])
MultiIndex([(0, 'green'),
(0, 'purple'),
(1, 'green'),
(1, 'purple'),
(2, 'green'),
(2, 'purple')],
names=['number', 'color'])
what I got:
MultiIndex(levels=[[0, 1, 2], ['green', 'purple']],
codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=['numbers', 'colors'])
can someone please help understand why I got this output by putting in the same code?
That was how previous Pandas versions represent the multiIndex. On my system, Pandas 1.0.3 gives the former and 0.24.2 gives the latter. Make sure your system's version is the same with that of the doc.
See the section "Better repr for MultiIndex" enhancement which was released in v0.25.0.

Pandas DataFrame to Dict with (Row, Column) tuple as keys and int value at those location as values

I am trying to make the following DataFrame
A B C D E
A 0 7324 11765 6937 10424
B 7324 0 17791 3532 5902
C 11765 17791 0 17184 20608
D 6937 3532 17184 0 6550
E 10424 5902 20608 6550 0
to look something like this:
{
('A','A'): 0,
('A','B'): 7324,
('A','C'): 11765,
.
.
.
('E','C'): 20608,
('E','D'): 6550,
('E','E'): 0,
}
Simply put, the output is a dictionary with 2-tuples as keys of rows and columns and values at those locations as dictionary's values. Thank you!
stack and then convert to dict:
df.stack().to_dict()
{('A', 'A'): 0,
('A', 'B'): 7324,
('A', 'C'): 11765,
('A', 'D'): 6937,
('A', 'E'): 10424,
('B', 'A'): 7324,
('B', 'B'): 0,
('B', 'C'): 17791,
('B', 'D'): 3532,
('B', 'E'): 5902,
('C', 'A'): 11765,
('C', 'B'): 17791,
('C', 'C'): 0,
('C', 'D'): 17184,
('C', 'E'): 20608,
('D', 'A'): 6937,
('D', 'B'): 3532,
('D', 'C'): 17184,
('D', 'D'): 0,
('D', 'E'): 6550,
('E', 'A'): 10424,
('E', 'B'): 5902,
('E', 'C'): 20608,
('E', 'D'): 6550,
('E', 'E'): 0}

Can fromfile omit fields?

I am reading data from a given binary format, however I am only interested in a subset of the fields.
For example:
MY_DTYPE = np.dtype({'names': ('A', 'B', 'C'), 'formats': ('<f8', '<u2', 'u1')})
data = np.fromfile(infile, count=-1, dtype=MY_DTYPE)
Assume I don't really need data['C'], is it possible to specify what fields I want to keep in the first place?
Simulate the load:
In [117]: MY_DTYPE = np.dtype({'names': ('A', 'B', 'C'), 'formats': ('<f8', '<u2', 'u1')})
In [118]: data = np.zeros(3, MY_DTYPE)
In [119]: data
Out[119]:
array([(0., 0, 0), (0., 0, 0), (0., 0, 0)],
dtype=[('A', '<f8'), ('B', '<u2'), ('C', 'u1')])
In [120]: data['C']
Out[120]: array([0, 0, 0], dtype=uint8)
In the latest numpy version, multifield indexing creates a view:
In [121]: data[['A','B']]
Out[121]:
array([(0., 0), (0., 0), (0., 0)],
dtype={'names':['A','B'], 'formats':['<f8','<u2'], 'offsets':[0,8], 'itemsize':11})
It provides a repack_fields functions to make a proper copy:
In [122]: import numpy.lib.recfunctions as rf
In [123]: rf.repack_fields(data[['A','B']])
Out[123]: array([(0., 0), (0., 0), (0., 0)], dtype=[('A', '<f8'), ('B', '<u2')])
See the docs of repack for more information, or look at recent release notes.

pandas Multiindex - set_index with list of tuples

I experienced following issue. I have an existing MultiIndex and want to replace the single level with a list of tuples. But I got some strange value error
Code to reproduce:
idx = pd.MultiIndex.from_tuples([(1, u'one'), (1, u'two'),
(2, u'one'), (2, u'two')],
names=['foo', 'bar'])
idx.set_levels([3, 5], level=0) # works fine
idx.set_levels([(1,2),(3,4)], level=0) #TypeError: Levels must be list-like
Can anyone comment:
1) What's the issue?
2) What's the best method to replace index (int values -> tuple values)
Thanks!
For me working new contructor:
idx = pd.MultiIndex.from_product([[(1,2),(3,4)], idx.levels[1]], names=idx.names)
print (idx)
MultiIndex(levels=[[(1, 2), (3, 4)], ['one', 'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['foo', 'bar'])
EIT1:
df = pd.DataFrame({'A':list('abcdef'),
'B':[1,2,1,2,2,1],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')}).set_index(['B','C'])
#dynamic generate dictioanry with list of tuples
new = [(1, 2), (3, 4)]
d = dict(zip(df.index.levels[0], new))
print (d)
{1: (1, 2), 2: (3, 4)}
#explicit define dictionary
d = {1:(1,2), 2:(3,4)}
#rename first level of MultiInex
df = df.rename(index=d, level=0)
print (df)
A D E F
B C
(1, 2) 7 a 1 5 a
(3, 4) 8 b 3 3 a
(1, 2) 9 c 5 6 a
(3, 4) 4 d 7 9 b
2 e 1 2 b
(1, 2) 3 f 0 4 b
EDIT:
new = [(1, 2), (3, 4)]
lvl0 = list(map(tuple, np.array(new)[pd.factorize(idx.get_level_values(0))[0]].tolist()))
print (lvl0)
[(1, 2), (1, 2), (3, 4), (3, 4)]
idx = pd.MultiIndex.from_arrays([lvl0, idx.get_level_values(1)], names=idx.names)
print (idx)
MultiIndex(levels=[[(1, 2), (3, 4)], ['one', 'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['foo', 'bar'])

Convert pandas Series/DataFrame to numpy matrix, unpacking coordinates from index

I have a pandas series as so:
A 1
B 2
C 3
AB 4
AC 5
BA 4
BC 8
CA 5
CB 8
Simple code to convert to a matrix as such:
1 4 5
4 2 8
5 8 3
Something fairly dynamic and built in, rather than many loops to fix this 3x3 problem.
You can do it this way.
import pandas as pd
# your raw data
raw_index = 'A B C AB AC BA BC CA CB'.split()
values = [1, 2, 3, 4, 5, 4, 8, 5, 8]
# reformat index
index = [(a[0], a[-1]) for a in raw_index]
multi_index = pd.MultiIndex.from_tuples(index)
df = pd.DataFrame(values, columns=['values'], index=multi_index)
df.unstack()
df.unstack()
Out[47]:
values
A B C
A 1 4 5
B 4 2 8
C 5 8 3
For pd.DataFrame uses .values member or else .to_records(...) method
For pd.Series use .unstack() method as Jianxun Li said
import numpy as np
import pandas as pd
d = pd.DataFrame(data = {
'var':['A','B','C','AB','AC','BA','BC','CA','CB'],
'val':[1,2,3,4,5,4,8,5,8] })
# Here are some options for converting to np.matrix ...
np.matrix( d.to_records(index=False) )
# matrix([[(1, 'A'), (2, 'B'), (3, 'C'), (4, 'AB'), (5, 'AC'), (4, 'BA'),
# (8, 'BC'), (5, 'CA'), (8, 'CB')]],
# dtype=[('val', '<i8'), ('var', 'O')])
# Here you can add code to rearrange it, e.g.
[(val, idx[0], idx[-1]) for val,idx in d.to_records(index=False) ]
# [(1, 'A', 'A'), (2, 'B', 'B'), (3, 'C', 'C'), (4, 'A', 'B'), (5, 'A', 'C'), (4, 'B', 'A'), (8, 'B', 'C'), (5, 'C', 'A'), (8, 'C', 'B')]
# and if you need numeric row- and col-indices:
[ (val, 'ABCDEF...'.index(idx[0]), 'ABCDEF...'.index(idx[-1]) ) for val,idx in d.to_records(index=False) ]
# [(1, 0, 0), (2, 1, 1), (3, 2, 2), (4, 0, 1), (5, 0, 2), (4, 1, 0), (8, 1, 2), (5, 2, 0), (8, 2, 1)]
# you can sort by them:
sorted([ (val, 'ABCDEF...'.index(idx[0]), 'ABCDEF...'.index(idx[-1]) ) for val,idx in d.to_records(index=False) ], key=lambda x: x[1:2] )