Create dataframe from rows which comprise differing column names - pandas

(Question rewritten as per comment-suggestions)
Suppose I have data like this:
{
2012: [ ('A', 9), ('C', 7), ('D', 4) ],
2013: [ ('B', 7), ('C', 6), ('E', 1) ]
}
How would I construct a dataframe that will account for the 'missing columns' in the rows?
i.e.
year A B C D E
0 2012 9 0 7 4 0
1 2013 0 7 6 0 1
I suppose I can perform a trivial preliminary manipulation to get:
[
[ ('year', 2012), ('A', 9), ('C', 7), ('D', 4) ],
[ ('year', 2013), ('B', 7), ('C', 6), ('E', 1) ]
]

You could first apply the method suggested in this post by #jezrael, create a df with the standard constructor, and then use df.pivot to get the df in the desired shape:
import pandas as pd
data = {
2012: [ ('A', 9), ('C', 7), ('D', 4) ],
2013: [ ('B', 7), ('C', 6), ('E', 1) ]
}
L = [(k, *t) for k, v in data.items() for t in v]
df = pd.DataFrame(L).rename(columns={0:'year'})\
.pivot(index='year', columns=1, values=2).fillna(0).reset_index(drop=False)
df.columns.name = None
print(df)
year A B C D E
0 2012 9.0 0.0 7.0 4.0 0.0
1 2013 0.0 7.0 6.0 0.0 1.0
If the values are all ints, you could do .fillna(0).astype(int).reset_index(drop=False), of course.

import pandas as pd
data = {
2012: [ ('A', 9), ('C', 7), ('D', 4) ],
2013: [ ('B', 7), ('C', 6), ('E', 1) ]
}
L = [(k, *t) for k, v in data.items() for t in v]
df = pd.DataFrame(L).rename(columns={0:'year'})\
.pivot(index='year', columns=1, values=2).fillna(0).reset_index(drop=False)
df.columns.name = None
df = df.astype({'A':'int','B':'int','C':'int','D':'int','E':'int'})

Related

Pandas: Calculate of max value from filtered elements from the same group

I have an input data frame like this:
# input data frame
df = pd.DataFrame(
[
("A", 11, 1),
("A", 12, 2),
("A", 13, 3),
("A", 14, 4),
("B", 21, 1),
("B", 22, 2),
("B", 23, 3),
("B", 24, 4)
],
columns=("key", "ord", "val"),
)
I am looking for a simple way (without iteration) to calculate for each group (key) and each group element the maximal of previous values from the previous rows in the same group the result should be like this:
# wanted output data frame
df = pd.DataFrame(
[
("A", 11, 1, np.NaN), # no previous element in this group, so it should be Nul
("A", 12, 2, 1), # max of vals = [1] in group "A" and ord < 12
("A", 13, 3, 2), # max of vals = [1,2] in group "A" and ord < 13
("A", 14, 4, 3), # max of vals = [1,2,3] in group "A" and ord < 14
("B", 21, 2, np.NaN),
("B", 22, 3, 2),
("B", 23, 4, 3),
("B", 24, 5, 4),
],
columns=("key", "ord", "val", "max_val_before"),
)
I tried to group and filter but my solution do not give me the expected results. I this possible without iterating each row manually? Thank you very much.
I have saved the notebook also on Kaggle:
https://www.kaggle.com/maciejbednarz/mean-previous
Let us try cummax with shift
df.groupby('key').val.apply(lambda x : x.cummax().shift())
Out[221]:
0 NaN
1 1.0
2 2.0
3 3.0
4 NaN
5 1.0
6 2.0
7 3.0
Name: val, dtype: float64

List of the (row, col) of the n largest values in a numeric pandas DataFrame?

Given a Pandas DataFrame of numeric values how can one produce a list of the .loc cell locations that one can then use to then obtain the corresponding n largest values in the entire DataFame?
For example:
A
B
C
D
E
X
1.3
3.6
33
61.38
0.3
Y
3.14
2.71
64
23.2
21
Z
1024
42
66
137
22.2
T
63.123
111
1.23
14.16
50.49
An n of 3 would produce the (row,col) pairs for the values 1024, 137 and 111.
These locations could then, as usual, be fed to .loc to extract those values from the DataFrame. i.e.
df.loc['Z','A']
df.loc['Z','D']
df.loc['T','B']
Note: It is easy to mistake this question for one that involves .idxmax. That isn't applicable due to the fact that there may be multiple values selected from a row and/or column in the n largest.
You could try:
>>> data = {0 : [1.3, 3.14, 1024, 63.123], 1: [3.6, 2.71, 42, 111], 2 : [33, 64, 66, 1.23], 3 : [61.38, 23.2, 137, 14.16], 4 : [0.3, 21, 22.2, 50.49] }
>>> df = pd.DataFrame(data)
>>> df
0 1 2 3 4
0 1.300 3.60 33.00 61.38 0.30
1 3.140 2.71 64.00 23.20 21.00
2 1024.000 42.00 66.00 137.00 22.20
3 63.123 111.00 1.23 14.16 50.49
>>>
>>> a = list(zip(*df.stack().nlargest(3).index.labels))
>>> a
[(2, 0), (2, 3), (3, 1)]
>>> # then ...
>>> df.loc[a[0]]
1024.0
>>>
>>> # all sorted in decreasing order ...
>>> list(zip(*df.stack().nlargest(20).index.labels))
[(2, 0), (2, 3), (3, 1), (2, 2), (1, 2), (3, 0), (0, 3), (3, 4), (2, 1), (0, 2), (1, 3), (2, 4), (1, 4), (3, 3), (0, 1), (1, 0), (1, 1), (0, 0), (3, 2), (0, 4)]
Edit: In pandas versions 0.24.0 and above, MultiIndex.labels has been replaced by MultiIndex.codes(see Deprecations in What’s new in 0.24.0 (January 25, 2019)). The above code will throw AttributeError: 'MultiIndex' object has no attribute 'labels' and needs to be updated as follows:
>>> a = list(zip(*df.stack().nlargest(3).index.codes))
>>> a
[(2, 0), (2, 3), (3, 1)]
Edit 2: This question has become a "moving target", as the OP keeps changing it (this is my last update/edit). In the last update, OP's dataframe looks as follows:
>>> data = {'A' : [1.3, 3.14, 1024, 63.123], 'B' : [3.6, 2.71, 42, 111], 'C' : [33, 64, 66, 1.23], 'D' : [61.38, 23.2, 137, 14.16], 'E' : [0.3, 21, 22.2, 50.49] }
>>> df = pd.DataFrame(data, index=['X', 'Y', 'Z', 'T'])
>>> df
A B C D E
X 1.300 3.60 33.00 61.38 0.30
Y 3.140 2.71 64.00 23.20 21.00
Z 1024.000 42.00 66.00 137.00 22.20
T 63.123 111.00 1.23 14.16 50.49
The desired output can be obtained using:
>>> a = df.stack().nlargest(3).index
>>> a
MultiIndex([('Z', 'A'),
('Z', 'D'),
('T', 'B')],
)
>>>
>>> df.loc[a[0]]
1024.0
The trick is to use np.unravel_index on the np.argsort
Example:
import numpy as np
import pandas as pd
N = 5
df = pd.DataFrame([[11, 3, 50, -3],
[5, 73, 11, 100],
[75, 9, -2, 44]])
s_ix = np.argsort(df.values, axis=None)[::-1][:N]
labels = np.unravel_index(s_ix, df.shape)
labels = list(zip(*labels))
print(labels) # --> [(1, 3), (2, 0), (1, 1), (0, 2), (2, 3)]
print(df.loc[labels[0]]) # --> 100

Pure PostgreSQL replacement for PL/R sample() function?

Our new database does not (and will not) support PL/R usage, which we rely on extensively to implement a random weighted sample function:
CREATE OR REPLACE FUNCTION sample(
ids bigint[],
size integer,
seed integer DEFAULT 1,
with_replacement boolean DEFAULT false,
probabilities numeric[] DEFAULT NULL::numeric[])
RETURNS bigint[]
LANGUAGE 'plr'
COST 100
VOLATILE
AS $BODY$
set.seed(seed)
ids = as.integer(ids)
if (length(ids) == 1) {
s = rep(ids,size)
} else {
s = sample(ids,size, with_replacement,probabilities)
}
return(s)
$BODY$;
Is there a purely SQL approach to this same function? This post shows an approach that selects a single random row, but does not have the functionality of sampling multiple groups at once.
As far as I know, SQL Fiddle does not support PLR, so see below for a quick replication example:
CREATE TABLE test
(category text, uid integer, weight numeric)
;
INSERT INTO test
(category, uid, weight)
VALUES
('a', 1, 45),
('a', 2, 10),
('a', 3, 25),
('a', 4, 100),
('a', 5, 30),
('b', 6, 20),
('b', 7, 10),
('b', 8, 80),
('b', 9, 40),
('b', 10, 15),
('c', 11, 20),
('c', 12, 10),
('c', 13, 80),
('c', 14, 40),
('c', 15, 15)
;
SELECT category,
unnest(diffusion_shared.sample(array_agg(uid ORDER BY uid),
1,
1,
True,
array_agg(weight ORDER BY uid))
) as uid
FROM test
WHERE category IN ('a', 'b')
GROUP BY category;
Which outputs:
category uid
'a' 4
'b' 8
Any ideas?

Pandas DataFrame to Dict with (Row, Column) tuple as keys and int value at those location as values

I am trying to make the following DataFrame
A B C D E
A 0 7324 11765 6937 10424
B 7324 0 17791 3532 5902
C 11765 17791 0 17184 20608
D 6937 3532 17184 0 6550
E 10424 5902 20608 6550 0
to look something like this:
{
('A','A'): 0,
('A','B'): 7324,
('A','C'): 11765,
.
.
.
('E','C'): 20608,
('E','D'): 6550,
('E','E'): 0,
}
Simply put, the output is a dictionary with 2-tuples as keys of rows and columns and values at those locations as dictionary's values. Thank you!
stack and then convert to dict:
df.stack().to_dict()
{('A', 'A'): 0,
('A', 'B'): 7324,
('A', 'C'): 11765,
('A', 'D'): 6937,
('A', 'E'): 10424,
('B', 'A'): 7324,
('B', 'B'): 0,
('B', 'C'): 17791,
('B', 'D'): 3532,
('B', 'E'): 5902,
('C', 'A'): 11765,
('C', 'B'): 17791,
('C', 'C'): 0,
('C', 'D'): 17184,
('C', 'E'): 20608,
('D', 'A'): 6937,
('D', 'B'): 3532,
('D', 'C'): 17184,
('D', 'D'): 0,
('D', 'E'): 6550,
('E', 'A'): 10424,
('E', 'B'): 5902,
('E', 'C'): 20608,
('E', 'D'): 6550,
('E', 'E'): 0}

Convert pandas Series/DataFrame to numpy matrix, unpacking coordinates from index

I have a pandas series as so:
A 1
B 2
C 3
AB 4
AC 5
BA 4
BC 8
CA 5
CB 8
Simple code to convert to a matrix as such:
1 4 5
4 2 8
5 8 3
Something fairly dynamic and built in, rather than many loops to fix this 3x3 problem.
You can do it this way.
import pandas as pd
# your raw data
raw_index = 'A B C AB AC BA BC CA CB'.split()
values = [1, 2, 3, 4, 5, 4, 8, 5, 8]
# reformat index
index = [(a[0], a[-1]) for a in raw_index]
multi_index = pd.MultiIndex.from_tuples(index)
df = pd.DataFrame(values, columns=['values'], index=multi_index)
df.unstack()
df.unstack()
Out[47]:
values
A B C
A 1 4 5
B 4 2 8
C 5 8 3
For pd.DataFrame uses .values member or else .to_records(...) method
For pd.Series use .unstack() method as Jianxun Li said
import numpy as np
import pandas as pd
d = pd.DataFrame(data = {
'var':['A','B','C','AB','AC','BA','BC','CA','CB'],
'val':[1,2,3,4,5,4,8,5,8] })
# Here are some options for converting to np.matrix ...
np.matrix( d.to_records(index=False) )
# matrix([[(1, 'A'), (2, 'B'), (3, 'C'), (4, 'AB'), (5, 'AC'), (4, 'BA'),
# (8, 'BC'), (5, 'CA'), (8, 'CB')]],
# dtype=[('val', '<i8'), ('var', 'O')])
# Here you can add code to rearrange it, e.g.
[(val, idx[0], idx[-1]) for val,idx in d.to_records(index=False) ]
# [(1, 'A', 'A'), (2, 'B', 'B'), (3, 'C', 'C'), (4, 'A', 'B'), (5, 'A', 'C'), (4, 'B', 'A'), (8, 'B', 'C'), (5, 'C', 'A'), (8, 'C', 'B')]
# and if you need numeric row- and col-indices:
[ (val, 'ABCDEF...'.index(idx[0]), 'ABCDEF...'.index(idx[-1]) ) for val,idx in d.to_records(index=False) ]
# [(1, 0, 0), (2, 1, 1), (3, 2, 2), (4, 0, 1), (5, 0, 2), (4, 1, 0), (8, 1, 2), (5, 2, 0), (8, 2, 1)]
# you can sort by them:
sorted([ (val, 'ABCDEF...'.index(idx[0]), 'ABCDEF...'.index(idx[-1]) ) for val,idx in d.to_records(index=False) ], key=lambda x: x[1:2] )