Pandas: Calculate of max value from filtered elements from the same group - pandas

I have an input data frame like this:
# input data frame
df = pd.DataFrame(
[
("A", 11, 1),
("A", 12, 2),
("A", 13, 3),
("A", 14, 4),
("B", 21, 1),
("B", 22, 2),
("B", 23, 3),
("B", 24, 4)
],
columns=("key", "ord", "val"),
)
I am looking for a simple way (without iteration) to calculate for each group (key) and each group element the maximal of previous values from the previous rows in the same group the result should be like this:
# wanted output data frame
df = pd.DataFrame(
[
("A", 11, 1, np.NaN), # no previous element in this group, so it should be Nul
("A", 12, 2, 1), # max of vals = [1] in group "A" and ord < 12
("A", 13, 3, 2), # max of vals = [1,2] in group "A" and ord < 13
("A", 14, 4, 3), # max of vals = [1,2,3] in group "A" and ord < 14
("B", 21, 2, np.NaN),
("B", 22, 3, 2),
("B", 23, 4, 3),
("B", 24, 5, 4),
],
columns=("key", "ord", "val", "max_val_before"),
)
I tried to group and filter but my solution do not give me the expected results. I this possible without iterating each row manually? Thank you very much.
I have saved the notebook also on Kaggle:
https://www.kaggle.com/maciejbednarz/mean-previous

Let us try cummax with shift
df.groupby('key').val.apply(lambda x : x.cummax().shift())
Out[221]:
0 NaN
1 1.0
2 2.0
3 3.0
4 NaN
5 1.0
6 2.0
7 3.0
Name: val, dtype: float64

Related

Pandas: how to read CSV specific columns which doesn't contain header

usecols = [*range(1, 5), *range(7, 9), *range(11, 13)]
df = pd.read_csv('/content/drive/MyDrive/weather.csv', header=None, usecols=usecols, names=['d', 'm', 'y', 'time', 'temp1', 'outtemp', 'temp2', 'air_pressure', 'humidity'])
I'm trying this but always get
ValueError: Usecols do not match columns, columns expected but not found: [1, 2, 3, 4, 7, 8, 11, 12]
my data set looks like:
The problem you are seeing is due to a mismatch in the number of columns designated by usecols and the number columns designated by names.
Usecols:
[1, 2, 3, 4, 7, 8, 11, 12] - 8 columns
names:
'd', 'm', 'y', 'time', 'temp1', 'outtemp', 'temp2', 'air_pressure', 'humidity' - 9 columns
Change code so that the range in usecols ends in 14 rather than 13:
Code:
usecols = [*range(1, 5), *range(7, 9), *range(11, 14)]
df = pd.read_csv('/content/drive/MyDrive/weather.csv', header=None, usecols=usecols, names=['d', 'm', 'y', 'time', 'temp1', 'outtemp', 'temp2', 'air_pressure', 'humidity'])
Example df output:

List of the (row, col) of the n largest values in a numeric pandas DataFrame?

Given a Pandas DataFrame of numeric values how can one produce a list of the .loc cell locations that one can then use to then obtain the corresponding n largest values in the entire DataFame?
For example:
A
B
C
D
E
X
1.3
3.6
33
61.38
0.3
Y
3.14
2.71
64
23.2
21
Z
1024
42
66
137
22.2
T
63.123
111
1.23
14.16
50.49
An n of 3 would produce the (row,col) pairs for the values 1024, 137 and 111.
These locations could then, as usual, be fed to .loc to extract those values from the DataFrame. i.e.
df.loc['Z','A']
df.loc['Z','D']
df.loc['T','B']
Note: It is easy to mistake this question for one that involves .idxmax. That isn't applicable due to the fact that there may be multiple values selected from a row and/or column in the n largest.
You could try:
>>> data = {0 : [1.3, 3.14, 1024, 63.123], 1: [3.6, 2.71, 42, 111], 2 : [33, 64, 66, 1.23], 3 : [61.38, 23.2, 137, 14.16], 4 : [0.3, 21, 22.2, 50.49] }
>>> df = pd.DataFrame(data)
>>> df
0 1 2 3 4
0 1.300 3.60 33.00 61.38 0.30
1 3.140 2.71 64.00 23.20 21.00
2 1024.000 42.00 66.00 137.00 22.20
3 63.123 111.00 1.23 14.16 50.49
>>>
>>> a = list(zip(*df.stack().nlargest(3).index.labels))
>>> a
[(2, 0), (2, 3), (3, 1)]
>>> # then ...
>>> df.loc[a[0]]
1024.0
>>>
>>> # all sorted in decreasing order ...
>>> list(zip(*df.stack().nlargest(20).index.labels))
[(2, 0), (2, 3), (3, 1), (2, 2), (1, 2), (3, 0), (0, 3), (3, 4), (2, 1), (0, 2), (1, 3), (2, 4), (1, 4), (3, 3), (0, 1), (1, 0), (1, 1), (0, 0), (3, 2), (0, 4)]
Edit: In pandas versions 0.24.0 and above, MultiIndex.labels has been replaced by MultiIndex.codes(see Deprecations in What’s new in 0.24.0 (January 25, 2019)). The above code will throw AttributeError: 'MultiIndex' object has no attribute 'labels' and needs to be updated as follows:
>>> a = list(zip(*df.stack().nlargest(3).index.codes))
>>> a
[(2, 0), (2, 3), (3, 1)]
Edit 2: This question has become a "moving target", as the OP keeps changing it (this is my last update/edit). In the last update, OP's dataframe looks as follows:
>>> data = {'A' : [1.3, 3.14, 1024, 63.123], 'B' : [3.6, 2.71, 42, 111], 'C' : [33, 64, 66, 1.23], 'D' : [61.38, 23.2, 137, 14.16], 'E' : [0.3, 21, 22.2, 50.49] }
>>> df = pd.DataFrame(data, index=['X', 'Y', 'Z', 'T'])
>>> df
A B C D E
X 1.300 3.60 33.00 61.38 0.30
Y 3.140 2.71 64.00 23.20 21.00
Z 1024.000 42.00 66.00 137.00 22.20
T 63.123 111.00 1.23 14.16 50.49
The desired output can be obtained using:
>>> a = df.stack().nlargest(3).index
>>> a
MultiIndex([('Z', 'A'),
('Z', 'D'),
('T', 'B')],
)
>>>
>>> df.loc[a[0]]
1024.0
The trick is to use np.unravel_index on the np.argsort
Example:
import numpy as np
import pandas as pd
N = 5
df = pd.DataFrame([[11, 3, 50, -3],
[5, 73, 11, 100],
[75, 9, -2, 44]])
s_ix = np.argsort(df.values, axis=None)[::-1][:N]
labels = np.unravel_index(s_ix, df.shape)
labels = list(zip(*labels))
print(labels) # --> [(1, 3), (2, 0), (1, 1), (0, 2), (2, 3)]
print(df.loc[labels[0]]) # --> 100

How much unique data is there, put it all in a table

I would like to query in SQL how many unique values ​​there are and how many rows are there. In Python, I could do it like this. But how do I do this in SQL so that I get the result like at the bottom?
In Python I could do the following
d = {'sellerid': [1, 1, 1, 2, 2, 3, 3, 3], 'modelnumber': [85, 45, 85, 12 ,85, 74, 85, 12]
, 'modelgroup': [2, 3, 2, 1, 2, 3, 2, 1 ]}
df = pd.DataFrame(data=d)
display(df.head(10))
df['Dataframe']='df'
unique_sellerid = df['sellerid'].nunique()
print("unique_sellerid", unique_sellerid)
unique_modelnumber = df['modelnumber'].nunique()
print("unique_modelnumber", unique_modelnumber)
unique_modelgroup = df['modelgroup'].nunique()
print("unique_modelgroup", unique_modelgroup)
total_rows = df.shape[0]
print("total_rows", total_rows)
[OUT]
unique_sellerid 3
unique_modelnumber 4
unique_modelgroup 3
total_rows 8
I want a query like
Here is the dummy table
CREATE TABLE cars (
sellerid INT NOT NULL,
modelnumber INT NOT NULL,
modelgroup INT,
);
INSERT INTO cars
(sellerid , modelnumber, modelgroup )
VALUES
(1, 85, 2),
(1, 45, 3),
(1, 85, 2),
(2, 12, 1),
(2, 85, 2),
(3, 74, 3),
(3, 85, 2),
(3, 12, 1);
You could use the count(distinct column) aggregation function like :
select
count(distinct col1) as nunique_col1,
count(distinct col2) as nunique_col2,
count(1) as nb_rows
from database
Also in pandas, you can also apply the nunique() function on the dataset, rather than doing it on each column: df.nunique()

pandas Multiindex - set_index with list of tuples

I experienced following issue. I have an existing MultiIndex and want to replace the single level with a list of tuples. But I got some strange value error
Code to reproduce:
idx = pd.MultiIndex.from_tuples([(1, u'one'), (1, u'two'),
(2, u'one'), (2, u'two')],
names=['foo', 'bar'])
idx.set_levels([3, 5], level=0) # works fine
idx.set_levels([(1,2),(3,4)], level=0) #TypeError: Levels must be list-like
Can anyone comment:
1) What's the issue?
2) What's the best method to replace index (int values -> tuple values)
Thanks!
For me working new contructor:
idx = pd.MultiIndex.from_product([[(1,2),(3,4)], idx.levels[1]], names=idx.names)
print (idx)
MultiIndex(levels=[[(1, 2), (3, 4)], ['one', 'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['foo', 'bar'])
EIT1:
df = pd.DataFrame({'A':list('abcdef'),
'B':[1,2,1,2,2,1],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')}).set_index(['B','C'])
#dynamic generate dictioanry with list of tuples
new = [(1, 2), (3, 4)]
d = dict(zip(df.index.levels[0], new))
print (d)
{1: (1, 2), 2: (3, 4)}
#explicit define dictionary
d = {1:(1,2), 2:(3,4)}
#rename first level of MultiInex
df = df.rename(index=d, level=0)
print (df)
A D E F
B C
(1, 2) 7 a 1 5 a
(3, 4) 8 b 3 3 a
(1, 2) 9 c 5 6 a
(3, 4) 4 d 7 9 b
2 e 1 2 b
(1, 2) 3 f 0 4 b
EDIT:
new = [(1, 2), (3, 4)]
lvl0 = list(map(tuple, np.array(new)[pd.factorize(idx.get_level_values(0))[0]].tolist()))
print (lvl0)
[(1, 2), (1, 2), (3, 4), (3, 4)]
idx = pd.MultiIndex.from_arrays([lvl0, idx.get_level_values(1)], names=idx.names)
print (idx)
MultiIndex(levels=[[(1, 2), (3, 4)], ['one', 'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['foo', 'bar'])

Convert pandas Series/DataFrame to numpy matrix, unpacking coordinates from index

I have a pandas series as so:
A 1
B 2
C 3
AB 4
AC 5
BA 4
BC 8
CA 5
CB 8
Simple code to convert to a matrix as such:
1 4 5
4 2 8
5 8 3
Something fairly dynamic and built in, rather than many loops to fix this 3x3 problem.
You can do it this way.
import pandas as pd
# your raw data
raw_index = 'A B C AB AC BA BC CA CB'.split()
values = [1, 2, 3, 4, 5, 4, 8, 5, 8]
# reformat index
index = [(a[0], a[-1]) for a in raw_index]
multi_index = pd.MultiIndex.from_tuples(index)
df = pd.DataFrame(values, columns=['values'], index=multi_index)
df.unstack()
df.unstack()
Out[47]:
values
A B C
A 1 4 5
B 4 2 8
C 5 8 3
For pd.DataFrame uses .values member or else .to_records(...) method
For pd.Series use .unstack() method as Jianxun Li said
import numpy as np
import pandas as pd
d = pd.DataFrame(data = {
'var':['A','B','C','AB','AC','BA','BC','CA','CB'],
'val':[1,2,3,4,5,4,8,5,8] })
# Here are some options for converting to np.matrix ...
np.matrix( d.to_records(index=False) )
# matrix([[(1, 'A'), (2, 'B'), (3, 'C'), (4, 'AB'), (5, 'AC'), (4, 'BA'),
# (8, 'BC'), (5, 'CA'), (8, 'CB')]],
# dtype=[('val', '<i8'), ('var', 'O')])
# Here you can add code to rearrange it, e.g.
[(val, idx[0], idx[-1]) for val,idx in d.to_records(index=False) ]
# [(1, 'A', 'A'), (2, 'B', 'B'), (3, 'C', 'C'), (4, 'A', 'B'), (5, 'A', 'C'), (4, 'B', 'A'), (8, 'B', 'C'), (5, 'C', 'A'), (8, 'C', 'B')]
# and if you need numeric row- and col-indices:
[ (val, 'ABCDEF...'.index(idx[0]), 'ABCDEF...'.index(idx[-1]) ) for val,idx in d.to_records(index=False) ]
# [(1, 0, 0), (2, 1, 1), (3, 2, 2), (4, 0, 1), (5, 0, 2), (4, 1, 0), (8, 1, 2), (5, 2, 0), (8, 2, 1)]
# you can sort by them:
sorted([ (val, 'ABCDEF...'.index(idx[0]), 'ABCDEF...'.index(idx[-1]) ) for val,idx in d.to_records(index=False) ], key=lambda x: x[1:2] )