How do I solve this kind of problem through pandas.cut()? - pandas

I have my data as
data = pd.DataFrame({'A':[3,50,50,60],'B':[49,5,37,59],'C':[15,34,43,6],'D':[35,39,10,25]})
If I use cut this way
p = ['A','S','T','U','V','C','Z']
bins = [0,30,35,40,45,50,55,60]
data['A*'] = pd.cut(data.A,bins,labels=p)
print(data)
I get
A B C D A*
0 3 49 15 35 A
1 50 5 34 39 V
2 50 37 43 10 V
3 60 59 6 25 Z
How would I cut it to get
A B C D A*
0 3 49 15 35 3A
1 50 5 34 39 50V
2 50 37 43 10 50V
3 60 59 6 25 60Z
I tried this but doesn't work
for x in data.A:
p = [str(x)+'A',str(x)+'S',str(x)+'T',str(x)+'U',str(x)+'V',str(x)+'C',str(x)+'Z']
bins = [0,30,35,40,45,50,55,60]
It gives me this
A B C D A*
0 3 49 15 35 60A
1 50 5 34 39 60V
2 50 37 43 10 60V
3 60 59 6 25 60Z

Convert column A to strings and categoricals from pd.cut too and join together:
p = ['A','S','T','U','V','C','Z']
bins = [0,30,35,40,45,50,55,60]
data['A*'] = data.A.astype(str) + pd.cut(data.A,bins,labels=p).astype(str)
print(data)
A B C D A*
0 3 49 15 35 3A
1 50 5 34 39 50V
2 50 37 43 10 50V
3 60 59 6 25 60Z
EDIT:
For processing all columns is possible use DataFrame.apply:
data = data.apply(lambda x: x.astype(str) + pd.cut(x,bins,labels=p).astype(str))
print(data)
A B C D
0 3A 49V 15A 35S
1 50V 5A 34S 39T
2 50V 37T 43U 10A
3 60Z 59Z 6A 25A

Related

iteration calculation based on another dataframe

How to do iteration calculation as shown in df2 as desired output ?
any reference links for this > many thanks for helping
df1
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
df2 :
a b c
0 1 0 5 >> values from df1
1 19 18 9 >> values from (df1.iloc[1] * 2) + df2.iloc[0] *1)
2 23 22 25 >> values from (df1.iloc[2] * 2) + df2.iloc[1] *1)
3 35 28 25 >> values from (df1.iloc[3] * 2) + df2.iloc[2] *1)
4 47 30 39 >> values from (df1.iloc[4] * 2) + df2.iloc[3] *1)
IIUC, you can try:
df2 = df1.mul(2).cumsum().sub(df1.iloc[0])
Output:
a b c
0 1 0 5
1 19 18 9
2 23 22 25
3 35 28 25
4 47 30 39
more complex operation
If you want x[n] = x[n]*2 + x[n-1]*2, you need to iterate:
def process(s):
out = [s[0]]
for x in s[1:]:
out.append(x*2+out[-1]*3)
return out
df1.apply(process)
Output:
a b c
0 1 0 5
1 21 18 19
2 67 58 73
3 213 180 219
4 651 542 671

Reordering a DF by category in a preset order

df = pd.DataFrame(np.random.randint(0,100,size=(15, 3)), columns=list('NMO'))
df['Catgeory1'] = ['I','I','I','I','I','G','G','G','G','G','P','P','I','I','P']
df['Catgeory2'] = ['W','W','C','C','C','W','W','W','W','W','O','O','O','O','O']
Imagining this df is much larger with many more categories, how might I sort the list, retaining all the characteristics of any given row, by a determined order. Ex. Sorting the df only by 'category1', such that all the P's are first, the I's, then G's.
You can use categorical type:
cat_type = pd.CategoricalDtype(categories=["P", "I", "G"], ordered=True)
df['Category1'] = df['Category1'].astype(cat_type)
print(df.sort_values(by='Category1'))
Prints:
N M O Category1 Category2
10 49 37 44 P O
11 72 64 66 P O
14 39 98 32 P O
0 93 12 89 I W
1 20 74 21 I W
2 25 22 24 I C
3 47 11 33 I C
4 60 16 34 I C
12 0 90 6 I O
13 13 35 80 I O
5 84 64 67 G W
6 70 47 83 G W
7 61 57 76 G W
8 19 8 3 G W
9 7 8 5 G W
For PIG order (reverse alphabetical order):
df.sort_values('Catgeory1',ascending=False)
For custom sorting:
df['Catgeory1'] = pd.Categorical(df['Catgeory1'], ['P','G','I'])
df = df.sort_values('Catgeory1')

R: How to make a violin/box plot of the last (or any) data points in a time series?

I have the following data frame, A, and would like to make a violin/box plot of the last data points (or any other selected) for all IDs in a time series, i.e. for time=90 the values for ID = 1...10 should be plotted.
A = data.frame(ID = rep(seq(1,5),each=10),
time = rep(seq(0,90,by = 10),5),
value = rnorm(50))
ID time value
1 1 0 0.056152116
2 1 10 0.560673698
3 1 20 -0.240922725
4 1 30 -1.054686869
5 1 40 -0.734477812
6 1 50 1.123602646
7 1 60 -2.242830898
8 1 70 -0.818526167
9 1 80 1.476234401
10 1 90 -0.332324134
11 2 0 -1.486034438
12 2 10 0.222252053
13 2 20 -0.675720560
14 2 30 -3.144918043
15 2 40 3.058383376
16 2 50 0.978174555
17 2 60 -0.280927730
18 2 70 -0.188338714
19 2 80 -1.115583389
20 2 90 0.362044729
...
41 5 0 0.687402844
42 5 10 -1.127714642
43 5 20 0.117758547
44 5 30 0.507666153
45 5 40 0.205580300
46 5 50 -1.033018214
47 5 60 -1.906279605
48 5 70 0.117539035
49 5 80 -0.968888556
50 5 90 0.122049005
Try this:
set.seed(42)
A = data.frame(ID = rep(seq(1,5),each=10),
time = rep(seq(0,90,by = 10),5),
value = rnorm(50))
library(ggplot2)
library(dplyr)
filter(A, time == 90) %>%
ggplot(aes(y = value)) +
geom_boxplot()
Created on 2020-06-09 by the reprex package (v0.3.0)

Dropping list of rows from multi-level pandas dataframe where first two levels have duplicate indices

I would like to drop a list of row indices from a multi-level data frame, where the first two levels have duplicate entries. I imagine it is possible to do this without a loop, but thus far I have not found this.
I have attempted to use the pd.drop function by providing a list row index combinations, though this does not have the desired effect. As an example:
import numpy as np
import pandas as pd
def mklbl(prefix, n):
return ["%s%s" % (prefix, i) for i in range(n)]
def src_rec(n, mult):
src = [[no]*mult for no in range(1,n)]
src = [item for sublist in src for item in sublist]
rec = [no for no in range(1,n)]*mult
return src, rec
src, rec = src_rec(4,4)
miindex = pd.MultiIndex.from_arrays([src*2,
rec*2,
mklbl('C', 24)])
dfmi = pd.DataFrame(np.arange(len(miindex) * 2)\
.reshape((len(miindex), 2)),
index=miindex)
I would like to drop all rows with index values (1,2,:) and (2,3,:)
As = [1, 2]
Bs = [2, 3]
dfmi.drop(pd.MultiIndex.from_arrays([As,Bs]))
The result of this is:
0 1
1 1 C0 0 1
2 1 C18 36 37
2 C19 38 39
3 3 C20 40 41
1 C21 42 43
2 C22 44 45
3 C23 46 47
While my desired result is:
0 1
1 1 C0 0 1
3 C2 4 5
1 C3 6 7
2 2 C4 8 9
1 C6 12 13
2 C7 14 15
3 3 C8 16 17
1 C9 18 19
2 C10 20 21
3 C11 22 23
1 1 C12 24 25
3 C14 28 29
1 C15 30 31
2 2 C16 32 33
1 C18 36 37
2 C19 38 39
3 3 C20 40 41
1 C21 42 43
2 C22 44 45
3 C23 46 47
An example of this in a loop is;
for A, B in zip(As, Bs):
dfmi_drop_idx = CCdata.loc[(A, B, slice(None)), :].index
dfmi.drop(dfmi_drop_idx, inplace=True, errors='raise')
Use boolean indexing with test membership by Index.isin:
m = pd.MultiIndex.from_arrays([As,Bs])
df = dfmi[~dfmi.reset_index(level=2, drop=True).index.isin(m)]
print (df)
0 1
1 1 C0 0 1
3 C2 4 5
1 C3 6 7
2 2 C4 8 9
1 C6 12 13
2 C7 14 15
3 3 C8 16 17
1 C9 18 19
2 C10 20 21
3 C11 22 23
1 1 C12 24 25
3 C14 28 29
1 C15 30 31
2 2 C16 32 33
1 C18 36 37
2 C19 38 39
3 3 C20 40 41
1 C21 42 43
2 C22 44 45
3 C23 46 47

Updating values in a table if present in a different table using pandas merge

I have 2 tables, Table p and Table q. The contents of Table p are to be updated from Table q.
Table p:
A B C
1 45 22 25
2 34 46 56
3 59 55 44
Table q:
A B C
1 34 46 59
2 59 55 49
I want to merge these two tables based on column 'A' and 'B' such that if values of 'A', 'B' in table p are not present in table q, values in column B in table p are the same.
Tried:
p['A'] = pd.merge(q, on=['A','B'], how='left')['C']
Output:
A B C
1 45 22 NaN
2 34 46 59
3 59 55 49
Desired Output:
A B C
1 45 22 25
2 34 59 59
3 59 55 49
I can create a different column and merge and then combine back to column 'A' of table p but that seems lengthy. Is there a more direct way to do this?
You can using update
keycol=['A','B']
df1=df1.set_index(keycol)
df1.update(df2.set_index(keycol))
df1
Out[762]:
C
A B
45 22 25.0
34 46 59.0
59 55 49.0
df1.reset_index()
Out[763]:
A B C
0 45 22 25.0
1 34 46 59.0
2 59 55 49.0
Another solution from map
df1.A.map(df2.set_index('A').B).fillna(df1.B)
Out[727]:
1 22.0
2 59.0
3 99.0
Name: A, dtype: float64