I'm looking to create a new column which is based on the ordering of two other columns, preferably using the Tidyverse functions, but any suggestions are appreciated. I have a table of around 1300 entries and several columns but a sample of my data looks something like:
Number of people
TotalOrder
TotalQuantile
12
1
1
19
2
1
21
3
2
45
5
2
53
5
3
55
6
3
60
7
4
75
8
4
But I want a fourth column which ranks TotalOrder within TotalQuantile, and to look something like:
Number of people
TotalOrder
TotalQuantile
NewOrder
12
1
1
1
19
2
1
2
21
3
2
1
45
5
2
2
53
5
3
1
55
6
3
2
60
7
4
1
75
8
4
2
I've tried a few things like filtering, arranging, etc but it's not worked out. Thanks for the help.
library(dplyr)
df <-
structure(list(
Number.of.people = c(12L, 19L, 21L, 45L, 53L, 55L, 60L, 75L),
TotalOrder = c(1L, 2L, 3L, 5L, 5L, 6L, 7L, 8L),
TotalQuantile = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L)),
row.names = c(NA,-8L), class = c("tbl_df", "tbl", "data.frame"))
df %>%
group_by(TotalQuantile) %>%
mutate(NewOrder = row_number())
# A tibble: 8 x 4
# Groups: TotalQuantile [4]
Number.of.people TotalOrder TotalQuantile NewOrder
<int> <int> <int> <int>
1 12 1 1 1
2 19 2 1 2
3 21 3 2 1
4 45 5 2 2
5 53 5 3 1
6 55 6 3 2
7 60 7 4 1
8 75 8 4 2
Related
I want to do multiindexing for my data frame such that MAE,MSE,RMSE,MPE are grouped together and given a new index level. Similarly the rest of the four should be grouped together in the same level but different name
> mux3 = pd.MultiIndex.from_product([list('ABCD'),list('1234')],
> names=['one','two'])###dummy data
> df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3) #### dummy data frame
> print(df3) #intended output required for the data frame in the picture given below
Assuming column groups are already in the appropriate order we can simply create an np.arange over the length of the columns and floor divide by 4 to get groups and create a simple MultiIndex.from_arrays.
Sample Input and Output:
import numpy as np
import pandas as pd
initial_index = [1, 2, 3, 4] * 3
np.random.seed(5)
df3 = pd.DataFrame(
np.random.choice(10, (3, len(initial_index))), columns=initial_index
)
1 2 3 4 1 2 3 4 1 2 3 4 # Column headers are in repeating order
0 3 6 6 0 9 8 4 7 0 0 7 1
1 5 7 0 1 4 6 2 9 9 9 9 1
2 2 7 0 5 0 0 4 4 9 3 2 4
# Create New Columns
df3.columns = pd.MultiIndex.from_arrays([
np.arange(len(df3.columns)) // 4, # Group Each set of 4 columns together
df3.columns # Keep level 1 the same as current columns
], names=['one', 'two']) # Set Names (optional)
df3
one 0 1 2
two 1 2 3 4 1 2 3 4 1 2 3 4
0 3 6 6 0 9 8 4 7 0 0 7 1
1 5 7 0 1 4 6 2 9 9 9 9 1
2 2 7 0 5 0 0 4 4 9 3 2 4
If columns are in mixed order:
np.random.seed(5)
df3 = pd.DataFrame(
np.random.choice(10, (3, 8)), columns=[1, 1, 3, 2, 4, 3, 2, 4]
)
df3
1 1 3 2 4 3 2 4 # Cannot select groups positionally
0 3 6 6 0 9 8 4 7
1 0 0 7 1 5 7 0 1
2 4 6 2 9 9 9 9 1
We can convert Index.to_series then enumerate columns using groupby cumcount then sort_index if needed to get in order:
df3.columns = pd.MultiIndex.from_arrays([
# Enumerate Groups to create new level 0 index
df3.columns.to_series().groupby(df3.columns).cumcount(),
df3.columns
], names=['one', 'two']) # Set Names (optional)
# Sort to Order Correctly
# (Do not sort before setting columns it will break alignment with data)
df3 = df3.sort_index(axis=1)
df3
one 0 1
two 1 2 3 4 1 2 3 4 # Notice Data has moved with headers
0 3 0 6 9 6 4 8 7
1 0 1 7 5 0 0 7 1
2 4 9 2 9 6 9 9 1
My frame has many pairs of identically named columns, with the only difference being the prefix. For example, player1.player.id and player2.player.id.
Here's an example (with fewer and shorter columns):
pd.DataFrame({'p1.a': {0: 4, 1: 0}, 'p1.b': {0: 1, 1: 4},
'p1.c': {0: 2, 1: 8}, 'p1.d': {0: 3, 1: 12},
'p1.e': {0: 4, 1: 16}, 'p1.f': {0: 5, 1: 20},
'p1.g': {0: 6, 1: 24},
'p2.a': {0: 0, 1: 0}, 'p2.b': {0: 3, 1: 12},
'p2.c': {0: 6, 1: 24}, 'p2.d': {0: 9, 1: 36},
'p2.e': {0: 12, 1: 48}, 'p2.f': {0: 15, 1: 60},
'p2.g': {0: 18, 1: 72}})
p1.a p1.b p1.c p1.d p1.e p1.f p1.g p2.a p2.b p2.c p2.d p2.e p2.f p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
I'd like to turn it into a long format, with a new side column denoting either p1 or p2. I have several crappy ways of doing it, for example:
df1 = df.filter(regex='^p1.*').assign(side='p1')
df2 = df.filter(regex='^p2.*').assign(side='p2')
df1.columns = [c.replace('p1.', '') for c in df1.columns]
df2.columns = [c.replace('p2.', '') for c in df2.columns]
pd.concat([df1, df2]).head()
a b c d e f g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
This feels non-idiomatic, and I couldn't get pd.wide_to_long() to work here.
I'd appreciate an answer which also handles arbitrary substrings, not just prefix, i.e., I'm also interested in something like this:
foo.p1.a foo.p1.b foo.p1.c foo.p1.d foo.p1.e foo.p1.f foo.p1.g foo.p2.a foo.p2.b foo.p2.c foo.p2.d foo.p2.e foo.p2.f foo.p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
Turning into:
foo.a foo.b foo.c foo.d foo.e foo.f foo.g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
But if there's an idiomatic way to handle prefixes whereas substrings require complexity, I'd appreciate learning about both.
What's the idiomatic (pythonic? pandonic?) way of doing this?
A couple of options to do this:
with pd.wide_to_long, you need to reorder the positions based on the delimiter; in this case we move the a, b, ... to the fore and the p1, p2 to the back, before reshaping:
temp = df.copy()
temp = temp.rename(columns = lambda df: ".".join(df.split(".")[::-1]))
(pd.wide_to_long(temp.reset_index(),
stubnames = ["a", "b", "c", "d", "e", "f", "g"],
sep=".",
suffix=".+",
i = "index",
j = "side")
.droplevel('index')
.reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
One limitation with pd.wide_to_long is the reshaping of positions. The other limitation is that the stubnames have to be explicitly specified.
Another option is via stack, where the columns are split, based on the delimiter and reshaped:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
temp.stack(0).droplevel(0).rename_axis('side').reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
stack is quite flexible, and did not require us to list the column names. The limitation of stack is that it fails if the index is not unique.
Another option is pivot_longer from pyjanitor, which abstracts the process:
# pip install janitor
import janitor
df.pivot_longer(index = None,
names_to = ("side", ".value"),
names_sep=".")
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
The worker here is .value. This tells the code that anything after . should remain as column names, while anything before . should be collated into a new column (side). Note that, unlike wide_to_long, the stubnames do not need to be stated - it abstracts that for us. Also, it can handle duplicate indices, since it uses pd.melt under the hood.
One limitation of pivot_longer is that you have to install the pyjanitor library.
For the other example, I'll use stack and pivot_longer; you can still use pd.wide_to_long to solve it.
With stack:
first split the columns and convert into a MultiIndex:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
Reshape the data:
temp = temp.stack(1).droplevel(0).rename_axis('side')
Merge the column names:
temp.columns = temp.columns.map(".".join)
Reset the index:
temp.reset_index()
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
With pivot_longer, one option is to reorder the columns, before reshaping:
temp = df.copy()
temp.columns = ["".join([first, last, middle])
for first, middle, last in
temp.columns.str.split(r'(\.p\d)')]
(
temp
.pivot_longer(
index = None,
names_to = ('.value', 'side'),
names_pattern = r"(.+)\.(p\d)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
In the dev version however, the column reorder is not necessary; we can simply use multiple .value to reshape the dataframe - note that you'll have to install from the repo to get the latest dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_pattern = r"(.+)\.(.\d)(.+)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Another option with names_sep:
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_sep = r'\.(p\d)')
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Does anyone know how to randomly delete and add rows based on ID? Here is a reproducible example:
> y <- rnorm(20)
> x <- rnorm(20)
> z <- rep(1:5, 4)
> w <- rep(1:4, each=5)
> data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
id cluster x y
1 1 1 0.30003855 0.65325768
2 2 1 -1.00563626 -0.12270866
3 3 1 0.01925927 -0.41367651
4 4 1 -1.07742065 -2.64314895
5 5 1 0.71270333 -0.09294102
6 1 2 1.08477509 0.43028470
7 2 2 -2.22498770 0.53539884
8 3 2 1.23569346 -0.55527835
9 4 2 -1.24104450 1.77950291
10 5 2 0.45476927 0.28642442
11 1 3 0.65990264 0.12631586
12 2 3 -0.19988983 1.27226678
13 3 3 -0.64511396 -0.71846622
14 4 3 0.16532102 -0.45033862
15 5 3 0.43881870 2.39745248
16 1 4 0.88330282 0.01112919
17 2 4 -2.05233698 1.63356842
18 3 4 -1.63637927 -1.43850664
19 4 4 1.43040234 -0.19051680
20 5 4 1.04662885 0.37842390
After randomly adding and deleting some data based on ID, the dataset looks like this and the total number of observations should match the one above:
id cluster x y
1 1 1 0.895 -0.659
2 2 1 -0.160 -0.366
3 1 2 -0.528 -0.294
4 2 2 -0.919 0.362
5 3 2 -0.901 -0.467
6 1 3 0.275 0.134
7 2 3 0.423 0.534
8 3 3 0.929 -0.953
9 4 3 1.67 0.668
10 5 3 0.286 0.0872
11 1 4 -0.373 -0.109
12 2 4 0.289 0.299
13 3 4 -1.43 -0.677
14 4 4 -0.884 1.70
15 5 4 1.12 0.386
16 1 5 -0.723 0.247
17 2 5 0.463 -2.59
18 3 5 0.234 0.893
19 4 5 -0.313 -1.96
20 5 5 0.848 -0.0613
We could sample the number of rows to delete/add from each cluster by taking a sample of size of nrow(d) from a new id-cluster ID id2. We then just add a number of rows according to that sample size and increment cluster number by one.
From inspecting your shown expected output you may wish to have a minimum nobs of 2 per cluster in the result. We can handle that in the function arguments and prevent nonsense combinations with a few stopifnots. A repeat loop breaks then when the conditions are met.
FUN <- function(d, cl.obs=2, min.cl=NA) {
l.cl <- length(unique(d$cluster))
if (is.na(min.cl)) min.cl <- l.cl
stopifnot(cl.obs <= min(table(d$cluster)))
stopifnot(min.cl <= l.cl + 1)
stopifnot(cl.obs*min.cl <= nrow(d))
d$id2 <- Reduce(paste, d[c("id", "cluster")])
repeat({
samp <- sample(d$id2, sample(1:nrow(d), 1))
l <- length(samp)
if (l == 0) {
return(d[,-5])
}
else {
a <- cbind(id=1:l, cluster= max(d$cluster) + 1,
matrix(rnorm(l*2),,2, dimnames=list(NULL, letters[24:25])))
o <- rbind(d[!d$id2 %in% samp, -5], a)
(cl.tb <- table(o$cluster))
if (all(cl.tb >= cl.obs) & length(cl.tb) >= min.cl) break
}
})
return(`rownames<-`(o, NULL))
}
set.seed(42)
FUN(d)
# id cluster x y
# 1 1 1 -0.30663859 1.37095845
# 2 2 1 -1.78130843 -0.56469817
# 3 3 1 -0.17191736 0.36312841
# 4 4 1 1.21467470 0.63286260
# 5 1 2 -0.43046913 -0.10612452
# 6 2 2 -0.25726938 1.51152200
# 7 3 2 -1.76316309 -0.09465904
# 8 4 2 0.46009735 2.01842371
# 9 5 2 -0.63999488 -0.06271410
# 10 1 3 0.45545012 1.30486965
# 11 2 3 0.70483734 2.28664539
# 12 3 3 1.03510352 -1.38886070
# 13 5 3 0.50495512 -0.13332134
# 14 2 4 -0.78445901 -0.28425292
# 15 3 4 -0.85090759 -2.65645542
# 16 4 4 -2.41420765 -2.44046693
# 17 5 4 0.03612261 1.32011335
# 18 1 5 -0.43144620 -0.78383894
# 19 2 5 0.65564788 1.57572752
# 20 3 5 0.32192527 0.64289931
Using arguments:
set.seed(666)
FUN(d, cl.obs=1)
# id cluster x y
# 1 4 1 1.21467470 0.63286260 ## just one obs in cl. 1
# 2 2 2 -0.25726938 1.51152200
# 3 3 2 -1.76316309 -0.09465904
# 4 5 2 -0.63999488 -0.06271410
# 5 1 3 0.45545012 1.30486965
# 6 3 3 1.03510352 -1.38886070
# 7 5 3 0.50495512 -0.13332134
# 8 1 4 -1.71700868 0.63595040
# 9 3 4 -0.85090759 -2.65645542
# 10 1 5 -0.08365711 0.07771005
# 11 2 5 0.25683143 2.12925556
# 12 3 5 -1.07362365 0.63895459
# 13 4 5 -0.62286788 0.26934743
# 14 5 5 0.28499111 2.29896933
# 15 6 5 1.05156653 -1.37464590
# 16 7 5 -0.25952120 0.66236713
# 17 8 5 0.02230428 0.48351632
# 18 9 5 -0.01440929 1.23229183
# 19 10 5 1.33285534 -1.77762517
# 20 11 5 0.14842679 0.88552740
Data:
d <- structure(list(id = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L,
1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), cluster = c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L,
4L, 4L), x = c(-0.306638594078475, -1.78130843398, -0.171917355759621,
1.2146746991726, 1.89519346126497, -0.4304691316062, -0.25726938276893,
-1.76316308519478, 0.460097354831271, -0.639994875960119, 0.455450123241219,
0.704837337228819, 1.03510352196992, -0.608926375407211, 0.50495512329797,
-1.71700867907334, -0.784459008379496, -0.850907594176518, -2.41420764994663,
0.0361226068922556), y = c(1.37095844714667, -0.564698171396089,
0.363128411337339, 0.63286260496104, 0.404268323140999, -0.106124516091484,
1.51152199743894, -0.0946590384130976, 2.01842371387704, -0.062714099052421,
1.30486965422349, 2.28664539270111, -1.38886070111234, -0.278788766817371,
-0.133321336393658, 0.635950398070074, -0.284252921416072, -2.65645542090478,
-2.44046692857552, 1.32011334573019)), class = "data.frame", row.names = c(NA,
-20L))
Below is a small dataset of transaction records, with ID, DATE of the month, dummy variable of Bad_Credit or not. I would like to pull out all the transactions after a bad credit start.
The OUTPUT column indicate the correct result, which is row 1,2,3,5,6,8,10.
This is just an example, there could be thousands of rows. SQL, R, SPSS will all work. Thank you.
DATE
ID
Bad_CREDIT
OUTPUT
12
A
1
1
15
A
1
1
18
A
0
1
2
B
0
0
10
B
1
1
20
B
0
1
5
C
0
0
15
C
1
1
1
D
0
0
9
E
1
1
You can arrange the data by ID and DATE and for each ID assign 0 if the first value of Bad_CREDIT is 0.
library(dplyr)
df %>%
arrange(ID, DATE) %>%
group_by(ID) %>%
mutate(OUTPUT = as.integer(!(first(Bad_CREDIT) == 0 & row_number() == 1)))
# DATE ID Bad_CREDIT OUTPUT
# <int> <chr> <int> <int>
# 1 12 A 1 1
# 2 15 A 1 1
# 3 18 A 0 1
# 4 2 B 0 0
# 5 10 B 1 1
# 6 20 B 0 1
# 7 5 C 0 0
# 8 15 C 1 1
# 9 1 D 0 0
#10 9 E 1 1
data
df <- structure(list(DATE = c(12L, 15L, 18L, 2L, 10L, 20L, 5L, 15L,
1L, 9L), ID = c("A", "A", "A", "B", "B", "B", "C", "C", "D",
"E"), Bad_CREDIT = c(1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L)),
row.names = c(NA, -10L), class = "data.frame")
If I understand correctly, you can use window functions:
select t.*
from (select t.*,
min(case when bad_credit = 1 then date end) over (partition by id) as min_bd_date
from t
) t
where date >= min_bd_date;
You can also do this with a correlated subquery:
select t.*
from t
where t.date >= (select min(t2.date)
from t t2
where t2.id = t.id and
t2.bad_credit = 1
);
If this is in a database, then I think SQL is likely the better place to address this. However, if you already have it in R, then ...
Here's an R method, using dplyr:
library(dplyr)
dat %>%
group_by(ID) %>%
mutate(OUTPUT2 = +cumany(Bad_CREDIT)) %>%
ungroup()
# # A tibble: 10 x 5
# DATE ID Bad_CREDIT OUTPUT OUTPUT2
# <int> <chr> <int> <int> <int>
# 1 12 A 1 1 1
# 2 15 A 1 1 1
# 3 18 A 0 1 1
# 4 2 B 0 0 0
# 5 10 B 1 1 1
# 6 20 B 0 1 1
# 7 5 C 0 0 0
# 8 15 C 1 1 1
# 9 1 D 0 0 0
# 10 9 E 1 1 1
Because this is effectively a simple grouping operation, then base R and data.table solutions are as straight-forward.
+ave(dat$Bad_CREDIT, dat$ID, FUN=cumany)
# [1] 1 1 1 0 1 1 0 1 0 1
library(data.table)
datDT <- as.data.table(dat)
datDT[, OUTPUT2 := +cumany(Bad_CREDIT), by = .(ID)]
You can use EXISTS as follows:
select t.* from your_table t
where exists
(select 1
from your_table tt
where t.id = tt.id
and t.date >= tt.date
and tt.bad_credit = 1);
This is for SPSS:
sort cases by ID date.
compute PullOut=Bad_CREDIT.
if $casenum>1 and ID=lag(ID) and lag(PullOut)=1 PullOut=1.
exe.
I have a dataframe like this:
df = pd.DataFrame([[1,2],
[1,4],
[1,5],
[2,65],
[2,34],
[2,23],
[2,45]], columns = ['label', 'score'])
Is there an efficient way to create a column score_winsor that winsorises the score column within the groups at the 1% level?
I tried this with no success:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: max(x.quantile(.01), min(x, x.quantile(.99))))
You could use scipy's implementation of winsorize
df["score_winsor"] = df.groupby('label')['score'].transform(lambda row: winsorize(row, limits=[0.01,0.01]))
Output
>>> df
label score score_winsor
0 1 2 2
1 1 4 4
2 1 5 5
3 2 65 65
4 2 34 34
5 2 23 23
6 2 45 45
This works:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: np.maximum(x.quantile(.01), np.minimum(x, x.quantile(.99))))
Output
print(df.to_string())
label score score_winsor
0 1 2 2.04
1 1 4 4.00
2 1 5 4.98
3 2 65 64.40
4 2 34 34.00
5 2 23 23.33
6 2 45 45.00