I would like to perform two column sampling on a dataframe. I am working on very small probabilities, and I have a problem on the end. Here is my methodology.
library(splitstackshape)
#Creation of a dataframe similar to the one I'm working on.
data1 <- data.frame(categorie_metier = sample(c("agriculteur", "artisan", "autre", "cadres", "employes", "ouvriers", "prof_int"), 429, replace = TRUE, prob = c(0.01, 0.05, 0.14, 0.41, 0.25, 0.04, 0.10)), en_teletravail = sample(c("0", "1"), 429, replace = TRUE, prob = c(0.59, 0.41)), stringsAsFactors = TRUE)
#Creation of a dataframe to simulate my probabilities.
data2 <- data.frame(categorie_metier = sample(c("agriculteur", "artisan", "autre", "cadres", "employes", "ouvriers", "prof_int"), 1000000, replace = TRUE, prob = c(0.01, 0.03, 0.27, 0.21, 0.13, 0.10, 0.25)), en_teletravail = sample(c("0", "1"), 1000000, replace = TRUE, prob = c(0.991, 0.009)), stringsAsFactors = TRUE)
#Grouping of columns.
data2$groupe <- paste(data2$categorie_metier, data2$en_teletravail)
#Extraction of groups in a variable. Objective: Create an output dataframe of 50 lines.
gsize <- 50 * round(prop.table(table(data2$groupe)), 2)
gsize = as.list(gsize)
#Generation of the output dataframe.
data3 <- stratified(data1, c("categorie_metier", "en_teletravail"), gsize)
Error in stratified(data1, c("categorie_metier", "en_teletravail"), gsize) :
Incompatible sizes supplied
According to my research, this error is due to the existence of values of 0 in "gsize". This is inevitable, because I am working on very small probabilities.
How could I handle these values at 0, knowing that I cannot enlarge the size of data3 ?
Thank you.
Related
I am currently working on longitudinal data and trying to reshape the data from the wide format to the long. The naming pattern of the time-varying variables is r*variable (for example, height data collected in wave 1 is r1height). The identifiers are hhid (household id) and pn (person id). The data itself is unbalanced. Some variables are observed from first wave to last wave, but others are only observed from the middle of the study (i.e., wave 3 to 5).
I have already reshaped the data using merged.stack from the splitstackshape package (see codes below).
df <- data.frame(hhid = c("10001", "10002", "10003", "10004"),
pn = c("001", "001", "001", "002"),
r1weight = c(56, 76, 87, 64),
r2weight = c(57, 75, 88, 66),
r3weight = c(56, 76, 87, 65),
r4weight = c(78,99,23,32),
r5weight = c(55, 77, 84, 65),
r1height = c(151, 163, 173, 153),
r2height = c(154, 164, NA, 154),
r3height = c(NA, 165, NA, 152),
r4height = c(153, 162, 172, 154),
r5height = c(152,161,171,154),
r3bmi = c(22,23,24,25),
r4bmi = c(23,24,20,19),
r5bmi = c(21,14,22,19))
library(splitstackshape)
# Merge stack (this is what I want)
long1 <- merged.stack(df, id.vars = c("hhid", "pn"),
var.stubs = c("weight", "height", "bmi"),
sep = "var.stubs", atStart = F, keep.all = FALSE)
Now I want to know if I can use the "reshape" function to get the same results. I have tried using reshape method but failed. For example, the reshape function, as shown in the code below, returns bizarre longitudinal data. I thought the "sep" statement should cause the problem, but I don't know how to specify a pattern for my time-varying variables.
# Reshape (Wrong results)
library(reshape)
namelist <- names(df)
namelist <- namelist[namelist %in% c("hhid", "pn") == FALSE]
long2 <- reshape(data=df,
varying = namelist,
sep = "",
direction = "long",
idvar = c("hhid", "pn"))
Could anyone let me know how to address this problem?
Thanks
Is there way to carry out a wilcoxon.test by group, with calculate confidence intervals, and then plot these results in ggplot?
My "data":
zero <- sample(0:0, 50, replace = TRUE)
small <- sample(1:5, 20, replace = TRUE)
medium <- sample(5:25, 15, replace = TRUE)
high <- sample(150:300, 5, replace = TRUE)
f <- function(x){
return(data.frame(ID=deparse(substitute(x)), value=x))
}
all <- bind_rows(f(zero), f(small), f(medium), f(high))
all <- as.data.frame(all[,-1])
names(all)[1] <- "value"
all$group <- c("a", "b", "c")
My attempt:
x <- ddply(all, .(group), function(x) {wilcox.test(all$value, conf.int=TRUE, conf.level=0.95)})
Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) :
Results must be all atomic, or all data frames
In addition: There were 12 warnings (use warnings() to see them)
I'd then like to plot the psuedo-medians with their respective confidence intervals, but I'm also not sure how to save the results for ggplot to work from.
When I run the xgboost rank demo by setting 2 samples for every group, eval_metric=auc, it shows warning that 'Dataset is empty, or contains only positive or negative samples'.
I have tried for many times modify the dtarget for training and validattion group and found that it has no effect and the problem occurs only when I set 2 samples for every gourp in dgroup, such as [2,2,2]. I don't kwnow where the problem is.
My xgboost param is :
xgb_rank_params1 = {
'booster': 'gbtree',
'eta': 0.1,
'gamma': 1.0,
'min_child_weight': 0.1,
'objective': 'rank:pairwise',
'eval_metric': 'auc',
'max_depth': 6,
'num_boost_round': 10,
'save_period': 0
}
data prebuild code is:
n_group = 3
n_choice = 2
dtrain = np.random.uniform(0, 100, [n_group * n_choice, 2])
dtarget = [1, 0, 1, 0, 1, 0]
# **problem here : when set n_choice = 2 sample for every gourp**
dgroup = np.array([n_choice for i in range(n_group)]).flatten()
# concate Train data, very import here !
xgbTrain = DMatrix(dtrain, label=dtarget)
xgbTrain.set_group(dgroup)
# generate eval data
dtrain_eval = np.random.uniform(0, 100, [n_group * n_choice, 2])
xgbTrain_eval = DMatrix(dtrain_eval, label=dtarget)
xgbTrain_eval.set_group(dgroup)
evallist = [(xgbTrain, 'train'), (xgbTrain_eval, 'eval')]
rankModel = train(xgb_rank_params1, xgbTrain, num_boost_round=20, evals=evallist)
output says:
[15:54:52] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/metric/auc.cc:330: Dataset is empty, or contains only positive or negative samples.
[0] train-auc:nan eval-auc:nan
[15:54:52] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/metric/auc.cc:330: Dataset is empty, or contains only positive or negative samples.
[15:54:52] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/metric/auc.cc:330: Dataset is empty, or contains only positive or negative samples.
[1] train-auc:nan eval-auc:nan
I am trying to annotate my subplots inside a for loop. Each subplot will have RMS value printed on the plot. I tried to do it the following way:
from plotly import tools
figg = tools.make_subplots(rows=4, cols=1)
fake_date = {"X": np.arange(1, 101, 0.5), "Y": np.sin(x), "Z": [x + 1 for x in range(10)] * 20}
fake_date = pd.DataFrame(fake_date)
fake_date.sort_values("Z")
unique_ids = fake_date['Z'].unique()
train_id, test_id = np.split(np.random.permutation(unique_ids), [int(.6 * len(unique_ids))])
for i, j in enumerate(test_id):
x_test = fake_date[fake_date['Z'].isin([test_id[i]])]
y_test = fake_date[fake_date['Z'].isin([test_id[i]])]
# Evaluate
rms_test = 0.04
r_test = 0.9
Real = {'type' : 'scatter',
'x' : x_test.X,
'y' : x_test.Y,
"mode" : 'lines+markers',
"name" : 'Real'}
figg.append_trace(Real, i+1, 1)
figg['layout'].update( annotations=[dict(x = 10,y = 0.2, text= rms_test, xref= "x1",yref="y1")] )
figg['layout'].update(height=1800, width=600, title='Testing')
pyo.iplot(figg)
This does not work, although the answer given here seems to work for others. Can anyone point out what wrong am I doing?
I generated fake date for reproducibility
I am not sure where to exactly place the RMS value, but below is a sample code which will help you achieve what you want.
We create an array annotation_arr where we store the annotations using the for loop.
We need to set the xval and yval for each of the individual axes. Remember, first axis will be x, second will be x2 so, I have written a ternary condition for that, please checkout the below code and let me know if there is any issues!
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
from plotly import tools
import numpy as np
import pandas as pd
init_notebook_mode(connected=True)
rows = 4
figg = tools.make_subplots(rows=rows, cols=1)
fake_date = {"X": np.arange(0, 100, 0.5), "Y": [np.sin(x) for x in range(200)], "Z": [x + 1 for x in range(10)] * 20}
fake_date = pd.DataFrame(fake_date)
fake_date.sort_values("Z")
unique_ids = fake_date['Z'].unique()
train_id, test_id = np.split(np.random.permutation(unique_ids), [int(.6 * len(unique_ids))])
top = 0
annotation_arr = []
for i, j in enumerate(test_id):
x_test = fake_date[fake_date['Z'].isin([test_id[i]])]
y_test = fake_date[fake_date['Z'].isin([test_id[i]])]
# Evaluate
rms_test = 0.04
r_test = 0.9
Real = {'type' : 'scatter',
'x' : x_test.X,
'y' : x_test.Y,
"mode" : 'lines+markers',
"name" : 'Real'}
top = top + 1/rows
i_val = "" if i == 0 else i + 1
annotation_arr.append(dict(x = r_test,y = top, text= rms_test, xref= "x"+str(i_val),yref="y"+str(i_val)))
figg.append_trace(Real, i+1, 1)
figg['layout'].update( annotations=annotation_arr )
figg['layout'].update(height=1800, width=600, title='Testing')
iplot(figg)
I would like to map a color to each row in the dataframe as a function of two columns. It would be much easier with just one column as argument. But how can I achieve this with two columns ?
What I have done so far:
a = np.random.rand(3,10)
i = [[30,10], [10, 30], [60, 60]]
names = ['a', 'b']
index = pd.MultiIndex.from_tuples(i, names = names)
df = pd.DataFrame(a, index=index).reset_index()
c1 = plt.cm.Greens(np.linspace(0.2,0.8,3))
c2 = plt.cm.Blues(np.linspace(0.2,0.8,3))
#c3 = plt.cm.Reds(np.linspace(0.2,0.8,3))
color = np.vstack((c1,c2))
a = df.a.sort_values().values
b = df.b.sort_values().values
mapping = dict()
for i in range(len(a)):
mapping[a[i]] = {}
for ii in range(len(b)):
mapping[a[i]][b[ii]] = color[i+ii]
Maybe something similar to df['color'] = df.apply(lamda x: mapping[x.a][x.b]) ?
Looks like you answered your own question. Apply can happen across the rows by changing the axis argument to 1. df['color'] = df.apply(lambda x: mapping[x.a][x.b], axis =1)