I have a data set where I estimated home range size using 10 estimators from 41 individuals. I wanted to test if these estimators were significantly different from each other so I fit a linear mixed-effects model with nlme as follows:
mod3fML1b<-lme(size~band*nobsS+Sex,random=~1|snake,data=HR.compare,weights=vf6,method='REML')
Size is home range size (log transformed to meet assumptions of normality), band is a factor with 10 levels representing my 10 estimators, nobsS is a measure of sampling intensity (number of individual locations standardized by number of sampling days), Sex is male/female, and snake is my random effect of individual. The summary output is as follows:
> summary(mod3fML1b)$tTable
Value Std.Error DF t-value p-value
(Intercept) 4.099809927 0.24824427 351 16.51522508 9.632758e-46
bandBPIdiagonal -0.344448847 0.05724340 351 -6.01726718 4.462854e-09
bandBPIfull -0.303881612 0.05593369 351 -5.43289002 1.038615e-07
bandHREFdiagonal -0.053639559 0.06749969 351 -0.79466377 4.273461e-01
bandHREFfull -0.131436107 0.06471844 351 -2.03089130 4.301931e-02
bandLCVdiagonal -0.186017321 0.11520159 351 -1.61471137 1.072717e-01
bandLCVfull -0.186017321 0.11520176 351 -1.61470908 1.072722e-01
bandLCVsingle -0.181687618 0.11940300 351 -1.52163366 1.290012e-01
bandSCVdiagonal -0.163761675 0.05816320 351 -2.81555466 5.144255e-03
bandSCVfull -0.120439828 0.05720672 351 -2.10534398 3.597148e-02
nobsS 0.406335759 1.16315897 38 0.34933811 7.287640e-01
SexMale 1.457327832 0.23373517 38 6.23495313 2.711438e-07
bandBPIdiagonal:nobsS -0.415077222 0.24845863 351 -1.67060896 9.569032e-02
bandBPIfull:nobsS -0.442855108 0.24277399 351 -1.82414559 6.898016e-02
bandHREFdiagonal:nobsS -0.143832274 0.29297492 351 -0.49093716 6.237777e-01
bandHREFfull:nobsS -0.007949556 0.28090319 351 -0.02829999 9.774390e-01
bandLCVdiagonal:nobsS -2.072417895 0.50001973 351 -4.14467228 4.270077e-05
bandLCVfull:nobsS -2.072417895 0.50002044 351 -4.14466640 4.270182e-05
bandLCVsingle:nobsS -2.171179830 0.51825545 351 -4.18940086 3.542319e-05
bandSCVdiagonal:nobsS -0.745834058 0.25245092 351 -2.95437248 3.344486e-03
bandSCVfull:nobsS -0.766847848 0.24829943 351 -3.08839960 2.172757e-03
The p-values indicate several levels are significantly different from the reference level and the patterns are consistent with what I expected based on the raw data. However, I wanted to display the mean log(home range size) with a 95% confidence intervals (not prediction intervals) to compare the effect sizes among different estimators. To do this I calculated predicted values using the predict.lme function and SE using code from a previous R thread at:
https://stat.ethz.ch/pipermail/r-sig-mixed-models/2010q1/003320.html
However, my 95% CI broadly overlapped, much more than I would have expected based on the p-values from my lme model. I have copied the values I obtained which shows the broad overlap in 95% CI:
data<-data.frame(band=c('BPId','BPIf','REFs','REFd','REFf','LCVs','LCVd','LCVf','SCVd','SCVF'),
log.mean.size=c(4.99,5.03,5.41,5.33,5.28,4.88,4.89,4.89,5.13,5.17),
lci=c(4.62,4.65,5.03,4.95,4.90,4.90,4.50,4.50,4.75,4.79),
uci=c(5.38,5.41,5.79,5.71,5.66,5.28,5.29,5.29,5.51,5.54))
Does anyone have suggestions for why the p-values are showing significant differences among estimator levels but the CI overlap? I apologize for not providing any data but I would be happy to email a copy of my raw data and full code if that would help.
Thanks,
JBauder
Related
I am trying to evaluate the performance of my object detection+tracking on the standard dataset used in the industry in the 2DMOT Challenge 2015. I have downloaded the dataset but I am unable to understand the data fields in the labelled ground truth data.
I have understood the first six columns of the dataset but unable to do so for the rest four columns. Following is the sample data from the directory <\2DMOT2015\train\ETH-Bahnhof\gt>:
frame no. object_id bb_left bb_top bb_width bb_height (?) (?) (?) (?)
1 1 212 204 20 57 0 -3.1784 16.34 0.45739
1 2 223 181 36 104 1 -1.407 9.0212 0.68774
Please let me know if you are aware of this?
The last three fields represent the 3D real-world coordinates of the objects. A similar data structure can be found in videos of ETH-Bahnhof, ETH-Sunnyday, PETS09-S2L1 and TUD-Stadtmitte in 2DMOT2015. For ground-truth, score=1. But sometimes it varies b/w 0-1, then it acts as a flag value and zeroes mean that the line is not to be considered for evaluation. So the data fields are in the format:
frame no. , object_id , bb_left , bb_top , bb_width , bb_height , score, X, Y, Z
I am using data collected from two different instruments which have different resolution because of the sampling rate of each instrument. For a specific time, one of the sets have >10k entries while the other has ~2.5k. They however capture data over the same time interval, and I want to plot them on top of each other even though they have different resolution in data. The minimum and maximum x of both sets are the same however one of them have more entries.
Simplified it could look like this:
1st set from instrument with higher sampling rate:
time(s) value
0.0 10
0.2 11
0.4 12
0.6 13
0.8 14
... ..
100 50
2nd set from instrument with lower sampling rate:
time(s) value
0 100
1 120
2 125
3 128
4 130
. ...
100 430
They are measuring different things, but I would like to display them in the same plot. How can I accomplish this?
I found the mistake.. I was trying to plot both datasets using the time data from the first instrument. Of course they need to be plotted with their respective time data and I put the first time data in the second plot by mistake..
I have a voice data with length 1.85 seconds, then I extract its feature using MFCC (with libraby from James Lyson). It returns 184 x 13 features. I am using 10 milisecond frame step, 25 miliseconds frame size, and 13 coefficients from DCT. How can it return 184? I still can not understand because the last frame's length is not 25 miliseconds. Is there any formula which explain how can it return 184? Thank you in advance.
There is a picture that can explain you things, basically the last window takes more space than previous ones.
If you have 184 windows, the region you cover is 183 * 10 + 25 or approximately 1855 ms.
I am working on vba code where I have data (for Slope Inclinometers) at various depths like so:
Depth A0 A180 Checksum B0 B180 Checksum
4.5 (-1256) 1258 2 (-394) 378 (-16)
4.5 (-1250) 1257 7 (-396) 376 (-20)
4.5 (-1257) 1257 0 (-400) 374 (-26)
Depth A0 A180 Checksum B0 B180 Checksum
5 (-1214) 1214 0 (-472) 459 (-13)
5 (-1215) 1212 -3 (-472) 455 (-17)
5 (-1216) 1211 -5 (-473) 455 (-18)
UNKNOWN AMOUNT OF DATA WILL BE PRESENT (depends how much the user transfers to this sheet)
Now I need to be able to calculate the A Axis Displacement, the B Axis Displacement, and the resultant which have formulas as followed:
A Axis Displacement = [((A0-A180)/2)-((A0*-A180*)/2))]*(constant/constant)
Where * is the initial readings which is always the first row of data at that specified depth.
B Axis Displacement = [((A0-A180)/2)-((A0*-A180*)/2))]*(constant/constant)
Where * is the initial readings which is always the first row of data at that specified depth.
Resultant = SQRT[(A Axis Displacement)^2 + (B Axis Displacement)^2]
I'm struggling to find examples of how I can implement this using vba as there will be various depths present (unknown amount) on the same sheet where the formula will need to start over at each new depth present.
Any helps/tips would be greatly appreciated!
how I can implement this using vba as there will be various depths present...
You still can do it purely with formulas and easy auto-fill, because the formula can find the the first occurrence of the current depth and perform all the necessary calculations, leaving blank at header rows or blank rows. For instance, you can enter these formulas at row 2 and fill down all the rows.
H2 (A Axis Displacement):
=IF(ISNUMBER($A2),0.5*(B2-C2-VLOOKUP($A2,$A:$F,2,0)+VLOOKUP($A2,$A:$F,3,0)), "")
I2 (B Axis Displacement):
=IF(ISNUMBER($A2),0.5*(E2-F2-VLOOKUP($A2,$A:$F,5,0)+VLOOKUP($A2,$A:$F,6,0)), "")
J2 (Resultant):
=IF(ISNUMBER($A2),SQRT(SUMSQ(H2,I2)),"")
p.s. in the displacements formulas I omitted the (constant/constant) factor as it is irrelevant to the answer, you can easily multiply the 0.5 factor by anything you need.
say I have a data frame,df, with columns: id |site| time| clicks |impressions
I want to use the machine learning technique of k-fold cross validation ( split the data randomly into k=10 equal sized partitions - based on eg column id) . I think of this as a mapping from id: {0,1,...9} ( so new column 'fold' going from 0-9)
then iteratively take 9/10 partitions as training data and 1/10 partition as validation data
( so first fold==0 is validation, rest is training, then fold==1, rest is training)
[ so am thinking of this as a generator based on grouping by fold column]
finally I want to group all the training data by site and time ( and similarly for validation data) ( in other words sum over the fold index, but keeping the site and time indices)
What is the right way of doing this in pandas?
The way I thought of doing it at the moment is
df_sum=df.groupby( 'fold','site','time').sum()
#so df_sum has indices fold,site, time
# create new Series object,dat, name='cross' by mapping fold indices
# to 'training'/'validation'
df_train_val=df_sum.groupby( [ dat,'site','time']).sum()
df_train_val.xs('validation',level='cross')
Now the direct problem I run into is that groupby with columns will handle introducing a Series object but groupby on multiindices doesn't [df_train_val assignment above doesn't work]. Obviously I could use reset_index but given that I want to group over site and time [ to aggregate over folds 1 to 9, say] this seems wrong. ( I assume grouping is much faster on indices than on 'raw' columns)
So Question 1 is this the right way to do cross-validation followed by aggregation in pandas. More generally grouping and then regrouping based on multiindex values.
Question 2 - is there a way of mixing arbitrary mappings with multilevel indices.
This generator seems to do what I want. You pass in the grouped data (with 1 index corresponding to the fold [0 to n_folds]).
def split_fold2(fold_data, n_folds, new_fold_col='fold'):
i_fold=0
indices=list(fold_data.index.names)
slicers=[slice(None)]*len(fold_data.index.names)
fold_index=fold_data.index.names.index(new_fold_col)
indices.remove(new_fold_col)
while (i_fold<n_folds):
slicers[fold_index]=[i for i in range(n_folds) if i !=i_fold]
slicers_tuple=tuple(slicers)
train_data=fold_data.loc[slicers_tuple,:].groupby(level=indices).sum()
val_data=fold_data.xs(i_fold,level=new_fold_col)
yield train_data,val_data
i_fold+=1
On my data set this takes :
CPU times: user 812 ms, sys: 180 ms, total: 992 ms Wall time: 991 ms
(to retrieve one fold)
replacing train_data assignment with
train_data=fold_data.select(lambda x: x[fold_index]!=i_fold).groupby(level=indices).sum()
takes
CPU times: user 2.59 s, sys: 263 ms, total: 2.85 s Wall time: 2.83 s