Pandas - Indexing by not in index - pandas

Googled around a bit and couldn't seem to find anything on this.
Is there an option to access data in a pandas data frame using "not index"?
So something like
df_index = asdf = pandas.MultiIndex(levels=[
['2014-10-19', '2014-10-20', '2014-10-21', '2014-10-22', '2014-10-30'],
[u'after_work', u'all_day', u'breakfast', u'lunch', u'mid_evening']],
labels=[[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 4, 4, 4, 4],
[4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 2, 0, 1, 3, 4]],
names=[u'start_date', u'time_group'])
And then I would like to be able to call the following to get everything not in df_index
df.ix[~df_index]
I know you can do it for logical indexing within pandas. Just curious if I could do it using an Index Object

you can use df.drop(df_index, errors="ignore").

Related

Creating empty pandas dataframe with Multi-Index

I'm trying to create an empty pandas.Dataframe with a Multi-Index that I can later fill columnwise with my data. I've looked at other answers (here and here), but they all work with data that does not fill in columnwise, or that is somehow connected in the different columns.
The information I want to be contained in the Multi-Index looks like this:
GCM_list = ['BCC-CSM2-MR', 'CAMS-CSM1-0', 'CESM2', 'CESM2-WACCM', 'CMCC-CM2-SR5', 'EC-Earth3', 'EC-Earth3-Veg', 'FGOALS-f3-L', 'GFDL-ESM4', 'INM-CM4-8', 'INM-CM5-0', 'MPI-ESM1-2-HR', 'MRI-ESM2-0', 'NorESM2-MM', 'TaiESM1']
SSP_list = ['SSP_126', 'SSP_245', 'SSP_370', 'SSP_585']
index_years = [2030, 2040, 2050, 2060, 2070, 2080, 2090, 2100]
And I want it to look somewhat like this (for the three first items in GCM_list):
BCC-CSM2-MR CAMS-CSM1-0 CESM2
SSP_126 SSP_245 SSP_370 SSP_585 SSP_126 SSP_245 SSP_370 SSP_585 SSP_126 SSP_245 SSP_370 SSP_585
2030 | |
2040 | |
2050 V V
2060 1 2
2070
2080
2090
2100
The "arrows" in the first two columns should represent how and in what order I want to fill the Dataframe after the Index is created - if that's important for this question.
I've tried building the index like this, but I'm not sure what to make of the result. How should I proceed? Is there a way to build this empty dataframe so that I can fill it column after column?
arrays = [GCM_list, SSP_list]
index = pd.MultiIndex.from_arrays(arrays, names=('GCM', 'SSP'))
>>> index
MultiIndex(levels=[[u'BCC-CSM2-MR', u'CAMS-CSM1-0', u'CESM2', u'CESM2-WACCM', u'CMCC-CM2-SR5', u'EC-Earth3', u'EC-Earth3-Veg', u'FGOALS-f3-L', u'GFDL-ESM4', u'INM-CM4-8', u'INM-CM5-0', u'MPI-ESM1-2-HR', u'MRI-ESM2-0', u'NorESM2-MM', u'TaiESM1'], [u'SSP_126', u'SSP_245', u'SSP_370', u'SSP_585']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]],
names=[u'GCM', u'SSP'])
Use MultiIndex.from_product:
arrays = [GCM_list, SSP_list]
mux = pd.MultiIndex.from_product(arrays, names=('GCM', 'SSP'))
df = pd.DataFrame(columns=mux, index=index_years)

How to group thousands of dates and Id's into columns in Pandas for machine learning?

enter image description hereenter image description hereI have a large dataframe with muliple columns and rows. They are grouped by geographic location and date. The problem is I have too many columns with dates. I think I need to further develop this dataframe so that I have: "GeographyCode", "Number of Awards", "Secondary School Stage", "SCQF Level" and "DateCode" as single rows. I do not know if my data can be used for Scikit Learn Linear Regression. Please help.
pivot02.columns
MultiIndex(levels=[['Number Of Awards', 'SCQF Level', 'Secondary School Stage'], ['2002/2003', '2003/2004', '2004/2005', '2005/2006', '2006/2007', '2007/2008', '2008/2009', '2009/2010', '2010/2011', '2011/2012', '2012/2013']],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]],
names=[None, 'DateCode'])
I have successfuly grouped geographic location, Number Of Awards', 'SCQF Level', 'Secondary School Stage. But the final output is a multi index which I do not know if I can use in linear regression. Is this ok for machine learning?

Specifying integer latent variable in stan

I'm learning Bayesian data analysis. I try to replicate the tutorials by Trond Reitan by stan, which are originally created by WinBugs.
Specifically, I have following data and model
weta.windata<-list(numdet=c(0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 1, 2, 0, 3, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, 0, 3, 1, 1, 3, 1, 1, 2, 0, 2, 1, 1, 1, 1,0, 0, 0, 2, 0, 2, 4, 3, 1, 0, 0, 2, 0, 2, 2, 1, 0, 0, 1),
numvisit=c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,4, 4, 4, 4, 4, 4, 4 ,4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3),
nsites=72)
model_string1="
data{
int nsites;
real<lower=0> numdet[nsites];
real<lower=0> numvisit[nsites];
}
parameters{
real<lower=0> p;
real<lower=0> psi;
int<lower=0> z[nsites];
}
model{
p~uniform(0,1);
psi~uniform(0,1);
for(i in 1:nsites){
z[i]~ bernoulli(psi);
p.site[i]~z[i]*p;
numdet[i]~binomial(numvisit[i],p.site[i]);
}
}
"
mcmc_samples <- stan(model_code=model_string1,
data=weta.windata,
pars=c("p","psi","z"),
chains=3, iter=30000, warmup=10000)
The context is about detecting wetas in fields. There are 72 sites. for each site, researchers visited several times (i.e., numvisit) and recorded the number of times weta found (i.e., numdet).
There is a latent variable z, describing whether one site has weta or not. psi is the probability that one site has weta. p is the detection rate.
The problem I have is I can not declare z to be integers
parameters or transformed parameters cannot be integer or integer array; found declared type int, parameter name=z
Problem with declaration.
However, if I set z to be real, that is,
real<lower=0> z[nsites];
the problem becomes I cannot set the variable from bernoulli as integer...
No matches for:
real ~ bernoulli(real)
I'm very new to stan. Forgive me if this question is very silly.
Stan doesn't support integer parameters or hacks to let you pretend real variables are integers. What it does support is marginalizing the integer variables out of the density. You can then reconstruct them with much more efficiency and much higher tail resolution.
The chapter in the manual on latent discrete parameters is the place to start. It includes an implementation of the CJS population models, which may be familiar. I implemented the Dorazio and Royle occupance models as a case study and Hiroki Ito translated the entire Kery and Schaub book to Stan. They're all linked under users >> documentation on the web site.
I ran into this mysterious error with ulam while answering practice problems in Statistical Rethinking. When you're constructing a list to pass to the data argument to ulam be sure to use = rather than <- for assignment. If you don't the list you construct won't have named components, and a missing name produces this error.

How to populate a queue from preloaded data with variable length sequences?

Say I have input data with variable length sequences loaded into memory:
sentences = [
[0, 1, 2, 3, 4, 5, 6, 7],
[0, 1, 2, 3, 4, 5, 6, 7],
[0, 1, 2, 3, 4, 5 ],
[0, 1, 2, 3, 4, 5 ]
]
How can I use this to fill a queue? E.g. something like:
padding_q = tf.PaddingFIFOQueue(
capacity=len(sentences),
dtypes=[tf.int32], shapes=[[None]])
qr = tf.train.QueueRunner(padding_q, [the_wanted_op])
How does the_wanted_op look like? It should enqueue one sentence yet four enqueues must have enqueued each sentence once.

Convert string to integer pandas dataframe index

I have a pandas dataframe with a multiindex. Unfortunately one of the indices gives years as a string
e.g. '2010', '2011'
how do I convert these to integers?
More concretely
MultiIndex(levels=[[u'2010', u'2011'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, , ...]], names=[u'Year', u'Month'])
.
df_cbs_prelim_total.index.set_levels(df_cbs_prelim_total.index.get_level_values(0).astype('int'))
seems to do it, but not inplace. Any proper way of changing them?
Cheers,
Mike
Will probably be cleaner to do this before you assign it as index (as #EdChum points out), but when you already have it as index, you can indeed use set_levels to alter one of the labels of a level of your multi-index. A bit cleaner as your code (you can use index.levels[..]):
In [165]: idx = pd.MultiIndex.from_product([[1,2,3], ['2011','2012','2013']])
In [166]: idx
Out[166]:
MultiIndex(levels=[[1, 2, 3], [u'2011', u'2012', u'2013']],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])
In [167]: idx.levels[1]
Out[167]: Index([u'2011', u'2012', u'2013'], dtype='object')
In [168]: idx = idx.set_levels(idx.levels[1].astype(int), level=1)
In [169]: idx
Out[169]:
MultiIndex(levels=[[1, 2, 3], [2011, 2012, 2013]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])
You have to reassign it to save the changes (as is done above, in your case this would be df_cbs_prelim_total.index = df_cbs_prelim_total.index.set_levels(...))