Using pandas to plot - array error - pandas

I have a file that looks like this:
> loc.38167 h3k4me1 1.8299 1.5343 0.0 0.0 1.8299 1.5343 0.0 ....
> loc.08652 h3k4me3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ....
I want to plot 500 random 'loc.' points on a graph. Each loc. has 100 values. I use the following python script:
file = open('h3k4me3.tab.data')
data = {}
for line in file:
cols = line.strip().split('\t')
vals = map(float,cols[2:])
data[cols[0]] = vals
file.close
randomA = data.keys()[:500]
window = int(math.ceil(5000.0 / 100))
xticks = range(-2500,2500,window)
sns.tsplot([data[k] for k in randomA],time=xticks)
However, I get
ValueError: arrays must all be same length

Related

pandas assign result from list of columns

Suppose I have a dataframe as shown below:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'A':np.random.randn(5), 'B': np.zeros(5), 'C': np.zeros(5)})
df
>>>
A B C
0 0.496714 0.0 0.0
1 -0.138264 0.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 0.0
And I have the list of columns which I want to populate with the value of 1, when A is negative.
idx = df.A < 0
cols = ['B', 'C']
So in this case, I want the indices [1, 'B'] and [4, 'C'] set to 1.
What I tried:
However, doing df.loc[idx, cols] = 1 sets the entire row to be 1, and not just the individual column. I also tried doing df.loc[idx, cols] = pd.get_dummies(cols) which gave the result:
A B C
0 0.496714 0.0 0.0
1 -0.138264 0.0 1.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 NaN NaN
I'm assuming this is because the index of get_dummies and the dataframe don't line up.
Expected Output:
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
So what's the best (read fastest) way to do this. In my case, there are 1000's of rows and 5 columns.
Timing of results:
TLDR: editing values directly is faster.
%%timeit
df.values[idx, df.columns.get_indexer(cols)] = 1
123 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df.iloc[idx.array,df.columns.get_indexer(cols)]=1
266 µs ± 7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use numpy indexing for improve performance:
idx = df.A < 0
res = ['B', 'C']
arr = df.values
arr[idx, df.columns.get_indexer(res)] = 1
print (arr)
[[ 0.49671415 0. 0. ]
[-0.1382643 1. 0. ]
[ 0.64768854 0. 0. ]
[ 1.52302986 0. 0. ]
[-0.23415337 0. 1. ]]
df = pd.DataFrame(arr, columns=df.columns, index=df.index)
print (df)
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
Alternative:
idx = df.A < 0
res = ['B', 'C']
df.values[idx, df.columns.get_indexer(res)] = 1
print (df)
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
ind = df.index[idx]
for idx,col in zip(ind,res):
...: df.at[idx,col] = 1
In [7]: df
Out[7]:
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0

Mixture prior not working in JAGS, only when likelihood term included

The code at the bottom will replicate the problem, just copy and paste it into R.
What I want is for the mean and precision to be (-100, 100) 30% of the time, and (200, 1000) for 70% of the time. Think of it as lined up in a, b, and p.
So 'pick' should be 1 30% of the time, and 2 70% of the time.
What actually happens is that on every iteration, pick is 2 (or 1 if the first element of p is the larger one). You can see this in the summary, where the quantiles for 'pick', 'testa', and 'testb' remain unchanged throughout. The strangest thing is that if you remove the likelihood loop, pick then works exactly as intended.
I hope this explains the problem, if not let me know. It's my first time posting so I'm bound to have messed things up.
library(rjags)
n = 10
y <- rnorm(n, 5, 10)
a = c(-100, 200)
b = c(100, 1000)
p = c(0.3, 0.7)
## Model
mod_str = "model{
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu, 10)
}
# ISSUE HERE: MIXTURE PRIOR
mu ~ dnorm(a[pick], b[pick])
pick ~ dcat(p[1:2])
testa = a[pick]
testb = b[pick]
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('pick', 'testa', 'testb', 'mu'), n.iter = 10000)
summary(res)
I think you are having problems for a couple of reasons. First, the data that you have supplied to the model (i.e., y) is not a mixture of normal distributions. As a result, the model itself has no need to mix. I would instead generate data something like this:
set.seed(320)
# number of samples
n <- 10
# Because it is a mixture of 2 we can just use an indicator variable.
# here, pick (in the long run), would be '1' 30% of the time.
pick <- rbinom(n, 1, p[1])
# generate the data. b is in terms of precision so we are converting this
# to standard deviations (which is what R wants).
y_det <- pick * rnorm(n, a[1], sqrt(1/b[1])) + (1 - pick) * rnorm(n, a[2], sqrt(1/b[2]))
# add a small amount of noise, can change to be more as necessary.
y <- rnorm(n, y_det, 1)
These data look more like what you would want to supply to a mixture model.
Following this, I would code the model up in a similar way as I did the data generation process. I want some indicator variable to jump between the two normal distributions. Thus, mu may change for each scalar in y.
mod_str = "model{
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], 10)
mu[i] <- mu_ind[i] * a_mu + (1 - mu_ind[i]) * b_mu
mu_ind[i] ~ dbern(p[1])
}
a_mu ~ dnorm(a[1], b[1])
b_mu ~ dnorm(a[2], b[2])
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('mu_ind', 'a_mu', 'b_mu'), n.iter = 10000)
summary(res)
2.5% 25% 50% 75% 97.5%
a_mu -100.4 -100.3 -100.2 -100.1 -100
b_mu 199.9 200.0 200.0 200.0 200
mu_ind[1] 0.0 0.0 0.0 0.0 0
mu_ind[2] 1.0 1.0 1.0 1.0 1
mu_ind[3] 0.0 0.0 0.0 0.0 0
mu_ind[4] 1.0 1.0 1.0 1.0 1
mu_ind[5] 0.0 0.0 0.0 0.0 0
mu_ind[6] 0.0 0.0 0.0 0.0 0
mu_ind[7] 1.0 1.0 1.0 1.0 1
mu_ind[8] 0.0 0.0 0.0 0.0 0
mu_ind[9] 0.0 0.0 0.0 0.0 0
mu_ind[10] 1.0 1.0 1.0 1.0 1
If you supplied more data, you would (in the long run) have the indicator variable mu_ind take the value of 1 30% of the time. If you had more than 2 distributions you could instead use dcat. Thus, an alternative and more generalized way of doing this would be (and I am borrowing heavily from this post by John Kruschke):
mod_str = "model {
# Likelihood:
for( i in 1 : n ) {
y[i] ~ dnorm( mu[i] , 10 )
mu[i] <- muOfpick[ pick[i] ]
pick[i] ~ dcat( p[1:2] )
}
# Prior:
for ( i in 1:2 ) {
muOfpick[i] ~ dnorm( a[i] , b[i] )
}
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('pick', 'muOfpick'), n.iter = 10000)
summary(res)
2.5% 25% 50% 75% 97.5%
muOfpick[1] -100.4 -100.3 -100.2 -100.1 -100
muOfpick[2] 199.9 200.0 200.0 200.0 200
pick[1] 2.0 2.0 2.0 2.0 2
pick[2] 1.0 1.0 1.0 1.0 1
pick[3] 2.0 2.0 2.0 2.0 2
pick[4] 1.0 1.0 1.0 1.0 1
pick[5] 2.0 2.0 2.0 2.0 2
pick[6] 2.0 2.0 2.0 2.0 2
pick[7] 1.0 1.0 1.0 1.0 1
pick[8] 2.0 2.0 2.0 2.0 2
pick[9] 2.0 2.0 2.0 2.0 2
pick[10] 1.0 1.0 1.0 1.0 1
The link above includes even more priors (e.g., a Dirichlet prior on the probabilities incorporated into the Categorical distribution).

How can I loop through this pandas dataframe faster?

I have this loop that iterates over a dataframe and creates a cumulative value. I have around 450k rows in my dataframe and it takes in excess of 30 minutes to complete.
Here is the head of my dataframe:
timestamp open high low close volume vol_thrs flg
1970-01-01 09:30:59 136.01 136.08 135.94 136.030 5379100 0.0 0.0
1970-01-01 09:31:59 136.03 136.16 136.01 136.139 759900 0.0 0.0
1970-01-01 09:32:59 136.15 136.18 136.10 136.180 609000 0.0 0.0
1970-01-01 09:33:59 136.18 136.18 136.07 136.100 510900 0.0 0.0
1970-01-01 09:34:59 136.11 136.15 136.05 136.110 306400 0.0 0.0
The timestamp column is the index.
Any thoughts on how I make this quicker?
for (i, (idx, row)) in enumerate(df.iterrows()):
if i == 0:
tmp_cum = df.loc[idx, 'volume']
else:
tmp_cum = tmp_cum + df.loc[idx, 'volume']
if tmp_cum >= df.loc[idx, 'vol_thrs']:
tmp_cum = 0
df.loc[idx, 'flg'] = 1
Try using df.at instead of df.loc, as so:
for (i, (idx, row)) in enumerate(df.iterrows()):
if i == 0:
tmp_cum = df.at[idx, 'volume']
else:
tmp_cum = tmp_cum + df.at[idx, 'volume']
if tmp_cum >= df.at[idx, 'vol_thrs']:
tmp_cum = 0
df.at[idx, 'flg'] = 1
df.at should theoretically perform better. df.at is better if you're accessing a single data value, which is the case in your function. df.loc will let you do slicing, but df.at won't.

How to create a new column in a Pandas DataFrame using pandas.cut method?

I have a column with house prices that looks like this:
0 0.0
1 1480000.0
2 1035000.0
3 0.0
4 1465000.0
5 850000.0
6 1600000.0
7 0.0
8 0.0
9 0.0
Name: Price, dtype: float64
and I want to create a new column called data['PriceRanges'] which sets each price in a given range. This is what my code looks like:
data = pd.read_csv("Melbourne_housing_FULL.csv")
data.fillna(0, inplace=True)
for i in range(0, 12000000, 50000):
bins = np.array(i)
labels = np.array(str(i))
data['PriceRange'] = pd.cut(data.Price, bins=bins, labels=labels, right=True)
And I get this Error message:
TypeError: len() of unsized object
I've been trying different approaches and seem to be stuck here. I'd really appreciate some help.
Thanks,
Hugo
There is problem you overwrite bins and labels in loop, so there is only last value.
for i in range(0, 12000000, 50000):
bins = np.array(i)
labels = np.array(str(i))
print (bins)
11950000
print (labels)
11950000
There is no necessary loop, only instead range use numpy alternative arange and for labels create ranges. Last add parameter include_lowest=True to cut for include first value of bins (0) to first group.
bins = np.arange(0, 12000000, 50000)
labels = ['{} - {}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
#correct first value
labels[0] = '0 - 50000'
print (labels[:10])
['0 - 50000', '50001 - 100000', '100001 - 150000', '150001 - 200000',
'200001 - 250000', '250001 - 300000', '300001 - 350000', '350001 - 400000',
'400001 - 450000', '450001 - 500000']
data['PriceRange'] = pd.cut(data.Price,
bins=bins,
labels=labels,
right=True,
include_lowest=True)
print (data)
Price PriceRange
0 0.0 0 - 50000
1 1480000.0 1450001 - 1500000
2 1035000.0 1000001 - 1050000
3 0.0 0 - 50000
4 1465000.0 1450001 - 1500000
5 850000.0 800001 - 850000
6 1600000.0 1550001 - 1600000
7 0.0 0 - 50000
8 0.0 0 - 50000
9 0.0 0 - 50000

How do I aggregate sub-dataframes in pandas?

Suppose I have two-leveled multi-indexed dataframe
In [1]: index = pd.MultiIndex.from_tuples([(i,j) for i in range(3)
: for j in range(1+i)], names=list('ij') )
: df = pd.DataFrame(0.1*np.arange(2*len(index)).reshape(-1,2),
: columns=list('xy'), index=index )
: df
Out[1]:
x y
i j
0 0 0.0 0.1
1 0 0.2 0.3
1 0.4 0.5
2 0 0.6 0.7
1 0.8 0.9
2 1.0 1.1
And I want to run a custom function on every sub-dataframe:
In [2]: def my_aggr_func(subdf):
: return subdf['x'].mean() / subdf['y'].mean()
:
: level0 = df.index.levels[0].values
: pd.DataFrame({'mean_ratio': [my_aggr_func(df.loc[i]) for i in level0]},
: index=pd.Index(level0, name=index.names[0]) )
Out[2]:
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
Is there an elegant way to do it with df.groupby('i').agg(__something__) or something similar?
Need GroupBy.apply, which working with DataFrame:
df1 = df.groupby('i').apply(my_aggr_func).to_frame('mean_ratio')
print (df1)
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
You don't need the custom function. You can calculate the 'within group means' with agg then perform an eval to get the ratio you want.
df.groupby('i').agg('mean').eval('x / y')
i
0 0.000000
1 0.750000
2 0.888889
dtype: float64