Using numpy for polynomial fit on pandas dataframe - numpy

I have a dataframe containing astronomical data:
I'm using statsmodels.formula.api to try to apply a polynomial fit to an dataframe, using columns labelled log_z and U, B, V, and other variables. I've got so far
sources['log_z'] = np.log10(sources.z)
mask = ~np.isnan((B-I)) & ~np.isnan(log_z)
model = ols(formula='(B-I) + np.power((U-R),2) ~ log_z', data = [log_z[mask], (B-I)[mask]]).fit()
but I keep getting
PatsyError: Error evaluating factor: TypeError: list indices must be integers or slices, not str
(B-I) + np.power((U-R),2) ~ log_z
^^^^^^^^^^^^^^^^^
even though I'm passing arrays into the function. I get the same error message (apart from the last line) no matter what arrays I use, or how I format them. Can anyone see what I'm doing wrong?

Related

Construct NumPy matrix row by row

I'm trying to construct a 2D NumPy array from values in an extant 2D NumPy array using an iterative process. Using ordinary python lists the process I'm describing would look like so:
coords = #data from file contained in a 2D list
d = #integer
edges = []
for i in range(d+1):
for j in range(i+1, d+1):
edge = coords[j] - coords[i]
edges.append(edge)
However, the NumPy array imposes restrictions that do not permit the process shown above. Below I try to do the same thing using NumPy arrays, and it should immediately be clear where the problems are:
coords = np.genfromtxt('Energies.txt', dtype=float, skip_header=1)
d = #integer
#how to initialize?
for i in range(d+1):
for j in range(i+1, d+1):
edge = coords[j] - coords[i]
#how to append?
Because .append does not exist for NumPy arrays I need to rely on concatenate or stack instead. But these functions are designed to join existing arrays, and I don't have anything to concatenate or stack until after the first iteration of my loop. So I suppose I need to change my data flow, but I'm unsure how to go about this.
Any help would be greatly appreciated. Thanks in advance.
that function is numpy.meshgrid [1] , the function does it by default.
[1] https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.meshgrid.html

Vectorizing text from data frame column using pandas

I have a Data Frame which looks like this:
I am trying to vectorize every row, but only from the text column. I wrote this code:
vectorizerCount = CountVectorizer(stop_words='english')
# tokenize and build vocab
allDataVectorized = allData.apply(vectorizerCount.fit_transform(allData.iloc[:]['headline_text']), axis=1)
The error says:
TypeError: ("'csr_matrix' object is not callable", 'occurred at index 0')
Doing some research and trying changes I found out the fit_transform function returns a scipy.sparse.csr.csr_matrix and that is not callable.
Is there another way to do this?
Thanks!
There are a number of problems with your code. You probably need something like
allDataVectorized = pd.DataFrame(vectorizerCount.fit_transform(allData[['headline_text']]))
allData[['headline_text']]) (with the double brackets) is a DataFrame, which transforms to a numpy 2d array.
fit_transform returns a csr matrix.
pd.DataFrame(...) creates a DataFrame from a csr matrix.

Read a file consisting two columns of pure(?) double into a complex NumPy array

Almost the same questions were asked by the following:
Numpy read complex numbers from text file
Writing and reading complex numbers using numpy.savetxt and numpy.loadtxt
loading complex numbers with numpy.loadtxt
Reading complex data into numpy array
However, the above involved slightly different input format, e.g. parentheses, than the file content herein.
Consider a file named example containing two columns of pure(?) double:
0.8355544313622164 0
1.199174279986189 0
1.417275292218002 0
I am able to generate a numpy array of np.complex64 by doing the following:
data = np.loadtxt("./example", dtype=np.float64, delimiter='\t')
complexData = data.T[0] + 1j*data.T[1]
Printing complexData now gives:
[ 0.83555443+0.j 1.19917428+0.j 1.41727529+0.j ... ]
Is it possible to reduce the above approach into a neater one?
For example, changing data type to np.complex64 raises TypeError:
data = np.loadtxt("./example", dtype=np.complex64, delimiter='\t')
Instead of converting the real array to complex with
complexData = data.T[0] + 1j*data.T[1]
you can create a complex view of the data:
complexData = data.view(np.complex128)
Then data and complexData share the underlying array of floating point numbers, but complexData interprets those values as complex numbers.
complexData will be an array with shape (n, 1). To get rid of the extraneous second dimension, you can use
complexData = data.view(np.complex128)[:, 0]
You could do the conversion immediately upon reading the data. For example, my sample file called "real.txt" is
0.8355544313622164 0
1.199174279986189 0
1.417275292218002 0
3.141592653589793 -1
and it is not tab-delimited, so I'll use the default delimiter. To read the data as complex:
In [18]: z = np.loadtxt('real.txt').view(np.complex128)[:, 0]
In [19]: z
Out[19]: array([0.83555443+0.j, 1.19917428+0.j, 1.41727529+0.j, 3.14159265-1.j])

assign certain entries of Tensor, like set_subtensor of Theano

Can I just assign values to certain entries in a tensor? I got this problems when I compute the cross correlation matrix of a NxP feature matrix feats, where N is observations and P is dimension. Some columns are constant so the standard deviation is zero, and I don't want to devide by std for those constant column. Here is what I did:
fmean, fvar = tf.nn.moments(feats, axes = [0], keep_dims = False)
fstd = tf.sqrt(fvar)
feats = feats - fmean
sel = (fstd != 0)
feats[:, sel] = feats[:, sel]/ fstd[sel]
corr = tf.matmul(tf.transpose(feats), feats)
However, I got this error: TypeError: 'Tensor' object does not support item assignment. Is there any workaround for such issue?
You can make your feats a tf.Variable and use tf.scatter_update to update locations selectively.
It's a bit awkward in that scatter_update needs a list of linear indices to update, so you'd need to convert your [:, sel] implicit 2D specification into explicit list of 1D indices. There's example of constructing 1D indices from 2D here
There's some work in simplifying this kind of use-case in issue #206

How to get a subarray in numpy

I have an 3d array and I want to get a sub-array of size (2n+1) centered around an index indx. Using slices I can use
y[slice(indx[0]-n,indx[0]+n+1),slice(indx[1]-n,indx[1]+n+1),slice(indx[2]-n,indx[2]+n+1)]
which will only get uglier if I want a different size for each dimension. Is there a nicer way to do this.
You don't need to use the slice constructor unless you want to store the slice object for later use. Instead, you can simply do:
y[indx[0]-n:indx[0]+n+1, indx[1]-n:indx[1]+n+1, indx[2]-n:indx[2]+n+1]
If you want to do this without specifying each index separately, you can use list comprehensions:
y[[slice(i-n, i+n+1) for i in indx]]
You can create numpy arrays for indexing into different dimensions of the 3D array and then use use ix_ function to create indexing map and thus get the sliced output. The benefit with ix_ is that it allows for broadcasted indexing maps. More info on this could be found here. Then, you can specify different window sizes for each dimension for a generic solution. Here's the implementation with sample input data -
import numpy as np
A = np.random.randint(0,9,(17,18,16)) # Input array
indx = np.array([5,10,8]) # Pivot indices for each dim
N = [4,3,2] # Window sizes
# Arrays of start & stop indices
start = indx - N
stop = indx + N + 1
# Create indexing arrays for each dimension
xc = np.arange(start[0],stop[0])
yc = np.arange(start[1],stop[1])
zc = np.arange(start[2],stop[2])
# Create mesh from multiple arrays for use as indexing map
# and thus get desired sliced output
Aout = A[np.ix_(xc,yc,zc)]
Thus, for the given data with window sizes array, N = [4,3,2], the whos info shows -
In [318]: whos
Variable Type Data/Info
-------------------------------
A ndarray 17x18x16: 4896 elems, type `int32`, 19584 bytes
Aout ndarray 9x7x5: 315 elems, type `int32`, 1260 bytes
The whos info for the output, Aout seems to be coherent with the intended output shape which must be 2N+1.