Store regression coefficients, merge back into data-frame - pandas

I'm trying to estimate a random effects model, and store those coefficients. I then want to merge them to the data-frame to predict the dependent variable.
There is a random effect coefficient for each group. In the data-frame, if an observation belongs to group 1, I want the group 1 coefficient listed there. For observations in group 2, the group 2 coefficient and so on.
I am able to access and store the coefficients. But I'm not able to merge them back into the data-frame. I'm not sure how to think of it. Here is the code I have so far:
md = smf.mixedlm('y ~ x', data=df, groups=train['GroupID'])
mdf = md.fit()
I tried storing the coefficients in three ways:
re_coeffs = pd.Series(mdf.random_effects.values) #creates a series with shape (1,)
re_coeffs = [(k) for k in mdf.random_effects.values()] #creates a list with the coeffs
re_coeffs = np.array(mdf.random_effects.values) #creates array with shape ()
All of them work, but none of them let me merge them back into the original data-frame. I'm not sure about using a dictionary or a list, or generally how to think about merging these coefficients back into the original data-frame.
I'll appreciate any suggestions for this.

This seems to work:
md = smf.mixedlm('y ~ x', data=train, groups=train['GroupID'])
mdf = md.fit()
re_coeffs = [(k) for k in mdf.random_effects.values()]
df = pd.DataFrame(re_coeffs)
df['ConfigID'] = df.index
merged = pd.merge(train,df, on=['GroupID'])

Related

How to build a numpy matrix one row at a time?

I'm trying to build a matrix one row at a time.
import numpy as np
f = np.matrix([])
f = np.vstack([ f, np.matrix([1]) ])
This is the error message.
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 0 and the array at index 1 has size 1
As you can see, np.matrix([]) is NOT an empty list. I'm going to have to do this some other way. But what? I'd rather not do an ugly workaround kludge.
you have to pass some dimension to the initial matrix. Either fill it with some zeros or use np.empty():
f = np.empty(shape = [1,1])
f = np.vstack([f,np.matrix([1])])
you can use np.hstack instead for the first case, then use vstack iteratively.
arr = np.array([])
arr = np.hstack((arr, np.array([1,1,1])))
arr = np.vstack((arr, np.array([2,2,2])))
Now you can convert into a matrix.
mat = np.asmatrix(arr)
Good grief. It appears there is no way to do what I want. Kludgetown it is. I'll build an array with a bogus first entry, then when I'm done make a copy without the bogosity.

Difficulty with numpy broadcasting

I have two 2d point clouds (oldPts and newPts) which I whish to combine. They are mx2 and nx2 numpyinteger arrays with m and n of order 2000. newPts contains many duplicates or near duplicates of oldPts and I need to remove these before combining.
So far I have used the histogram2d function to produce a 2d representation of oldPts (H). I then compare each newPt to an NxN area of H and if it is empty I accept the point. This last part I am currently doing with a python loop which i would like to remove. Can anybody show me how to do this with broadcasting or perhaps suggest a completely different method of going about the problem. the working code is below
npzfile = np.load(path+datasetNo+'\\temp.npz')
arrs = npzfile.files
oldPts = npzfile[arrs[0]]
newPts = npzfile[arrs[1]]
# remove all the negative values
oldPts = oldPts[oldPts.min(axis=1)>=0,:]
newPts = newPts[newPts.min(axis=1)>=0,:]
# round to integers
oldPts = np.around(oldPts).astype(int)
newPts = newPts.astype(int)
# put the oldPts into 2d array
H, xedg,yedg= np.histogram2d(oldPts[:,0],oldPts[:,1],
bins = [xMax,yMax],
range = [[0, xMax], [0, yMax]])
finalNewList = []
N = 5
for pt in newPts:
if not H[max(0,pt[0]-N):min(xMax,pt[0]+N),
max(0,pt[1]- N):min(yMax,pt[1]+N)].any():
finalNewList.append(pt)
finalNew = np.array(finalNewList)
The right way to do this is to use linear algebra to compute the distance between each pair of 2-long vectors, and then accept only the new points that are "different enough" from each old point: using scipy.spatial.distance.cdist:
import numpy as np
oldPts = np.random.randn(1000,2)
newPts = np.random.randn(2000,2)
from scipy.spatial.distance import cdist
dist = cdist(oldPts, newPts)
print(dist.shape) # (1000, 2000)
okIndex = np.max(dist, axis=0) > 5
print(np.sum(okIndex)) # prints 1503 for me
finalNew = newPts[okIndex,:]
print(finalNew.shape) # (1503, 2)
Above I use the Euclidean distance of 5 as the threshold for "too close": any point in newPts that's farther than 5 from all points in oldPts is accepted into finalPts. You will have to look at the range of values in dist to find a good threshold, but your histogram can guide you in picking the best one.
(One good way to visualize dist is to use matplotlib.pyplot.imshow(dist).)
This is a more refined version of what you were doing with the histogram. In fact, you ought to be able to get the exact same answer as the histogram by passing in metric='minkowski', p=1 keyword arguments to cdist, assuming your histogram bin widths are the same in both dimensions, and using 5 again as the threshold.
(PS. If you're interested in another useful function in scipy.spatial.distance, check out my answer that uses pdist to find unique rows/columns in an array.)

Pseudoinverse calculation in Python

Problem
I was working on the problem described here. I have two goals.
For any given system of linear equations, figure out which variables have unique solutions.
For those variables with unique solutions, return the minimal list of equations such that knowing those equations determines the value of that variable.
For example, in the following set of equations
X = a + b
Y = a + b + c
Z = a + b + c + d
The appropriate output should be c and d, where X and Y determine c and Y and Z determine d.
Parameters
I'm provided a two columns pandas DataFrame entitled InputDataSet where the two columns are Equation and Variable. Each row represents a variable's membership in a given equation. For example, the above set of equations would be represented as
InputDataSet = pd.DataFrame([['X','a'],['X','b'],['Y','a'],['Y','b'],['Y','c'],
['Z','a'],['Z','b'],['Z','c'],['Z','d']],columns=['Equation','Variable'])
The output will be stored in a 2 column DataFrame named OutputDataSet as well, where the first contains the variables that have unique solution, and the second is a comma delimited string of the minimal set of equations needed to solve the given variable. For example, the correct OutputDataSet would look like
OutputDataSet = pd.DataFrame([['c','X,Y'],['d','Y,Z']],columns=['Variable','EquationList'])
Current Solution
My current solution takes the InputDataSet and converts it into a NetworkX graph. After splitting the graph into connected subgraphs, it then converts the graph into a biadjacency matrix (since the graph by nature is bipartite). After this conversion, the SVD is computed, and the nullspace and pseudoinverse are calculated from the SVD (To see how they are calculated, see here and here: look at the source code for numpy.linalg.pinv and the cookbook function for nullspace. I fused the two functions since they both use SVD).
After calculating nullspace and pseudo-inverse, and rounding to a given tolerance, I find all rows in the nullspace where all of the coefficients are 0, and return those variables as those with a unique solution, and return those equations with non-zero coefficients for those variables in the pseudo-inverse.
Here is the code:
import networkx as nx
import pandas as pd
import numpy as np
import numpy.core as cr
def svd_lite(a, tol=1e-2):
wrap = getattr(a, "__array_prepare__", a.__array_wrap__)
rcond = cr.asarray(tol)
a = a.conjugate()
u, s, vt = np.linalg.svd(a)
nnz = (s >= tol).sum()
ns = vt[nnz:].conj().T
shape = a.shape
if shape[0]>shape[1]:
u = u[:,:shape[1]]
elif shape[1]>shape[0]:
vt = vt[:shape[0]]
cutoff = rcond[..., cr.newaxis] * cr.amax(s, axis=-1, keepdims=True)
large = s > cutoff
s = cr.divide(1, s, where=large, out=s)
s[~large] = 0
res = cr.matmul(cr.swapaxes(vt, -1, -2), cr.multiply(s[..., cr.newaxis],
cr.swapaxes(u, -1, -2)))
return (wrap(res),ns)
cols = InputDataSet.columns
tolexp=2
graphs = nx.connected_component_subgraphs(nx.from_pandas_dataframe(InputDataSet,cols[0],
cols[1]))
OutputDataSet = []
Eqs = InputDataSet[cols[0]].unique()
Vars = InputDataSet[cols[1]].unique()
for i in graphs:
EqList = np.array([val for val in np.array(i.nodes) if val in Eqs])
VarList = [val for val in np.array(i.nodes) if val in Vars]
pinv,nulls = svd_lite(nx.bipartite.biadjacency_matrix(i,EqList,VarList,format='csc')
.astype(float).todense(),tol=10**-tolexp)
df2 = np.where(~np.round(nulls,tolexp).any(axis=1))[0]
df3 = np.round(np.array(pinv),tolexp)
OutputDataSet.extend([[VarList[i],",".join(EqList[np.nonzero(df3[i])])] for i in df2])
OutputDataSet = pd.DataFrame(OutputDataSet)
Issues
On the data that I've tested this algorithm on, it performs pretty well with decent execution time. However, the main issue is that it suggests far too many equations as required to determine a given variable.
Often, with datasets of 10,000 equations, the algorithm will claim that 8,000 of those 10,000 are required to determine a given variable, which most definitely is not the case.
I tried raising the tolerance (what I round the coefficients in the pseudo-inverse) to .1, but even then, nearly 5000 equations had non-zero coefficients.
I had conjectured that perhaps the pseudo-inverse is collapsing upon a non-optimal set of coefficients, but the Moore-Penrose pseudoinverse is unique, so that isn't a possibility.
Am I doing something wrong here? Or is the approach I'm taking not going to give me what I desire?
Further Notes
All of the coefficients of all of the variables are 1
The results the current algorithm is producing are reliable ... When I multiply any vector of equation totals by the pseudoinverse generated by the algorithm, I get values essentially equal to those claimed to have a unique solution, which is promising.
What I want to know here is either whether I'm doing something wrong in how I'm extrapolating information from the pseudo-inverse, or whether my approach is completely wrong.
I apologize for not posting any actual results, but not only are they quite large, but they are somewhat unintuitive since they are reformatted into an XML which would probably take another question to explain anyways.
Thank you for you time!

Combine Sklearn TFIDF with Additional Data

I am trying to prepare data for supervised learning. I have my Tfidf data, which was generated from a column in my dataframe called "merged"
vect = TfidfVectorizer(stop_words='english', use_idf=True, min_df=50, ngram_range=(1,2))
X = vect.fit_transform(merged['kws_name_desc'])
print X.shape
print type(X)
(57629, 11947)
<class 'scipy.sparse.csr.csr_matrix'>
But I also need to add additional columns to this matrix. For each document in the TFIDF matrix, I have a list of additional numeric features. Each list is length 40 and it's comprised of floats.
So for clarify, I have 57,629 lists of length 40 which I'd like to append on to my TDIDF result.
Currently, I have this in a DataFrame, example data: merged["other_data"]. Below is an example row from the merged["other_data"]
0.4329597715,0.3637511039,0.4893141843,0.35840...
How can I append the 57,629 rows of my dataframe column with the TF-IDF matrix? I honestly don't know where to begin and would appreciate any pointers/guidance.
This will do the work.
`df1 = pd.DataFrame(X.toarray()) //Convert sparse matrix to array
df2 = YOUR_DF of size 57k x 40
newDf = pd.concat([df1, df2], axis = 1)`//newDf is the required dataframe
I figured it out:
First: iterate over my pandas column and create a list of lists
for_np = []
for x in merged['other_data']:
row = x.split(",")
row2 = map(float, row)
for_np.append(row2)
Then create a np array:
n = np.array(for_np)
Then use scipy.sparse.hstack on X (my original tfidf sparse matrix and my new matrix. I'll probably end-up reweighting these 40-d vectors if they do not improve the classification results, but this approach worked!
import scipy.sparse
X = scipy.sparse.hstack([X, n])
You could have a look at the answer to this question:
use Featureunion in scikit-learn to combine two pandas columns for tfidf
Obviously, the anwers given should work, but as soon as you want your classifier to make predictions, you definitely want to work with pipelines and feature unions.

Create color histogram of an image using tensorflow

Is there a neat way to compute a color histogram of an image? Maybe by abusing the internal code of tf.histogram_summary? From what I've seen, this code is not very modular and calls directly some C++ code.
Thanks in advance.
I would use tf.unsorted_segment_sum, where the "segment IDs" are computed from the color values and the thing you sum is a tf.ones vector. Note that tf.unsorted_segment_sum is probably better thought of as "bucket sum". It implements dest[segment] += thing_to_sum -- exactly the operation you need for a histogram.
In slightly pseudocode (meaning I haven't run this):
binned_values = tf.reshape(tf.floor(img_r * (NUM_BINS-1)), [-1])
binned_values = tf.cast(binned_values, tf.int32)
ones = tf.ones_like(binned_values, dtype=tf.int32)
counts = tf.unsorted_segment_sum(ones, binned_values, NUM_BINS)
You could accomplish this in one pass instead of separating out the r, g, and b values with a split if you wanted to cleverly construct your "ones" to look like "100100..." for red, "010010" for green, etc., but I suspect it would be slower overall, and harder to read. I'd just do the split that you proposed above.
This is what I'm using right now:
# Assumption: img is a tensor of the size [img_width, img_height, 3], normalized to the range [-1, 1].
with tf.variable_scope('color_hist_producer') as scope:
bin_size = 0.2
hist_entries = []
# Split image into single channels
img_r, img_g, img_b = tf.split(2, 3, img)
for img_chan in [img_r, img_g, img_b]:
for idx, i in enumerate(np.arange(-1, 1, bin_size)):
gt = tf.greater(img_chan, i)
leq = tf.less_equal(img_chan, i + bin_size)
# Put together with logical_and, cast to float and sum up entries -> gives count for current bin.
hist_entries.append(tf.reduce_sum(tf.cast(tf.logical_and(gt, leq), tf.float32)))
# Pack scalars together to a tensor, then normalize histogram.
hist = tf.nn.l2_normalize(tf.pack(hist_entries), 0)
tf.histogram_fixed_width
might be what you are looking for...
Full documentation on
https://www.tensorflow.org/api_docs/python/tf/histogram_fixed_width