how tf.contrib.layers.bucketized_column works? - tensorflow

I am trying to figure out how to use
age = tf.contrib.layers.real_valued_column("age")
age_buckets = tf.contrib.layers.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
addressed in https://www.tensorflow.org/tutorials/wide.

One example usage defined as a comment in: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/feature_column.py
Copying from that file:
age_column = real_valued_column("age")
bucketized_age_column = bucketized_column(
source_column=age_column,
boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
first_layer = input_from_feature_columns(...,
feature_columns=[..., bucketized_age_column])
second_layer = fully_connected(first_layer, ...)
# For linear models
predicitions, _, _ = weighted_sum_from_feature_columns(...,
feature_columns=[..., bucketized_age_column])

Related

Feeding Word Embedding Matrix into a Pytorch LSTM Model

I have a LSTM model I am using to predict the unemployment rate from federal reserve filings. It uses glove vectors and vocab2index embedding and the training went as planned. However, upon attempting to feed a word embedding into the model for prediction testing it keeps throwing various errors.
Here is the model:
def load_glove_vectors(glove_file= glove_embedding_vectors_text_file):
"""Load the glove word vectors"""
word_vectors = {}
with open(glove_file) as f:
for line in f:
split = line.split()
word_vectors[split[0]] = np.array([float(x) for x in split[1:]])
return word_vectors
def get_emb_matrix(pretrained, word_counts, emb_size = 300):
""" Creates embedding matrix from word vectors"""
vocab_size = len(word_counts) + 2
vocab_to_idx = {}
vocab = ["", "UNK"]
W = np.zeros((vocab_size, emb_size), dtype="float32")
W[0] = np.zeros(emb_size, dtype='float32') # adding a vector for padding
W[1] = np.random.uniform(-0.25, 0.25, emb_size) # adding a vector for unknown words
vocab_to_idx["UNK"] = 1
i = 2
for word in word_counts:
if word in word_vecs:
W[i] = word_vecs[word]
else:
W[i] = np.random.uniform(-0.25,0.25, emb_size)
vocab_to_idx[word] = i
vocab.append(word)
i += 1
return W, np.array(vocab), vocab_to_idx
word_vecs = load_glove_vectors()
pretrained_weights, vocab, vocab2index = get_emb_matrix(word_vecs, counts)
Unfortunately when I feed this array
[array([ 3, 10, 6287, 6, 113, 271, 3, 6639, 104, 5105, 7525,
104, 7526, 9, 23, 9, 10, 11, 24, 7527, 7528, 104,
11, 24, 7529, 7530, 104, 11, 24, 7531, 7530, 104, 11,
24, 7532, 7530, 104, 11, 24, 7533, 7534, 24, 7535, 7536,
104, 7537, 104, 7538, 7539, 7540, 6643, 7541, 7354, 7542, 7543,
7544, 9, 23, 9, 10, 11, 24, 25, 8, 10, 11,
24, 3, 10, 663, 168, 9, 10, 290, 291, 3, 4909,
198, 10, 1478, 169, 15, 4621, 3, 3244, 3, 59, 1967,
113, 59, 520, 198, 25, 5105, 7545, 7546, 7547, 7546, 7548,
7549, 7550, 1874, 10, 7551, 9, 10, 11, 24, 7552, 6287,
7553, 7554, 7555, 24, 7556, 24, 7557, 7558, 7559, 6, 7560,
323, 169, 10, 7561, 1432, 6, 3134, 3, 7562, 6, 7563,
1862, 7144, 741, 3, 3961, 7564, 7565, 520, 7566, 4833, 7567,
7568, 4901, 7569, 7570, 4901, 7571, 1874, 7572, 12, 13, 7573,
10, 7574, 7575, 59, 7576, 59, 638, 1620, 7577, 271, 6488,
59, 7578, 7579, 7580, 7581, 271, 7582, 7583, 24, 669, 5932,
7584, 9, 113, 271, 3764, 3, 5930, 3, 59, 4901, 7585,
793, 7586, 7587, 6, 1482, 520, 7588, 520, 7589, 3246, 7590,
13, 7591])
into torch.LongTensor() I keep getting the following error:
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
Any ideas on how to remedy? I am fairly new to AI in general, and I am an economist by trade so I am almost certain I have made a boneheaded error.

Replace values by indices of corresponding to another array

I am a newbie in numpy. I have an array A of size [x,] of values and an array B of size [y,] (y > x). I want as result an array C of size [x,] filled with indices of B.
Here is an example of inputs and outputs:
>>> A = [10, 20, 30, 10, 40, 50, 10, 50, 20]
>>> B = [10, 20, 30, 40, 50]
>>> C = #Some operations
>>> C
[0, 1, 2, 0, 3, 4, 0, 4, 1]
I didn't find the way how to do this. Please advice me. Thank you.
I think you are looking for searchsorted, assuming that B is sorted increasingly:
C = np.searchsorted(B,A)
Output:
array([0, 1, 2, 0, 3, 4, 0, 4, 1])
Update for general situation where B is not sorted. We can do an argsort:
# let's swap 40 and 50 in B
# expect the output to have 3 and 4 swapped
B = [10, 20, 30, 50, 40]
BB = np.sort(B)
C = np.argsort(B)[np.searchsorted(BB,A)]
Output:
array([0, 1, 2, 0, 4, 3, 0, 3, 1], dtype=int64)
You can double check:
(np.array(B)[C] == A).all()
# True
For general python lists
A = [10, 20, 30, 10, 40, 50, 10, 50, 20]
B = [10, 20, 30, 40, 50]
C = [A.index(e) for e in A if e in B]
print(C)
You can try this code
A = np.array([10, 20, 30, 10, 40, 50, 10, 50, 20])
B = np.array([10, 20, 30, 40, 50])
np.argmax(B==A[:,None],axis=1)

I need to filter the column from the beginning of a sentence

In my code, I can filter a column from exact texts, and it works without problems. However, it is necessary to filter another column with the beginning of a sentence.
The phrases in this column are:
A_2020.092222
A_2020.090787
B_2020.983898
B_2020.209308
So, I need to receive everything that starts with A_20 and B_20.
Thanks in advance
My code:
from bs4 import BeautifulSoup
import pandas as pd
import zipfile, urllib.request, shutil, time, csv, datetime, os, sys, os.path
#location
dt = datetime.datetime.now()
file_csv = "/home/Downloads/source.CSV"
file_csv_new = "/var/www/html/Data/Test.csv"
#open CSV
with open(file_csv, 'r', encoding='CP1251') as file:
reader = csv.reader(file, delimiter=';')
data = list(reader)
#list to dataframe
df = pd.DataFrame(data)
#filter UF
df = df.loc[df[9].isin(['PR','SC','RS'])]
#filter key
# A_ & B_
df = df.loc[df[35].isin(['A_20','B_20'])]
#print (df)
#Empty DataFrame
#Columns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
#Index: []
#[0 rows x 119 columns]```
Give the following a try:
lst1 = ['A_2020.092222', 'A_2020.090787 ', 'B_2020.983898', 'B_2020.209308', 'C_2020.209308', 'D_2020.209308']
df = pd.DataFrame(lst1, columns =['Name'])
df.loc[df.Name.str.startswith(('A_20','B_20'))]

Pymc size / indexing issue

I am trying to model Kruschke's "filtration-condensation experiment" with pymc 2.3.5. (numpy 1.10.1)
Basicaly there are:
4 groups
each group has 40 individuals
each individual has 64 Bernoulli trials (correct/incorrect)
What I am modeling:
each individual's results are Binomial distribution (e.g. 45 correct out of 64).
my belief about each individual's performance is Beta distribution.
this Beta distribution is influenced by group to which individual belongs (through parameters A=mu*kappa and B=(1-mu)*kappa)
my belief about how strong each group's influence is Gamma distribution (kappa variable)
my belief about each group's average is Beta distribution (mu variable)
The problem:
when I do modeling with "size=" parameters, pymc get's lost
when I seperate each distribution manually (no size=) the pymc does good job
I include the code below:
Data
import numpy as np
import seaborn as sns
import pymc as pm
from pymc.Matplot import plot as mcplot
%matplotlib inline
# Data
ncond = 4
nSubj = 40
trials = 64
N = np.repeat([trials], (ncond * nSubj))
z = np.array([45, 63, 58, 64, 58, 63, 51, 60, 59, 47, 63, 61, 60, 51, 59, 45,
61, 59, 60, 58, 63, 56, 63, 64, 64, 60, 64, 62, 49, 64, 64, 58, 64, 52, 64, 64,
64, 62, 64, 61, 59, 59, 55, 62, 51, 58, 55, 54, 59, 57, 58, 60, 54, 42, 59, 57,
59, 53, 53, 42, 59, 57, 29, 36, 51, 64, 60, 54, 54, 38, 61, 60, 61, 60, 62, 55,
38, 43, 58, 60, 44, 44, 32, 56, 43, 36, 38, 48, 32, 40, 40, 34, 45, 42, 41, 32,
48, 36, 29, 37, 53, 55, 50, 47, 46, 44, 50, 56, 58, 42, 58, 54, 57, 54, 51, 49,
52, 51, 49, 51, 46, 46, 42, 49, 46, 56, 42, 53, 55, 51, 55, 49, 53, 55, 40, 46,
56, 47, 54, 54, 42, 34, 35, 41, 48, 46, 39, 55, 30, 49, 27, 51, 41, 36, 45, 41,
53, 32, 43, 33])
condition = np.repeat([0,1,2,3], nSubj)
Does not work
# modeling
mu = pm.Beta('mu', 1, 1, size=ncond)
kappa = pm.Gamma('gamma', 1, 0.1, size=ncond)
# Prior
theta = pm.Beta('theta', mu[condition] * kappa[condition], (1 - mu[condition]) * kappa[condition], size=len(z))
# likelihood
y = pm.Binomial('y', p=theta, n=N, value=z, observed=True)
# model
model = pm.Model([mu, kappa, theta, y])
mcmc = pm.MCMC(model)
#mcmc.use_step_method(pm.Metropolis, mu)
#mcmc.use_step_method(pm.Metropolis, theta)
#mcmc.assign_step_methods()
mcmc.sample(100000, burn=20000, thin=3)
# outputs never converge and does vary in new simulations
mcplot(mcmc.trace('mu'), common_scale=False)
Works
z1 = z[:40]
z2 = z[40:80]
z3 = z[80:120]
z4 = z[120:]
Nv = N[:40]
mu1 = pm.Beta('mu1', 1, 1)
mu2 = pm.Beta('mu2', 1, 1)
mu3 = pm.Beta('mu3', 1, 1)
mu4 = pm.Beta('mu4', 1, 1)
kappa1 = pm.Gamma('gamma1', 1, 0.1)
kappa2 = pm.Gamma('gamma2', 1, 0.1)
kappa3 = pm.Gamma('gamma3', 1, 0.1)
kappa4 = pm.Gamma('gamma4', 1, 0.1)
# Prior
theta1 = pm.Beta('theta1', mu1 * kappa1, (1 - mu1) * kappa1, size=len(Nv))
theta2 = pm.Beta('theta2', mu2 * kappa2, (1 - mu2) * kappa2, size=len(Nv))
theta3 = pm.Beta('theta3', mu3 * kappa3, (1 - mu3) * kappa3, size=len(Nv))
theta4 = pm.Beta('theta4', mu4 * kappa4, (1 - mu4) * kappa4, size=len(Nv))
# likelihood
y1 = pm.Binomial('y1', p=theta1, n=Nv, value=z1, observed=True)
y2 = pm.Binomial('y2', p=theta2, n=Nv, value=z2, observed=True)
y3 = pm.Binomial('y3', p=theta3, n=Nv, value=z3, observed=True)
y4 = pm.Binomial('y4', p=theta4, n=Nv, value=z4, observed=True)
# model
model = pm.Model([mu1, kappa1, theta1, y1, mu2, kappa2, theta2, y2,
mu3, kappa3, theta3, y3, mu4, kappa4, theta4, y4])
mcmc = pm.MCMC(model)
#mcmc.use_step_method(pm.Metropolis, mu)
#mcmc.use_step_method(pm.Metropolis, theta)
#mcmc.assign_step_methods()
mcmc.sample(100000, burn=20000, thin=3)
# outputs converge and are not too much different in every simulation
mcplot(mcmc.trace('mu1'), common_scale=False)
mcplot(mcmc.trace('mu2'), common_scale=False)
mcplot(mcmc.trace('mu3'), common_scale=False)
mcplot(mcmc.trace('mu4'), common_scale=False)
mcmc.summary()
Can someone please explain it to me why mu[condition] and gamma[condition] does not work? :)
I guess that not splitting thetas into different variables is the problem but cannot understand why and maybe there is a way to pass a shape parameter to size= on theta?
First of all, I can confirm that the first version doesn't lead to stable results. What I can't confirm is that the second one is much better; I have seen very different results also with the second code, with values for the first mu parameter varying between 0.17 and 0.9 for different runs.
The convergence problems can be cured by using good starting values for the Markov chain. This can be done by first doing a maximum a posteriori (MAP) estimate, and then starting the Markov chain from there. The MAP step is computationally inexpensive and leads to a converging Markov chain with reproducible results for both variants of your code. For reference and comparison: The values I see for the four mu parameters are around 0.94 / 0.86 / 0.72 and 0.71.
You can do the MAP estimation by inserting the following two lines of code right after the line in which you define your model with "model=pm.Model(...":
map_ = pm.MAP(model)
map_.fit()
This technique is covered in more detail in Cameron Davidson-Pilon's Bayesian Methods for Hackers, together with other helpful topics around PyMC.

Why extreme large value to 0 frequency fft (numpy.fft.fft method)

I have a signal ts which has rougly mean 40 and applied fft on that with code
ts = array([25, 40, 30, 40, 29, 48, 36, 32, 34, 38, 15, 33, 40, 32, 41, 25, 37,49, 41, 35, 23, 22, 36, 44, 28, 36, 32, 37, 39, 51])
index = fftshift(fftfreq(len(ts)))
ft_ts =fftshift(fft(ts))
output
ft_ts = array([ -76.00000000 +8.34887715e-14j, -57.72501110 +1.17054586e+01j,
7.69492662 +9.79582336e+00j, -29.11145618 -7.22493645e+00j,
14.92140414 +4.58471353e+01j, -26.00000000 -4.67653718e+01j,
-39.61803399 -2.83601821e+01j, -11.34044003 +8.66215368e+00j,
23.68703939 +1.57391882e+01j, -64.88854382 -2.44499549e+01j,
50.00000000 -3.98371686e+01j, 4.09382150 -6.27663403e+00j,
-37.38196601 -3.06708342e+01j, 35.97162964 +1.31929223e+01j,
18.69662985 -2.20453671e+00j, 1048.00000000 +0.00000000e+00j,
18.69662985 +2.20453671e+00j, 35.97162964 -1.31929223e+01j,
-37.38196601 +3.06708342e+01j, 4.09382150 +6.27663403e+00j,
50.00000000 +3.98371686e+01j, -64.88854382 +2.44499549e+01j,
23.68703939 -1.57391882e+01j, -11.34044003 -8.66215368e+00j,
-39.61803399 +2.83601821e+01j, -26.00000000 +4.67653718e+01j,
14.92140414 -4.58471353e+01j, -29.11145618 +7.22493645e+00j,
7.69492662 -9.79582336e+00j, -57.72501110 -1.17054586e+01j])
at 0 frequency ft_ts has value of 1048. Shouldn't that be the mean of my original signal ts which is 40 ? What happened here ?
Many thanks
The FFT is not normalized, so the first term should be the sum, not the mean.
For example, see the definition here
and you can see, that when k=0, the exponential term is 1, and you'll just get the sum of x_n.
This is why the first item in fft(np.ones(10)) is 10, not 1. 1 is the mean (since it's an array of ones), and 10 is the sum.