going from numpy array to a pandas dataframe changes values - pandas

consider the following code:
dog = np.random.rand(10, 10)
frog = pd.DataFrame(dog, columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
from sklearn.preprocessing import StandardScaler
slog = StandardScaler()
mog = slog.fit_transform(frog.values)
frog[frog.columns] = mog
OK, now we should have a dataframe whose values should be the standard-scaled array. But:
frog.describe()
gives:
[![describe the dataframe][1]][1]
Note that the standard deviation is 1.05
While
np.std(mog, axis=0)
Gives the expected:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
What gives?

The standard deviation computed by the describe method uses the sample standard deviation, while StandardScaler uses the population standard deviation. The only difference between the two is whether the sum of the squared differences from the mean is divided by n-1 (for the sample st. dev.) or n (for the pop. std. dev.).
numpy.std computes the population st. dev. by default, but you can use it to compute the sample st. dev. by adding the argument ddof=1, and the result agrees with the values computed by describe:
In [54]: np.std(mog, axis=0, ddof=1)
Out[54]:
array([1.05409255, 1.05409255, 1.05409255, 1.05409255, 1.05409255,
1.05409255, 1.05409255, 1.05409255, 1.05409255, 1.05409255])

Related

How to do one hot encoding in tft using hardcoded values

I want to apply one hot encoding to my categorical features. I see how one can use tf.one_hot to do that but one_hot accepts indices so I'd need to map the tokens to indices. But all of the examples I've found are computing the vocab over the entire dataset. I don't want to do that as I have hard-coded dict of possible values. Something like:
CATEG = {
'feature1': ['a', 'b', 'c'],
'feature2': ['foo', 'bar']
}
I just need the proprocessing_fn to simply map the tokens to an index then run it through tf.one_hot. How can I do that?
For example, tft.apply_vocabulary sounds like what I need but then I see that it takes a deferred_vocab_filename_tensor of type common_types.TemporaryAnalyzerOutputType? The description says:
The deferred vocab filename tensor as returned by tft.vocabulary, as long as the frequencies were not stored.
And I see that tft.vocabulary is again computing the vocab:
Computes The unique values taken by x, which can be a Tensor or CompositeTensor of any size. The unique values will be aggregated over all dimensions of x and all instances.
Why doesn't something simple like this exist?
The simplest option is probably to use tf.equal as follows
import tensorflow as tf
CATEG = {
'feature1': ['a', 'b', 'c'],
'feature2': ['foo', 'bar']
}
tokens = tf.constant(CATEG['feature2'])
inputs = tf.constant(["foo", "foo", "bar", "none"])
onehot = tf.cast(tf.expand_dims(tokens, 1) == tf.expand_dims(inputs, 0), dtype=tf.float32)
print(onehot)
# [[1., 1., 0., 0.],
# [0., 0., 1., 0.]]
Add batch dims if needed.

Saving and loading a custom dtype to/from a text file with numpy

I introduce my own data type and I want to furnish it with save/load functions which would operate on text files, but I fail to provide a proper fmt string to numpy.savetxt(). The problem arises due to the fact that one of the fields of my dtype is a tuple (two floats in the naive example below), which I think effectively results in an attempt of saving a 3D object with savetxt().
It can be made work only when saving a number of floats as "%s" (but then I can not loadtxt() them, variant 1 in the code) or when introducing an inefficient my_repr() function (variant 2) below.
I can not believe that numpy does not provide an efficient formatter/save/load api to custom types. Anyone with an idea of solving it nicely?
import numpy as np
def main():
my_type = np.dtype([('single_int', np.int),
('two_floats', np.float64, (2,))])
my_var = np.array( [(1, (2., 3.)),
(4, (5., 6.))
],
dtype=my_type)
# Verification
print(my_var)
print(my_var['two_floats'])
# Let's try to save and load it in three variants
variant = 2
if variant == 0:
# the line below would not work: "ValueError: fmt has wrong number of % formats: %d %f %f"
np.savetxt('f.txt', my_var, fmt='%d %f %f')
# so I don't even try to load
elif variant == 1:
# The line below does work, but saves floats between '[]' which makes them not loadable later
np.savetxt('f.txt', my_var, fmt='%d %s')
# lines such as "1 [2. 3.]" won't load, the line below raises an Exception
my_var_loaded = np.loadtxt('f.txt', dtype=my_type)
elif variant == 2:
# An ugly workaround:
def my_repr(o):
return [(elem['single_int'], *elem['two_floats']) for elem in o]
# and then the rest works fine:
np.savetxt('f.txt', my_repr(my_var), fmt='%d %f %f')
my_var_loaded = np.loadtxt('f.txt', dtype=my_type)
print('my_var_loaded')
print(my_var_loaded)
if __name__ == '__main__':
main()
In [115]: my_type = np.dtype([('single_int', np.int),
...: ('two_floats', np.float64, (2,))])
In [116]: my_var = np.array( [(1, (2., 3.)),
...: (4, (5., 6.))
...: ],
...: dtype=my_type)
In [117]: my_var
Out[117]:
array([(1, [2., 3.]), (4, [5., 6.])],
dtype=[('single_int', '<i8'), ('two_floats', '<f8', (2,))])
Jumping straight to the loading step:
In [118]: txt = """1 2. 3.
...: 4 5. 6."""
In [119]: np.genfromtxt(txt.splitlines(), dtype=my_type)
Out[119]:
array([(1, [2., 3.]), (4, [5., 6.])],
dtype=[('single_int', '<i8'), ('two_floats', '<f8', (2,))])
As I commented savetxt is simply doing:
for row in my_var:
f.write(fmt % tuple(row))
So we have to, in one way or other, work around or with the basic Python % formatting. Either that, or write our own text file. There's nothing magical about savetxt. It's plain python.
===
Recent numpy versions include a function to 'flatten' a structured array:
In [120]: import numpy.lib.recfunctions as rf
In [121]: arr = rf.structured_to_unstructured(my_var)
In [122]: arr
Out[122]:
array([[1., 2., 3.],
[4., 5., 6.]])
In [123]: np.savetxt('test.csv', arr, fmt='%d %f %f')
In [124]: cat test.csv
1 2.000000 3.000000
4 5.000000 6.000000
In [125]: np.genfromtxt('test.csv', dtype=my_type)
Out[125]:
array([(1, [2., 3.]), (4, [5., 6.])],
dtype=[('single_int', '<i8'), ('two_floats', '<f8', (2,))])
edit
Saving an object dtype array gets around a lot of the formatting issues:
In [182]: my_var
Out[182]:
array([(1, [2., 3.]), (4, [5., 6.])],
dtype=[('single_int', '<i8'), ('two_floats', '<f8', (2,))])
In [183]: def my_repr(o):
...: return [(elem['single_int'], *elem['two_floats']) for elem in o]
...:
In [184]: my_repr(my_var)
Out[184]: [(1, 2.0, 3.0), (4, 5.0, 6.0)]
In [185]: np.array(_,object)
Out[185]:
array([[1, 2.0, 3.0],
[4, 5.0, 6.0]], dtype=object)
In [186]: np.savetxt('f.txt', _, fmt='%d %f %f')
In [187]: cat f.txt
1 2.000000 3.000000
4 5.000000 6.000000

How can I efficiently "stretch" present values in an array over absent ones

Where 'absent' can mean either nan or np.masked, whichever is easiest to implement this with.
For instance:
>>> from numpy import nan
>>> do_it([1, nan, nan, 2, nan, 3, nan, nan, 4, 3, nan, 2, nan])
array([1, 1, 1, 2, 2, 3, 3, 3, 4, 3, 3, 2, 2])
# each nan is replaced with the first non-nan value before it
>>> do_it([nan, nan, 2, nan])
array([nan, nan, 2, 2])
# don't care too much about the outcome here, but this seems sensible
I can see how you'd do this with a for loop:
def do_it(a):
res = []
last_val = nan
for item in a:
if not np.isnan(item):
last_val = item
res.append(last_val)
return np.asarray(res)
Is there a faster way to vectorize it?
Assuming there are no zeros in your data (in order to use numpy.nan_to_num):
b = numpy.maximum.accumulate(numpy.nan_to_num(a))
>>> array([ 1., 1., 1., 2., 2., 3., 3., 3., 4., 4.])
mask = numpy.isnan(a)
a[mask] = b[mask]
>>> array([ 1., 1., 1., 2., 2., 3., 3., 3., 4., 3.])
EDIT: As pointed out by Eric, an even better solution is to replace nans with -inf:
mask = numpy.isnan(a)
a[mask] = -numpy.inf
b = numpy.maximum.accumulate(a)
a[mask] = b[mask]
cumsumming over an array of flags provides a good way to determine which numbers to write over the NaNs:
def do_it(x):
x = np.asarray(x)
is_valid = ~np.isnan(x)
is_valid[0] = True
valid_elems = x[is_valid]
replacement_indices = is_valid.cumsum() - 1
return valid_elems[replacement_indices]
Working from #Benjamin's deleted solution, everything is great if you work with indices
def do_it(data, valid=None, axis=0):
# normalize the inputs to match the question examples
data = np.asarray(data)
if valid is None:
valid = ~np.isnan(data)
# flat array of the data values
data_flat = data.ravel()
# array of indices such that data_flat[indices] == data
indices = np.arange(data.size).reshape(data.shape)
# thanks to benjamin here
stretched_indices = np.maximum.accumulate(valid*indices, axis=axis)
return data_flat[stretched_indices]
Comparing solution runtime:
>>> import numpy as np
>>> data = np.random.rand(10000)
>>> %timeit do_it_question(data)
10000 loops, best of 3: 17.3 ms per loop
>>> %timeit do_it_mine(data)
10000 loops, best of 3: 179 µs per loop
>>> %timeit do_it_user(data)
10000 loops, best of 3: 182 µs per loop
# with lots of nans
>>> data[data > 0.25] = np.nan
>>> %timeit do_it_question(data)
10000 loops, best of 3: 18.9 ms per loop
>>> %timeit do_it_mine(data)
10000 loops, best of 3: 177 µs per loop
>>> %timeit do_it_user(data)
10000 loops, best of 3: 231 µs per loop
So both this and #user2357112's solution blow the solution in the question out of the water, but this has the slight edge over #user2357112 when there are high numbers of nans

Get specific elements from ndarray of ndarrays with shape (n,)

Given the ndarray:
A = np.array([np.array([1], dtype='f'),
np.array([2, 3], dtype='f'),
np.array([4, 5], dtype='f'),
np.array([6], dtype='f'),
np.array([7, 8, 9], dtype='f')])
which displays as:
A
array([array([ 1.], dtype=float32), array([ 2., 3.], dtype=float32),
array([ 4., 5.], dtype=float32), array([ 6.], dtype=float32),
array([ 7., 8., 9.], dtype=float32)], dtype=object)
I am trying to create a new array from the first elements of each "sub-array" of A. To show you what I mean, below is some code creating the array that I want using a loop. I would like to achieve the same thing but as efficiently as possible, since my array A is quite large (~50000 entries) and I need to do the operation many times.
B = np.zeros(len(A))
for i, val in enumerate(A):
B[i] = val[0]
B
array([ 1., 2., 4., 6., 7.])
Here's an approach that concatenates all elements into an 1D array and then select the first elements by linear-indexing. The implementation would look like this -
lens = np.array([len(item) for item in A])
out = np.concatenate(A)[np.append(0,lens[:-1].cumsum())]
The bottleneck would be with the concatenation part, but that might be offsetted if there are huge number of elements with small lengths. So, the efficiency would depend on the format of the input array.
I suggest transforming your original jagged array of arrays into a single masked array:
B = np.ma.masked_all((len(A), max(map(len, A))))
for ii, row in enumerate(A):
B[ii,:len(row)] = row
Now you have:
[[1.0 -- --]
[2.0 3.0 --]
[4.0 5.0 --]
[6.0 -- --]
[7.0 8.0 9.0]]
And you can get the first column this way:
B[:,0].data

Disable numpy fancy indexing and assignment?

This post identifies a "feature" that I would like to disable.
Current numpy behavior:
>>> a = arange(10)
>>> a[a>5] = arange(10)
array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3])
The reason it's a problem: say I wanted an array to have two different sets of values on either side of a breakpoint (e.g., for making a "broken power-law" or some other simple piecewise function). I might accidentally do something like this:
>>> x = empty(10)
>>> a = arange(10)
>>> x[a<=5] = 0 # this is fine
>>> x[a>5] = a**2 # this is not
# but what I really meant is this
>>> x[a>5] = a[a>5]**2
The first behavior, x[a>5] = a**2 yields something I would consider counterintuitive - the left side and right side shapes disagree and the right side is not scalar, but numpy lets me do this assignment. As pointed out on the other post, x[5:]=a**2 is not allowed.
So, my question: is there any way to make x[a>5] = a**2 raise an Exception instead of performing the assignment? I'm very worried that I have typos hiding in my code because I never before suspected this behavior.
I don't know of a way offhand to disable a core numpy feature. Instead of disabling the behavior you could try using np.select:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.select.html
In [110]: x = np.empty(10)
In [111]: a = np.arange(10)
In [112]: x[a<=5] = 0
In [113]: x[a>5] = a**2
In [114]: x
Out[114]: array([ 0., 0., 0., 0., 0., 0., 0., 1., 4., 9.])
In [117]: condlist = [a<=5,a>5]
In [119]: choicelist=[0,a**2]
In [120]: x = np.select(condlist,choicelist)
In [121]: x
Out[121]: array([ 0, 0, 0, 0, 0, 0, 36, 49, 64, 81])