what does `replace` in `pandas.DataFrame.sample()` do? [duplicate] - pandas

Here explains the function numpy.random.choice. However, I am confused about the third parameter replace. What is it? And in which case will it be useful? Thanks!

It controls whether the sample is returned to the sample pool. If you want only unique samples then this should be false.

You can use it when you want sample some elements from a list, and meanwhile you want the elements no repeat, then you can set the "replace=False".
eg.
from numpy import random as rd
ary = list(range(10))
# usage
In[18]: rd.choice(ary, size=8, replace=False)
Out[18]: array([0, 5, 9, 8, 2, 1, 6, 3]) # no repeated elements
In[19]: rd.choice(ary, size=8, replace=True)
Out[19]: array([4, 9, 8, 5, 4, 1, 1, 9]) # elements may be repeated

Related

Numpy subarrays and relative indexing

I have been searching if there is an standard mehtod to create a subarray using relative indexes. Take the following array into consideration:
>>> m = np.arange(25).reshape([5, 5])
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
I want to access the 3x3 matrix at a specific array position, for example [2,2]:
>>> x = 2, y = 2
>>> m[slice(x-1,x+2), slice(y-1,y+2)]
array([[ 6, 7, 8],
[11, 12, 13],
[16, 17, 18]])
For example for the above somethig like m.subarray(pos=[2,2], shape=[3,3])
I want to sample a ndarray of n dimensions on a specific position which might change.
I did not want to use a loop as it might be inneficient. Scipy functions correlate and convolve do this very efficiently, but for all positions. I am interested only in the sampling of one.
The best answer could solve the issues at edges, in my case I would like for example to have wrap mode:
(a b c d | a b c d | a b c d)
--------------------EDITED-----------------------------
Based on the answer from #Carlos Horn, I could create the following function.
def cell_neighbours(array, index, shape):
pads = [(floor(dim/2), ceil(dim / 2)) for dim in shape]
array = np.pad(self.configuration, pads, "wrap")
views = np.lib.stride_tricks.sliding_window_view
return views(array, shape)[tuple(index)]
Last concern might be about speed, from docs: For many applications using a sliding window view can be convenient, but potentially very slow. Often specialized solutions exist.
From here maybe is easier to get a faster solution.
You could build a view of 3x3 matrices into the array as follows:
import numpy as np
m = np.arange(25).reshape(5,5)
m3x3view = np.lib.stride_tricks.sliding_window_view(m, (3,3))
Note that it will change slightly your indexing on half the window size meaning
x_view = x - 3//2
y_view = y - 3//2
print(m3x3view[x_view,y_view]) # gives your result
In case a copy operation is fine, you could use:
mpad = np.pad(m, 1, mode="wrap")
mpad3x3view = np.lib.stride_tricks.sliding_window_view(mpad, (3,3))
print(mpad3x3view[x % 5,y % 5])
to use arbitrary x, y integer values.

Assign numpy matrix to pandas columns

I have dataframe with 48870 rows and calculated embeddings with shape (48870, 768)
I wanna assign this embeddings to padnas column
When i try
test['original_text_embeddings'] = embeddings
I have an error: Wrong number of items passed 768, placement implies 1
I know if a make something like df.loc['original_text_embeddings'] = embeddings[0] will work but i need to automate this process
A dataframe/column needs a 1d list/array:
In [84]: x = np.arange(12).reshape(3,4)
In [85]: pd.Series(x)
...
ValueError: Data must be 1-dimensional
Splitting the array into a list (of arrays):
In [86]: pd.Series(list(x))
Out[86]:
0 [0, 1, 2, 3]
1 [4, 5, 6, 7]
2 [8, 9, 10, 11]
dtype: object
In [87]: _.to_numpy()
Out[87]:
array([array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8, 9, 10, 11])],
dtype=object)
Your embeddings have 768 columns, which would translate to equally 768 columns in a data frame. You are trying to assign all columns from the embeddings to just one column in the data frame, which is not possible.
What you could do is generating a new data frame from the embeddings and concatenate the test df with the embedding df
embedding_df = pd.DataFrame(embeddings)
test = pd.concat([test, embedding_df], axis=1)
Have a look at the documentation for handling indexes and concatenating on different axis:
https://pandas.pydata.org/docs/reference/api/pandas.concat.html

Why does numpy.delete always delete one extra element in my test?

I want to use numpy.delete to delete certain elements in an array
import numpy as np
aa = np.array([1,2,3,4,5,6,7,8])
bb = np.array([0,0,0,0,0,0,0,0], dtype='bool')
np.delete(aa,bb)
gives me the results:
array([2, 3, 4, 5, 6, 7, 8])
I expect the results like this:
array([1, 2, 3, 4, 5, 6, 7, 8])
And if I change the bb to
bb = np.array([1,0,0,0,0,0,0,0], dtype='bool')
I got:
np.delete(aa,bb)
array([3, 4, 5, 6, 7, 8])
The code is simple, but I do not understand why numpy.delete behaves like this. Any explanations?
When I paste your code into a repl, I get the intended output. I am using Numpy v1.19.4 and Python 3.8.5. Check if there is an update for Numpy, and make sure that you are not doing any operations after that may remove the first item in the array.
np.delete is using an array of ints as the indices to remove. If you use a bool, that is converted to an int (False =0, True=1). So what you are doing in your first example is saying remove the 0 index value and in the second example you have a 1 and a 0 so it's removing those indexes.
In the future numpy will not cast the booleans as integers.
FutureWarning: in the future insert will treat boolean arrays and array-likes as boolean index instead of casting it to integer

SIice a numpy array based on an interval?

Is it possible to systematically slice an 1d array of length m by an interval n in numpy? Say I have a list of 1000 values, could I break that into 10 lists of 100 values easily?
You can use both np.array_split() and np.split() which in fact are the same with a little note (as per np.array_split())
From the documentation:
x = np.arange(8.0)
np.array_split(x, 3)
#Result
[array([0., 1., 2.]), array([3., 4., 5.]), array([6., 7.])]
Split an array into multiple sub-arrays.
Please refer to the split documentation. The only difference between
these functions is that array_split allows indices_or_sections to be
an integer that does not equally divide the axis. For an array of
length l that should be split into n sections, it returns l % n
sub-arrays of size l//n + 1 and the rest of size l//n.
array_split allows one to split with unequal spacing as well, should this ever meet your needs
ar = np.arange(0, 20, dtype='int')
s = [2, 7, 12, 17]
np.array_split(ar, s)
Out[80]:
[array([0, 1]),
array([2, 3, 4, 5, 6]),
array([ 7, 8, 9, 10, 11]),
array([12, 13, 14, 15, 16]),
array([17, 18, 19])]

Numpy rebinning a 2D array

I am looking for a fast formulation to do a numerical binning of a 2D numpy array. By binning I mean calculate submatrix averages or cumulative values. For ex. x = numpy.arange(16).reshape(4, 4) would have been splitted in 4 submatrix of 2x2 each and gives numpy.array([[2.5,4.5],[10.5,12.5]]) where 2.5=numpy.average([0,1,4,5]) etc...
How to perform such an operation in an efficient way... I don't have really any ideay how to perform this ...
Many thanks...
You can use a higher dimensional view of your array and take the average along the extra dimensions:
In [12]: a = np.arange(36).reshape(6, 6)
In [13]: a
Out[13]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
In [14]: a_view = a.reshape(3, 2, 3, 2)
In [15]: a_view.mean(axis=3).mean(axis=1)
Out[15]:
array([[ 3.5, 5.5, 7.5],
[ 15.5, 17.5, 19.5],
[ 27.5, 29.5, 31.5]])
In general, if you want bins of shape (a, b) for an array of (rows, cols), your reshaping of it should be .reshape(rows // a, a, cols // b, b). Note also that the order of the .mean is important, e.g. a_view.mean(axis=1).mean(axis=3) will raise an error, because a_view.mean(axis=1) only has three dimensions, although a_view.mean(axis=1).mean(axis=2) will work fine, but it makes it harder to understand what is going on.
As is, the above code only works if you can fit an integer number of bins inside your array, i.e. if a divides rows and b divides cols. There are ways to deal with other cases, but you will have to define the behavior you want then.
See the SciPy Cookbook on rebinning, which provides this snippet:
def rebin(a, *args):
'''rebin ndarray data into a smaller ndarray of the same rank whose dimensions
are factors of the original dimensions. eg. An array with 6 columns and 4 rows
can be reduced to have 6,3,2 or 1 columns and 4,2 or 1 rows.
example usages:
>>> a=rand(6,4); b=rebin(a,3,2)
>>> a=rand(6); b=rebin(a,2)
'''
shape = a.shape
lenShape = len(shape)
factor = asarray(shape)/asarray(args)
evList = ['a.reshape('] + \
['args[%d],factor[%d],'%(i,i) for i in range(lenShape)] + \
[')'] + ['.sum(%d)'%(i+1) for i in range(lenShape)] + \
['/factor[%d]'%i for i in range(lenShape)]
print ''.join(evList)
return eval(''.join(evList))
I assume that you only want to know how to generally build a function that performs well and does something with arrays, just like numpy.reshape in your example. So if performance really matters and you're already using numpy, you can write your own C code for that, like numpy does. For example, the implementation of arange is completely in C. Almost everything with numpy which matters in terms of performance is implemented in C.
However, before doing so you should try to implement the code in python and see if the performance is good enough. Try do make the python code as efficient as possible. If it still doesn't suit your performance needs, go the C way.
You may read about that in the docs.