pyspark PandasUDFDType.SCALAR convert Row array has wrong - series

I want to use PandasUDFDType.SCALAR to operate the Row arrays like belows:
df = spark.createDataFrame([([1, 2, 3, 2],), ([4, 5, 5, 4],)], ['data'])
#pandas_udf(ArrayType(IntegerType()), PandasUDFType.SCALAR)
def s(x):
z = x.apply(lambda xx: xx*2)
return z
df.select(s(df.data)).show()
but it went wrong:
pyarrow.lib.ArrowInvalid: trying to convert NumPy type int32 but got int64```

Related

Dataframe index with isclose function

I have a dataframe with numerical values between 0 and 1. I am trying to create simple summary statistics (manually). I when using boolean I can get the index but when I try to use math.isclose the function does not work and gives an error.
For example:
import pandas as pd
df1 = pd.DataFrame({'col1':[0,.05,0.74,0.76,1], 'col2': [0,
0.05,0.5, 0.75,1], 'x1': [1,2,3,4,5], 'x2':
[5,6,7,8,9]})
result75 = df1.index[round(df1['col2'],2) == 0.75].tolist()
value75 = df1['x2'][result75]
print(value75.mean())
This will give the correct result but occasionally the value result is NAN so I tried:
result75 = df1.index[math.isclose(round(df1['col2'],2), 0.75, abs_tol = 0.011)].tolist()
value75 = df1['x2'][result75]
print(value75.mean())
This results in the following error message:
TypeError: cannot convert the series to <class 'float'>
Both are type "bool" so not sure what is going wrong here...
This works:
rows_meeting_condition = df1[(df1['col2'] > 0.74) & (df1['col2'] < 0.76)]
print(rows_meeting_condition['x2'].mean())

Taking a list of 2-d arrays and getting the non-zero values as ones in a single array with Numpy

I have a list of 2-d numpy arrays, and I wish to create one array consisting of the non-zero values (or-wise) of each array set to 1. For example
arr1 = np.array([[1,0],[0,0]])
arr2 = np.array([[0,10],[0,0]])
arr3 = np.array([[0,0],[0,8]])
arrs = [arr1, arr2, arr3]
And so my op would yield
op(arrs) = [[1, 1], [0, 1]]]
What is an efficient way to do this in numpy (for about 8 image arrays of 600 by 600)?
Took me a while to understand. Try just summing all the arrays keeping their dimensions and then replace non-zero values with 1 as follows-
def op(arrs):
return np.where(np.add.reduce(arrs) != 0, 1, 0)

Is there a numpy function like np.fill(), but for arrays as fill value?

I'm trying to build an array of some given shape in which all elements are given by another array. Is there a function in numpy which does that efficiently, similar to np.full(), or any other elegant way, without simply employing for loops?
Example: Let's say I want an array with shape
(dim1,dim2) filled with a given, constant scalar value. Numpy has np.full() for this:
my_array = np.full((dim1,dim2),value)
I'm looking for an analog way of doing this, but I want the array to be filled with another array of shape (filldim1,filldim2) A brute-force way would be this:
my_array = np.array([])
for i in range(dim1):
for j in range(dim2):
my_array = np.append(my_array,fill_array)
my_array = my_array.reshape((dim1,dim2,filldim1,filldim2))
EDIT
I was being stupid, np.full() does take arrays as fill value if the shape is modified accordingly:
my_array = np.full((dim1,dim2,filldim1,filldim2),fill_array)
Thanks for pointing that out, #Arne!
You can use np.tile:
>>> shape = (2, 3)
>>> fill_shape = (4, 5)
>>> fill_arr = np.random.randn(*fill_shape)
>>> arr = np.tile(fill_arr, [*shape, 1, 1])
>>> arr.shape
(2, 3, 4, 5)
>>> np.all(arr[0, 0] == fill_arr)
True
Edit: better answer, as suggested by #Arne, directly using np.full:
>>> arr = np.full([*shape, *fill_shape], fill_arr)
>>> arr.shape
(2, 3, 4, 5)
>>> np.all(arr[0, 0] == fill_arr)
True

Python pandas json 2D array

relatively new to pandas, I have a json and python files:
{"dataset":{
"id": 123,
"data": [["2015-10-16",1,2,3,4,5,6],
["2015-10-15",7,8,9,10,11,12],
["2015-10-14",13,14,15,16,17]]
}}
&
import pandas
x = pandas.read_json('sample.json')
y = x.dataset.data
print x.dataset
Printing x.dataset and y works fine, but when I go to access a sub-element y, it returns a 'buffer' type. What's going on? How can I access the data inside the array? Attempting y[0][1] it returns out of bounds error, and iterating through returns a strange series of 'nul' characters and yet, it appears to be able to return the first portion of the data after printing x.dataset...
The data attribute of a pandas Series points to the memory buffer of all the data contained in that series:
>>> df = pandas.read_json('sample.json')
>>> type(df.dataset)
pandas.core.series.Series
>>> type(df.dataset.data)
memoryview
If you have a column/row named "data", you have to access it by it's string name, e.g.:
>>> type(df.dataset['data'])
list
Because of surprises like this, it's usually considered best practice to access columns through indexing rather than through attribute access. If you do this, you will get your desired result:
>>> df['dataset']['data']
[['2015-10-16', 1, 2, 3, 4, 5, 6],
['2015-10-15', 7, 8, 9, 10, 11, 12],
['2015-10-14', 13, 14, 15, 16, 17]]
>>> arr = df['dataset']['data']
>>> arr[0][0]
'2015-10-16'

Numpy : resize array

I have two Numpy array whose size is 994 and 1000. As such I when I am doing the below operation:
X * Y
I get error that "ValueError: operands could not be broadcast together with shapes (994) (1000)"
Hence as per fix I am trying to pad extras / trailing zeros to the array which great size by below method:
padzero = 0
if(bw.size > w.size):
padzero = bw.size - w.size
w = np.pad(w,padzero, 'constant', constant_values=0)
if(bw.size < w.size):
padzero = w.size - bw.size
bw = np.pad(bw,padzero, 'constant', constant_values=0)
But now the issue comes that if the size difference is 6 then 12 0's are getting padded in the array - which exactly should be six in my case.
I tried many ways to achieve this but its not resulting to resolve the issue. If I try he below way:
bw = np.pad(bw,padzero/2, 'constant', constant_values=0)
ValueError: Unable to create correctly shaped tuple from 3.0
How can I fix the issue?
a = np.array([1, 2, 3])
To insert zeros front:
np.pad(a,(2,0),'constant', constant_values=0)
array([0, 0, 1, 2, 3])
To insert zeros back:
np.pad(a,(0,2),'constant', constant_values=0)
array([1, 2, 3, 0, 0])
Front and back:
np.pad(a,(1,1),'constant', constant_values=0)
array([0, 1, 2, 3, 0])