Inconsistent behavior when slicing a 2d array in PostgreSQL - sql

Let's say I have a 2d array:
# SELECT ARRAY[ARRAY[1,2], ARRAY[3,4]];
array
---------------
{{1,2},{3,4}}
(1 row)
Now, if I want to get the first element of each inner array, adding (...)[:][1] will do the trick:
# SELECT (ARRAY[ARRAY[1,2], ARRAY[3,4]])[:][1];
array
-----------
{{1},{3}}
(1 row)
BUT: If I want to obtain the second element of each inner array, I have to opt for adding (...)[:][2:2], as (...)[:][2] would return the untouched array again
# SELECT (ARRAY[ARRAY[1,2], ARRAY[3,4]])[:][2];
array
---------------
{{1,2},{3,4}}
(1 row)
# SELECT (ARRAY[ARRAY[1,2], ARRAY[3,4]])[:][2:2];
array
-----------
{{2},{4}}
(1 row)
What is the reason for this inconsistent behavior?

I think the documentation explains this pretty well:
If any dimension is written as a slice, i.e., contains a colon, then all dimensions are treated as slices. Any dimension that has only a single number (no colon) is treated as being from 1 to the number specified.
That is, when you are using slices, Postgres expects all dimensions to be slices. Those that are not are defaulted to 1:n.

Related

Fastest way to find two minimum values in each column of NumPy array

If I want to find the minimum value in each column of a NumPy array, I can use the numpy.amin() function. However, is there a way to find the two minimum values in each column, that is faster than sorting each column?
You can simply use np.partition along the columns to get smallest N numbers -
N = 2
np.partition(a,kth=N-1,axis=0)[:N]
This doesn't actually sort the entire data, simply partitions into two sections such that smallest N numbers are in the first section, also called as partial-sort.
Bonus (Getting top N elements) : Similarly, to get the top N numbers per col, simply use negative kth value -
np.partition(a,kth=-N,axis=0)[-N:]
Along other axes and higher dim arrays
To use it along other axes, change the axis value. So, along rows, it would be axis=1 for a 2D array and extend the same way for higher dimension ndarrays.
Use the min() method, and specify the axis you want to average over:
a = np.random.rand(10,3)
a.min(axis=0)
gives:
array([0.04435587, 0.00163139, 0.06327353])
a.min(axis=1)
gives
array([0.01354386, 0.08996586, 0.19332211, 0.00163139, 0.55650945,
0.08409907, 0.23015718, 0.31463493, 0.49117553, 0.53646868])

AttributeError: 'int' object has no attribute 'count' while using itertuples() method with dataframes

I am trying to iterate over rows in a Pandas Dataframe using the itertuples()-method, which works quite fine for my case. Now i want to check if a specific value ('x') is in a specific tuple. I used the count() method for that, as i need to use the number of occurences of x later.
The weird part is, for some Tuples that works just fine (i.e. in my case (namedtuple[7].count('x')) + (namedtuple[8].count('x')) ), but for some (i.e. namedtuple[9].count('x')) i get an AttributeError: 'int' object has no attribute 'count'
Would appreciate your help very much!
Apparently, some columns of your DataFrame are of object type (actually a string)
and some of them are of int type (more generally - numbers).
To count occurrences of x in each row, you should:
Apply a function to each row which:
checks whether the type of the current element is str,
if it is, return count('x'),
if not, return 0 (don't attempt to look for x in a number).
So far this function returns a Series, with a number of x in each column
(separately), so to compute the total for the whole row, this Series should
be summed.
Example of working code:
Test DataFrame:
C1 C2 C3
0 axxv bxy 10
1 vx cy 20
2 vv vx 30
Code:
for ind, row in df.iterrows():
print(ind, row.apply(lambda it:
it.count('x') if type(it).__name__ == 'str' else 0).sum())
(in my opinion, iterrows is more convenient here).
The result is:
0 3
1 1
2 1
So as you can see, it is possible to count occurrences of x,
even when some columns are not strings.

Removed the last element from a json[]?

I have a json[] array (_result_group) in PostgreSQL 9.4, and I want to remove its last json element (_current). I prepared with:
_result_group := (SELECT array_append(_result_group,_current));
And tried to remove with:
SELECT _result_group[1:array_length(_result_group,1) -1] INTO _result_group;
But it didn't work.
How to do this?
To remove the last element from any array (including json[]) with the means of Postgres 9.4, obviously within plpgsql code:
_result_group := _result_group[1:cardinality(_result_group)-1];
Assuming a 1-dimensional array with default subscripts starting with 1.
You get an empty array for empty array input and null for null.
According to the manual, cardinality() ...
returns the total number of elements in the array, or 0 if the array is empty
Then just take the array slice from 1 to cardinality -1.
Then again, your attempt should work as well:
SELECT _result_group[1:array_length(_result_group,1) -1] INTO _result_group;
For non-standard array subscripts see:
Normalize array subscripts for 1-dimensional array so they start with 1

How can I Select nth element from an array's 2nd dimension?

I have a 2-dimensional int array, and I'd like to get the 2nd element from every array in the 2nd dimension. So for example, I'd like to get 2,4, and 6 from the array literal '{{1,2},{3,4},{5,6}'. Is this possible? I've searched the docs but I haven't found anything that can do what I want.
unnest(arr[:][2:2]) will give you a table expression for what you want (where arr is the name of your array column)
If you want to get a 1 dimensional array of those elements, you can use array(select * from unnest(arr[:][2:2])) (because arr[:][2:2] is still a 2 dimensional one).
http://rextester.com/VLOJ18858

Numpy maximum(arrays)--how to determine the array each max value came from

I have numpy arrays representing July temperature for each year since 1950.
I can use the numpy.maximum(temp1950,temp1951,temp1952,..temp2014)
to determine the maximum July temperature at each cell.
I need the maximum for each cell..the numpy.maximum() works for only 2 arrays
How do I determine the year that each max value came from?
Also the numpy.maximum(array1,array2) works comparing only two arrays.
Thanks to Praveen, the following works fine:
array1 = numpy.array( ([1,2],[3,4]) )
array2 = numpy.array( ([3,4],[1,2]) )
array3 = numpy.array( ([9,1],[1,9]) )
all_arrays = numpy.dstack((array1,array2,array3))
#maxvalues = numpy.maximum(all_arrays)#will not work
all_arrays.max(axis=2) #this returns the max from each cell location
max_indexes = numpy.argmax(all_arrays,axis=2)#this returns correct indexes
The answer is argmax, except that you need to do this along the required axis. If you have 65 years' worth of temperatures, it doesn't make sense to keep them in separate arrays.
Instead, put them all into a single 2D dimensional array using something like np.vstack and then take the argmax over rows.
alltemps = np.vstack((temp1950, temp1951, ..., temp2014))
maxindexes = np.argmax(alltemps, axis=0)
If your temperature arrays are already 2D for some reason, then you can use np.dstack to stack in depth instead. Then you'll have to take argmax over axis=2.
For the specific example in your question, you're looking for something like:
t = np.dstack((array1, array2)) # Note the double parantheses. You need to pass
# a tuple to the function
maxindexes = np.argmax(t, axis=2)
PS: If you are getting the data out of a file, I suggest putting them in a single array to start with. It gets hard to handle 65 variable names.
You need to use Numpy's argmax
It would give you the index of the largest element in the array, which you can map to the year.