Numpy maximum(arrays)--how to determine the array each max value came from - numpy

I have numpy arrays representing July temperature for each year since 1950.
I can use the numpy.maximum(temp1950,temp1951,temp1952,..temp2014)
to determine the maximum July temperature at each cell.
I need the maximum for each cell..the numpy.maximum() works for only 2 arrays
How do I determine the year that each max value came from?
Also the numpy.maximum(array1,array2) works comparing only two arrays.
Thanks to Praveen, the following works fine:
array1 = numpy.array( ([1,2],[3,4]) )
array2 = numpy.array( ([3,4],[1,2]) )
array3 = numpy.array( ([9,1],[1,9]) )
all_arrays = numpy.dstack((array1,array2,array3))
#maxvalues = numpy.maximum(all_arrays)#will not work
all_arrays.max(axis=2) #this returns the max from each cell location
max_indexes = numpy.argmax(all_arrays,axis=2)#this returns correct indexes

The answer is argmax, except that you need to do this along the required axis. If you have 65 years' worth of temperatures, it doesn't make sense to keep them in separate arrays.
Instead, put them all into a single 2D dimensional array using something like np.vstack and then take the argmax over rows.
alltemps = np.vstack((temp1950, temp1951, ..., temp2014))
maxindexes = np.argmax(alltemps, axis=0)
If your temperature arrays are already 2D for some reason, then you can use np.dstack to stack in depth instead. Then you'll have to take argmax over axis=2.
For the specific example in your question, you're looking for something like:
t = np.dstack((array1, array2)) # Note the double parantheses. You need to pass
# a tuple to the function
maxindexes = np.argmax(t, axis=2)
PS: If you are getting the data out of a file, I suggest putting them in a single array to start with. It gets hard to handle 65 variable names.

You need to use Numpy's argmax
It would give you the index of the largest element in the array, which you can map to the year.

Related

Pinescript index range

Is there a way to code an index range in pinescript? For example, If I want to include all close values between 10 bars and 5 bars ago. Everything between close[10] and close[5].
In python this would be close[5:10] but I cannot find any literature discussing a range of indexes.
thanks!
You could code a function to do that, but to be clear [] has a specific meaning in Pinescript as a history-referencing operator. I think what you are asking for is a way to construct an array of values from a series based on indicies.
This would work if you're use float values like OHLC
//#version=4
study("My Script")
range(_src, _a, _b) =>
_arr = array.new_float(0)
for i = _a to _b - 1
array.push(_arr, _src[i])
_arr
someCloses = range(close, 5, 10)
plot(array.size(someCloses))
But with this you are converting your data to a different type. So make sure to look at the available array functions.

Fastest way to find two minimum values in each column of NumPy array

If I want to find the minimum value in each column of a NumPy array, I can use the numpy.amin() function. However, is there a way to find the two minimum values in each column, that is faster than sorting each column?
You can simply use np.partition along the columns to get smallest N numbers -
N = 2
np.partition(a,kth=N-1,axis=0)[:N]
This doesn't actually sort the entire data, simply partitions into two sections such that smallest N numbers are in the first section, also called as partial-sort.
Bonus (Getting top N elements) : Similarly, to get the top N numbers per col, simply use negative kth value -
np.partition(a,kth=-N,axis=0)[-N:]
Along other axes and higher dim arrays
To use it along other axes, change the axis value. So, along rows, it would be axis=1 for a 2D array and extend the same way for higher dimension ndarrays.
Use the min() method, and specify the axis you want to average over:
a = np.random.rand(10,3)
a.min(axis=0)
gives:
array([0.04435587, 0.00163139, 0.06327353])
a.min(axis=1)
gives
array([0.01354386, 0.08996586, 0.19332211, 0.00163139, 0.55650945,
0.08409907, 0.23015718, 0.31463493, 0.49117553, 0.53646868])

How do I reverse each value in a column bit wise for a hex number?

I have a dataframe which has a column called hexa which has hex values like this. They are of dtype object.
hexa
0 00802259AA8D6204
1 00802259AA7F4504
2 00802259AA8D5A04
I would like to remove the first and last bits and reverse the values bitwise as follows:
hexa-rev
0 628DAA592280
1 457FAA592280
2 5A8DAA592280
Please help
I'll show you the complete solution up here and then explain its parts below:
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
reversed_bits = [list_of_bits[-i] for i in range(1,len(list_of_bits)+1)]
return ''.join(reversed_bits)
df['hexa-rev'] = df['hexa'].apply(lambda x: reverse_bits(x))
There are possibly a couple ways of doing it, but this way should solve your problem. The general strategy will be defining a function and then using the apply() method to apply it to all values in the column. It should look something like this:
df['hexa-rev'] = df['hexa'].apply(lambda x: reverse_bits(x))
Now we need to define the function we're going to apply to it. Breaking it down into its parts, we strip the first and last bit by indexing. Because of how negative indexes work, this will eliminate the first and last bit, regardless of the size. Your result is a list of characters that we will join together after processing.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
The second line iterates through the list of characters, matches the first and second character of each bit together, and then concatenates them into a single string representing the bit.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
The second to last line returns the list you just made in reverse order. Lastly, the function returns a single string of bits.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
reversed_bits = [list_of_bits[-i] for i in range(1,len(list_of_bits)+1)]
return ''.join(reversed_bits)
I explained it in reverse order, but you want to define this function that you want applied to your column, and then use the apply() function to make it happen.

Calling preprocessing.scale on a heterogeneous array

I have this TypeError as per below, I have checked my df and it all contains numbers only, can this be caused when I converted to numpy array? After the conversion the array has items like
[Timestamp('1993-02-11 00:00:00') 28.1216 28.3374 ...]
Any suggestion how to solve this, please?
df:
Date Open High Low Close Volume
9 1993-02-11 28.1216 28.3374 28.1216 28.2197 19500
10 1993-02-12 28.1804 28.1804 28.0038 28.0038 42500
11 1993-02-16 27.9253 27.9253 27.2581 27.2974 374800
12 1993-02-17 27.2974 27.3366 27.1796 27.2777 210900
X = np.array(df.drop(['High'], 1))
X = preprocessing.scale(X)
TypeError: float() argument must be a string or a number
While you're saying that your dataframe "all contains numbers only", you also note that the first column consists of datetime objects. The error is telling you that preprocessing.scale only wants to work with float values.
The real question, however, is what you expect to happen to begin with. preprocessing.scale centers values on the mean and normalizes the variance. This is such that measured quantities are all represented on roughly the same footing. Now, your first column tells you what dates your data correspond to, while the rest of the columns are numeric data themselves. Why would you want to normalize the dates? How would you normalize the dates?
Semantically speaking, I believe you should leave your dates alone. Whatever post-processing you're planning to perform on your numerical data, the normalized data should still be parameterized by the original dates. If you want to process your dates too, you need to come up with an explicit way to handle your dates to something numeric (say, elapsed time from a given date in given units).
So I believe you should drop your dates from your processing round altogether, and start with
X = df.drop(['Date','High'], 1).as_matrix()

Apply function with pandas dataframe - POS tagger computation time

I'm very confused on the apply function for pandas. I have a big dataframe where one column is a column of strings. I'm then using a function to count part-of-speech occurrences. I'm just not sure the way of setting up my apply statement or my function.
def noun_count(row):
x = tagger(df['string'][row].split())
# array flattening and filtering out all but nouns, then summing them
return num
So basically I have a function similar to the above where I use a POS tagger on a column that outputs a single number (number of nouns). I may possibly rewrite it to output multiple numbers for different parts of speech, but I can't wrap my head around apply.
I'm pretty sure I don't really have either part arranged correctly. For instance, I can run noun_count[row] and get the correct value for any index but I can't figure out how to make it work with apply how I have it set up. Basically I don't know how to pass the row value to the function within the apply statement.
df['num_nouns'] = df.apply(noun_count(??),1)
Sorry this question is all over the place. So what can I do to get a simple result like
string num_nouns
0 'cat' 1
1 'two cats' 1
EDIT:
So I've managed to get something working by using list comprehension (someone posted an answer, but they've deleted it).
df['string'].apply(lambda row: noun_count(row),1)
which required an adjustment to my function:
def tagger_nouns(x):
list_of_lists = st.tag(x.split())
flat = [y for z in list_of_lists for y in z]
Parts_of_speech = [row[1] for row in flattened]
c = Counter(Parts_of_speech)
nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
return nouns
I'm using the Stanford tagger, but I have a big problem with computation time, and I'm using the left 3 words model. I'm noticing that it's calling the .jar file again and again (java keeps opening and closing in the task manager) and maybe that's unavoidable, but it's really taking far too long to run. Any way I can speed it up?
I don't know what 'tagger' is but here's a simple example with a word count that ought to work more or less the same way:
f = lambda x: len(x.split())
df['num_words'] = df['string'].apply(f)
string num_words
0 'cat' 1
1 'two cats' 2