loading np array very slow - numpy

New to python (very cool), first question. I am reading a 50+ mb ascii file, scanning for property tags and parsing the data into a numpy array. I have placed timing reports throughout the loop and found the culprit, the while loop using np.append(). Wondering if there is a faster method.
This is a sample input file format with fake data for debugging:
...
tag parameter
char name "Poro"
array float data 100
1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56 56 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80 81 82 83 84
85 86 87 88 89 90 91 92 93 94 95 96
97 98 99 100
endtag
...
and this is the code fragment, where it's the while loop that is taking 70 seconds for a 350k element array:
def readParameter(self, parameterName):
startTime = time.time()
intervalTime = time.time()
token = "tag parameter"
self.inputBuffer.seek(0)
for lineno, line in enumerate(self.inputBuffer, 1):
if token in line:
line = self.inputBuffer.next().replace('"', '').split()
elapsedTime = time.time() - intervalTime
logging.debug(" Time to readParameter find token: " + str(elapsedTime))
intervalTime = time.time()
if line[2] == parameterName:
line = self.inputBuffer.next()
line = self.inputBuffer.next()
np.parameterArray = np.fromstring(line, dtype=float, sep=" ")
line = self.inputBuffer.next()
**while not "endtag" in line:
np.parameterArray = np.append(np.parameterArray, np.fromstring(line, dtype=float, sep=" "))
line = self.inputBuffer.next()**
elapsedTime = time.time() - startTime
logging.debug(" Time to readParameter load array: " + str(elapsedTime))
break
elapsedTime = time.time() - startTime
logging.debug(" Time to readParameter: " + str(elapsedTime))
logging.debug(np.parameterArray)
np.parameterArray = self.make3D(np.parameterArray)
return np.parameterArray
Thanks, Jeff

Appending to an array requires resizing the array, which usually requires allocating a new block of memory that's big enough to hold the new array, copying the existing array to the new location, and freeing the memory it used to use. All of those operations are expensive, and you're doing them for each element. With 350k elements, it's basically garbage-collector memory fragmentation stress-test.
Pre-allocate your array. You've got the count parameter, so make an array that size, and inside your loop, just assign the newly-parsed element to the next spot in the array, instead of appending it. You'll have to keep your own counter of how many elements have been filled. (You could instead iterate over the elements of the blank array and replace them, but that would make error handling a bit trickier to add in.)

Related

How to visualize multi-indexed series into a heatmap in pandas?

I am trying to turn this kind of a series:
Animal Idol
50 60 15
55 14
81 14
80 13
56 11
53 10
58 9
57 9
50 9
59 6
52 6
61 1
52 52 64
58 28
55 21
81 17
60 16
50 16
56 15
80 12
61 10
59 10
53 9
57 4
53 53 27
56 14
58 10
50 9
80 8
52 6
55 6
61 5
81 5
60 4
57 4
59 3
Into something looking more like this:
Animal/Idol 60 55 81 80 ...
50 15 14 14 13
52 16 21 17 12
53 4 6 5 8
...
My base for the series here is actually a data frame looking like this (The unnamed values in series being sums of times a pair of animal and idol repeated, and there are many idols to each animal):
Animal Idol
1058 50 50
1061 50 50
1197 50 50
1357 50 50
1637 50 50
... ... ...
2780 81 81
2913 81 81
2915 81 81
3238 81 81
3324 81 81
Sadly, I have no clue how to convert any of this 2 into the desired form. I guess the good name for it is a pivot table, however I could not get the good result using them. How would You transform any of these into the form I need? I would also like to know, how to visualize this kind of a pivot table (if thats a good name) into a heat map, where color for each cell would differ based on the value in the cell (the higher the value, the deeper the colour). Thanks in advance!
i think you are looking for .unstack() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html) to unstack data.
To visualize you can use multiple tools. I like using holoviews (https://holoviews.org/),
hv.Image should be able to plot a 2d array . You can use hv.Image(df.unstack().values) to do that.
Example:
df = pd.DataFrame({'data': np.random.randint(0, 100, 100)}, index=pd.MultiIndex.from_tuples([(i, j) for i in range(10) for j in range(10)]))
df
unstack:
df_unstacked = df.unstack()
df_unstacked
plot:
import holoviews as hv
hv.Image(df_unstacked.values)
or to plot with matplotlib:
import matplotlib
import matplotlib as mpl
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
im = ax.imshow(df_unstacked.values)

Reverse the order of the rows by chunks of n rows

Consider the following sequence:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
which produces:
A B C D
0 56 83 99 46
1 40 70 22 51
2 70 9 78 33
3 65 72 79 87
4 0 6 22 73
.. .. .. .. ..
95 35 76 62 97
96 86 85 50 65
97 15 79 82 62
98 21 20 19 32
99 21 0 51 89
I can reverse the sequence with the following command:
df.iloc[::-1]
That gives me the following result:
A B C D
99 21 0 51 89
98 21 20 19 32
97 15 79 82 62
96 86 85 50 65
95 35 76 62 97
.. .. .. .. ..
4 0 6 22 73
3 65 72 79 87
2 70 9 78 33
1 40 70 22 51
0 56 83 99 46
How would I rewrite the code if I wanted to reverse the sequence every nth row, e.g. every 4th row?
IIUC, you want to reverse by chunk (3, 2, 1, 0, 8, 7, 6, 5…):
One option is to use groupby with a custom group:
N = 4
group = df.index//N
# if the index is not a linear range
# import numpy as np
# np.arange(len(df))//N
df.groupby(group).apply(lambda d: d.iloc[::-1]).droplevel(0)
output:
A B C D
3 45 33 73 77
2 91 34 19 68
1 12 25 55 19
0 65 48 17 4
7 99 99 95 9
.. .. .. .. ..
92 89 68 48 67
99 99 28 52 87
98 47 49 21 8
97 80 18 92 5
96 49 12 24 40
[100 rows x 4 columns]
A very fast method, based only on indexing is to use numpy to generate a list of the indices reversed by chunk:
import numpy as np
N = 4
idx = np.arange(len(df)).reshape(-1, N)[:, ::-1].ravel()
# array([ 3, 2, 1, 0, 7, 6, 5, 4, 11, ...])
# slice using iloc
df.iloc[idx]

actions.perform() in selenium only works at the first time, than it does nothing. What can be the problem?

The code below is an automated CookieClicker I wrote for experimenting with ActionChains. It's based on a tutorial video, at 9:42. (Link)
When I run this code, the for loop runs down 1000 times but only 1 click happens. Multiple clicks only happen if I remove "#" from the commented line, so I run actions.click(cookie) each time. As for the video, that one extra line of code is not necessary. What can be the cause of that?
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.action_chains import ActionChains
s = Service("C:\Program Files (x86)\chromedriver.exe")
driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.get("https://orteil.dashnet.org/cookieclicker/")
driver.implicitly_wait(5)
cookie=driver.find_element(By.ID,"bigCookie")
cookie_count = driver.find_element(By.ID,"cookies")
actions = ActionChains(driver)
actions.click(cookie)
for i in range(1000):
#actions.click(cookie)
actions.perform()
count=int(cookie_count.text.split(" ")[0])
print(i,count)
driver.quit()
The ActionChains implementation
ActionChains can be used in a chain pattern. When you call methods for actions on the ActionChains object, the actions are stored in a queue in the ActionChains object. When you call perform(), the events are fired in the order they are queued up.
perform()
Performs all stored actions.
Conclusion
perform() would fire the events stored in the queue. In your usecase, the actions.click(cookie) is the event.
Your optimal code block will be:
driver.get("https://orteil.dashnet.org/cookieclicker/")
cookie_count = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#cookies")))
cookie = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#bigCookie")))
for i in range(100):
ActionChains(driver).click(cookie).perform()
count = cookie_count.text.split(" ")[0]
print(i,count)
driver.quit()
Console Output:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
31 31
32 32
33 33
34 34
35 35
36 36
37 37
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51
52 52
53 53
54 54
55 55
56 56
57 57
58 58
59 59
60 60
61 61
62 62
63 63
64 64
65 65
66 66
67 67
68 68
69 69
70 70
71 71
72 72
73 73
74 74
75 75
76 76
77 77
78 78
79 79
80 80
81 81
82 82
83 83
84 84
85 85
86 86
87 87
88 88
89 89
90 90
91 91
92 92
93 93
94 94
95 95
96 96
97 97
98 98
99 99
I assume you are using actions, for the sake of using it, or learning about it, since you could simply call cookie.click() to get the desired result.
Actions are used when you need to perform some "action" to an element other than find it or click on it, i.e. you want to right click, or click and hold, or send a keystroke combination and so on. Check Selenium Actions for more info.
As for using actions to click, you need to also understand that perform gets the composite object (call the build function of Actions) of your actions and execute them. Since your actions are declared outside the for loop, after the first click, the perform() function has no more actions to perform.
TLDR:
Either remove the comment of actions.click(cookie) inside your for loop or use cookie.click() to get the same result without using actions.
for i in range(10):
actions.click(cookie)
actions.perform()
#cookie.click()
count=int(cookie_count.text.split(" ")[0])
print(i,count)
driver.quit()
Colab Notebook of it working

How to compare dates with np.nanmin() [duplicate]

How can I reference the minimum value of two dataframes as part of a pandas dataframe equation? I tried using the python min() function which did not work. I'm sorry if this is well-documented somewhere but I have not been able to find a working solution for this problem. I am looking for something along the lines of this:
data['eff'] = pd.DataFrame([data['flow_h'], data['flow_c']]).min() *Cp* (data[' Thi'] - data[' Tci'])
I also tried to use pandas min() function, which is also not working.
min_flow = pd.DataFrame([data['flow_h'], data['flow_c']]).min()
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I was confused by this error. The data columns are just numbers and a name, I wasn't sure where the index comes into play.
import pandas as pd
import numpy as np
np.random.seed(365)
rows = 10
flow = {'flow_c': [np.random.randint(100) for _ in range(rows)],
'flow_d': [np.random.randint(100) for _ in range(rows)],
'flow_h': [np.random.randint(100) for _ in range(rows)]}
data = pd.DataFrame(flow)
# display(data)
flow_c flow_d flow_h
0 82 36 43
1 52 48 12
2 33 28 77
3 91 99 11
4 44 95 27
5 5 94 64
6 98 3 88
7 73 39 92
8 26 39 62
9 56 74 50
If you are trying to get the row-wise mininum of two or more columns, use pandas.DataFrame.min. Note that by default axis=0; specifying axis=1 is necessary.
data['min_c_h'] = data[['flow_h','flow_c']].min(axis=1)
# display(data)
flow_c flow_d flow_h min_c_h
0 82 36 43 43
1 52 48 12 12
2 33 28 77 33
3 91 99 11 11
4 44 95 27 27
5 5 94 64 5
6 98 3 88 88
7 73 39 92 73
8 26 39 62 26
9 56 74 50 50
If you like to get a single minimum value of multiple columns:
data[['flow_h','flow_c']].min().min()
the first "min()" calculates the minimum per column and returns a pandas series. The second "min" returns the minimum of the minimums per column.

Repeating the format specifiers in awk

I am trying to format the output of the AWK's printf() function. More precisely, I am trying to print a matrix with very long rows and I would like to wrap them and continue on the next line. What I am trying to do is best illustrated using Fortran. Consider the following Fortran statement:
write(*,'(10I5)')(i,i=1,100)
The output would be the integers in the range 1:100 printed in rows of 10 elements.
Is it possible to do the same in AWK. I could do that by offsetting the index and printing to new line with "\n". The question is whether that can be done in an elegant manner as in Fortran.
Thanks,
As suggested in the comments I would like to explain my Fortran code, given as an example above.
(i,i=1,100) ! => is a do loop going from 1 to 100
write(*,'(10I5)') ! => is a formatted write statement
10I5 says print 10 integers and for each integer allocate 5 character slot
The trick is, that when one exceeds the 10 x 5 character slots given by the formatted write, one jumps on the next line. So one doesn't need the trailing "\n".
This may help you
[akshay#localhost tmp]$ cat test.for
implicit none
integer i
write(*,'(10I5)')(i,i=1,100)
end
[akshay#localhost tmp]$ gfortran test.for
[akshay#localhost tmp]$ ./a.out
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
[akshay#localhost tmp]$ awk 'BEGIN{for(i=1;i<=100;i++)printf("%5d%s",i,i%10?"":"\n")}'
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100