How to compare dates with np.nanmin() [duplicate] - pandas

How can I reference the minimum value of two dataframes as part of a pandas dataframe equation? I tried using the python min() function which did not work. I'm sorry if this is well-documented somewhere but I have not been able to find a working solution for this problem. I am looking for something along the lines of this:
data['eff'] = pd.DataFrame([data['flow_h'], data['flow_c']]).min() *Cp* (data[' Thi'] - data[' Tci'])
I also tried to use pandas min() function, which is also not working.
min_flow = pd.DataFrame([data['flow_h'], data['flow_c']]).min()
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I was confused by this error. The data columns are just numbers and a name, I wasn't sure where the index comes into play.
import pandas as pd
import numpy as np
np.random.seed(365)
rows = 10
flow = {'flow_c': [np.random.randint(100) for _ in range(rows)],
'flow_d': [np.random.randint(100) for _ in range(rows)],
'flow_h': [np.random.randint(100) for _ in range(rows)]}
data = pd.DataFrame(flow)
# display(data)
flow_c flow_d flow_h
0 82 36 43
1 52 48 12
2 33 28 77
3 91 99 11
4 44 95 27
5 5 94 64
6 98 3 88
7 73 39 92
8 26 39 62
9 56 74 50

If you are trying to get the row-wise mininum of two or more columns, use pandas.DataFrame.min. Note that by default axis=0; specifying axis=1 is necessary.
data['min_c_h'] = data[['flow_h','flow_c']].min(axis=1)
# display(data)
flow_c flow_d flow_h min_c_h
0 82 36 43 43
1 52 48 12 12
2 33 28 77 33
3 91 99 11 11
4 44 95 27 27
5 5 94 64 5
6 98 3 88 88
7 73 39 92 73
8 26 39 62 26
9 56 74 50 50

If you like to get a single minimum value of multiple columns:
data[['flow_h','flow_c']].min().min()
the first "min()" calculates the minimum per column and returns a pandas series. The second "min" returns the minimum of the minimums per column.

Related

How to visualize multi-indexed series into a heatmap in pandas?

I am trying to turn this kind of a series:
Animal Idol
50 60 15
55 14
81 14
80 13
56 11
53 10
58 9
57 9
50 9
59 6
52 6
61 1
52 52 64
58 28
55 21
81 17
60 16
50 16
56 15
80 12
61 10
59 10
53 9
57 4
53 53 27
56 14
58 10
50 9
80 8
52 6
55 6
61 5
81 5
60 4
57 4
59 3
Into something looking more like this:
Animal/Idol 60 55 81 80 ...
50 15 14 14 13
52 16 21 17 12
53 4 6 5 8
...
My base for the series here is actually a data frame looking like this (The unnamed values in series being sums of times a pair of animal and idol repeated, and there are many idols to each animal):
Animal Idol
1058 50 50
1061 50 50
1197 50 50
1357 50 50
1637 50 50
... ... ...
2780 81 81
2913 81 81
2915 81 81
3238 81 81
3324 81 81
Sadly, I have no clue how to convert any of this 2 into the desired form. I guess the good name for it is a pivot table, however I could not get the good result using them. How would You transform any of these into the form I need? I would also like to know, how to visualize this kind of a pivot table (if thats a good name) into a heat map, where color for each cell would differ based on the value in the cell (the higher the value, the deeper the colour). Thanks in advance!
i think you are looking for .unstack() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html) to unstack data.
To visualize you can use multiple tools. I like using holoviews (https://holoviews.org/),
hv.Image should be able to plot a 2d array . You can use hv.Image(df.unstack().values) to do that.
Example:
df = pd.DataFrame({'data': np.random.randint(0, 100, 100)}, index=pd.MultiIndex.from_tuples([(i, j) for i in range(10) for j in range(10)]))
df
unstack:
df_unstacked = df.unstack()
df_unstacked
plot:
import holoviews as hv
hv.Image(df_unstacked.values)
or to plot with matplotlib:
import matplotlib
import matplotlib as mpl
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
im = ax.imshow(df_unstacked.values)

Reverse the order of the rows by chunks of n rows

Consider the following sequence:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
which produces:
A B C D
0 56 83 99 46
1 40 70 22 51
2 70 9 78 33
3 65 72 79 87
4 0 6 22 73
.. .. .. .. ..
95 35 76 62 97
96 86 85 50 65
97 15 79 82 62
98 21 20 19 32
99 21 0 51 89
I can reverse the sequence with the following command:
df.iloc[::-1]
That gives me the following result:
A B C D
99 21 0 51 89
98 21 20 19 32
97 15 79 82 62
96 86 85 50 65
95 35 76 62 97
.. .. .. .. ..
4 0 6 22 73
3 65 72 79 87
2 70 9 78 33
1 40 70 22 51
0 56 83 99 46
How would I rewrite the code if I wanted to reverse the sequence every nth row, e.g. every 4th row?
IIUC, you want to reverse by chunk (3, 2, 1, 0, 8, 7, 6, 5…):
One option is to use groupby with a custom group:
N = 4
group = df.index//N
# if the index is not a linear range
# import numpy as np
# np.arange(len(df))//N
df.groupby(group).apply(lambda d: d.iloc[::-1]).droplevel(0)
output:
A B C D
3 45 33 73 77
2 91 34 19 68
1 12 25 55 19
0 65 48 17 4
7 99 99 95 9
.. .. .. .. ..
92 89 68 48 67
99 99 28 52 87
98 47 49 21 8
97 80 18 92 5
96 49 12 24 40
[100 rows x 4 columns]
A very fast method, based only on indexing is to use numpy to generate a list of the indices reversed by chunk:
import numpy as np
N = 4
idx = np.arange(len(df)).reshape(-1, N)[:, ::-1].ravel()
# array([ 3, 2, 1, 0, 7, 6, 5, 4, 11, ...])
# slice using iloc
df.iloc[idx]

SQL I need the highest number from column + count duplicate values

I'm looking for a query that gives a list of the RepairCost for each BikeNumber,
but the duplicate values have to be counted as well. So BikeNumber 18 cost total 22 + 58 = 80
Id RepairCost BikeNumber
16 82 23
88 51 20
12 20 19
33 22 **18**
40 58 **18**
69 41 17
10 2 16
66 35 15
If i understand the question, the query is pretty simple:
SELECT BikeNumber, SUM(RepairCost)
FROM YourTable
GROUP BY BikeNumber

Strange results with VAR and STDEV

This
SELECT
AVG(s.Amount/100)[Avg],
STDEV(s.Amount/100) [StDev],
VAR(s.Amount/100) [Var]
Returns this:
Avg StDev Var
133 550.82021581146 303402.910146583
Statistics aren't my strongest suit, but how is it possible that standard deviation and variance are larger than the average? Not only that, but variance is almost 100x larger than the largest sample in set.
Here is the entire sample set, with the above replaced with
SELECT s.Amount/100
while the rest of the query is identical
Amount
4645
3182
422
377
359
298
278
242
230
213
182
180
174
166
150
130
116
113
109
107
102
96
84
78
78
76
66
64
61
60
60
60
59
59
56
49
46
41
41
39
38
36
29
27
26
25
25
25
24
24
24
22
22
22
20
20
19
19
19
19
19
18
17
17
17
16
14
13
12
12
12
11
11
10
10
10
10
9
9
9
8
8
8
7
7
6
6
6
3
3
3
3
2
2
2
2
2
1
1
1
1
1
1
You need to read a book on statistics, or at least start with the Wikipedia pages that describe the concepts.
The standard deviation and variance are very related. The variance is the square (or close enough to the square) of the standard deviation. You can check that this is true of your numbers.
There is not really a relationship between the standard deviation and the average. The standard deviation is measuring the dispersal of the data around the average. The data can be arbitrarily dispersed around an average.
You might be confused because there are estimates on standard deviation/standard error when you assume a particular distribution of the data. However, those estimates are about the distribution and not about the data.

loading np array very slow

New to python (very cool), first question. I am reading a 50+ mb ascii file, scanning for property tags and parsing the data into a numpy array. I have placed timing reports throughout the loop and found the culprit, the while loop using np.append(). Wondering if there is a faster method.
This is a sample input file format with fake data for debugging:
...
tag parameter
char name "Poro"
array float data 100
1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56 56 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80 81 82 83 84
85 86 87 88 89 90 91 92 93 94 95 96
97 98 99 100
endtag
...
and this is the code fragment, where it's the while loop that is taking 70 seconds for a 350k element array:
def readParameter(self, parameterName):
startTime = time.time()
intervalTime = time.time()
token = "tag parameter"
self.inputBuffer.seek(0)
for lineno, line in enumerate(self.inputBuffer, 1):
if token in line:
line = self.inputBuffer.next().replace('"', '').split()
elapsedTime = time.time() - intervalTime
logging.debug(" Time to readParameter find token: " + str(elapsedTime))
intervalTime = time.time()
if line[2] == parameterName:
line = self.inputBuffer.next()
line = self.inputBuffer.next()
np.parameterArray = np.fromstring(line, dtype=float, sep=" ")
line = self.inputBuffer.next()
**while not "endtag" in line:
np.parameterArray = np.append(np.parameterArray, np.fromstring(line, dtype=float, sep=" "))
line = self.inputBuffer.next()**
elapsedTime = time.time() - startTime
logging.debug(" Time to readParameter load array: " + str(elapsedTime))
break
elapsedTime = time.time() - startTime
logging.debug(" Time to readParameter: " + str(elapsedTime))
logging.debug(np.parameterArray)
np.parameterArray = self.make3D(np.parameterArray)
return np.parameterArray
Thanks, Jeff
Appending to an array requires resizing the array, which usually requires allocating a new block of memory that's big enough to hold the new array, copying the existing array to the new location, and freeing the memory it used to use. All of those operations are expensive, and you're doing them for each element. With 350k elements, it's basically garbage-collector memory fragmentation stress-test.
Pre-allocate your array. You've got the count parameter, so make an array that size, and inside your loop, just assign the newly-parsed element to the next spot in the array, instead of appending it. You'll have to keep your own counter of how many elements have been filled. (You could instead iterate over the elements of the blank array and replace them, but that would make error handling a bit trickier to add in.)