I am trying to turn this kind of a series:
Animal Idol
50 60 15
55 14
81 14
80 13
56 11
53 10
58 9
57 9
50 9
59 6
52 6
61 1
52 52 64
58 28
55 21
81 17
60 16
50 16
56 15
80 12
61 10
59 10
53 9
57 4
53 53 27
56 14
58 10
50 9
80 8
52 6
55 6
61 5
81 5
60 4
57 4
59 3
Into something looking more like this:
Animal/Idol 60 55 81 80 ...
50 15 14 14 13
52 16 21 17 12
53 4 6 5 8
...
My base for the series here is actually a data frame looking like this (The unnamed values in series being sums of times a pair of animal and idol repeated, and there are many idols to each animal):
Animal Idol
1058 50 50
1061 50 50
1197 50 50
1357 50 50
1637 50 50
... ... ...
2780 81 81
2913 81 81
2915 81 81
3238 81 81
3324 81 81
Sadly, I have no clue how to convert any of this 2 into the desired form. I guess the good name for it is a pivot table, however I could not get the good result using them. How would You transform any of these into the form I need? I would also like to know, how to visualize this kind of a pivot table (if thats a good name) into a heat map, where color for each cell would differ based on the value in the cell (the higher the value, the deeper the colour). Thanks in advance!
i think you are looking for .unstack() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html) to unstack data.
To visualize you can use multiple tools. I like using holoviews (https://holoviews.org/),
hv.Image should be able to plot a 2d array . You can use hv.Image(df.unstack().values) to do that.
Example:
df = pd.DataFrame({'data': np.random.randint(0, 100, 100)}, index=pd.MultiIndex.from_tuples([(i, j) for i in range(10) for j in range(10)]))
df
unstack:
df_unstacked = df.unstack()
df_unstacked
plot:
import holoviews as hv
hv.Image(df_unstacked.values)
or to plot with matplotlib:
import matplotlib
import matplotlib as mpl
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
im = ax.imshow(df_unstacked.values)
Related
Consider the following sequence:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
which produces:
A B C D
0 56 83 99 46
1 40 70 22 51
2 70 9 78 33
3 65 72 79 87
4 0 6 22 73
.. .. .. .. ..
95 35 76 62 97
96 86 85 50 65
97 15 79 82 62
98 21 20 19 32
99 21 0 51 89
I can reverse the sequence with the following command:
df.iloc[::-1]
That gives me the following result:
A B C D
99 21 0 51 89
98 21 20 19 32
97 15 79 82 62
96 86 85 50 65
95 35 76 62 97
.. .. .. .. ..
4 0 6 22 73
3 65 72 79 87
2 70 9 78 33
1 40 70 22 51
0 56 83 99 46
How would I rewrite the code if I wanted to reverse the sequence every nth row, e.g. every 4th row?
IIUC, you want to reverse by chunk (3, 2, 1, 0, 8, 7, 6, 5…):
One option is to use groupby with a custom group:
N = 4
group = df.index//N
# if the index is not a linear range
# import numpy as np
# np.arange(len(df))//N
df.groupby(group).apply(lambda d: d.iloc[::-1]).droplevel(0)
output:
A B C D
3 45 33 73 77
2 91 34 19 68
1 12 25 55 19
0 65 48 17 4
7 99 99 95 9
.. .. .. .. ..
92 89 68 48 67
99 99 28 52 87
98 47 49 21 8
97 80 18 92 5
96 49 12 24 40
[100 rows x 4 columns]
A very fast method, based only on indexing is to use numpy to generate a list of the indices reversed by chunk:
import numpy as np
N = 4
idx = np.arange(len(df)).reshape(-1, N)[:, ::-1].ravel()
# array([ 3, 2, 1, 0, 7, 6, 5, 4, 11, ...])
# slice using iloc
df.iloc[idx]
How can I reference the minimum value of two dataframes as part of a pandas dataframe equation? I tried using the python min() function which did not work. I'm sorry if this is well-documented somewhere but I have not been able to find a working solution for this problem. I am looking for something along the lines of this:
data['eff'] = pd.DataFrame([data['flow_h'], data['flow_c']]).min() *Cp* (data[' Thi'] - data[' Tci'])
I also tried to use pandas min() function, which is also not working.
min_flow = pd.DataFrame([data['flow_h'], data['flow_c']]).min()
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I was confused by this error. The data columns are just numbers and a name, I wasn't sure where the index comes into play.
import pandas as pd
import numpy as np
np.random.seed(365)
rows = 10
flow = {'flow_c': [np.random.randint(100) for _ in range(rows)],
'flow_d': [np.random.randint(100) for _ in range(rows)],
'flow_h': [np.random.randint(100) for _ in range(rows)]}
data = pd.DataFrame(flow)
# display(data)
flow_c flow_d flow_h
0 82 36 43
1 52 48 12
2 33 28 77
3 91 99 11
4 44 95 27
5 5 94 64
6 98 3 88
7 73 39 92
8 26 39 62
9 56 74 50
If you are trying to get the row-wise mininum of two or more columns, use pandas.DataFrame.min. Note that by default axis=0; specifying axis=1 is necessary.
data['min_c_h'] = data[['flow_h','flow_c']].min(axis=1)
# display(data)
flow_c flow_d flow_h min_c_h
0 82 36 43 43
1 52 48 12 12
2 33 28 77 33
3 91 99 11 11
4 44 95 27 27
5 5 94 64 5
6 98 3 88 88
7 73 39 92 73
8 26 39 62 26
9 56 74 50 50
If you like to get a single minimum value of multiple columns:
data[['flow_h','flow_c']].min().min()
the first "min()" calculates the minimum per column and returns a pandas series. The second "min" returns the minimum of the minimums per column.
I have a data frame, df, which looks like this:
index New Old Map Limit count
1 93 35 54 > 18 1
2 163 93 116 > 18 1
3 134 78 96 > 18 1
4 117 81 93 > 18 1
5 194 108 136 > 18 1
6 125 57 79 <= 18 1
7 66 39 48 > 18 1
8 120 83 95 > 18 1
9 150 98 115 > 18 1
10 149 99 115 > 18 1
11 148 85 106 > 18 1
12 92 55 67 <= 18 1
13 64 24 37 > 18 1
14 84 53 63 > 18 1
15 99 70 79 > 18 1
I need to produce a data frame that looks like this:
Limit <=18 >18
total mean total mean
New xx1 yy1 aa1 bb1
Old xx2 yy2 aa2 bb2
MAP xx3 yy3 aa3 bb3
I tried this without success:
df.groupby('Limit')['New', 'Old', 'MAP'].[sum(), mean()].T without success.
How can I achieve this in pandas?
You can use groupby with agg, then transpose by T and unstack:
print (df[['New', 'Old', 'Map', 'Limit']].groupby('Limit').agg([sum, 'mean']).T.unstack())
Limit <= 18 > 18
sum mean sum mean
New 217.0 108.5 1581.0 121.615385
Old 112.0 56.0 946.0 72.769231
Map 146.0 73.0 1153.0 88.692308
I edit by comment, it looks nicer:
print (df.groupby('Limit')['New', 'Old', 'Map', 'Limit'].agg([sum, 'mean']).T.unstack())
And if need total columns:
print (df.groupby('Limit')['New', 'Old', 'Map', 'Limit']
.agg({'total':sum, 'mean': 'mean'})
.T
.unstack(0))
Limit <= 18 > 18
total mean total mean
New 217.0 108.5 1581.0 121.615385
Old 112.0 56.0 946.0 72.769231
Map 146.0 73.0 1153.0 88.692308
I have the data stored in columns that I need to change to rows. The transpose method is not working as expected.
reg_no st_name six seven eight nine ten
number
1 1200210606 DORIN 18 28 98 78 58
2 1001200049 PRAMA 79 69 59 19 29
3 1205210026 PILLA 47 57 67 27 17
4 1205210064 SUSAT 16 16 66 76 86
5 10002100113 PAVITH 15 85 75 65 15
The expected results will look something like this.
1 1200210606 DORIN six 18
1 1200210606 DORIN seven 28
1 1200210606 DORIN eight 98
1 1200210606 DORIN nine 78
1 1200210606 DORIN ten 58
2 1001200049 PRAMA six 79
2 1001200049 PRAMA seven 69
2 1001200049 PRAMA eight 59
2 1001200049 PRAMA nine 19
2 1001200049 PRAMA ten 29
3 1205210026 PILLA six 47
3 1205210026 PILLA seven 57
3 1205210026 PILLA eight 67
3 1205210026 PILLA nine 27
3 1205210026 PILLA ten 17
4 1205210064 SUSAT six 16
4 1205210064 SUSAT seven 16
4 1205210064 SUSAT eight 66
4 1205210064 SUSAT nine 76
4 1205210064 SUSAT ten 86
5 10002100113 PAVITH six 15
5 10002100113 PAVITH seven 85
5 10002100113 PAVITH eight 75
5 10002100113 PAVITH nine 65
5 10002100113 PAVITH ten 15
You are trying to convert wide format to long format.
Use melt function for that.
ref : http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
import pandas as pd
pd.melt(df,['number,reg_no','st_name']) # df is your dataframe object
you can use sort() after you melt to get the exact order
This
SELECT
AVG(s.Amount/100)[Avg],
STDEV(s.Amount/100) [StDev],
VAR(s.Amount/100) [Var]
Returns this:
Avg StDev Var
133 550.82021581146 303402.910146583
Statistics aren't my strongest suit, but how is it possible that standard deviation and variance are larger than the average? Not only that, but variance is almost 100x larger than the largest sample in set.
Here is the entire sample set, with the above replaced with
SELECT s.Amount/100
while the rest of the query is identical
Amount
4645
3182
422
377
359
298
278
242
230
213
182
180
174
166
150
130
116
113
109
107
102
96
84
78
78
76
66
64
61
60
60
60
59
59
56
49
46
41
41
39
38
36
29
27
26
25
25
25
24
24
24
22
22
22
20
20
19
19
19
19
19
18
17
17
17
16
14
13
12
12
12
11
11
10
10
10
10
9
9
9
8
8
8
7
7
6
6
6
3
3
3
3
2
2
2
2
2
1
1
1
1
1
1
You need to read a book on statistics, or at least start with the Wikipedia pages that describe the concepts.
The standard deviation and variance are very related. The variance is the square (or close enough to the square) of the standard deviation. You can check that this is true of your numbers.
There is not really a relationship between the standard deviation and the average. The standard deviation is measuring the dispersal of the data around the average. The data can be arbitrarily dispersed around an average.
You might be confused because there are estimates on standard deviation/standard error when you assume a particular distribution of the data. However, those estimates are about the distribution and not about the data.