Related
I am trying to better understand hstack, vstack, and dstack in NumPy.
a = np.arange(96).reshape(2,4,4,3)
print(a)
print(f"dimensions of a:", np.ndim(a))
print(f"Shape of a:", a.shape)
b = np.arange(201,225).reshape(2,4,3)
print(f"Shape of b:", b)
c = np.arange(101,133).reshape(2,4,4)
print(c)
print(f"dimensions of c:", np.ndim(c))
print(f"Shape of c:", c.shape)
a is:
[[[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
[[12 13 14]
[15 16 17]
[18 19 20]
[21 22 23]]
[[24 25 26]
[27 28 29]
[30 31 32]
[33 34 35]]
[[36 37 38]
[39 40 41]
[42 43 44]
[45 46 47]]]
[[[48 49 50]
[51 52 53]
[54 55 56]
[57 58 59]]
[[60 61 62]
[63 64 65]
[66 67 68]
[69 70 71]]
[[72 73 74]
[75 76 77]
[78 79 80]
[81 82 83]]
[[84 85 86]
[87 88 89]
[90 91 92]
[93 94 95]]]]
and c is:
[[[101 102 103 104]
[105 106 107 108]
[109 110 111 112]
[113 114 115 116]]
[[117 118 119 120]
[121 122 123 124]
[125 126 127 128]
[129 130 131 132]]]
and b is:
[[[201 202 203]
[204 205 206]
[207 208 209]
[210 211 212]]
[[213 214 215]
[216 217 218]
[219 220 221]
[222 223 224]]]
How do I reshape c so that I can use hstack correctly: I wish to add one column for each row in each of the dimensions.
How do I reshape b so that I can use vstack correctly: I wish one row for each column in each of the dimensions.
I would like to come up with a general rule on the dimensions to check for the array that needs to be added to an existing array.
You can concatenate to a (2,4,4,3) a
(1,4,4,3) axis 0
(2,1,4,3) with axis=1
(2,4,1,3) axis 2
(2,4,4,1) axis 3
Read and reread as needed, the np.concatenate docs.
edit
In previous post(s) I've summarized the code of hstack and vstack, though you easily read that via the [source] link in the official docs.
When should I use hstack/vstack vs append vs concatenate vs column_stack?
hstack makes sure all arguments are atleast_1d and does a concatenate on axis 0 or 1. vstack makes sure all are atleast_2d, and does a concatenate on axis 0.
Maybe I should have insisted on seeing your attempts and any errors (and attempts to understand the errors).
For adding c to a:
In [58]: np.hstack((a,c))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [58], in <cell line: 1>()
----> 1 np.hstack((a,c))
File <__array_function__ internals>:5, in hstack(*args, **kwargs)
File ~\anaconda3\lib\site-packages\numpy\core\shape_base.py:345, in hstack(tup)
343 return _nx.concatenate(arrs, 0)
344 else:
--> 345 return _nx.concatenate(arrs, 1)
File <__array_function__ internals>:5, in concatenate(*args, **kwargs)
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 4 dimension(s) and the array at index 1 has 3 dimension(s)
Notice, the error was raised by concatenate, and focuses on the number of dimensions - 4d and 3d. The hstack wrapper did not change inputs at all.
If I add a trailing dimension to c, I get:
In [62]: c[...,None].shape
Out[62]: (2, 4, 4, 1)
In [63]: np.concatenate((a, c[...,None]),axis=3).shape
Out[63]: (2, 4, 4, 4)
Similarly for b:
In [64]: np.concatenate((a, b[...,None,:]),axis=2).shape
Out[64]: (2, 4, 5, 3)
The hstack/vstack docs specify 2nd and 1st axis concatenate. But you want to use axis 2 or 3. So those 'stack' functions don't apply, do they?
I'm pretty new to numpy arrays, and was not able to find a good explanation / example for my issue. I saw things like take() or take_along_axis() but I didn't understood what was going on...
I have this 2D numpy, which may contain N sub-arrays, of each 5 values (h, s, i, x, y):
values = np.array([
[1,2,3,4,5],
[1,22,33,44,55],
[1,22,333,444,555],
[1,22,333,4444,5555],
[1,222,33,44,55],
[1,222,330,440,550],
[10,20,30,40,50],
[100,200,300,400,500],
])
As you can see, values can be repeated for a same index.
I want to regroup sub-arrays, by indexes values, such as:
1
2
3
4
5
22
33
44
55
333
444
555
4444
5555
222
33
44
55
330
440
550
10
20
30
40
50
100
200
300
400
500
The goal is to obtain a regular array like:
array = [1, 2, 3, 4 , 5, 22, 33, 44, 55, 333, 444, 555, 4444, 5555, 222, 33, 44, 55, 330, 440, 550, 10, 20, 30, 40, 50, 100, 200, 300, 400, 500]
Thank you very much for your support.
you can use flatten method
list(values.flatten())
I'd like to plot a Trellis Stacked Bar Chart graph like in the example Trellis Stacked Bar Chart.
I have this dataset:
pd.DataFrame({
'storage': ['dev01', 'dev01', 'dev01', 'dev02', 'dev02', 'dev03'],
'project': ['omega', 'alpha', 'beta', 'omega', 'beta', 'alpha'],
'read': [3, 0, 0, 114, 27, 82],
'write': [70, 0, 0, 45, 655, 203],
'read-write': [313, 322, 45, 89, 90, 12]
})
storage project read write read-write
0 dev01 omega 3 70 313
1 dev01 alpha 0 0 322
2 dev01 beta 0 0 45
3 dev02 omega 114 45 89
4 dev02 beta 27 655 90
5 dev03 alpha 82 203 12
What I can't figure out is how to specify the read, write, read-write columns as the colors / values for Altair.
Your data is wide-form, and must be converted to long-form to be used in Altair encodings. See Long-Form vs. Wide-Form Data in Altair's documentation for more information.
This can be addressed by modifying the input data in Pandas using pd.melt, but it is often more convenient to use Altair's Fold Transform to do this reshaping within the chart specification. For example:
import pandas as pd
import altair as alt
df = pd.DataFrame({
'storage': ['dev01', 'dev01', 'dev01', 'dev02', 'dev02', 'dev03'],
'project': ['omega', 'alpha', 'beta', 'omega', 'beta', 'alpha'],
'read': [3, 0, 0, 114, 27, 82],
'write': [70, 0, 0, 45, 655, 203],
'read-write': [313, 322, 45, 89, 90, 12]
})
alt.Chart(df).transform_fold(
['read', 'write', 'read-write'],
as_=['mode', 'value']
).mark_bar().encode(
x='value:Q',
y='project:N',
column='storage:N',
color='mode:N'
).properties(
width=200
)
You need to melt your desired columns into a new column:
# assuming your DataFrame is assigned to `df`
cols_to_melt = ['read', 'write', 'read-write']
cols_to_keep = df.columns.difference(cols_to_melt)
df = df.melt(cols_to_keep, cols_to_melt, 'mode')
So you get the following:
project storage mode value
0 omega dev01 read 3
1 alpha dev01 read 0
2 beta dev01 read 0
3 omega dev02 read 114
4 beta dev02 read 27
5 alpha dev03 read 82
6 omega dev01 write 70
7 alpha dev01 write 0
8 beta dev01 write 0
9 omega dev02 write 45
10 beta dev02 write 655
11 alpha dev03 write 203
12 omega dev01 read-write 313
13 alpha dev01 read-write 322
14 beta dev01 read-write 45
15 omega dev02 read-write 89
16 beta dev02 read-write 90
17 alpha dev03 read-write 12
Then in the altair snippet, instead of color='site', use color='mode'.
I have a Pandas dataframe that I need to sort by the data columns' maximum values. I am having trouble performing the sort because all of the sorting examples that I have found operate on all of the columns in the dataframe when performing the sort. In this case I need to sort only a subset of the columns. The first column contains a date, and the remaining 90 columns contain data. The 90 data columns are currently sorted alphabetically by their column name. I would like to sort them in decreasing order of their maximum value, which happens to be in the last row.
In the bigger scheme of things, this question is about how to perform sorting on a range of columns within a dataframe, rather than sorting all of the columns in the dataframe. There may be cases, for example, where I need to sort only columns 2 through 12 of a dataframe, while leaving the remaining columns in their existing order.
Here is a sample of the unsorted dataframe:
df.tail()
Date ADAMS ALLEN BARTHOLOMEW BENTON BLACKFORD BOONE BROWN ... WABASH WARREN WARRICK WASHINGTON WAYNE WELLS WHITE WHITLEY
65 2020-05-10 8 828 356 13 14 227 28 ... 64 12 123 48 53 11 149 22
66 2020-05-11 8 860 367 16 14 235 28 ... 67 12 126 48 56 12 161 23
67 2020-05-12 8 872 371 17 14 235 28 ... 67 12 131 49 56 12 162 23
68 2020-05-13 9 897 382 17 14 249 29 ... 68 12 140 50 58 13 164 27
69 2020-05-14 9 955 394 21 14 252 29 ... 69 12 145 50 60 15 164 28
I would like to perform the sort so that the column with the largest value in row 69 is placed after df['Date'], with the columns ordered so that the values in row 69 decrease from left to right. Once that is done, I'd like to create a series containing the column headers, to generate rank list. Using the visible columns as an example, the desired list would be:
rank_list=[ "ALLEN", "BARTHOLOMEW", "BOONE", "WHITE", "WARRICK", ... "BLACKFORD", "WARREN", "ADAMS" ]
My biggest hurdle at present is that when I perform the sort I'm not able to exclude the Date column, and I'm receiving a type error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
I am new to Pandas so I apologize if there is a solution to this problem that should be obvious. thanks.
You can do it this way using sort_values once selected the right row and the range of column
#data sample
np.random.seed(86)
df = pd.DataFrame({'date':pd.date_range('2020-05-15', periods=5),
'a': np.random.randint(0,50, 5),
'b': np.random.randint(0,50, 5),
'c': np.random.randint(0,50, 5),
'd': np.random.randint(0,50, 5)})
# parameters
start_idx = 1 #note: the indexing start at 0, so 1 is the second column
end_idx = df.shape[1] #for the last column
row_position = df.shape[0]-1 #for the last one
# create the new order
new_col_roder = df.columns.tolist()
new_col_roder[start_idx:end_idx] = df.iloc[row_position, start_idx:end_idx]\
.sort_values(ascending=False).index
#reirder
df = df[new_col_roder]
print(df)
date c a d b
0 2020-05-15 30 20 44 40
1 2020-05-16 45 32 29 9
2 2020-05-17 17 44 14 27
3 2020-05-18 13 28 4 41
4 2020-05-19 41 35 14 12 #as you can see, the columns are now c, a, d, b
I suggest the following:
# initialize the provided sample data frame
df = pd.DataFrame([['65 2020-05-10', 8, 828, 356, 13, 14, 227, 28, 64, 12, 123, 48, 53, 11, 149, 22],
['66 2020-05-11', 8, 860, 367, 16, 14, 235, 28, 67, 12, 126, 48, 56, 12, 161, 23],
['67 2020-05-12', 8, 872, 371, 17, 14, 235, 28, 67, 12, 131, 49, 56, 12, 162, 23],
['68 2020-05-13', 9, 897, 382, 17, 14, 249, 29, 68, 12, 140, 50, 58, 13, 164, 27],
['69 2020-05-14', 9, 955, 394, 21, 14, 252, 29, 69, 12, 145, 50, 60, 15, 164, 28]],
columns = ['Date', 'ADAMS', 'ALLEN', 'BARTHOLOMEW', 'BENTON', 'BLACKFORD', 'BOONE', 'BROWN', 'WABASH', 'WARREN', 'WARRICK', 'WASHINGTON', 'WAYNE', 'WELLS', 'WHITE', 'WHITLEY']
)
# a list of tuples in the form (column_name, max_value)
column_max_list = [(column, df[column].max()) for column in df.columns.values[1:]]
# sort the list descending by the max value
column_max_list_sorted = sorted(column_max_list, key = lambda tup: tup[1], reverse = True)
# extract only the column names
rank_list = [tup[0] for tup in column_max_list_sorted]
for i in range(len(rank_list)):
# get the column to insert next
col = df[rank_list[i]]
# drop the column to be inserted back
df.drop(columns = [rank_list[i]], inplace = True)
# insert the column at the correct index
df.insert(loc = i + 1, column = rank_list[i], value = col)
This yields the desired rank_list
['ALLEN', 'BARTHOLOMEW', 'BOONE', 'WHITE', 'WARRICK', 'WABASH', 'WAYNE', 'WASHINGTON', 'BROWN', 'WHITLEY', 'BENTON', 'WELLS', 'BLACKFORD', 'WARREN', 'ADAMS']
as well as the desired df:
Date ALLEN BARTHOLOMEW BOONE WHITE ...
0 65 2020-05-10 828 356 227 149 ...
1 66 2020-05-11 860 367 235 161 ...
2 67 2020-05-12 872 371 235 162 ...
3 68 2020-05-13 897 382 249 164 ...
4 69 2020-05-14 955 394 252 164 ...
Let's say I have a pandas series:
>> t.head()
Timestamp
2014-02-01 05:43:26 35.592899
2014-02-01 06:18:32 33.898003
2014-02-01 10:04:04 33.898003
2014-02-01 10:36:30 35.592899
2014-02-01 12:20:32 40.677601
and what I want is a frequency table with bins I can set. This sounds easy but the closest I've come to is via matplotlib
In [8]: fd = plt.hist(t, bins=range(20,50))
In [9]: fd
Out[9]:
(array([ 0, 0, 1, 0, 0, 3, 0, 3, 1, 0, 8, 0, 11, 20, 0, 18, 0,
19, 6, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0]),
array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
<a list of 29 Patch objects>)
but that of course actually plots the histogram. I can find lost of advice on how to plot histograms, but not on how to simply form the frequency distribution; from the above I have the 'bins' as fd[1] (or at least the lower-bounds of them) and the values as fd[0].
I want the frequency distribution on its own in order to later form a Dataframe with the distributions of a number of series, (all with the same bins). I feel there must be a way to do it without matplotlib?
UPDATE: desired results:
{'Station1': 20 0
21 0
22 1
23 0
24 0
25 3
26 0
27 3
28 1
29 0
30 8
31 0
32 11
33 20
34 0
35 18
36 0
37 19
38 6
39 0
40 2
41 0
42 0
43 0
44 0
45 0
46 0
47 0
48 0
dtype: int32}
These are wind speeds: once I have similar data from a number of different met stations I want to be able to form a DataFrame with the bins as the index and the columns as the freq. distrs.
VALUE_COUNTS()
I did think about value counts, it gives me this:
33.898003 20
37.287800 19
35.592899 18
32.203102 11
30.508202 8
38.982700 6
27.118401 3
25.423500 3
40.677601 2
28.813301 1
22.033701 1
dtype: int64
The data itself is clearly A/D converted: the thing is that suppose the next met station has different indices, such as 33.898006 instead of 33.898003, then I'll get a new 'bin' just for that one - I want to guarantee that the bins are the same for each set of data.