NumPy stack, vstack and sdtack usage - numpy

I am trying to better understand hstack, vstack, and dstack in NumPy.
a = np.arange(96).reshape(2,4,4,3)
print(a)
print(f"dimensions of a:", np.ndim(a))
print(f"Shape of a:", a.shape)
b = np.arange(201,225).reshape(2,4,3)
print(f"Shape of b:", b)
c = np.arange(101,133).reshape(2,4,4)
print(c)
print(f"dimensions of c:", np.ndim(c))
print(f"Shape of c:", c.shape)
a is:
[[[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
[[12 13 14]
[15 16 17]
[18 19 20]
[21 22 23]]
[[24 25 26]
[27 28 29]
[30 31 32]
[33 34 35]]
[[36 37 38]
[39 40 41]
[42 43 44]
[45 46 47]]]
[[[48 49 50]
[51 52 53]
[54 55 56]
[57 58 59]]
[[60 61 62]
[63 64 65]
[66 67 68]
[69 70 71]]
[[72 73 74]
[75 76 77]
[78 79 80]
[81 82 83]]
[[84 85 86]
[87 88 89]
[90 91 92]
[93 94 95]]]]
and c is:
[[[101 102 103 104]
[105 106 107 108]
[109 110 111 112]
[113 114 115 116]]
[[117 118 119 120]
[121 122 123 124]
[125 126 127 128]
[129 130 131 132]]]
and b is:
[[[201 202 203]
[204 205 206]
[207 208 209]
[210 211 212]]
[[213 214 215]
[216 217 218]
[219 220 221]
[222 223 224]]]
How do I reshape c so that I can use hstack correctly: I wish to add one column for each row in each of the dimensions.
How do I reshape b so that I can use vstack correctly: I wish one row for each column in each of the dimensions.
I would like to come up with a general rule on the dimensions to check for the array that needs to be added to an existing array.

You can concatenate to a (2,4,4,3) a
(1,4,4,3) axis 0
(2,1,4,3) with axis=1
(2,4,1,3) axis 2
(2,4,4,1) axis 3
Read and reread as needed, the np.concatenate docs.
edit
In previous post(s) I've summarized the code of hstack and vstack, though you easily read that via the [source] link in the official docs.
When should I use hstack/vstack vs append vs concatenate vs column_stack?
hstack makes sure all arguments are atleast_1d and does a concatenate on axis 0 or 1. vstack makes sure all are atleast_2d, and does a concatenate on axis 0.
Maybe I should have insisted on seeing your attempts and any errors (and attempts to understand the errors).
For adding c to a:
In [58]: np.hstack((a,c))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [58], in <cell line: 1>()
----> 1 np.hstack((a,c))
File <__array_function__ internals>:5, in hstack(*args, **kwargs)
File ~\anaconda3\lib\site-packages\numpy\core\shape_base.py:345, in hstack(tup)
343 return _nx.concatenate(arrs, 0)
344 else:
--> 345 return _nx.concatenate(arrs, 1)
File <__array_function__ internals>:5, in concatenate(*args, **kwargs)
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 4 dimension(s) and the array at index 1 has 3 dimension(s)
Notice, the error was raised by concatenate, and focuses on the number of dimensions - 4d and 3d. The hstack wrapper did not change inputs at all.
If I add a trailing dimension to c, I get:
In [62]: c[...,None].shape
Out[62]: (2, 4, 4, 1)
In [63]: np.concatenate((a, c[...,None]),axis=3).shape
Out[63]: (2, 4, 4, 4)
Similarly for b:
In [64]: np.concatenate((a, b[...,None,:]),axis=2).shape
Out[64]: (2, 4, 5, 3)
The hstack/vstack docs specify 2nd and 1st axis concatenate. But you want to use axis 2 or 3. So those 'stack' functions don't apply, do they?

Related

Create a dataframe from a series with a TimeSeriesIndex multiplied by another series

Let's say I have a series, ser1 with a TimeSeriesIndex length x. I also have another series, ser2 length y. How do I multiply these so that I get a dataframe shape (x,y) where the index is from ser1 and the columns are the indices from ser2. I want every element of ser2 to be multiplied by the values of each element in ser1.
import pandas as pd
ser1 = pd.Series([100, 105, 110, 114, 89],index=pd.date_range(start='2021-01-01', end='2021-01-05', freq='D'), name='test')
test_ser2 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
Perhaps this is more elegantly done with numpy.
Try this using np.outer with pandas DataFrame constructor:
pd.DataFrame(np.outer(ser1, test_ser2), index=ser1.index, columns=test_ser2.index)
Output:
a b c d e
2021-01-01 100 200 300 400 500
2021-01-02 105 210 315 420 525
2021-01-03 110 220 330 440 550
2021-01-04 114 228 342 456 570
2021-01-05 89 178 267 356 445

Plot a Trellis Stacked Bar Chart in Altair by combining column values

I'd like to plot a Trellis Stacked Bar Chart graph like in the example Trellis Stacked Bar Chart.
I have this dataset:
pd.DataFrame({
'storage': ['dev01', 'dev01', 'dev01', 'dev02', 'dev02', 'dev03'],
'project': ['omega', 'alpha', 'beta', 'omega', 'beta', 'alpha'],
'read': [3, 0, 0, 114, 27, 82],
'write': [70, 0, 0, 45, 655, 203],
'read-write': [313, 322, 45, 89, 90, 12]
})
storage project read write read-write
0 dev01 omega 3 70 313
1 dev01 alpha 0 0 322
2 dev01 beta 0 0 45
3 dev02 omega 114 45 89
4 dev02 beta 27 655 90
5 dev03 alpha 82 203 12
What I can't figure out is how to specify the read, write, read-write columns as the colors / values for Altair.
Your data is wide-form, and must be converted to long-form to be used in Altair encodings. See Long-Form vs. Wide-Form Data in Altair's documentation for more information.
This can be addressed by modifying the input data in Pandas using pd.melt, but it is often more convenient to use Altair's Fold Transform to do this reshaping within the chart specification. For example:
import pandas as pd
import altair as alt
df = pd.DataFrame({
'storage': ['dev01', 'dev01', 'dev01', 'dev02', 'dev02', 'dev03'],
'project': ['omega', 'alpha', 'beta', 'omega', 'beta', 'alpha'],
'read': [3, 0, 0, 114, 27, 82],
'write': [70, 0, 0, 45, 655, 203],
'read-write': [313, 322, 45, 89, 90, 12]
})
alt.Chart(df).transform_fold(
['read', 'write', 'read-write'],
as_=['mode', 'value']
).mark_bar().encode(
x='value:Q',
y='project:N',
column='storage:N',
color='mode:N'
).properties(
width=200
)
You need to melt your desired columns into a new column:
# assuming your DataFrame is assigned to `df`
cols_to_melt = ['read', 'write', 'read-write']
cols_to_keep = df.columns.difference(cols_to_melt)
df = df.melt(cols_to_keep, cols_to_melt, 'mode')
So you get the following:
project storage mode value
0 omega dev01 read 3
1 alpha dev01 read 0
2 beta dev01 read 0
3 omega dev02 read 114
4 beta dev02 read 27
5 alpha dev03 read 82
6 omega dev01 write 70
7 alpha dev01 write 0
8 beta dev01 write 0
9 omega dev02 write 45
10 beta dev02 write 655
11 alpha dev03 write 203
12 omega dev01 read-write 313
13 alpha dev01 read-write 322
14 beta dev01 read-write 45
15 omega dev02 read-write 89
16 beta dev02 read-write 90
17 alpha dev03 read-write 12
Then in the altair snippet, instead of color='site', use color='mode'.

Pandas: How to Sort a Range of Columns in a Dataframe?

I have a Pandas dataframe that I need to sort by the data columns' maximum values. I am having trouble performing the sort because all of the sorting examples that I have found operate on all of the columns in the dataframe when performing the sort. In this case I need to sort only a subset of the columns. The first column contains a date, and the remaining 90 columns contain data. The 90 data columns are currently sorted alphabetically by their column name. I would like to sort them in decreasing order of their maximum value, which happens to be in the last row.
In the bigger scheme of things, this question is about how to perform sorting on a range of columns within a dataframe, rather than sorting all of the columns in the dataframe. There may be cases, for example, where I need to sort only columns 2 through 12 of a dataframe, while leaving the remaining columns in their existing order.
Here is a sample of the unsorted dataframe:
df.tail()
Date ADAMS ALLEN BARTHOLOMEW BENTON BLACKFORD BOONE BROWN ... WABASH WARREN WARRICK WASHINGTON WAYNE WELLS WHITE WHITLEY
65 2020-05-10 8 828 356 13 14 227 28 ... 64 12 123 48 53 11 149 22
66 2020-05-11 8 860 367 16 14 235 28 ... 67 12 126 48 56 12 161 23
67 2020-05-12 8 872 371 17 14 235 28 ... 67 12 131 49 56 12 162 23
68 2020-05-13 9 897 382 17 14 249 29 ... 68 12 140 50 58 13 164 27
69 2020-05-14 9 955 394 21 14 252 29 ... 69 12 145 50 60 15 164 28
I would like to perform the sort so that the column with the largest value in row 69 is placed after df['Date'], with the columns ordered so that the values in row 69 decrease from left to right. Once that is done, I'd like to create a series containing the column headers, to generate rank list. Using the visible columns as an example, the desired list would be:
rank_list=[ "ALLEN", "BARTHOLOMEW", "BOONE", "WHITE", "WARRICK", ... "BLACKFORD", "WARREN", "ADAMS" ]
My biggest hurdle at present is that when I perform the sort I'm not able to exclude the Date column, and I'm receiving a type error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
I am new to Pandas so I apologize if there is a solution to this problem that should be obvious. thanks.
You can do it this way using sort_values once selected the right row and the range of column
#data sample
np.random.seed(86)
df = pd.DataFrame({'date':pd.date_range('2020-05-15', periods=5),
'a': np.random.randint(0,50, 5),
'b': np.random.randint(0,50, 5),
'c': np.random.randint(0,50, 5),
'd': np.random.randint(0,50, 5)})
# parameters
start_idx = 1 #note: the indexing start at 0, so 1 is the second column
end_idx = df.shape[1] #for the last column
row_position = df.shape[0]-1 #for the last one
# create the new order
new_col_roder = df.columns.tolist()
new_col_roder[start_idx:end_idx] = df.iloc[row_position, start_idx:end_idx]\
.sort_values(ascending=False).index
#reirder
df = df[new_col_roder]
print(df)
date c a d b
0 2020-05-15 30 20 44 40
1 2020-05-16 45 32 29 9
2 2020-05-17 17 44 14 27
3 2020-05-18 13 28 4 41
4 2020-05-19 41 35 14 12 #as you can see, the columns are now c, a, d, b
I suggest the following:
# initialize the provided sample data frame
df = pd.DataFrame([['65 2020-05-10', 8, 828, 356, 13, 14, 227, 28, 64, 12, 123, 48, 53, 11, 149, 22],
['66 2020-05-11', 8, 860, 367, 16, 14, 235, 28, 67, 12, 126, 48, 56, 12, 161, 23],
['67 2020-05-12', 8, 872, 371, 17, 14, 235, 28, 67, 12, 131, 49, 56, 12, 162, 23],
['68 2020-05-13', 9, 897, 382, 17, 14, 249, 29, 68, 12, 140, 50, 58, 13, 164, 27],
['69 2020-05-14', 9, 955, 394, 21, 14, 252, 29, 69, 12, 145, 50, 60, 15, 164, 28]],
columns = ['Date', 'ADAMS', 'ALLEN', 'BARTHOLOMEW', 'BENTON', 'BLACKFORD', 'BOONE', 'BROWN', 'WABASH', 'WARREN', 'WARRICK', 'WASHINGTON', 'WAYNE', 'WELLS', 'WHITE', 'WHITLEY']
)
# a list of tuples in the form (column_name, max_value)
column_max_list = [(column, df[column].max()) for column in df.columns.values[1:]]
# sort the list descending by the max value
column_max_list_sorted = sorted(column_max_list, key = lambda tup: tup[1], reverse = True)
# extract only the column names
rank_list = [tup[0] for tup in column_max_list_sorted]
for i in range(len(rank_list)):
# get the column to insert next
col = df[rank_list[i]]
# drop the column to be inserted back
df.drop(columns = [rank_list[i]], inplace = True)
# insert the column at the correct index
df.insert(loc = i + 1, column = rank_list[i], value = col)
This yields the desired rank_list
['ALLEN', 'BARTHOLOMEW', 'BOONE', 'WHITE', 'WARRICK', 'WABASH', 'WAYNE', 'WASHINGTON', 'BROWN', 'WHITLEY', 'BENTON', 'WELLS', 'BLACKFORD', 'WARREN', 'ADAMS']
as well as the desired df:
Date ALLEN BARTHOLOMEW BOONE WHITE ...
0 65 2020-05-10 828 356 227 149 ...
1 66 2020-05-11 860 367 235 161 ...
2 67 2020-05-12 872 371 235 162 ...
3 68 2020-05-13 897 382 249 164 ...
4 69 2020-05-14 955 394 252 164 ...

Appending pandas data to hdf store, getting 'TypeError: object of type 'int' has no len()' error

Motivation:
I have about 30 million rows of data, one column being an index value, the other being a list of 512 int32 numbers. I wish to only retrieve maybe a thousand or so at a time, so I want to create some sort of datastore that can look up the data by index, while leaving the rest on the disk.
Right now the data is split up into 184 files, which can be opened by pandas.
This is what my dataframe looks like
df.head()
IndexID NumpyIds
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131 [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716 [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098 [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...
There is the index, and then the column 'NumpyIds' which are numpy arrays of size 512, containing int32 ints.
I then tried this:
store = pd.HDFStore('/data2.h5')
store.put('index', df, format='table', append=True)
And got this
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-05b956667991> in <module>()
----> 1 store.put('index', df, format='table', append=True, data_columns=True)
2 store.close
4 frames
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in put(self, key, value, format, index, append, complib, complevel, min_itemsize, nan_rep, data_columns, encoding, errors)
1040 data_columns=data_columns,
1041 encoding=encoding,
-> 1042 errors=errors,
1043 )
1044
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, axes, index, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, encoding, errors)
1707 dropna=dropna,
1708 nan_rep=nan_rep,
-> 1709 data_columns=data_columns,
1710 )
1711
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns)
4141 min_itemsize=min_itemsize,
4142 nan_rep=nan_rep,
-> 4143 data_columns=data_columns,
4144 )
4145
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in _create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize)
3811 nan_rep=nan_rep,
3812 encoding=self.encoding,
-> 3813 errors=self.errors,
3814 )
3815 adj_name = _maybe_adjust_name(new_name, self.version)
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in _maybe_convert_for_string_atom(name, block, existing_col, min_itemsize, nan_rep, encoding, errors)
4798 # we cannot serialize this data, so report an exception on a column
4799 # by column basis
-> 4800 for i in range(len(block.shape[0])):
4801
4802 col = block.iget(i)
TypeError: object of type 'int' has no len()
What am I trying to do?
I have 184 pandas files which I am trying to concatenate into 1 hdf file for fast look up using the index.
For example
store['index'][21]
Would give me that 512 dimension vector for the index of 21.
Edit:
I tried creating a column for every number, so
df[[str(i) for i in range(512)]] = pd.DataFrame(df.NumpyIds.to_numpy(), index=df.index)
df.drop(columns='NumpyIds', inplace=True)
store.put('index', df, format='table', append=True)
store.close
Which works, although I feel this may be a hack rather than an ideal workaround. But now the issue is I can't seem to get those values from the index
store.select(key='index', start=2163410)
returns
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511
IndexID
0 rows × 512 columns
Which are the column names, but not the data in that column. Also this method takes a lot of RAM. I am wondering if it loads all the data at once, rather than just the index specified.
Another workaround I'm trying is opening the data directly in h5py
df = pd.read_hdf(hdf_files[0])
df.set_index('IndexID', inplace=True)
df.to_hdf('testhdf.h5', key='df')
h = h5py.File('testhdf.h5')
But I can't seem to figure out how to retrieve data by index from this store
h['df'][2163410]
/usr/local/lib/python3.6/dist-packages/h5py/_hl/base.py in _e(self, name, lcpl)
135 else:
136 try:
--> 137 name = name.encode('ascii')
138 coding = h5t.CSET_ASCII
139 except UnicodeEncodeError:
AttributeError: 'int' object has no attribute 'encode'
as far as I know, this is a BUG.
See #34274.
I've fixed it in #38919. Now it shows appropriate error message.

The polynomial interpolation of degree 4

For the provided data set, write equations for calculating the polynomial interpolation of degree 4 and find the formula for f by hand.
x = [1, 2, 3, 4, 5]
y = f(x) = [5, 31, 121, 341, 781]
The pyramid of iterated differences is
5 31 121 341 781
26 90 220 440
64 130 220
66 90
24
You can extend the table by setting the last line constant 24 and computing backwards or you could read off the coefficients of the Newton interpolation polynomial. Anyway, the extended value table for x=-3..10 is
[61, 11, 1, 1, 5, 31, 121, 341, 781, 1555, 2801, 4681, 7381, 11111]