Numpy array - Regroup values when index equals a specified value - numpy

I'm pretty new to numpy arrays, and was not able to find a good explanation / example for my issue. I saw things like take() or take_along_axis() but I didn't understood what was going on...
I have this 2D numpy, which may contain N sub-arrays, of each 5 values (h, s, i, x, y):
values = np.array([
[1,2,3,4,5],
[1,22,33,44,55],
[1,22,333,444,555],
[1,22,333,4444,5555],
[1,222,33,44,55],
[1,222,330,440,550],
[10,20,30,40,50],
[100,200,300,400,500],
])
As you can see, values can be repeated for a same index.
I want to regroup sub-arrays, by indexes values, such as:
1
2
3
4
5
22
33
44
55
333
444
555
4444
5555
222
33
44
55
330
440
550
10
20
30
40
50
100
200
300
400
500
The goal is to obtain a regular array like:
array = [1, 2, 3, 4 , 5, 22, 33, 44, 55, 333, 444, 555, 4444, 5555, 222, 33, 44, 55, 330, 440, 550, 10, 20, 30, 40, 50, 100, 200, 300, 400, 500]
Thank you very much for your support.

you can use flatten method
list(values.flatten())

Related

Pandas conditional average by groups

Updated:
In this post, the first answer is very close to solve this problem too. However it does not take into account the column A and C.
Pandas Average If in Python : Combining groupby mean with conditional statement
There is a DataFrame with 3 columns. I would like to add 2 new columns which are:
the rolling avg of B by A and C (rolling 2 of the current and the previous row which are pass the statement - the same A and C)
the rolling avg of B by A and C (rolling 2 of the previous 2 which are pass the statement - the same A and C)
For the second part, I have date and a sequence which could be used as the basic of rolling avg calculation.
Any ideas?
df = pd.DataFrame({'A': ['t1', 't1', 't1', 't1', 't2', 't2', 't2', 't2','t1'],
'B': [100, 104, 108, 110, 102, 110, 98, 100, 200],
'C': ['h', 'a', 'a', 'a', 'a', 'h', 'h', 'h','h'],
'expected1': [100, 104, 106, 109, 102, 110, 104, 99, 150],
'expected2': [0, 0, 104, 106, 0, 0, 110, 104, 100]}, columns=['A', 'B', 'C','expected1','expected2'])
df
Use lazy group:
grp = df.groupby(['A', 'C'], sort=False)['B']
df['mean'] = grp.transform('mean')
df['mean_avg'] = grp.rolling(2, min_periods=1).mean().values
Output:
>>> df
A B C mean mean_avg
0 t1 100 h 100.000000 100.0
1 t1 104 a 107.333333 104.0
2 t1 108 a 107.333333 106.0
3 t1 110 a 107.333333 109.0
4 t2 102 a 102.000000 110.0
5 t2 110 h 102.666667 104.0
6 t2 98 h 102.666667 99.0
7 t2 100 h 102.666667 102.0

Pandas: How to Sort a Range of Columns in a Dataframe?

I have a Pandas dataframe that I need to sort by the data columns' maximum values. I am having trouble performing the sort because all of the sorting examples that I have found operate on all of the columns in the dataframe when performing the sort. In this case I need to sort only a subset of the columns. The first column contains a date, and the remaining 90 columns contain data. The 90 data columns are currently sorted alphabetically by their column name. I would like to sort them in decreasing order of their maximum value, which happens to be in the last row.
In the bigger scheme of things, this question is about how to perform sorting on a range of columns within a dataframe, rather than sorting all of the columns in the dataframe. There may be cases, for example, where I need to sort only columns 2 through 12 of a dataframe, while leaving the remaining columns in their existing order.
Here is a sample of the unsorted dataframe:
df.tail()
Date ADAMS ALLEN BARTHOLOMEW BENTON BLACKFORD BOONE BROWN ... WABASH WARREN WARRICK WASHINGTON WAYNE WELLS WHITE WHITLEY
65 2020-05-10 8 828 356 13 14 227 28 ... 64 12 123 48 53 11 149 22
66 2020-05-11 8 860 367 16 14 235 28 ... 67 12 126 48 56 12 161 23
67 2020-05-12 8 872 371 17 14 235 28 ... 67 12 131 49 56 12 162 23
68 2020-05-13 9 897 382 17 14 249 29 ... 68 12 140 50 58 13 164 27
69 2020-05-14 9 955 394 21 14 252 29 ... 69 12 145 50 60 15 164 28
I would like to perform the sort so that the column with the largest value in row 69 is placed after df['Date'], with the columns ordered so that the values in row 69 decrease from left to right. Once that is done, I'd like to create a series containing the column headers, to generate rank list. Using the visible columns as an example, the desired list would be:
rank_list=[ "ALLEN", "BARTHOLOMEW", "BOONE", "WHITE", "WARRICK", ... "BLACKFORD", "WARREN", "ADAMS" ]
My biggest hurdle at present is that when I perform the sort I'm not able to exclude the Date column, and I'm receiving a type error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
I am new to Pandas so I apologize if there is a solution to this problem that should be obvious. thanks.
You can do it this way using sort_values once selected the right row and the range of column
#data sample
np.random.seed(86)
df = pd.DataFrame({'date':pd.date_range('2020-05-15', periods=5),
'a': np.random.randint(0,50, 5),
'b': np.random.randint(0,50, 5),
'c': np.random.randint(0,50, 5),
'd': np.random.randint(0,50, 5)})
# parameters
start_idx = 1 #note: the indexing start at 0, so 1 is the second column
end_idx = df.shape[1] #for the last column
row_position = df.shape[0]-1 #for the last one
# create the new order
new_col_roder = df.columns.tolist()
new_col_roder[start_idx:end_idx] = df.iloc[row_position, start_idx:end_idx]\
.sort_values(ascending=False).index
#reirder
df = df[new_col_roder]
print(df)
date c a d b
0 2020-05-15 30 20 44 40
1 2020-05-16 45 32 29 9
2 2020-05-17 17 44 14 27
3 2020-05-18 13 28 4 41
4 2020-05-19 41 35 14 12 #as you can see, the columns are now c, a, d, b
I suggest the following:
# initialize the provided sample data frame
df = pd.DataFrame([['65 2020-05-10', 8, 828, 356, 13, 14, 227, 28, 64, 12, 123, 48, 53, 11, 149, 22],
['66 2020-05-11', 8, 860, 367, 16, 14, 235, 28, 67, 12, 126, 48, 56, 12, 161, 23],
['67 2020-05-12', 8, 872, 371, 17, 14, 235, 28, 67, 12, 131, 49, 56, 12, 162, 23],
['68 2020-05-13', 9, 897, 382, 17, 14, 249, 29, 68, 12, 140, 50, 58, 13, 164, 27],
['69 2020-05-14', 9, 955, 394, 21, 14, 252, 29, 69, 12, 145, 50, 60, 15, 164, 28]],
columns = ['Date', 'ADAMS', 'ALLEN', 'BARTHOLOMEW', 'BENTON', 'BLACKFORD', 'BOONE', 'BROWN', 'WABASH', 'WARREN', 'WARRICK', 'WASHINGTON', 'WAYNE', 'WELLS', 'WHITE', 'WHITLEY']
)
# a list of tuples in the form (column_name, max_value)
column_max_list = [(column, df[column].max()) for column in df.columns.values[1:]]
# sort the list descending by the max value
column_max_list_sorted = sorted(column_max_list, key = lambda tup: tup[1], reverse = True)
# extract only the column names
rank_list = [tup[0] for tup in column_max_list_sorted]
for i in range(len(rank_list)):
# get the column to insert next
col = df[rank_list[i]]
# drop the column to be inserted back
df.drop(columns = [rank_list[i]], inplace = True)
# insert the column at the correct index
df.insert(loc = i + 1, column = rank_list[i], value = col)
This yields the desired rank_list
['ALLEN', 'BARTHOLOMEW', 'BOONE', 'WHITE', 'WARRICK', 'WABASH', 'WAYNE', 'WASHINGTON', 'BROWN', 'WHITLEY', 'BENTON', 'WELLS', 'BLACKFORD', 'WARREN', 'ADAMS']
as well as the desired df:
Date ALLEN BARTHOLOMEW BOONE WHITE ...
0 65 2020-05-10 828 356 227 149 ...
1 66 2020-05-11 860 367 235 161 ...
2 67 2020-05-12 872 371 235 162 ...
3 68 2020-05-13 897 382 249 164 ...
4 69 2020-05-14 955 394 252 164 ...

How to effeciently convert dictionary of Lists into Pandas dataframe? [duplicate]

I am trying to create a dataframe from a list of values which has nested dictionaries So this is my data
d=[{'user': 200,
'p_val': {'a': 10, 'b': 200},
'f_val': {'a': 20, 'b': 300},
'life': 8},
{'user': 202,
'p_val': {'a': 100, 'b': 200},
'f_val': {'a': 200, 'b': 300},
'life': 8}]
i am trying to turn it into a dataframe as follows:
user new_col f_val p_val life
200 a 20 10 8
200 b 300 200 8
202 a 200 100 8
202 b 300 200 8
I looked at other answers, none of them matched my requirement.
The nearest I could find was this, and still did not work me.
Any help would be much appreciated! Thank you
Try
df = pd.concat([pd.DataFrame(e) for e in d])
df.reset_index().rename(columns={'index': 'newcol'})
newcol user p_val f_val life
0 a 200 10 20 8
1 b 200 200 300 8
2 a 202 100 200 8
3 b 202 200 300 8
The first line makes a DataFrame with the index being the a and b values. The second line makes this the newcol column.

The polynomial interpolation of degree 4

For the provided data set, write equations for calculating the polynomial interpolation of degree 4 and find the formula for f by hand.
x = [1, 2, 3, 4, 5]
y = f(x) = [5, 31, 121, 341, 781]
The pyramid of iterated differences is
5 31 121 341 781
26 90 220 440
64 130 220
66 90
24
You can extend the table by setting the last line constant 24 and computing backwards or you could read off the coefficients of the Newton interpolation polynomial. Anyway, the extended value table for x=-3..10 is
[61, 11, 1, 1, 5, 31, 121, 341, 781, 1555, 2801, 4681, 7381, 11111]

Frequency distribution of series in pandas

Let's say I have a pandas series:
>> t.head()
Timestamp
2014-02-01 05:43:26 35.592899
2014-02-01 06:18:32 33.898003
2014-02-01 10:04:04 33.898003
2014-02-01 10:36:30 35.592899
2014-02-01 12:20:32 40.677601
and what I want is a frequency table with bins I can set. This sounds easy but the closest I've come to is via matplotlib
In [8]: fd = plt.hist(t, bins=range(20,50))
In [9]: fd
Out[9]:
(array([ 0, 0, 1, 0, 0, 3, 0, 3, 1, 0, 8, 0, 11, 20, 0, 18, 0,
19, 6, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0]),
array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
<a list of 29 Patch objects>)
but that of course actually plots the histogram. I can find lost of advice on how to plot histograms, but not on how to simply form the frequency distribution; from the above I have the 'bins' as fd[1] (or at least the lower-bounds of them) and the values as fd[0].
I want the frequency distribution on its own in order to later form a Dataframe with the distributions of a number of series, (all with the same bins). I feel there must be a way to do it without matplotlib?
UPDATE: desired results:
{'Station1': 20 0
21 0
22 1
23 0
24 0
25 3
26 0
27 3
28 1
29 0
30 8
31 0
32 11
33 20
34 0
35 18
36 0
37 19
38 6
39 0
40 2
41 0
42 0
43 0
44 0
45 0
46 0
47 0
48 0
dtype: int32}
These are wind speeds: once I have similar data from a number of different met stations I want to be able to form a DataFrame with the bins as the index and the columns as the freq. distrs.
VALUE_COUNTS()
I did think about value counts, it gives me this:
33.898003 20
37.287800 19
35.592899 18
32.203102 11
30.508202 8
38.982700 6
27.118401 3
25.423500 3
40.677601 2
28.813301 1
22.033701 1
dtype: int64
The data itself is clearly A/D converted: the thing is that suppose the next met station has different indices, such as 33.898006 instead of 33.898003, then I'll get a new 'bin' just for that one - I want to guarantee that the bins are the same for each set of data.