Frequency distribution of series in pandas - pandas

Let's say I have a pandas series:
>> t.head()
Timestamp
2014-02-01 05:43:26 35.592899
2014-02-01 06:18:32 33.898003
2014-02-01 10:04:04 33.898003
2014-02-01 10:36:30 35.592899
2014-02-01 12:20:32 40.677601
and what I want is a frequency table with bins I can set. This sounds easy but the closest I've come to is via matplotlib
In [8]: fd = plt.hist(t, bins=range(20,50))
In [9]: fd
Out[9]:
(array([ 0, 0, 1, 0, 0, 3, 0, 3, 1, 0, 8, 0, 11, 20, 0, 18, 0,
19, 6, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0]),
array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
<a list of 29 Patch objects>)
but that of course actually plots the histogram. I can find lost of advice on how to plot histograms, but not on how to simply form the frequency distribution; from the above I have the 'bins' as fd[1] (or at least the lower-bounds of them) and the values as fd[0].
I want the frequency distribution on its own in order to later form a Dataframe with the distributions of a number of series, (all with the same bins). I feel there must be a way to do it without matplotlib?
UPDATE: desired results:
{'Station1': 20 0
21 0
22 1
23 0
24 0
25 3
26 0
27 3
28 1
29 0
30 8
31 0
32 11
33 20
34 0
35 18
36 0
37 19
38 6
39 0
40 2
41 0
42 0
43 0
44 0
45 0
46 0
47 0
48 0
dtype: int32}
These are wind speeds: once I have similar data from a number of different met stations I want to be able to form a DataFrame with the bins as the index and the columns as the freq. distrs.
VALUE_COUNTS()
I did think about value counts, it gives me this:
33.898003 20
37.287800 19
35.592899 18
32.203102 11
30.508202 8
38.982700 6
27.118401 3
25.423500 3
40.677601 2
28.813301 1
22.033701 1
dtype: int64
The data itself is clearly A/D converted: the thing is that suppose the next met station has different indices, such as 33.898006 instead of 33.898003, then I'll get a new 'bin' just for that one - I want to guarantee that the bins are the same for each set of data.

Related

Pandas: How to Sort a Range of Columns in a Dataframe?

I have a Pandas dataframe that I need to sort by the data columns' maximum values. I am having trouble performing the sort because all of the sorting examples that I have found operate on all of the columns in the dataframe when performing the sort. In this case I need to sort only a subset of the columns. The first column contains a date, and the remaining 90 columns contain data. The 90 data columns are currently sorted alphabetically by their column name. I would like to sort them in decreasing order of their maximum value, which happens to be in the last row.
In the bigger scheme of things, this question is about how to perform sorting on a range of columns within a dataframe, rather than sorting all of the columns in the dataframe. There may be cases, for example, where I need to sort only columns 2 through 12 of a dataframe, while leaving the remaining columns in their existing order.
Here is a sample of the unsorted dataframe:
df.tail()
Date ADAMS ALLEN BARTHOLOMEW BENTON BLACKFORD BOONE BROWN ... WABASH WARREN WARRICK WASHINGTON WAYNE WELLS WHITE WHITLEY
65 2020-05-10 8 828 356 13 14 227 28 ... 64 12 123 48 53 11 149 22
66 2020-05-11 8 860 367 16 14 235 28 ... 67 12 126 48 56 12 161 23
67 2020-05-12 8 872 371 17 14 235 28 ... 67 12 131 49 56 12 162 23
68 2020-05-13 9 897 382 17 14 249 29 ... 68 12 140 50 58 13 164 27
69 2020-05-14 9 955 394 21 14 252 29 ... 69 12 145 50 60 15 164 28
I would like to perform the sort so that the column with the largest value in row 69 is placed after df['Date'], with the columns ordered so that the values in row 69 decrease from left to right. Once that is done, I'd like to create a series containing the column headers, to generate rank list. Using the visible columns as an example, the desired list would be:
rank_list=[ "ALLEN", "BARTHOLOMEW", "BOONE", "WHITE", "WARRICK", ... "BLACKFORD", "WARREN", "ADAMS" ]
My biggest hurdle at present is that when I perform the sort I'm not able to exclude the Date column, and I'm receiving a type error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
I am new to Pandas so I apologize if there is a solution to this problem that should be obvious. thanks.
You can do it this way using sort_values once selected the right row and the range of column
#data sample
np.random.seed(86)
df = pd.DataFrame({'date':pd.date_range('2020-05-15', periods=5),
'a': np.random.randint(0,50, 5),
'b': np.random.randint(0,50, 5),
'c': np.random.randint(0,50, 5),
'd': np.random.randint(0,50, 5)})
# parameters
start_idx = 1 #note: the indexing start at 0, so 1 is the second column
end_idx = df.shape[1] #for the last column
row_position = df.shape[0]-1 #for the last one
# create the new order
new_col_roder = df.columns.tolist()
new_col_roder[start_idx:end_idx] = df.iloc[row_position, start_idx:end_idx]\
.sort_values(ascending=False).index
#reirder
df = df[new_col_roder]
print(df)
date c a d b
0 2020-05-15 30 20 44 40
1 2020-05-16 45 32 29 9
2 2020-05-17 17 44 14 27
3 2020-05-18 13 28 4 41
4 2020-05-19 41 35 14 12 #as you can see, the columns are now c, a, d, b
I suggest the following:
# initialize the provided sample data frame
df = pd.DataFrame([['65 2020-05-10', 8, 828, 356, 13, 14, 227, 28, 64, 12, 123, 48, 53, 11, 149, 22],
['66 2020-05-11', 8, 860, 367, 16, 14, 235, 28, 67, 12, 126, 48, 56, 12, 161, 23],
['67 2020-05-12', 8, 872, 371, 17, 14, 235, 28, 67, 12, 131, 49, 56, 12, 162, 23],
['68 2020-05-13', 9, 897, 382, 17, 14, 249, 29, 68, 12, 140, 50, 58, 13, 164, 27],
['69 2020-05-14', 9, 955, 394, 21, 14, 252, 29, 69, 12, 145, 50, 60, 15, 164, 28]],
columns = ['Date', 'ADAMS', 'ALLEN', 'BARTHOLOMEW', 'BENTON', 'BLACKFORD', 'BOONE', 'BROWN', 'WABASH', 'WARREN', 'WARRICK', 'WASHINGTON', 'WAYNE', 'WELLS', 'WHITE', 'WHITLEY']
)
# a list of tuples in the form (column_name, max_value)
column_max_list = [(column, df[column].max()) for column in df.columns.values[1:]]
# sort the list descending by the max value
column_max_list_sorted = sorted(column_max_list, key = lambda tup: tup[1], reverse = True)
# extract only the column names
rank_list = [tup[0] for tup in column_max_list_sorted]
for i in range(len(rank_list)):
# get the column to insert next
col = df[rank_list[i]]
# drop the column to be inserted back
df.drop(columns = [rank_list[i]], inplace = True)
# insert the column at the correct index
df.insert(loc = i + 1, column = rank_list[i], value = col)
This yields the desired rank_list
['ALLEN', 'BARTHOLOMEW', 'BOONE', 'WHITE', 'WARRICK', 'WABASH', 'WAYNE', 'WASHINGTON', 'BROWN', 'WHITLEY', 'BENTON', 'WELLS', 'BLACKFORD', 'WARREN', 'ADAMS']
as well as the desired df:
Date ALLEN BARTHOLOMEW BOONE WHITE ...
0 65 2020-05-10 828 356 227 149 ...
1 66 2020-05-11 860 367 235 161 ...
2 67 2020-05-12 872 371 235 162 ...
3 68 2020-05-13 897 382 249 164 ...
4 69 2020-05-14 955 394 252 164 ...

Reshaping column values into rows with Identifier column at the end

I have measurements for Power related to different sensors i.e A1_Pin, A2_Pin and so on. These measurements are recorded in file as columns. The data is uniquely recorded with timestamps.
df1 = pd.DataFrame({'DateTime': ['12/12/2019', '12/13/2019', '12/14/2019',
'12/15/2019', '12/16/2019'],
'A1_Pin': [2, 8, 8, 3, 9],
'A2_Pin': [1, 2, 3, 4, 5],
'A3_Pin': [85, 36, 78, 32, 75]})
I want to reform the table so that each row corresponds to one sensor. The last column indicates the sensor ID to which the row data belongs to.
The final table should look like:
df2 = pd.DataFrame({'DateTime': ['12/12/2019', '12/12/2019', '12/12/2019',
'12/13/2019', '12/13/2019','12/13/2019', '12/14/2019', '12/14/2019',
'12/14/2019', '12/15/2019','12/15/2019', '12/15/2019', '12/16/2019',
'12/16/2019', '12/16/2019'],
'Power': [2, 1, 85,8, 2, 36, 8,3,78, 3, 4, 32, 9, 5, 75],
'ModID': ['A1_PiN','A2_PiN','A3_PiN','A1_PiN','A2_PiN','A3_PiN',
'A1_PiN','A2_PiN','A3_PiN','A1_PiN','A2_PiN','A3_PiN',
'A1_PiN','A2_PiN','A3_PiN']})
I have tried Groupby, Melt, Reshape, Stack and loops but could not do that. If anyone could help? Thanks
When you tried stack, you were on one good track. you need to set_index first and reset_index after such as:
df2 = df1.set_index('DateTime').stack().reset_index(name='Power')\
.rename(columns={'level_1':'ModID'}) #to fit the names your expected output
And you get:
print (df2)
DateTime ModID Power
0 12/12/2019 A1_Pin 2
1 12/12/2019 A2_Pin 1
2 12/12/2019 A3_Pin 85
3 12/13/2019 A1_Pin 8
4 12/13/2019 A2_Pin 2
5 12/13/2019 A3_Pin 36
6 12/14/2019 A1_Pin 8
7 12/14/2019 A2_Pin 3
8 12/14/2019 A3_Pin 78
9 12/15/2019 A1_Pin 3
10 12/15/2019 A2_Pin 4
11 12/15/2019 A3_Pin 32
12 12/16/2019 A1_Pin 9
13 12/16/2019 A2_Pin 5
14 12/16/2019 A3_Pin 75
I'd try something like this:
df1.set_index('DateTime').unstack().reset_index()

The polynomial interpolation of degree 4

For the provided data set, write equations for calculating the polynomial interpolation of degree 4 and find the formula for f by hand.
x = [1, 2, 3, 4, 5]
y = f(x) = [5, 31, 121, 341, 781]
The pyramid of iterated differences is
5 31 121 341 781
26 90 220 440
64 130 220
66 90
24
You can extend the table by setting the last line constant 24 and computing backwards or you could read off the coefficients of the Newton interpolation polynomial. Anyway, the extended value table for x=-3..10 is
[61, 11, 1, 1, 5, 31, 121, 341, 781, 1555, 2801, 4681, 7381, 11111]

plotting histograms in pandas

I am looking to plot to sets of data each with 10 points in them in overlapping bins.
values1 = [29, 31, 24, 30, 30, 14, 25, 35, 27, 31]
values2 = [36, 29, 29, 29, 34, 33, 27, 34, 36, 39]
When I add them to a dataframe they come out as 2 columns.
i am looking to plot 2 rows each with 10 overlapping columns.
df1 = pd.DataFrame(values1, values2)
and subsequently when I plot them using histograms they do not come out correctly
df1.plot.hist(stacked = True)
plt.show()
So my aim is to do a pairwise comparison between each of the numbers in the arrays. 29 - 36 , 31 - 29 , 24 - 29 etc.
I would like to plot them so that they overlap as this example
http://pandas.pydata.org/pandas-docs/stable/_images/hist_new_stacked.png
however I have only to values instead of three as in example.
You can pass them as values to a dict:
values1 = [29, 31, 24, 30, 30, 14, 25, 35, 27, 31]
values2 = [36, 29, 29, 29, 34, 33, 27, 34, 36, 39]
df1 = pd.DataFrame({'values1':values1, 'values2':values2})
df1.plot.hist(stacked = True)
What you did caused the ctor to interpret the passed values as a single column of data and then the index values:
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Compare the difference:
In [166]:
pd.DataFrame(values1, values2)
Out[166]:
0
36 29
29 31
29 24
29 30
34 30
33 14
27 25
34 35
36 27
39 31
In [167]:
pd.DataFrame({'values1':values1, 'values2':values2})
Out[167]:
values1 values2
0 29 36
1 31 29
2 24 29
3 30 29
4 30 34
5 14 33
6 25 27
7 35 34
8 27 36
9 31 39

Pandas DataFrame.update with MultiIndex label

Given a DataFrame A with MultiIndex and a DataFrame B with one-dimensional index, how to update column values of A with new values from B where the index of B should be matched with the second index label of A.
Test data:
begin = [10, 10, 12, 12, 14, 14]
end = [10, 11, 12, 13, 14, 15]
values = [1, 2, 3, 4, 5, 6]
values_updated = [10, 20, 3, 4, 50, 60]
multiindexed = pd.DataFrame({'begin': begin,
'end': end,
'value': values})
multiindexed.set_index(['begin', 'end'], inplace=True)
singleindexed = pd.DataFrame.from_dict(dict(zip([10, 11, 14, 15],
[10, 20, 50, 60])),
orient='index')
singleindexed.columns = ['value']
And the desired result should be
value
begin end
10 10 10
11 20
12 12 3
13 4
14 14 50
15 60
Now I was thinking about a variant of
multiindexed.update(singleindexed)
I searched the docs of DataFrame.update, but could not find anything w.r.t. index handling.
Am I missing an easier way to accomplish this?
You can use loc for selecting data in multiindexed and then set new values by values:
print singleindexed.index
Int64Index([10, 11, 14, 15], dtype='int64')
print singleindexed.values
[[10]
[20]
[50]
[60]]
idx = pd.IndexSlice
print multiindexed.loc[idx[:, singleindexed.index],:]
value
start end
10 10 1
11 2
14 14 5
15 6
multiindexed.loc[idx[:, singleindexed.index],:] = singleindexed.values
print multiindexed
value
start end
10 10 10
11 20
12 12 3
13 4
14 14 50
15 60
Using slicers in docs.