plotting histograms in pandas - pandas

I am looking to plot to sets of data each with 10 points in them in overlapping bins.
values1 = [29, 31, 24, 30, 30, 14, 25, 35, 27, 31]
values2 = [36, 29, 29, 29, 34, 33, 27, 34, 36, 39]
When I add them to a dataframe they come out as 2 columns.
i am looking to plot 2 rows each with 10 overlapping columns.
df1 = pd.DataFrame(values1, values2)
and subsequently when I plot them using histograms they do not come out correctly
df1.plot.hist(stacked = True)
plt.show()
So my aim is to do a pairwise comparison between each of the numbers in the arrays. 29 - 36 , 31 - 29 , 24 - 29 etc.
I would like to plot them so that they overlap as this example
http://pandas.pydata.org/pandas-docs/stable/_images/hist_new_stacked.png
however I have only to values instead of three as in example.

You can pass them as values to a dict:
values1 = [29, 31, 24, 30, 30, 14, 25, 35, 27, 31]
values2 = [36, 29, 29, 29, 34, 33, 27, 34, 36, 39]
df1 = pd.DataFrame({'values1':values1, 'values2':values2})
df1.plot.hist(stacked = True)
What you did caused the ctor to interpret the passed values as a single column of data and then the index values:
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Compare the difference:
In [166]:
pd.DataFrame(values1, values2)
Out[166]:
0
36 29
29 31
29 24
29 30
34 30
33 14
27 25
34 35
36 27
39 31
In [167]:
pd.DataFrame({'values1':values1, 'values2':values2})
Out[167]:
values1 values2
0 29 36
1 31 29
2 24 29
3 30 29
4 30 34
5 14 33
6 25 27
7 35 34
8 27 36
9 31 39

Related

Pandas: How to Sort a Range of Columns in a Dataframe?

I have a Pandas dataframe that I need to sort by the data columns' maximum values. I am having trouble performing the sort because all of the sorting examples that I have found operate on all of the columns in the dataframe when performing the sort. In this case I need to sort only a subset of the columns. The first column contains a date, and the remaining 90 columns contain data. The 90 data columns are currently sorted alphabetically by their column name. I would like to sort them in decreasing order of their maximum value, which happens to be in the last row.
In the bigger scheme of things, this question is about how to perform sorting on a range of columns within a dataframe, rather than sorting all of the columns in the dataframe. There may be cases, for example, where I need to sort only columns 2 through 12 of a dataframe, while leaving the remaining columns in their existing order.
Here is a sample of the unsorted dataframe:
df.tail()
Date ADAMS ALLEN BARTHOLOMEW BENTON BLACKFORD BOONE BROWN ... WABASH WARREN WARRICK WASHINGTON WAYNE WELLS WHITE WHITLEY
65 2020-05-10 8 828 356 13 14 227 28 ... 64 12 123 48 53 11 149 22
66 2020-05-11 8 860 367 16 14 235 28 ... 67 12 126 48 56 12 161 23
67 2020-05-12 8 872 371 17 14 235 28 ... 67 12 131 49 56 12 162 23
68 2020-05-13 9 897 382 17 14 249 29 ... 68 12 140 50 58 13 164 27
69 2020-05-14 9 955 394 21 14 252 29 ... 69 12 145 50 60 15 164 28
I would like to perform the sort so that the column with the largest value in row 69 is placed after df['Date'], with the columns ordered so that the values in row 69 decrease from left to right. Once that is done, I'd like to create a series containing the column headers, to generate rank list. Using the visible columns as an example, the desired list would be:
rank_list=[ "ALLEN", "BARTHOLOMEW", "BOONE", "WHITE", "WARRICK", ... "BLACKFORD", "WARREN", "ADAMS" ]
My biggest hurdle at present is that when I perform the sort I'm not able to exclude the Date column, and I'm receiving a type error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
I am new to Pandas so I apologize if there is a solution to this problem that should be obvious. thanks.
You can do it this way using sort_values once selected the right row and the range of column
#data sample
np.random.seed(86)
df = pd.DataFrame({'date':pd.date_range('2020-05-15', periods=5),
'a': np.random.randint(0,50, 5),
'b': np.random.randint(0,50, 5),
'c': np.random.randint(0,50, 5),
'd': np.random.randint(0,50, 5)})
# parameters
start_idx = 1 #note: the indexing start at 0, so 1 is the second column
end_idx = df.shape[1] #for the last column
row_position = df.shape[0]-1 #for the last one
# create the new order
new_col_roder = df.columns.tolist()
new_col_roder[start_idx:end_idx] = df.iloc[row_position, start_idx:end_idx]\
.sort_values(ascending=False).index
#reirder
df = df[new_col_roder]
print(df)
date c a d b
0 2020-05-15 30 20 44 40
1 2020-05-16 45 32 29 9
2 2020-05-17 17 44 14 27
3 2020-05-18 13 28 4 41
4 2020-05-19 41 35 14 12 #as you can see, the columns are now c, a, d, b
I suggest the following:
# initialize the provided sample data frame
df = pd.DataFrame([['65 2020-05-10', 8, 828, 356, 13, 14, 227, 28, 64, 12, 123, 48, 53, 11, 149, 22],
['66 2020-05-11', 8, 860, 367, 16, 14, 235, 28, 67, 12, 126, 48, 56, 12, 161, 23],
['67 2020-05-12', 8, 872, 371, 17, 14, 235, 28, 67, 12, 131, 49, 56, 12, 162, 23],
['68 2020-05-13', 9, 897, 382, 17, 14, 249, 29, 68, 12, 140, 50, 58, 13, 164, 27],
['69 2020-05-14', 9, 955, 394, 21, 14, 252, 29, 69, 12, 145, 50, 60, 15, 164, 28]],
columns = ['Date', 'ADAMS', 'ALLEN', 'BARTHOLOMEW', 'BENTON', 'BLACKFORD', 'BOONE', 'BROWN', 'WABASH', 'WARREN', 'WARRICK', 'WASHINGTON', 'WAYNE', 'WELLS', 'WHITE', 'WHITLEY']
)
# a list of tuples in the form (column_name, max_value)
column_max_list = [(column, df[column].max()) for column in df.columns.values[1:]]
# sort the list descending by the max value
column_max_list_sorted = sorted(column_max_list, key = lambda tup: tup[1], reverse = True)
# extract only the column names
rank_list = [tup[0] for tup in column_max_list_sorted]
for i in range(len(rank_list)):
# get the column to insert next
col = df[rank_list[i]]
# drop the column to be inserted back
df.drop(columns = [rank_list[i]], inplace = True)
# insert the column at the correct index
df.insert(loc = i + 1, column = rank_list[i], value = col)
This yields the desired rank_list
['ALLEN', 'BARTHOLOMEW', 'BOONE', 'WHITE', 'WARRICK', 'WABASH', 'WAYNE', 'WASHINGTON', 'BROWN', 'WHITLEY', 'BENTON', 'WELLS', 'BLACKFORD', 'WARREN', 'ADAMS']
as well as the desired df:
Date ALLEN BARTHOLOMEW BOONE WHITE ...
0 65 2020-05-10 828 356 227 149 ...
1 66 2020-05-11 860 367 235 161 ...
2 67 2020-05-12 872 371 235 162 ...
3 68 2020-05-13 897 382 249 164 ...
4 69 2020-05-14 955 394 252 164 ...

How to create a partially filled column in pandas

I have a df_trg with, say 10 rows numbered 0-9.
I get from various sources values for an additional column foo which contains only a subset of rows, e.g. S1 has 0-3, 7, 9 and S2 has 4, 6.
I would like to get a data frame with a single new column foo where some rows may remain NaN.
Is there a "nicer" way other than:
df_trg['foo'] = np.nan
for src in sources:
df_trg['foo'][df_trg.index.isin(src.index)] = src
for example, using join or merge?
Let's create the source DataFrame (df), s1 and s2 (Series objects with
updating data) and a list of them (sources):
df = pd.DataFrame(np.arange(1, 51).reshape((5, -1)).T)
s1 = pd.Series([11, 12, 13, 14, 15, 16], index=[0, 1, 2, 3, 7, 9])
s2 = pd.Series([27, 28], index=[4, 6])
sources = [s1, s2]
Start the computation from adding foo column, initially filled with
an empty string:
df = df.assign(foo='')
Then run the following "updating" loop:
for src in sources:
df.foo.update(other=src)
The result is:
0 1 2 3 4 foo
0 1 11 21 31 41 11
1 2 12 22 32 42 12
2 3 13 23 33 43 13
3 4 14 24 34 44 14
4 5 15 25 35 45 27
5 6 16 26 36 46
6 7 17 27 37 47 28
7 8 18 28 38 48 15
8 9 19 29 39 49
9 10 20 30 40 50 16
In my opinion, this solution is (at least a little) nicer than yours and
shorter.
Alternative: Fill foo column initially with NaN, but this time
updating values will be converted to float (side effect of using NaN).

appending a list as dataframe row

I have a list with some counts with 5 elements.
counts = [33, 35, 17, 38, 29]
This counts list is updated with new numbers every day. So I wanted to create a dataframe and append the counts data as a new row every day. Every element of the list should appear in separate column in the dataframe. What i want to do is the following:
df = pd.DataFrame(columns = ['w1', 'w2', 'w3', 'w4', 'w5'])
df = df.append(counts)
but instead of adding counts data as a row, it adds a new column. Any help on how to do this correctly?
Assume counts on day0 is [33, 35, 17, 38, 29] and on day1 is [30, 36, 20, 34, 32], what i want is the following as output:
w1 w2 w3 w4 w5
0 33 35 17 38 29
1 30 36 20 34 32
where index represent the day at which counts were taken. any help?
Appending to DataFrame is possible, but because slow if many rows, better is create list of lists and create DataFrame by contructor:
counts = [33, 35, 17, 38, 29]
counts1 = [37, 8, 1, 2, 0]
L = [counts, counts1]
df = pd.DataFrame(L, columns = ['w1', 'w2', 'w3', 'w4', 'w5'])
print (df)
w1 w2 w3 w4 w5
0 33 35 17 38 29
1 37 8 1 2 0
But if need it, e.g. appenf one row daily only, then is necessary create Series with same index values like columns:
df = pd.DataFrame(columns = ['w1', 'w2', 'w3', 'w4', 'w5'])
counts = [33, 35, 17, 38, 29]
s = pd.Series(counts, index=df.columns)
df = df.append(s, ignore_index=True)
print (df)
w1 w2 w3 w4 w5
0 33 35 17 38 29
If you know day0 and day1 in the beginning, then you want to construct it as jezrael suggests.
But I'm assuming that you want to be able to add a new row in a loop.
Solution with loc
When using loc you need to use an index value that doesn't already exist. In this case, I'm assuming that we are maintaining a generic RangeIndex. If that is the case, I'll assume that the next index is the same as the length of the current DataFrame.
df.loc[len(df), :] = counts
df
w1 w2 w3 w4 w5
0 33 35 17 38 29
Let's make the loop.
day0 = [33, 35, 17, 38, 29]
day1 = [30, 36, 20, 34, 32]
for counts in [day0, day1]:
df.loc[len(df), :] = counts
df
w1 w2 w3 w4 w5
0 33.0 35.0 17.0 38.0 29.0
1 30.0 36.0 20.0 34.0 32.0
ran_list =[]
for i in list(range(0,12)):
ran_list.append(round(random.uniform(133.33, 266.66),3))
print(ran_list)
df = pd.DataFrame([ran_list])
df

Pandas DataFrame.update with MultiIndex label

Given a DataFrame A with MultiIndex and a DataFrame B with one-dimensional index, how to update column values of A with new values from B where the index of B should be matched with the second index label of A.
Test data:
begin = [10, 10, 12, 12, 14, 14]
end = [10, 11, 12, 13, 14, 15]
values = [1, 2, 3, 4, 5, 6]
values_updated = [10, 20, 3, 4, 50, 60]
multiindexed = pd.DataFrame({'begin': begin,
'end': end,
'value': values})
multiindexed.set_index(['begin', 'end'], inplace=True)
singleindexed = pd.DataFrame.from_dict(dict(zip([10, 11, 14, 15],
[10, 20, 50, 60])),
orient='index')
singleindexed.columns = ['value']
And the desired result should be
value
begin end
10 10 10
11 20
12 12 3
13 4
14 14 50
15 60
Now I was thinking about a variant of
multiindexed.update(singleindexed)
I searched the docs of DataFrame.update, but could not find anything w.r.t. index handling.
Am I missing an easier way to accomplish this?
You can use loc for selecting data in multiindexed and then set new values by values:
print singleindexed.index
Int64Index([10, 11, 14, 15], dtype='int64')
print singleindexed.values
[[10]
[20]
[50]
[60]]
idx = pd.IndexSlice
print multiindexed.loc[idx[:, singleindexed.index],:]
value
start end
10 10 1
11 2
14 14 5
15 6
multiindexed.loc[idx[:, singleindexed.index],:] = singleindexed.values
print multiindexed
value
start end
10 10 10
11 20
12 12 3
13 4
14 14 50
15 60
Using slicers in docs.

Frequency distribution of series in pandas

Let's say I have a pandas series:
>> t.head()
Timestamp
2014-02-01 05:43:26 35.592899
2014-02-01 06:18:32 33.898003
2014-02-01 10:04:04 33.898003
2014-02-01 10:36:30 35.592899
2014-02-01 12:20:32 40.677601
and what I want is a frequency table with bins I can set. This sounds easy but the closest I've come to is via matplotlib
In [8]: fd = plt.hist(t, bins=range(20,50))
In [9]: fd
Out[9]:
(array([ 0, 0, 1, 0, 0, 3, 0, 3, 1, 0, 8, 0, 11, 20, 0, 18, 0,
19, 6, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0]),
array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
<a list of 29 Patch objects>)
but that of course actually plots the histogram. I can find lost of advice on how to plot histograms, but not on how to simply form the frequency distribution; from the above I have the 'bins' as fd[1] (or at least the lower-bounds of them) and the values as fd[0].
I want the frequency distribution on its own in order to later form a Dataframe with the distributions of a number of series, (all with the same bins). I feel there must be a way to do it without matplotlib?
UPDATE: desired results:
{'Station1': 20 0
21 0
22 1
23 0
24 0
25 3
26 0
27 3
28 1
29 0
30 8
31 0
32 11
33 20
34 0
35 18
36 0
37 19
38 6
39 0
40 2
41 0
42 0
43 0
44 0
45 0
46 0
47 0
48 0
dtype: int32}
These are wind speeds: once I have similar data from a number of different met stations I want to be able to form a DataFrame with the bins as the index and the columns as the freq. distrs.
VALUE_COUNTS()
I did think about value counts, it gives me this:
33.898003 20
37.287800 19
35.592899 18
32.203102 11
30.508202 8
38.982700 6
27.118401 3
25.423500 3
40.677601 2
28.813301 1
22.033701 1
dtype: int64
The data itself is clearly A/D converted: the thing is that suppose the next met station has different indices, such as 33.898006 instead of 33.898003, then I'll get a new 'bin' just for that one - I want to guarantee that the bins are the same for each set of data.