Barplot based on coloured sectors - dataframe

I am creating a bar plot where the colour of the bar is proportional to the value. However, is it possible to break down the bar into colour sectors? See below what I have and what I mean:
import pandas as pd
import plotly.express as px
df = px.data.wind()
df_test = df[df["strength"]=='0-1']
fig=px.bar_polar(df_test,r='frequency',theta='direction', color='frequency',color_continuous_scale='YlGn')
fig.show()
Below is what I want to achieve.
Is it possible to have the polar sections instead of the actuar bars? Maybe at least add the horizontal axis to the bars? Or is there any other plotly chart type for this?

You can create linearly spaced arrays for each direction in df_test. Currently df_test looks like the following:
direction strength frequency
0 N 0-1 0.5
1 NNE 0-1 0.6
2 NE 0-1 0.5
3 ENE 0-1 0.4
4 E 0-1 0.4
5 ESE 0-1 0.3
6 SE 0-1 0.4
7 SSE 0-1 0.4
8 S 0-1 0.6
9 SSW 0-1 0.4
10 SW 0-1 0.5
11 WSW 0-1 0.6
12 W 0-1 0.6
13 WNW 0-1 0.5
14 NW 0-1 0.4
15 NNW 0-1 0.1
What we want is for each direction to have linearly spaced frequencies between 0.1 and the ending frequency value in df_test for that particular direction, such as the following:
direction strength frequency
0 N 0-1 0.1
1 N 0-1 0.2
2 N 0-1 0.3
3 N 0-1 0.4
4 N 0-1 0.5
5 NNE 0-1 0.1
...
We can perform a groupby on direction for df_test, then use np.arange to obtain the linearly spaced array for that particular direction, and build the resulting dataframe using pd.concat. Then we can sort the new DataFrame so the directions are in the correct order (as this is the format that px.bar_polar expects).
import numpy as np
import pandas as pd
import plotly.express as px
df = px.data.wind()
df_test = df[df["strength"]=='0-1']
df_test_sectors = pd.DataFrame(columns=df_test.columns)
## this only works if each group has one row
for direction, df_direction in df_test.groupby('direction'):
frequency_stop = df_direction['frequency'].tolist()[0]
frequencies = np.arange(0.1, frequency_stop+0.1, 0.1)
df_sector = pd.DataFrame({
'direction': [direction]*len(frequencies),
'strength': ['0-1']*len(frequencies),
'frequency': frequencies
})
df_test_sectors = pd.concat([df_test_sectors, df_sector])
df_test_sectors = df_test_sectors.reset_index(drop=True)
df_test_sectors['direction'] = pd.Categorical(
df_test_sectors['direction'],
df_test.direction.tolist() #sort the directions into the same order as those in df_test
)
df_test_sectors['frequency'] = df_test_sectors['frequency'].astype(float)
df_test_sectors = df_test_sectors.sort_values(['direction', 'frequency'])
fig = px.bar_polar(df_test_sectors, r='frequency', theta='direction', color='frequency', color_continuous_scale='YlGn')
fig.show()

Related

return list by dataframe linear interpolation

I have a dataframe that has, let's say 5 entries.
moment
stress
strain
0
0.12
13
0.11
1
0.23
14
0.12
2
0.56
15
0.56
I would like to get a 1D float list in the order of [moment, stress, strain], based on the linear interpolation of strain = 0.45
I have read a couple of threads talking about the interpolate() method from pandas. But it is used when you have Nan entry and you fill in the number.
How do I accomplish a similar task with my case?
Thank you
One method is to add new row to your dataframe with NaN values and sort it:
df = df.append(
{"moment": np.nan, "stress": np.nan, "strain": 0.45}, ignore_index=True
)
df = df.sort_values(by="strain").set_index("strain")
df = df.interpolate(method="index")
print(df)
Prints:
moment stress
strain
0.11 0.1200 13.00
0.12 0.2300 14.00
0.45 0.4775 14.75
0.56 0.5600 15.00
To get the values back:
df = df.reset_index()
print(
df.loc[df.strain == 0.45, ["moment", "stress", "strain"]]
.to_numpy()
.tolist()[0]
)
Prints:
[0.47750000000000004, 14.75, 0.45]

Calculate nearest distance to certain points in python

I have a dataset as shown below, each sample has x and y values and the corresponding result
Sr. X Y Resut
1 2 12 Positive
2 4 3 positive
....
Visualization
Grid size is 12 * 8
How I can calculate the nearest distance for each sample from red points (positive ones)?
Red = Positive,
Blue = Negative
Sr. X Y Result Nearest-distance-red
1 2 23 Positive ?
2 4 3 Negative ?
....
dataset
Its a lot easier when there is sample data, make sure to include that next time.
I generate random data
import numpy as np
import pandas as pd
import sklearn
x = np.linspace(1,50)
y = np.linspace(1,50)
GRID = np.meshgrid(x,y)
grid_colors = 1* ( np.random.random(GRID[0].size) > .8 )
sample_data = pd.DataFrame( {'X': GRID[0].flatten(), 'Y':GRID[1].flatten(), 'grid_color' : grid_colors})
sample_data.plot.scatter(x="X",y='Y', c='grid_color', colormap='bwr', figsize=(10,10))
BallTree (or KDTree) can create a tree to query with
from sklearn.neighbors import BallTree
red_points = sample_data[sample_data.grid_color == 1]
blue_points = sample_data[sample_data.grid_color != 1]
tree = BallTree(red_points[['X','Y']], leaf_size=15, metric='minkowski')
and use it with
distance, index = tree.query(sample_data[['X','Y']], k=1)
now add it to the DataFrame
sample_data['nearest_point_distance'] = distance
sample_data['nearest_point_X'] = red_points.X.values[index]
sample_data['nearest_point_Y'] = red_points.Y.values[index]
which gives
X Y grid_color nearest_point_distance nearest_point_X \
0 1.0 1.0 0 2.0 3.0
1 2.0 1.0 0 1.0 3.0
2 3.0 1.0 1 0.0 3.0
3 4.0 1.0 0 1.0 3.0
4 5.0 1.0 1 0.0 5.0
nearest_point_Y
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
Modification to have red point not find themself;
Find the nearest k=2 instead of k=1;
distance, index = tree.query(sample_data[['X','Y']], k=2)
And, with help of numpy indexing, make red points use the second instead of the first found;
sample_size = GRID[0].size
sample_data['nearest_point_distance'] = distance[np.arange(sample_size),sample_data.grid_color]
sample_data['nearest_point_X'] = red_points.X.values[index[np.arange(sample_size),sample_data.grid_color]]
sample_data['nearest_point_Y'] = red_points.Y.values[index[np.arange(sample_size),sample_data.grid_color]]
The output type is the same, but due to randomness it won't agree with earlier made picture.
cKDTree for scipy can calculate that distance for you. Something along those lines should work:
df['Distance_To_Red'] = cKDTree(coordinates_of_red_points).query((df['x'], df['y']), k=1)

Plotting interval of data in dataframe

A bit new to python so maybe code could be improved.
I have a txt file with x and y values, separated by some NaN in between.
Data goes from -x to x and then comes back (x to -x) but with somewhat different values of y, say:
x=np.array([-0.02,-0.01,0,0.01,0.02,NaN,1,NaN,0.02,0.01,0,-0.01,-0.02])
And I would like to plot (matplotlib) up to the first NaN with certain format, x=1 with other format, and last set of data with a third different format (color, marker, linewidth...).
Of course the data I have is a bit more complex, but I guess is a simple useful approximation.
Any idea?
I'm using pandas as my data manipulation tool
You can create a group label taking the cumsum of where x is null. Then you can define a dictionary keyed by the label with values being a dictionary containing all of the plotting parameters. Use groupby to plot each group separately, unpacking all the parameters to set the arguments for that group.
Sample Data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
x = np.array([-0.02,-0.01,0,0.01,0.02,np.NaN,1,np.NaN,0.02,0.01,0,-0.01,-0.02])
df = pd.DataFrame({'x': x})
Code
df['label'] = df.x.isnull().cumsum().where(df.x.notnull())
plot_params = {0: {'lw': 2, 'color': 'red', 'marker': 'o'},
1: {'lw': 6, 'color': 'black', 'marker': 's'},
2: {'lw': 9, 'color': 'blue', 'marker': 'x'}}
fig, ax = plt.subplots(figsize=(3,3))
for label, gp in df.groupby('label'):
gp.plot(y='x', **plot_params[label], ax=ax, legend=None)
plt.show()
This is what df looks like for reference after defining the group label
print(df)
x label
0 -0.02 0.0
1 -0.01 0.0
2 0.00 0.0
3 0.01 0.0
4 0.02 0.0
5 NaN NaN
6 1.00 1.0
7 NaN NaN
8 0.02 2.0
9 0.01 2.0
10 0.00 2.0
11 -0.01 2.0
12 -0.02 2.0

Fast way to expand a column of lists in pandas data frame to a single column

I have a data frame with text and sentiment score corresponding to it. I've created a column which stores all the bigrams in a column. Now I want to create a Dataframe which has this bigram column expanded with the score against it, when I do the second step using a for loop it's painfully slow
enter image description here
enter image description here
Pandas >= 0.25
You can use explode.
df = df.explode('bigrams')
Dummy Example:
import pandas as pd
df1 = pd.DataFrame({'score':[0.2,0.3],
'bigrams':[['a', 'b', 'c', 'e'],['f','g']]})
print(df1)
=========================
df1:
score bigrams
0 0.2 [a, b, c, e]
1 0.3 [f, g]
===========================
df1 = df1.explode('bigrams')
print(df1)
=============================
df1:
score bigrams
0 0.2 a
0 0.2 b
0 0.2 c
0 0.2 e
1 0.3 f
1 0.3 g

Pandas bar plot with continuous x axis

I try to make a barchart in pandas, with two data series coming from a groupby:
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose().plot(kind='bar', layout=(2,2))
The x axis is not continuous, and only shows values that are in the dataset. In this example, it jumps from 11 to 13.
How can I make it continuous?
**EDIT 2: **
I tried JohnE datacentric approach, and it works. It creates a new index with no missing values:
temp = data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose()
temp.reindex(np.arange(temp.index.min(), temp.index.max())).plot(kind='bar', layout=(2,2))
However, I assume there should be a better approach with histogram instead of bar plot. The best I could do with histograms is:
data.groupby(['popup','UID']).size().groupby(level=0).plot(kind='hist', bins=30, alpha=0.5, layout=(2,2), legend=True)
But I didn't find any option in hist plot to get the same rendering than bar plot, without bar overlapping.
**EDIT: ** Here are some information to answer comments.
Data sample:
INSEE C1 popup C3 date \
0 75101.0 0.0 0 NaN 2017-05-17T13:20:16Z
0 75101.0 0.0 0 NaN 2017-05-17T14:23:51Z
1 31557.0 0.0 1 NaN 2017-05-17T14:58:27Z
UID
0 ba4bd353-f14d-4bc5-95ba-6a1f5134cc84
0 ba4bd353-f14d-4bc5-95ba-6a1f5134cc84
1 bafe9715-3a07-4d9b-b85c-0bbf658a9115
First groupby result (sample):
data.groupby(['popup','UID']).size().head(3)
popup UID
0 016d3e7e-1901-4f84-be0e-117988ec57a8 6
01c15455-29cc-4d1e-8743-638fd0f51602 6
03fc9eb0-c5fb-4205-91f0-4b74f78a8b96 3
dtype: int64
Second groupby result (sample):
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().head(3)
popup
0 1 46
3 23
4 22
dtype: int64
After unstack and transpose:
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose().head(3)
popup 0 1
1 46.0 38.0
2 21.0 35.0
3 23.0 22.0
There is a solution with histogram plot from matplotlib.axes.Axes.hist. It is better to use histograms than bar plots for this purpose, as we can choose the number of bins.
# Separate groups by 'popup' and count number of records for each 'UID'
popup_values = data['popup'].unique()
count_by_popup = [data[data['popup'] == popup_value].groupby(['UID']).size() for popup_value in popup_values]
# Create histogram
fig, ax = plt.subplots()
ax.hist(count_by_popup, 20, histtype='bar', label=[str(x) for x in popup_values])
ax.legend()
plt.show()