My goal is to find the indexes of the local maxima and minima of the function in pandas or matplotlib.
Let us say we have a noisy signal with its local maxima and minima already plotted like in the following link:
https://stackoverflow.com/a/50836425/15934571
Here is the code (I just copy and paste it from the link above):
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import argrelextrema
# Generate a noisy AR(1) sample
np.random.seed(0)
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1] * 0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
n = 5 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal,
order=n)[0]]['data']
df['max'] = df.iloc[argrelextrema(df.data.values, np.greater_equal,
order=n)[0]]['data']
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
plt.plot(df.index, df['data'])
plt.show()
So, I do not have any idea how to continue from this point and find indexes corresponding to the obtained local maxima and minima on the plot. Would appreciate any help!
You use df['min'].notna(), it returns a dataframe in which the row of min is not nan. To find the index of local minima you can use the .loc method.
df.loc[df['min'].notna()].index
The output result for your example is:
Int64Index([0, 11, 21, 35, 43, 54, 67, 81, 105, 127, 141, 161, 168, 187], dtype='int64')
You can use the similar procedure to find local maximan.
Related
I'm passing a pandas DataFrame containing various features to sklearn and I do not want the estimator to use the dataframe index as one of the features. Does sklearn use the index as one of the features?
df_features = pd.DataFrame(columns=["feat1", "feat2", "target"])
# Populate the dataframe (not shown here)
y = df_features["target"]
X = df_features.drop(columns=["target"])
estimator = RandomForestClassifier()
estimator.fit(X, y)
No, sklearn doesn't use the index as one of your feature. It essentially happens here, when you call the fit method the check_array function will be applied. And now if you dig deep into the check_array function, you can find that you are converting your input into array using np.array function which essentially strips the indices from your dataframe as shown below:
import pandas as pd
import numpy as np
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df
Name Age
0 tom 10
1 nick 15
2 juli 14
np.array(df)
array([['tom', 10],
['nick', 15],
['juli', 14]], dtype=object)
I am a bit confused about what sort of package to use in order to plot my data which typically consists of 10 different categories (e.g. Temperatures) with 3 or 4 parallel measurements each. Here I have tried just using pandas (Trial1+2) and seaborn (Trial3).
In the end, what I would like to have is a scatterplot showing the three measurements from each category, and additionally drawing an average line through all my data (see example A and B below in figure).
I know that I can place my data in a CSV file which I can import using the PANDAS package in jupyter notebook. Then I get to my problem; which I think now might be related to indexing or data types? I get a lot of error that x must equal y, or that the index 'Degrees' is not defined... I will show the most successful trials below.
I have tried several things so far using this made up dataset 'Dummydata' which is very representative for the type of things I will do with my real data.
My test CSV File:
Its a .CSV file with four columns, where the first is the temperature, then the three next columns are the first, second and third measurement from corresponding temperature (y1, y2, y3).
in[]: Dummydata.to_dict()
Out[]:
{'Degrees': {0: 0,
1: 10,
2: 20,
3: 30,
4: 40,
5: 50,
6: 60,
7: 70,
8: 80,
9: 90},
'y1': {0: 20, 1: 25, 2: 34, 3: 35, 4: 45, 5: 70, 6: 46, 7: 20, 8: 10, 9: 15},
'y2': {0: 20, 1: 24, 2: 32, 3: 36, 4: 41, 5: 77, 6: 48, 7: 23, 8: 19, 9: 16},
'y3': {0: 18, 1: 26, 2: 36, 3: 37, 4: 42, 5: 75, 6: 46, 7: 21, 8: 15, 9: 16}}
Trial 1: trying to achieve a scatterplot
import pandas as pd
import matplotlib.pyplot as plt
Dummydata = pd.read_csv('DummyData.csv','r',delimiter=(';'), header=0)
y = ['y1','y2','y3']
x = ['Degrees']
Dummydata.plot(x,y)
This will give a nice line plot but also produce the UserWarning: Pandas doesn't allow columns to be created via a new attribute name (??).
If I change the plot to Dummydata.plot.scatter(x,y) then I get the error: x and y must be the same size... So I know that the shape of my data is (10,4) because of 10 rows and 4 column, how can I redefine this to be okay for pandas?
Trial 2: same thing small adjustments
import pandas as pd
import matplotlib.pyplot as plt
#import the .csv file, and set deliminator to ; and set the header as the first line(0)
Dummydata = pd.read_csv('DummyData.csv','r',delimiter=(';'), header = 0)
x =('Degrees')
y1 =('y1')
y2 =('y2')
y3 =('y3')
Dummydata.plot([x,y3]) #works fine for one value, but prints y1 and y2 ?? why?
Dummydata.plot([x,y1]) # also works, but print out y2 and y3 ??? why? # also works but prints out y2 and y3 ?? why?
Dummydata.plot([x,y]) # get error all arrays must be same length?
Dummydata.plot.scatter([x,y]) # many error, no plot
Somehow I must tell pandas that the data shape (10,4) is okay? Not sure what im doing wrong here.
Trial 3: using seaborn and try to get a scatterplot
I simply started to make a Factorplot, where I again came to the same problem of being able to get more than one y value onto my graph. I dont think converting this to a scatter would be hard if I just know how to add more data onto one graph.
import seaborn as sns
import matplotlib.pyplot as plt
#import the .csv file using pandas
Dummydata = pd.read_csv('DummyData.csv', 'r', delimiter=(';'))
#Checking what the file looks like
#Dummydata.head(2)
x =('Degrees')
y1 =('y1')
y2 =('y2')
y3 =('y3')
y =(['y1','y2','y3'])
Factorplot =sns.factorplot(x='Degrees',y='y1',data=Dummydata)
The Factor plot works fine for one dataset, however, trying to add more y value (either defining y =(['y1','y2','y3']) before or in the plotting, I get errors like: Could not interpret input 'y'.. For instance for this input:
Factorplot =sns.factorplot(x='Degrees',y='y',data=Dummydata)
or
Factorplot =sns.factorplot(x='Degrees',y=(['y1','y2','y3']),data=Dummydata)
#Error: cannot copy sequence with size 3 to array axis with dimension 10
What I would like to achieve is something like this:, where in (A) I would like a scatterplot with a rolling mean average - and in (B) I would like to plot the average only from each category but also showing the standard deviation, and additional draw a rolling mean across each category as following:
I dont want to write my data values in manually, I want to import then using .csv file (because the datasets can become very big).
Is there something wrong with the way I am organising my csv file?
All help appreciated.
Compute rolling statistics with rolling. Compute mean and standard deviation with meanand std. Plot data with plot. Add y-error bars with the yerr keyword argument.
data = data.set_index('Degrees').rolling(window=6).mean()
mean = data.mean(axis='columns')
std = data.std(axis='columns')
ax = mean.plot()
data.plot(style='o', ax=ax)
plt.figure()
mean.plot(yerr=std, capsize=3)
I've got the following data:
I'm interested in fitting a line on the 'middle bit' (intercept 0). How do I do that? It would be useful to get a figure for the gradient as well.
(FYI These are a list of cash transactions, in and out. The gradient would be the profit or loss).
Here's some of the data:
https://gist.github.com/chrism2671/1081c13b6760878b457a112d2041622f
You can use numpy.polyfit and numpy.poly1d to achieve that:
import matplotlib.pyplot as plt
import numpy as np
# Create data
ls = np.linspace(0, 100)
s = np.random.rand(len(ls))*100 + ls
# Fit the data
z = np.polyfit(ls, s, deg=1)
p = np.poly1d(z)
# Plotting
plt.figure(figsize=(16,4.5))
plt.plot(ls, s,
alpha=.75, label='signal')
plt.plot(ls, p(ls),
linewidth=1, linestyle='--', color='r', label='polyfit')
plt.legend(ncol=2)
Using the data you provided:
My DataFrame's structure
trx.columns
Index(['dest', 'orig', 'timestamp', 'transcode', 'amount'], dtype='object')
I'm trying to plot transcode (transaction code) against amount to see the how much money is spent per transaction. I made sure to convert transcode to a categorical type as seen below.
trx['transcode']
...
Name: transcode, Length: 21893, dtype: category
Categories (3, int64): [1, 17, 99]
The result I get from doing plt.scatter(trx['transcode'], trx['amount']) is
Scatter plot
While the above plot is not entirely wrong, I would like the X axis to contain just the three possible values of transcode [1, 17, 99] instead of the entire [1, 100] range.
Thanks!
In matplotlib 2.1 you can plot categorical variables by using strings. I.e. if you provide the column for the x values as string, it will recognize them as categories.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.choice([1,17,99], size=100),
"y" : np.random.rand(100)*100})
plt.scatter(df["x"].astype(str), df["y"])
plt.margins(x=0.5)
plt.show()
In order to optain the same in matplotlib <=2.0 one would plot against some index instead.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.choice([1,17,99], size=100),
"y" : np.random.rand(100)*100})
u, inv = np.unique(df["x"], return_inverse=True)
plt.scatter(inv, df["y"])
plt.xticks(range(len(u)),u)
plt.margins(x=0.5)
plt.show()
The same plot can be obtained using seaborn's stripplot:
sns.stripplot(x="x", y="y", data=df)
And a potentially nicer representation can be done via seaborn's swarmplot:
sns.swarmplot(x="x", y="y", data=df)
I'm doing a k-means clustering of activities on some open source projects on GitHub and am trying to plot the results together with the cluster centroids using Seaborn Scatterplot Matrix.
I can successfully plot the results of the clustering analysis (example tsv output below)
user_id issue_comments issues_created pull_request_review_comments pull_requests category
1 0.14936519790888722 2.0100502512562812 0.0 0.60790273556231 Group 0
1882 0.11202389843166542 0.5025125628140703 0.0 0.0 Group 1
2 2.315160567587752 20.603015075376884 0.13297872340425532 1.21580547112462 Group 2
1789 36.8185212845407 82.91457286432161 75.66489361702128 74.46808510638297 Group 3
The problem I'm having is that I'd like to be able to also plot the centroids of the clusters on the matrix plot too. Currently I'm my plotting script looks like this:
import seaborn as sns
import pandas as pd
from pylab import savefig
sns.set()
# By default, Pandas assumes the first column is an index
# so it will be skipped. In our case it's the user_id
data = pd.DataFrame.from_csv('summary_clusters.tsv', sep='\t')
grid = sns.pairplot(data, hue="category", diag_kind="kde")
savefig('normalised_clusters.png', dpi = 150)
This produces the expected output:
I'd like to be able to mark on each of these plots the centroids of the clusters. I can think of two ways to do this:
Create a new 'CENTROID' category and just plot this together with the other points.
Manually add extra points to the plots after calling sns.pairplot(data, hue="category", diag_kind="kde").
If (1) is the solution then I'd like to be able to customise the marker (perhaps a star?) to make it more prominent.
If (2) I'm all ears. I'm pretty new to Seaborn and Matplotlib so any assistance would be very welcome :-)
pairplot isn't going to be all that well suited to this sort of thing, but it's possible to make it work with a few tricks. Here's what I would do.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
sns.set_color_codes()
# Make some random iid data
cov = np.eye(3)
ds = np.vstack([np.random.multivariate_normal([0, 0, 0], cov, 50),
np.random.multivariate_normal([1, 1, 1], cov, 50)])
ds = pd.DataFrame(ds, columns=["x", "y", "z"])
# Fit the k means model and label the observations
km = KMeans(2).fit(ds)
ds["label"] = km.labels_.astype(str)
Now comes the non-obvious part: you need to create a dataframe with the centroid locations and then combine it with the dataframe of observations while identifying the centroids as appropriate using the label column:
centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
centroids["label"] = ["0 centroid", "1 centroid"]
full_ds = pd.concat([ds, centroids], ignore_index=True)
Then you just need to use PairGrid, which is a bit more flexible than pairplot and will allow you to map other plot attributes by the hue variable along with the color (at the expense of not being able to draw histograms on the diagonals):
g = sns.PairGrid(full_ds, hue="label",
hue_order=["0", "1", "0 centroid", "1 centroid"],
palette=["b", "r", "b", "r"],
hue_kws={"s": [20, 20, 500, 500],
"marker": ["o", "o", "*", "*"]})
g.map(plt.scatter, linewidth=1, edgecolor="w")
g.add_legend()
An alternate solution would be to plot the observations as normal then change the data attributes on the PairGrid object and add a new layer. I'd call this a hack, but in some ways it's more straightforward.
# Plot the data
g = sns.pairplot(ds, hue="label", vars=["x", "y", "z"], palette=["b", "r"])
# Change the PairGrid dataset and add a new layer
centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
g.data = centroids
g.hue_vals = [0, 1]
g.map_offdiag(plt.scatter, s=500, marker="*")
I know I'm a bit late to the party, but here is a generalized version of mwaskom's code to work with n clusters. Might save someone a few minutes
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
def cluster_scatter_matrix(data_norm, cluster_number):
sns.set_color_codes()
km = KMeans(cluster_number).fit(data_norm)
data_norm["label"] = km.labels_.astype(str)
centroids = pd.DataFrame(km.cluster_centers_, columns=data_norm.columns)
centroids["label"] = [str(n)+" centroid" for n in range(cluster_number)]
full_ds = pd.concat([data_norm, centroids], ignore_index=True)
g = sns.PairGrid(full_ds, hue="label",
hue_order=[str(n) for n in range(cluster_number)]+[str(n)+" centroid" for n in range(cluster_number)],
#palette=["b", "r", "b", "r"],
hue_kws={"s": [ 20 for n in range(cluster_number)]+[500 for n in range(cluster_number)],
"marker": [ 'o' for n in range(cluster_number)]+['*' for n in range(cluster_number)]}
)
g.map(plt.scatter, linewidth=1, edgecolor="w")
g.add_legend()