I plotted a graph using text input file now I have to apply prim's algorithm to it. How can I do it ? Below is my code for generating a graph using a text file
import matplotlib.pyplot as plt
import networkx as nx
f= open('input10.txt')
G=nx.Graph()
x=f.read()
x=x.split()
y=[float(i) for i in x]
for i in range(1,30,3):
G.add_node(y[i],pos=(y[i+1],y[i+2]))
def last_index(y):
return len(y)-1
z=last_index(y)
for i in range(31,z-3,5):
G.add_edge(y[i],y[i+1],weight=(y[i+2]))
pos=nx.get_node_attributes(G,'pos')
weight=nx.get_edge_attributes(G,'weight')
plt.figure()
nx.draw(G,pos)
Use the nodes u=y[i], v=y[i+1] and weight=y[i+2], and create adjacency matrix or adjacency list of the graph and then apply prim's algorithm, you can find a good and easy tutorial here :Prim’s Minimum Spanning Tree
Related
I have a simple notebook to read in text files, vectorize the text, use K-means clustering to label and then plot the documents. For testing purposes I choose a small number of documents from three distinct sources (Edgar Allen Poe fiction, Russian Troll Twitter, and Ukraine news) and deliberately chose K=3 as a kind of surface validity check. My problem is that in one visible cluster, several plot points are colored the same as a (visibly) far away cluster.
3 Clusters w/ Edge Points Colored as Distant Clusters
My code:
# import pandas to use dataframes and handle tabular data
import pandas as pd
# read in the data using panda's "read_csv" function
col_list = ["DOC_ID", "TEXT"]
data = pd.read_csv('/User/Documents/NLP/Three_Genre_Samples.csv', usecols=col_list)
# use regular expression to clean annoying "/n" newline characters
data = data.replace(r'\n',' ', regex=True)
#import sklearn for TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# vectorize text in the df and fit the TEXT data.
# using the Inverse Document Frequency (IDF) vector collected feature-wise over the corpus.
vectorizer = TfidfVectorizer(stop_words={'english'})
X = vectorizer.fit_transform(data.TEXT)# deliberate "K" value from document sources
true_k = 3
# define an unsupervised clustering "model" using KMeans
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=200, n_init=10)
#fit model to data
model.fit(X)
# define clusters lables (which are integers--a human needs to make them interpretable)
labels=model.labels_
title=[data.DOC_ID]
#make a "clustered" version of the dataframe
data_cl=data
# add label values as a new column, "Cluster"
data_cl['Cluster'] = labels
# output new, clustered dataframe to a csv file
data_cl.to_csv('/Users/Documents/NLP/Three_Genre_Samples_clustered.csv.csv')
# plot document clusters:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
model_indices = model.fit_predict(X)
pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(X.toarray())
colors = ["r", "b", "c"]
x_axis = [o[0] for o in scatter_plot_points]
y_axis = [o[1] for o in scatter_plot_points]
fig, ax = plt.subplots(figsize=(20,10))
ax.scatter(x_axis, y_axis, c=[colors[d] for d in model_indices])
for i, txt in enumerate(labels):
ax.annotate(txt, (x_axis[i]+.005, y_axis[i]), size=16)
I'd be grateful for any insight.
I am using python 3.8 on Windows 10; trying to make a plot with about 700M points in it, sound wave analysis. Here: Interactive large plot with ~20 million sample points and gigabytes of data
Vaex was highly recommended. I am trying to use examples from the Vaex tutorial but the graph does not appear. I could not find a good example on Internet.
import vaex
import numpy as np
df = vaex.example()
df.plot1d(df.x, limits='99.7%');
The Vaex documents don't mention that pyplot.show() should be used to display. Plot1d plots a histogram. How to plot just connected points?
I am pretty sure that the vaex documentation explains that the (now deprecated) method .plot1d(...) is a wrapper around matplotlib plotting routines.
If you would like to create custom plots using the binned data, you can take this approach (I also found it in their docs)
import vaex
import numpy as np
import pylab as plt
# Load example data
df = vaex.example()
# Do the binning yourself
counts = df.count(binby=df.x, shape=64, limits='99.7%')
# Take care of the x-axis
limits = df.limits_percentage(df.x, percentage=99.7)
xvals = np.linspace(limits[0], limits[1], num=64)
# Create your custom plot via matplotlib, plotly or your favorite tool
p.plot(xvals, counts, marker='o', ms=5);
I am running 5 fold cross validation with a random forest as such:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
forest = RandomForestClassifier(n_estimators=100, max_depth=8, max_features=6)
cv_results = cross_validate(forest, X, y, cv=5, scoring=scoring)
However, I want to plot the ROC curves for the 5 outputs on one graph. The documentation only provides an example to plot the roc curve with cross validation when specifically using StratifiedKFold cross validation (see documentation here: https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py)
I tried tweeking the code to make it work for cross_validate but to no avail.
How do I make a ROC curve with the 5 results from the cross_validate output being plotted on a single graph?
Thanks in advance
cross_validate is a Model validation tool rather than a splitter class. You need to choose the splitter class which is right for you. You are probably after KFold. Something like this:
from sklearn.model_selection import KFold
cv = KFold(n_splits=5)
I generated the attached image using matplotlib (png format). I would like to use eps or pdf, but I find that with all the data points, the figure is really slow to render on the screen. Other than just plotting less of the data, is there anyway to optimize it so that it loads faster?
I think you have three options:
As you mentioned yourself, you can plot fewer points. For the plot you showed in your question I think it would be fine to only plot every other point.
As #tcaswell stated in his comment, you can use a line instead of points which will be rendered more efficiently.
You could rasterize the blue dots. Matplotlib allows you to selectively rasterize single artists, so if you pass rasterized=True to the plotting command you will get a bitmapped version of the points in the output file. This will be way faster to load at the price of limited zooming due to the resolution of the bitmap. (Note that the axes and all the other elements of the plot will remain as vector graphics and font elements).
First, if you want to show a "trend" in your plot , and considering the x,y arrays you are plotting are "huge" you could apply a random sub-sampling to your x,y arrays, as a fraction of your data:
import numpy as np
import matplotlib.pyplot as plt
fraction = 0.50
x_resampled = []
y_resampled = []
for k in range(0,len(x)):
if np.random.rand() < fraction:
x_resampled.append(x[k])
y_resampled.append(y[k])
plt.scatter(x_resampled,y_resampled , s=6)
plt.show()
Second, have you considered using log-scale in the x-axis to increase visibility?
In this example, only the plotting area is rasterized, the axis are still in vector format:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.uniform(size=400000)
y = np.random.uniform(size=400000)
plt.scatter(x, y, marker='x', rasterized=False)
plt.savefig("norm.pdf", format='pdf')
Right now, I'm trying to fit a curve to a large set of data; there are two arrays, x and y, each with 352 elements. I've fit a polynomial to the data, which works fine:
import numpy as np
import matplotlib.pyplot as plt
coeff=np.polyfit(x, y, 20)
coeff=np.polyfit(x, y, 20)
poly=np.poly1d(coeff)
But I need a more accurately optimized curve, so I've been trying to fit a curve with scipy. Here's the code that I have so far:
import numpy as np
import scipy
from scipy import scipy.optimize as sp
coeff=np.polyfit(x, y, 20)
coeff=np.polyfit(x, y, 20)
poly=np.poly1d(coeff)
poly_y=poly(x)
def poly_func(x): return poly(x)
param=sp.curve_fit(poly_func, x, y)
But all it returns is this:
ValueError: Unable to determine number of fit parameters.
How can I get this to work? (Or how can I fit a curve to this data?)
Your fit function does not make sense, it takes no parameter to fit.
Curve fit uses a non-linear optimizer, which needs a initial guess of the fitting parameters.
If no guess is given, it tries to determine number of parameters via introspection, which fails for your function, and set them to one (something you almost never want.)