In [14]: import seaborn as sns
...: import matplotlib.pyplot as plt
...:
...: l = [41, 44, 46, 46, 47, 47, 48, 48, 49, 51, 52, 53, 53, 53, 53, 55, 55, 55,
...: 55, 56, 56, 56, 56, 56, 56, 57, 57, 57, 57, 57, 57, 57, 57, 58, 58, 58,
...: 58, 59, 59, 59, 59, 59, 59, 59, 59, 60, 60, 60, 60, 60, 60, 60, 60, 61,
...: 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 62, 62, 62, 62, 62, 62, 62, 62,
...: 62, 63, 63, 63, 63, 63, 63, 63, 63, 63, 64, 64, 64, 64, 64, 64, 64, 65,
...: 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 66, 66, 66, 66, 66,
...: 67, 67, 67, 67, 67, 67, 67, 67, 68, 68, 68, 68, 68, 69, 69, 69, 70, 70,
...: 70, 70, 71, 71, 71, 71, 71, 72, 72, 72, 72, 73, 73, 73, 73, 73, 73, 73,
...: 74, 74, 74, 74, 74, 75, 75, 75, 76, 77, 77, 78, 78, 79, 79, 79, 79, 80,
...: 80, 80, 80, 81, 81, 81, 81, 83, 84, 84, 85, 86, 86, 86, 86, 87, 87, 87,
...: 87, 87, 88, 90, 90, 90, 90, 90, 90, 91, 91, 91, 91, 91, 91, 91, 91, 92,
...: 92, 93, 93, 93, 94, 95, 95, 96, 98, 98, 99, 100, 102, 104, 105, 107, 108,
...: 109, 110, 110, 113, 113, 115, 116, 118, 119, 121]
...:
...: sns.distplot(l, kde=True, rug=False)
...:
/Users/congminmin/.venv/wbkg/lib/python3.7/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Out[14]: <AxesSubplot:ylabel='Density'>
<Figure size 1008x720 with 1 Axes>
In [15]: plt.show()
In [16]:
It doesn't give any error, as above in my iPython. UMAP is the same. If I put the code into a python file and run it, it doesn't show any visualization either, no error as well.
My OS:
macOS Big Sur, 11.6
This is the first time to try seaborn and UMAP libraries, no success.
Related
I have a dictionary file saved in .npy file that contain two cluster ids that i want to plot in a scatter plot(for the id values saved under key '0' one cluster and the id values saved under key '1' is another cluster)
My script:
import numpy as np
import matplotlib.pyplot as plt
data=np.load("dict.npy",allow_pickle=True)
print(data)
array({0: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 90, 91,
92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 125,
126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
191, 192, 193, 194, 195, 196, 197, 198, 199]), 1: array([ 89, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115,
116, 117, 118, 119, 120, 121, 122, 123, 124])}, dtype=object)
An example as you have request:
#you will need these libraries:
import numpy as np
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
Then generate some random 2D data, just for this example:
#the data you want to cluster
X = np.random.multivariate_normal(mean=[1,2], cov=[[.5, .25], [.25,.75]], size=1800)
plt.scatter(*X.T, alpha=.25, color="k")
Finally run the clustering and see the result:
X_cluster = KMeans(n_clusters=2).fit_predict(X)
for c in set(X_cluster):
plt.scatter(*X[X_cluster==c].T, alpha=.25)
plt.figure(figsize=(7,7))
for cluster in data:
plt.scatter(X[data[cluster],0], X[data[cluster],1])
plt.show()
where X is the dateset that you have used for the clustering and has shape (N,2) (N is the number of samples)
I have a big one dimensional array X.shape = (10000,), and a vector of indices y = [0, 7, 9995].
I would like to get a matrix with rows
[
X[0 : 100],
X[7 : 107],
concat(X[9995:], X[:95]),
]
That is, slices of length 100, starting at each index, with wrap-around.
I can do that with a python loop, but I'm wondering if there's a smarter batched way of doing it in pytorch or numpy, since my arrays can be quite large.
Quite simple, actually.
For each element E in y, create a range from E to E + 100
Concatenate all the ranges horizontally
Modulo the resulting array by the length of X
indexes = np.hstack([np.arange(v, v + 100) for v in y]) % X.shape[0]
Output:
>>> indexes
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87,
88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
99, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82,
83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93,
94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104,
105, 106, 9995, 9996, 9997, 9998, 9999, 0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,
59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,
70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
92, 93, 94])
Now just use index X with that:
X[indexes]
This is a version of user17242583's answer that doesn't use a python loop:
N, BS, S = 10000, 1000, 100
X = np.random.randn(N)
h = np.random.randint(N, size=(BS,))
indexes = (h[..., None] + np.arange(S)) % N
result = X[indexes]
In pytorch I also found another solution using unfold:
wrapped = torch.cat((X, X[:S-1]))
strides = wrapped.unfold(dimension=0, size=S, step=1)
result = strides[h]
But I haven't done experiments to see which one is more efficient yet.
How can I know the initialisation points that were used for the Means after performing Means from sklearn.cluster?
For each of my clusters, I need to return each feature of the initialisation points used (original input was in a Pandas datafraame)
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
np.random.seed(0)
# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# KMeans with 3 clusters
clf = KMeans(n_clusters=3)
clf.fit(X,y)
#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_
#Labels of each point
clf.labels_
# Nice Pythonic way to get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}
# Transform this dictionary into list (if you need a list as result)
dictlist = []
for key, value in mydict.iteritems():
temp = [key,value]
dictlist.append(temp)
print(dictlist)
[[0, array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
[1, array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
[2, array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]
Does it do one shuffling in one epoch, or else?
What is the difference of tf.train.shuffle_batch and tf.train.batch?
Could someone explain it? Thanks.
First take a look at the documentation (https://www.tensorflow.org/api_docs/python/tf/train/shuffle_batch and https://www.tensorflow.org/api_docs/python/tf/train/batch ). Internally batch is build around a FIFOQueue and shuffle_batch is build around a RandomShuffleQueue.
Consider the following toy example, it puts 1 to 100 in a constant which gets fed through tf.train.shuffle_batch and tf.train.batch and later on this prints the results.
import tensorflow as tf
import numpy as np
data = np.arange(1, 100 + 1)
data_input = tf.constant(data)
batch_shuffle = tf.train.shuffle_batch([data_input], enqueue_many=True, batch_size=10, capacity=100, min_after_dequeue=10, allow_smaller_final_batch=True)
batch_no_shuffle = tf.train.batch([data_input], enqueue_many=True, batch_size=10, capacity=100, allow_smaller_final_batch=True)
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(10):
print(i, sess.run([batch_shuffle, batch_no_shuffle]))
coord.request_stop()
coord.join(threads)
Which yields:
0 [array([23, 48, 15, 46, 78, 89, 18, 37, 88, 4]), array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])]
1 [array([80, 10, 5, 76, 50, 53, 1, 72, 67, 14]), array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20])]
2 [array([11, 85, 56, 21, 86, 12, 9, 7, 24, 1]), array([21, 22, 23, 24, 25, 26, 27, 28, 29, 30])]
3 [array([ 8, 79, 90, 81, 71, 2, 20, 63, 73, 26]), array([31, 32, 33, 34, 35, 36, 37, 38, 39, 40])]
4 [array([84, 82, 33, 6, 39, 6, 25, 19, 19, 34]), array([41, 42, 43, 44, 45, 46, 47, 48, 49, 50])]
5 [array([27, 41, 21, 37, 60, 16, 12, 16, 24, 57]), array([51, 52, 53, 54, 55, 56, 57, 58, 59, 60])]
6 [array([69, 40, 52, 55, 29, 15, 45, 4, 7, 42]), array([61, 62, 63, 64, 65, 66, 67, 68, 69, 70])]
7 [array([61, 30, 53, 95, 22, 33, 10, 34, 41, 13]), array([71, 72, 73, 74, 75, 76, 77, 78, 79, 80])]
8 [array([45, 52, 57, 35, 70, 51, 8, 94, 68, 47]), array([81, 82, 83, 84, 85, 86, 87, 88, 89, 90])]
9 [array([35, 28, 83, 65, 80, 84, 71, 72, 26, 77]), array([ 91, 92, 93, 94, 95, 96, 97, 98, 99, 100])]
tf.train.shuffle_batch() shuffles every epoch.
I have the following query which runs in Sql Server 2008, it works well if data is small but when it is huge i am getting the exception. Is there any way i can optimize the query
select
distinct olu.ID
from olu_1 (nolock) olu
join mystat (nolock) s
on s.stat_int = olu.stat_int
cross apply
dbo.GetFeeds
(
s.stat_id,
olu.cha_int,
olu.odr_int,
olu.odr_line_id,
olu.ID
) channels
join event_details (nolock) fed
on fed.air_date = olu.intended_air_date
and fed.cha_int = channels.cha_int
and fed.break_code_int = olu.break_code_int
join formats (nolock) fmt
on fed.format_int = fmt.format_int
where
fed.cha_int in (125, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 35, 36, 37, 38, 39, 40, 41, 43, 117, 45, 42, 44, 47, 49, 50, 51, 46, 52, 53, 54, 55, 56, 48, 59, 60, 57, 63, 58, 62, 64, 66, 69, 68, 67, 65, 70, 73, 71, 74, 72, 75, 76, 77, 78, 79, 82, 80, 159, 160, 161, 81, 83, 84, 85, 88, 87, 86, 89, 90, 61, 91, 92, 93, 95, 96, 97, 98, 99, 100, 94, 155, 156, 157, 158, 103, 104, 102, 101, 105, 106, 107, 108, 109, 110, 119, 111, 167, 168, 169, 112, 113, 114, 115, 116, 170, 118, 120, 121, 122, 123, 127, 162, 163, 164, 165, 166, 128, 129, 130, 124, 133, 131, 132, 126, 134, 136, 135, 137, 171, 138, 172, 173, 174) and
fed.air_date between '5/27/2013 12:00:00 AM' and '6/2/2013 12:00:00 AM' and
fmt.cha_int in (125, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 35, 36, 37, 38, 39, 40, 41, 43, 117, 45, 42, 44, 47, 49, 50, 51, 46, 52, 53, 54, 55, 56, 48, 59, 60, 57, 63, 58, 62, 64, 66, 69, 68, 67, 65, 70, 73, 71, 74, 72, 75, 76, 77, 78, 79, 82, 80, 159, 160, 161, 81, 83, 84, 85, 88, 87, 86, 89, 90, 61, 91, 92, 93, 95, 96, 97, 98, 99, 100, 94, 155, 156, 157, 158, 103, 104, 102, 101, 105, 106, 107, 108, 109, 110, 119, 111, 167, 168, 169, 112, 113, 114, 115, 116, 170, 118, 120, 121, 122, 123, 127, 162, 163, 164, 165, 166, 128, 129, 130, 124, 133, 131, 132, 126, 134, 136, 135, 137, 171, 138, 172, 173, 174) and
fmt.air_date between '5/27/2013 12:00:00 AM' and '6/2/2013 12:00:00 AM'
From IN (Transact-SQL)
Including an extremely large number of values (many thousands) in an
IN clause can consume resources and return errors 8623 or 8632. To
work around this problem, store the items in the IN list in a table.
So I would recomend inserting the values into a temp table and then either joining onto that table or selecting the IN from the table
So something like
DECLARE #TABLE TABLE(
val INT
)
INSERT INTO #TABLE VALUES(1),(2),(3),(4),(5)
SELECT *
FROM MyTable
WHERE ID IN (SELECT val FROM #TABLE)
SQL Fiddle DEMO
Do a backup from your production database (with many rows) and play with it locally on you development machine. The optimization may take some time it may actually be quite hard if you are new to sql. Break down the Query into several temporary tables and join them toghether in the end. Try and remove the dbo.GetFeeds(...) function from the Query to see if that function is the problem.