I am learning seaborn from
http://seaborn.pydata.org/tutorial/aesthetics.html
In the import section,please explain this line
np.random.seed(sum(map(ord, "aesthetics")))
What this line does and please explain each element in this line.
In plotting offset sine wave how to define this
plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)
The importatnt thing first: This line np.random.seed(sum(map(ord, "aesthetics"))) is completely irrelevant for seaborn to work. So in principle you don't have to wory about it at all.
ord gives the byte represenation of a character
map applies a function to every item of an interable
sum sums up the elements of an iterable.
So map(ord, "aesthetics") will give a list of numbers, [97, 101, 115, 116, 104, 101, 116, 105, 99, 115] which when summed up, give 1069.
This number is then fed to np.random.seed. It is a seed for numpy's random number generator. By specifying the seed, you make sure that any random numbers drawn afterwards are based on this seed.
The point of this is to make random numbers reproducible. Having specified the seed allows me to know that when generating a random number like np.random.randint(10) the result will be 4 (for seed 1069).
This is extremely useful to make examples reproducible, and it's the reason they use it in the seaborn tutorial to make sure that the plots generated from random numbers are actually the same everywhere.
One could of course argue that using this command is more confusing than it would confuse people to see different plots when reproducing the tutorial, but that's a different question I guess.
Related
Let's say that I have a dataset with multiple input features and one single output. For the sake of simplicity, let's say the output is binary. Either zero or one.
I want to split this dataset into k parts and use a k-fold cross-validation model to learn the mapping from the input features to the output one. If the dataset is imbalanced, the ratio between the number of records with output 0 and 1 is not going to be one. To make it concrete, let's say that 90% of the records are 0 and only 10% are 1.
I think it's important that within each part of k-folds we should see the same ratio of 0s and 1s in order for successful training (the same 9 to 1 ratio). I know how to do this in Pandas but my question is how to do it in TFX.
Reading the TFX documentation, I know that I can split a dataset by specifying an output_config to the class loading the examples:
output = tfx.proto.Output(
split_config=tfx.proto.SplitConfig(splits=[
tfx.proto.SplitConfig.Split(name='fold_1', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_2', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_3', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_4', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_5', hash_buckets=1)
]))
example_gen = CsvExampleGen(input_base=input_dir, output_config=output)
But then, the aforementioned ratio of the examples in each fold will be random at best. My question is: Is there any way I can specify what goes into each split? Can I somehow enforce the ratio of a feature?
BTW, I have seen and experimented with the partition_feature_name argument of the SplitConfig class. It's not useful here unless there's a feature with the ID of the fold for each example which I think is not practical since I might want to change the number of folds as part of the experiment without changing the dataset.
I'm going to answer my own question but only as a workaround. I'll be happy to see someone develop a real solution to this question.
What I could come up with at this point was to split the dataset into a number of tfrecord files. I've chosen a "composite" number of files so I can split them into (almost) any number I want. For this, I've settled down on 60 since it can be divided by 2, 3, 4, 5, 6, 10, and 12 (I don't think anyone would want KFold with k higher than 12). Then at the time of loading them, I have to somehow select which files will go into each split. There are two things to consider here.
First, the ImportExampleGen class from TFX supports glob file patterns. This means we can have multiple files loaded for each split:
input = tfx.proto.Input(splits=[
tfx.proto.Input.Split(name="fold_1", pattern="fold_1*"),
tfx.proto.Input.Split(name="fold_2", pattern="fold_2*")
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
input_config=input)
Next, we need some ingenuity to enable splitting the files into any number we like at the time of loading them. And this is my approach to it:
fold_3.0_4.0_5.0_6.0_10.0/part-###.tfrecords.gz
fold_3.0_4.0_5.1_6.0_10.6/part-###.tfrecords.gz
fold_3.0_4.0_5.2_6.0_10.2/part-###.tfrecords.gz
fold_3.0_4.0_5.3_6.0_10.8/part-###.tfrecords.gz
...
The file pattern is like this. Between each two _ I include the divisor, a ., and then the remainder. And I'll have as many of these as I want to have the "split possibility" later, at the time of loading the dataset.
In the example above, I'll have the option to load them into 3, 4, 5, 6, and 10 folds. The first file will be loaded as part of the 0th split if I want to split the dataset into any number of folds while the second file will be in the 1st split of 5-fold and 6th split of 10-fold.
And this is how I'll load them:
NUM_FOLDS = 5
input = tfx.proto.Input(splits=[
tfx.proto.Input.Split(name=f'fold_{index + 1}',
pattern=f"fold_*{str(NUM_FOLDS)+'.'+str(index)}*/*")
for index in range(NUM_FOLDS)
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
input_config=input)
I could change the NUM_FOLDS to any of the options 3, 4, 5, 6, or 10 and the loaded dataset will consist of pre-curated k-fold splits. It is worth mentioning that I have made sure of the ratio of the samples within each file at the time of creating them. So any combination of them will also have the same ratio.
Again, this is only a trick in the absence of an actual solution. The main drawback of this approach is the fact that you have to split the dataset manually yourself. I've done so, in this case, using pandas. That meant that I had to load the whole dataset into memory. Which might not be possible for all the datasets.
TL;DR: How can I get a subrange of a violinplot whilst keeping accurate quartile lines?
I am using seaborn violinplots to make static charts for a report, but as far as I can tell, there's no way to redraw a particular area between limits whilst retaining the 25/median/75 quartile lines of the original dataset.
Here's my example dataset as a violin. The 25/median/75 values are left side: 1.0/5.0/9.0; right side: 2.0/5.0/9.0
My data has such a long tail that all the useful info is scrunched up into a tiny area. I want to ignore (but not throw away) the tail and show a closer look at the interesting bit.
I tried to reset the ylim using ax.set(ylim=(0, upp)), but the resultant graph is not great: it's jaggy and the inner lines don't meet the violin edge.
Is there a way to reset the y-axis limits but get a better quality result?
Next I tried to cut off the tail by dropping values from the dataset. I dropped anything over the 97th centile. The violin looks way better, but the quartile lines have been recalculated for this new dataset. They're showing a median of about 4, not 5 as per the original dataset.
I'm using inner="quartile", so the code that gets called in Seaborn is _ViolinPlotter::draw_quartiles
def draw_quartiles(self, ax, data, support, density, center, split=False):
"""Draw the quartiles as lines at width of density."""
q25, q50, q75 = np.percentile(data, [25, 50, 75])
self.draw_to_density(ax, center, q25, support, density, split,
linewidth=self.linewidth,
dashes=[self.linewidth * 1.5] * 2)
As you can see, it assumes (understandably) that one wants to draw the quartile lines at percentiles 25, 50 and 75. It'd be amazeballs if there was a way I could call draw_to_density with my own values (is there?).
At the moment, I am attempting to manually adjust the position of the lines. It's trivial to figure out & set the y-values:
for l in ax.lines:
l.set_ydata(<get correct quartile value from original dataset>)
but I'm finding it hard to figure out the limits for x, i.e. the density of the distribution at the quartiles. It seems to involve gaussian kde, and tbh it's getting hacky and inelegant at this point. Is there an easy way to calculate how long each line should be?
What do you suggest?
Thanks for your help
Lnr
W/ Thanks to #JohanC.
added gridsize=1000 to the params of the violinplot and used ax.set(ylim=(0, upp)) to resize the y-axis to show the range from 0 to upp where upp is the upper limit. Much prettier lookin' graph:
I have a term-document matrix (X) of shape (6, 25931). The first 5 documents are my source documents and the last document is my target document. The column represents counts for different words in the vocabulary set. I want to get the cosine similarity of the last document with each of the other documents.
But since SVD produces an S of size (min(6, 25931),), If I used the S to reduce my X, I get a 6 * 6 matrix. But In this case, I feel that I will be losing too much information since I am reducing a vector of size (25931,) to (6,).
And when you think about it, usually, the number of documents will always be less than number of vocabulary words. In this case, using SVD to reduce dimensionality will always produce vectors that are of size (no documents,).
According to everything that I have read, when SVD is used like this on a term-document matrix, it's called LSA.
Am I implementing LSA correctly?
If this is correct, then is there any other way to reduce the dimensionality and get denser vectors where the size of the compressed vector is greater than (6,)?
P.S.: I also tried using fit_transform from sklearn.decomposition.TruncatedSVD which expects the vector to be of the form (n_samples, n_components) which is why the shape of my term-document matrix is (6, 25931) and not (25931, 6). I kept getting a (6, 6) matrix which initially confused me. But now it makes sense after I remembered the math behind SVD.
If the objective of the exercise is to find the cosine similarity, then the following approach can help. The author is only attempting to solve for the objective and not to comment on the definition of Latent Semantic Analysis or the definition of Singular Value Decomposition mentioned by the questioner.
Let us first invoke all the required libraries. Please install them if they do not exist in the machine.
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
Let us generate some sample data for this exercise.
df = {'sentence': ['one two three','two three four','four five','six seven eight nine ten']}
df = pd.DataFrame(df, columns = ['sentence'])
The first step is to get the exhaustive list of all the possible features. So collate all of the content at one place.
all_content = [' '.join(df['sentence'])]
Let us build a vectorizer and fit it now. Please note that the arguments in the vectorizer are not explained by the author as the focus is on solving the problem.
vectorizer = TfidfVectorizer(encoding = 'latin-1',norm = 'l2', min_df = 0.03, ngram_range = (1,2), max_features = 5000)
vectorizer.fit(all_content)
We can inspect the vocabulary to see if it makes sense. If needed, one could add stop words in the vectorizer above and supress them to see if they are indeed supressed.
print(vectorizer.vocabulary_)
Let us vectorize the sentences for us to deploy cosine similarity.
s1Tokens = vectorizer.transform(df.iloc[1,])
s2Tokens = vectorizer.transform(df.iloc[2,])
Finally, the cosine of the similarity can be computed as follows.
cosine_similarity(s1Tokens , s2Tokens)
im playing with python and scipy to understand windowing, i made a plot to see how windowing behave under FFT, but the result is not what i was specting.
the plot is:
the middle plots are pure FFT plot, here is where i get weird things.
Then i changed the trig. function to get leak, putting a 1 straight for the 300 first items of the array, the result:
the code:
sign_freq=80
sample_freq=3000
num=np.linspace(0,1,num=sample_freq)
i=0
#wave data:
sin=np.sin(2*pi*num*sign_freq)+np.sin(2*pi*num*sign_freq*2)
while i<1000:
sin[i]=1
i=i+1
#wave fft:
fft_sin=np.fft.fft(sin)
fft_freq_axis=np.fft.fftfreq(len(num),d=1/sample_freq)
#wave Linear Spectrum (Rms)
lin_spec=sqrt(2)*np.abs(np.fft.rfft(sin))/len(num)
lin_spec_freq_axis=np.fft.rfftfreq(len(num),d=1/sample_freq)
#window data:
hann=np.hanning(len(num))
#window fft:
fft_hann=np.fft.fft(hann)
#window fft Linear Spectrum:
wlin_spec=sqrt(2)*np.abs(np.fft.rfft(hann))/len(num)
#window + sin
wsin=hann*sin
#window + sin fft:
wsin_spec=sqrt(2)*np.abs(np.fft.rfft(wsin))/len(num)
wsin_spec_freq_axis=np.fft.rfftfreq(len(num),d=1/sample_freq)
fig=plt.figure()
ax1 = fig.add_subplot(431)
ax2 = fig.add_subplot(432)
ax3 = fig.add_subplot(433)
ax4 = fig.add_subplot(434)
ax5 = fig.add_subplot(435)
ax6 = fig.add_subplot(436)
ax7 = fig.add_subplot(413)
ax8 = fig.add_subplot(414)
ax1.plot(num,sin,'r')
ax2.plot(fft_freq_axis,abs(fft_sin),'r')
ax3.plot(lin_spec_freq_axis,lin_spec,'r')
ax4.plot(num,hann,'b')
ax5.plot(fft_freq_axis,fft_hann)
ax6.plot(lin_spec_freq_axis,wlin_spec)
ax7.plot(num,wsin,'c')
ax8.plot(wsin_spec_freq_axis,wsin_spec)
plt.show()
EDIT: as asked in the comments, i plotted the functions in dB scale, obtaining much clearer plots. Thanks a lot #SleuthEye !
It appears the plot which is problematic is the one generated by:
ax5.plot(fft_freq_axis,fft_hann)
resulting in the graph:
instead of the expected graph from Wikipedia.
There are a number of issues with the way the plot is constructed. The first is that this command essentially attempts to plot a complex-valued array (fft_hann). You may in fact be getting the warning ComplexWarning: Casting complex values to real discards the imaginary part as a result. To generate a graph which looks like the one from Wikipedia, you would have to take the magnitude (instead of the real part) with:
ax5.plot(fft_freq_axis,abs(fft_hann))
Then we notice that there is still a line striking through our plot. Looking at np.fft.fft's documentation:
The values in the result follow so-called “standard” order: If A = fft(a, n), then A[0] contains the zero-frequency term (the sum of the signal), which is always purely real for real inputs. Then A[1:n/2] contains the positive-frequency terms, and A[n/2+1:] contains the negative-frequency terms, in order of decreasingly negative frequency.
[...]
The routine np.fft.fftfreq(n) returns an array giving the frequencies of corresponding elements in the output.
Indeed, if we print the fft_freq_axis we can see that the result is:
[ 0. 1. 2. ..., -3. -2. -1.]
To get around this problem we simply need to swap the lower and upper parts of the arrays with np.fft.fftshift:
ax5.plot(np.fft.fftshift(fft_freq_axis),np.fft.fftshift(abs(fft_hann)))
Then you should note that the graph on Wikipedia is actually shown with amplitudes in decibels. You would then need to do the same with:
ax5.plot(np.fft.fftshift(fft_freq_axis),np.fft.fftshift(20*np.log10(abs(fft_hann))))
We should then be getting closer, but the result is not quite the same as can be seen from the following figure:
This is due to the fact that the plot on Wikipedia actually has a higher frequency resolution and captures the value of the frequency spectrum as its oscillates, whereas your plot samples the spectrum at fewer points and a lot of those points have near zero amplitudes. To resolve this problem, we need to get the frequency spectrum of the window at more frequency points.
This can be done by zero padding the input to the FFT, or more simply setting the parameter n (desired length of the output) to a value much larger than the input size:
N = 8*len(num)
fft_freq_axis=np.fft.fftfreq(N,d=1/sample_freq)
fft_hann=np.fft.fft(hann, N)
ax5.plot(np.fft.fftshift(fft_freq_axis),np.fft.fftshift(20*np.log10(abs(fft_hann))))
ax5.set_xlim([-40, 40])
ax5.set_ylim([-50, 80])
This is a very special plotting request, but I have data I want to view in a very particular way. Here's the situation:
1) The data I have is binned into 25 bins, each bin contains a different number of data points. The larger the bin value, the smaller then number of data points it has within it, roughly speaking (This is just a result of the data processing which was done).
[9568, 10079, 10137, 10090, 10154, 10091, 10046, 10116, 9959, 9401, 7703, 5216, 3089, 1632, 854, 466, 221, 106, 63, 27, 12, 5, 1, 0]
2) I have access to the bin values.
[ 0.02648645 0.09996368 0.1734409 0.24691813 0.32039536 0.39387258
0.46734981 0.54082703 0.61430426 0.68778148 0.76125871 0.83473593
0.90821316 0.98169038 1.05516761 1.12864483 1.20212206 1.27559928
1.34907651 1.42255373 1.49603096 1.56950818 1.64298541 1.71646264]
I can easily produce an 'errorbar' type plot in matplotlib (the y-axis is scaled from radius to degrees below):
But, this is not particularly insightful for what I'd like to study. I'd really like to know if there are 'islands' of angle values within each bin, and to do this, I would need something like a scatterplot or an imshow/hexbin type plot, where the density of points can be represented by color (in the case of imshow/hexbin at least). The following is an example of what happens when represented by a regular scatterplot with the smallest marker size:
Would anybody know of a good way to generate this type of visualization?
EDIT: This may help clarify a couple of things. The following plot is a sample of what a histogram would look like for the first couple of bins. Data contained within bins seem to follow some sort of distribution (I mentioned 'islands' before, because I am not ruling out the possibility of multiple peaks in the distribution). I would like this distribution to be visualized for all bins simultaneously. In other words, is there a way to do a vertical temperature map for each bin and have them all shown on the same plot?
The violin plot mentioned in the comments was a nice solution to my problem. Here's where I found a python implementation of it - it would certainly be nice if this were included into matplotlib eventually. Overplotted is a box plot centered on the median value, and includes the 2nd and 3rd quartiles.