Spaces in file and matplotlib pyplot only plotting first 10 values - matplotlib

I am using jupyter notebook to plot some data from a file that has a whole lot of space values between the two floats (x and y). For some reason, the plot that I output only outputs the first 10 values of the data set with lots of spaces.
The code works just file for another data file which does not have many spaces.
See below, the code and plot output for the file that works just fine (Ca_sr.World.avg_stddev.dat) and the code and plot output for the file that does not work well (Rep01_Cys239SG_ligandF1_dist.dat)
For the second (space-filled file) I have resorted to naming the spaces s1, s2, sn... and think that this may be the problem. Still not sure why this would only plot the first 10 lines of the data, though. To note, I have also tried to put an xlim value and it didn't work either.
Would greatly appreciate any help or suggestions!
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import glob
import gc
prefix_file='~/Desktop/'
file_name='Ca_sr.World.avg_stddev.dat'
# plot a file (standalone from dataset)
filename=prefix_file + file_name
readfile= pd.read_csv(
filename,
delimiter=' ',
names=['x', 'y', 's'],
dtype={'x': np.float64, 'y': np.float64, 's': np.float64}
)
plt.plot(readfile['x'], readfile['y'] ,label='Hake RyR efflux',color='red')
plt.show()
Successful plot
prefix_file='~/Desktop/'
file_name='Rep01_Cys239SG_ligandF1_dist.dat'
# plot a file (standalone from dataset)
filename=prefix_file + file_name
readfile= pd.read_csv(
filename,
delimiter=' ',
names=['s1','s2','s3','s4','s5','s6','s7','x','s8','s9','s10','s11','s12', 'y'],
dtype={'x': np.float64, 'y': np.float64}
)
plt.plot(readfile['x'], readfile['y'] ,label='Hake RyR efflux',color='blue')
plt.show()
Unsuccessful plot
Example of first 20 lines of Ca_sr.World.avg_stddev.dat:
0 11795 0
1e-05 11941.56 73.400861030372
2e-05 12063 60.216276869298
3e-05 12089.76 80.550247671872
4e-05 12117.32 84.542401196086
5e-05 12138.96 81.17018171718
6e-05 12182.64 63.665299810807
7e-05 12148.24 73.5615551766
8e-05 12168.32 74.333690881053
9e-05 12161.12 85.144733248745
0.0001 12165.88 64.499500773262
0.00011 12190.88 76.538784939402
0.00012 12180.44 68.184502638063
0.00013 12190.72 62.885623158239
0.00014 12174.16 85.336829095063
0.00015 12175.36 58.05024030958
0.00016 12187.4 70.30163582734
0.00017 12209.36 69.206866711331
0.00018 12186.84 70.809423101731
0.00019 12212.04 82.26857480229
Example of first 20 lines of Rep01_Cys239SG_ligandF1_dist.dat:
#Frame Dis_00003
1 12.0948
2 11.8884
3 11.8573
4 11.8988
5 12.0257
6 10.8092
7 10.6126
8 10.6221
9 10.6896
10 10.5544
11 10.0383
12 10.5199
13 10.0731
14 10.6336
15 10.6044
16 9.9472
17 10.2276
18 9.9793
19 10.4104
UPDATE
This definitely has something to do with the way I am hardcoding the lines into the names value. In short, I need a way to ignore all the NaN (space) values and process only the floats.
See the following image of the data description
Data description :

Related

Iterating and ploting five columns per iteration pandas

I am trying to plot five columns per iteration, but current code is ploting everithing five times. How to explain to it to plot five columns per iteration without repeting them?
n=4
for tag_1,tag_2,tag_3,tag_4,tag_5 in zip(df.columns[n:], df.columns[n+1:], df.columns[n+2:], df.columns[n+3:], df.columns[n+4:]):
fig,ax=plt.subplots(ncols=5, tight_layout=True, sharey=True, figsize=(20,3))
sns.scatterplot(df, x=tag_1, y='variable', ax=ax[0])
sns.scatterplot(df, x=tag_2, y='variable', ax=ax[1])
sns.scatterplot(df, x=tag_3, y='variable', ax=ax[2])
sns.scatterplot(df, x=tag_4, y='variable', ax=ax[3])
sns.scatterplot(df, x=tag_5, y='variable', ax=ax[4])
plt.show()
You are using list slicing in the wrong way. When you use df.columns[n:], you are getting all the column names from the one with index n to the last one. The same is valid for n+1, n+2, n+3 and n+4. This causes the repetition that you are referring to. In addition to that, the fact that the plot is shown five times is due to the behavior of the zip function: when used on iterables with different sizes, the iterable returned by zip has the size of the smaller one (in this case df.columns[n+4:]).
You can achieve what you want by adapting your code as follows:
# Imports to create sample data
import string
import random
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Create some sample data and a sample dataframe
data = { string.ascii_lowercase[i]: [random.randint(0, 100) for _ in range(100)] for i in range(15) }
df = pd.DataFrame(data)
# Iterate in groups of five indexes
for start in range(0, len(df.columns), 5):
# Get the next five columns. Pay attention to the case in which the number of columns is not a multiple of 5
cols = [df.columns[idx] for idx in range(start, min(start+5, len(df.columns)))]
# Adapt your plot and take into account that the last group can be smaller than 5
fig,ax=plt.subplots(ncols=len(cols), tight_layout=True, sharey=True, figsize=(20,3))
for idx in range(len(cols)):
#sns.scatterplot(df, x=cols[idx], y='variable', ax=ax[idx])
sns.scatterplot(df, x=cols[idx], y=df[cols[idx]], ax=ax[idx]) # In the example the values of the column are plotted
plt.show()
In this case, the code performs the following steps:
Iterate over groups of at most five indexes ([0->4], [5->10]...)
Recover the columns that are positioned in the previously recovered indexes. The last group of columns may be smaller than 5 (e.g., 18 columns, the last is composed of the ones with the following indexes: 15, 16, 17
Create the plot taking into account the previous corner case of less than 5 columns
With Seaborn's object interface, available from v0.12, we might do like this:
from numpy import random
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import seaborn.objects as so
sns.set_theme()
First, let's create a sample dataset, just like trolloldem's answer.
random.seed(0) # To produce the same random values across multiple runs
columns = list("abcdefghij")
sample_size = 20
df_orig = pd.DataFrame(
{c: random.randint(100, size=sample_size) for c in columns},
index=pd.Series(range(sample_size), name="variable")
)
Then transform the data frame into a long-form for easier processing.
df = (df_orig
.melt(value_vars=columns, var_name="tag", ignore_index=False)
.reset_index()
)
Then finally render the figures, 5 figures per row.
(
so.Plot(df, x="value", y="variable") # Or you might do x="variable", y="value" instead
.facet(col="tag", wrap=5)
.add(so.Dot())
)

In matplotlib, is there a method to fix or arrange the order of x-values of a mixed type with a character and digits?

There are several Q/A for x-values in matplotlib and it shows when the x values are int or float, matploblit plots the figure in the right order of x. For example, in character type, the plot shows x values in the order of
1 15 17 2 21 7 etc
but when it became int, it becomes
1 2 7 15 17 21 etc
in human order.
If the x values are mixed with character and digits such as
NN8 NN10 NN15 NN20 NN22 etc
the plot will show in the order of
NN10 NN15 NN20 NN22 NN8 etc
Is there a way to fix the order of x values in the human order or the existing order in the x list without removing 'NN' in x-values.
In more detail, the xvalues are directory names and using grep sort inside linux function, the results are displayed in linux terminal as follows, which can be saved in text file.
joonho#login:~/NDataNpowN$ get_TEFrmse NN 2 | sort -n -t N -k 3
NN7 0.3311
NN8 0.3221
NN9 0.2457
NN10 0.2462
NN12 0.2607
NN14 0.2635
Without sort, the linux shell also displays in the machine order such as
NN10 0.2462
NN12 0.2607
NN14 0.2635
NN7 0.3311
NN8 0.3221
NN9 0.2457
As I said, pandas would make this task easier than dealing with base Python lists and such:
import matplotlib.pyplot as plt
import pandas as pd
#imports the text file assuming that your data are separated by space, as in your example above
df = pd.read_csv("test.txt", delim_whitespace=True, names=["X", "Y"])
#extracting the number in a separate column, assuming you do not have terms like NN1B3X5
df["N"] = df.X.str.replace(r"\D", "", regex=True).astype(int)
#this step is only necessary, if your file is not pre-sorted by Linux
df = df.sort_values(by="N")
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 6))
#categorical plotting
df.plot(x="X", y="Y", ax=ax1)
ax1.set_title("Evenly spaced")
#numerical plotting
df.plot(x="N", y="Y", ax=ax2)
ax2.set_xticks(df.N)
ax2.set_xticklabels(df.X)
ax2.set_title("Numerical spacing")
plt.show()
Sample output:
Since you asked if there is a non-pandas solution - of course. Pandas makes some things just more convenient. In this case, I would revert to numpy. Numpy is a matplotlib dependency, so in contrast to pandas, it must be installed, if you use matplotlib:
import matplotlib.pyplot as plt
import numpy as np
import re
#read file as strings
arr = np.genfromtxt("test.txt", dtype="U15")
#remove trailing strings
Xnums = np.asarray([re.sub(r"\D", "", i) for i in arr[:, 0]], dtype=int)
#sort array
arr = arr[np.argsort(Xnums)]
#extract x-values as strings...
Xstr = arr[:, 0]
#...and y-values as float
Yvals = arr[:, 1].astype(float)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 6))
#categorical plotting
ax1.plot(Xstr, Yvals)
ax1.set_title("Evenly spaced")
#numerical plotting
ax2.plot(np.sort(Xnums), Yvals)
ax2.set_xticks(np.sort(Xnums))
ax2.set_xticklabels(Xstr)
ax2.set_title("Numerical spacing")
plt.show()
Sample output:

Matplotlib line plot of Numpy array change color at specified percentage

Say you have a numpy array containing about 2000 elements of decimals ranging between 0 to 5. Using matplotlib, how would you line plot the first 75% of these decimals in this numpy array in blue and the remaining 25% elements in red?
The following code first creates some toy data with 2000 values between 0 and 5. The first 3/4th is plotted in blue, the rest in red.
import numpy as np
import matplotlib.pyplot as plt
data = np.random.uniform(0, 5, 2000)
threefourth = len(data) * 3 // 4
plt.plot(range(threefourth+1), data[:threefourth + 1], color='dodgerblue')
plt.plot(range(threefourth, len(data)), data[threefourth:], color='crimson')
plt.show()
You might get away, depending on your data, by plotting first a blue line with all the data, and than plot a red line with the data of the lower decimals masked by replacing them with nan's with np.where. See the following example. (Line segments that cross the red/blue border are problematic and will be plotted blue in this case.)
For a better control have a look at the Multicolored lines example from the matplotlib documentation.
import numpy as np
import matplotlib.pyplot as plt
xs = np.arange(0,2000)
ys = ( 2.4
+ 2.0 * np.sin(3 * 2 * np.pi * xs / 2000)
+ 0.3 * np.random.random(size=2000)
)
red_ys = np.where( ys < 0.75 * 5 , np.nan , ys )
plt.plot(ys, 'b')
plt.plot(red_ys, 'r')
plt.show()

Replace xticks with names

I am working on the Spotify dataset from Kaggle. I plotted a barplot showing the top artists with most songs in the dataframe.
But the X-axis is showing numbers and I want to show names of the Artists.
names = list(df1['artist'][0:19])
plt.figure(figsize=(8,4))
plt.xlabel("Artists")
sns.barplot(x=np.arange(1,20),
y=df1['song_title'][0:19]);
I tried both list and Series object type but both are giving error.
How to replace the numbers in xticks with names?
Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Data
Data from Spotify - All Time Top 2000s Mega Dataset
df = pd.read_csv('Spotify-2000.csv')
titles = pd.DataFrame(df.groupby(['Artist'])['Title'].count()).reset_index().sort_values(['Title'], ascending=False).reset_index(drop=True)
titles.rename(columns={'Title': 'Title Count'}, inplace=True)
# titles.head()
Artist Title Count
Queen 37
The Beatles 36
Coldplay 27
U2 26
The Rolling Stones 24
Plot
plt.figure(figsize=(8, 4))
chart = sns.barplot(x=titles.Artist[0:19], y=titles['Title Count'][0:19])
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
plt.show()
OK, so I didnt know this, although now it seems stupid not to do so in hindsight!
Pass names(or string labels) in the argument for X-axis.
use plt.xticks(rotate=90) so the labels don't overlap

Line graphs in matplotlib

I am plotting the below data frame using google charts.
Hour S1 S2 S3
1 174 0 811
2 166 0 221
3 213 1 1061
But with google charts, I am not able to save it to a file. Just wondering whether I can plot the dataframe in matplotlib line charts.
Any help would be appreciated.
pandas has charting method, just do:
df.plot()
where df is your pandas dataframe
matplotlib 1.5 and above supports a data wkarg
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot('S1', 'S2', data=df)
or just directly pass in the columns as input
ax.plot(df['S1'])