How to remove NaN in subtracting? - numpy

I am trying to perform subtraction in python. This is a simple task when performed in excel but I want to do this in jupyter notebook.
Below is my code:
import pandas as pd
from sklearn import linear_model
import numpy as np
#Read X1 anomaly
X1= pd.read_csv (r'file\X1.csv')
X1 = pd.DataFrame(X1,columns=['Year','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
X1= X1[X1['Year'].between(1984,2020, inclusive="both")]
#X1 = X1["Mar"].describe()
#print (X1)
#Read X2 anomaly
X2= pd.read_csv (r'file\X2.csv')
X2 = pd.DataFrame(X2,columns=['Year','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
X2= X2[X2['Year'].between(1984,2020, inclusive="both")]
#X2 = X2["Mar"].describe()
#print (X2)
X1 = X1["Mar"]
X2 = X2["Mar"]
#### my goal is to remove transform X2 by removing their line of fit
regr = linear_model.LinearRegression()
regr.fit(X1.values.reshape(-1,1), X2)
Trend=regr.coef_*X1+regr.intercept_
X3=np.subtract(X2,Trend)
print (X3)
And here is the data link. I want to remove the linear between X1 and X2 so I performed regression of X1 and X2 then I want to subtract the trend line from X2 to make it X3. However, there is a lot of NaN in X3. Please help me on what I should do.

I found this answer after experimenting with the codes and I want to share them with you in case someone is experiencing similar issue.
regr = linear_model.LinearRegression()
regr.fit(X1.values.reshape(-1,1), X2)
Trend=regr.coef_*X1+regr.intercept_
X3=X2-np.array(Trend)
print (X3)
Notice what I did in the subtraction formula. Thank you.

Related

Histogram with Seaborn

I'd like to plot an Histogram which makes comparisons between two arrays of data. Basically, i want to make exactly this:
Suppose i want to make this plot, but using two arrays with four entries, one with the numbers which should go to the blue areas, and the other with the ones for the blue areas. I have tried this:
x1 = np.array([0.1,0.2,0.3])
x2 = np.array([0.1,0.2,0.5])
sns.histplot(data=[x1,x2], x=['1','2','3'], multiple="dodge", hue=['a','b'], shrink=.8)
But it gives me the error “ValueError: arrays must all be same length”
I know that i'm supposed to enter a df and not arrays, but sadly i'm not really an expert on how to use them.
How can i solve this problem? Simply put, i'm looking for a copy and paste solution here, in which i can then change the numbers, and the name of the columns.
It looks like you want a barplot, not a histogram. Creating a seaborn plot from multiple columns usually involves converting them to "long form", making the process less straightforward.
Here is an example:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
x1 = np.array([0.1, 0.2, 0.3])
x2 = np.array([0.1, 0.2, 0.5])
x = ['1', '2', '3'] # or, simpler, x = np.arange(len(x1)) + 1
df = pd.DataFrame({'a': x1, 'b': x2, 'x': x})
df_long = df.melt('x')
ax = sns.barplot(data=df_long, x='x', y='value', dodge=True, hue='variable')
plt.show()
The long form looks like:
x variable value
0 1 a 0.1
1 2 a 0.2
2 3 a 0.3
3 1 b 0.1
4 2 b 0.2
5 3 b 0.5
See pandas' melt for additional options, such as naming the created columns.

adding regression line in python using matplotlib

I have a question about drawing a regression line and determining the slope of that line. I am doing research for water heights of inland lakes in Tibet with the help of satellite date. I have the data for one year of one lake in this script.
However I want to determine the annual rise of the lake for as well the reference height as for the total beams. Is there some one that could help me?
This is the link towards the excel file: https://drive.google.com/file/d/12wD2ByQC6ObNCWq_yIhkXiNsV3KfDpit/view?usp=sharing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# Graph in chronological order
heights = pd.read_excel ('Qinghai_dates_heights.xlsx')
dates = (heights.loc[:,'Date'])
strong_beams = (heights.loc[:,'Strong total'])
weak_beams = (heights.loc[:,'Weak total'])
total_beams = (heights.loc[:,'Total'])
# setting the reference data from Hydrolabs
reference_dates = (heights.loc[:,'Date.1'])
reference_heights = (heights.loc[:,'Hydrolabs'])
# Set the locator
locator = mdates.MonthLocator() # every month
# Specify the format - %b gives us Jan, Feb...
fmt = mdates.DateFormatter('%b')
#plt.plot(dates,strong_beams, label='Strong Beams', marker="o")
#plt.plot(dates,weak_beams, label='Weak Beams', marker="o")
plt.plot(dates, total_beams, label='Total Beams', marker="o")
plt.plot(reference_dates, reference_heights, label='Reference height (Hydrolabs)', marker="o")
X = plt.gca().xaxis
X.set_major_locator(locator)
# Specify formatter
X.set_major_formatter(fmt)
plt.xlabel('Date [months]')
plt.ylabel('elevation [m]')
plt.title("Water-Height Qinghai from November 2018 - November 2019 ")
plt.legend()
plt.show()
Does this help ? I usually use sklearn for this.
import numpy as np
from matplotlib import pyplot as plt
from sklearn import linear_model, datasets
Generate a set of data
X = np.linspace(0, 10)
line_X = X[:, np.newaxis]
Y = X + 0.2*np.random.normal(size=50)
Choose your regression model (there are plenty more, depending on your needs)
lr = linear_model.LinearRegression()
Here you really do the fit
lr.fit(line_X, Y)
Here u extract the parameters, since you seems to need it ;)
slope = lr.coef_[0]
intercept = lr.intercept_
And then you plot
plt.plot(X, slope*X + intercept, ls='-', marker=' ')
plt.plot(X, Y)

Mutiple plots in a single window

I need to draw many such rows (for a0 .. a128) in a single window. I've searched in FacetGrid, PairGrid and all over around but couldn't find. Only regplot has similar argument ax but it doesn't plot histograms. My data is 128 real valued features with label column [0, 1]. I need the graphs to be shown from my Python code as a separate application on Linux.
Also, it there a way to scale this histogram to show relative values on Y such that the right curve is not skewed?
g = sns.FacetGrid(df, col="Result")
g.map(plt.hist, "a0", bins=20)
plt.show()
Just a simple example using matplotlib. The code is not optimized (ugly, but simple plot-indexing):
import numpy as np
import matplotlib.pyplot as plt
N = 5
data = np.random.normal(size=(N*N, 1000))
f, axarr = plt.subplots(N, N) # maybe you want sharex=True, sharey=True
pi = [0,0]
for i in range(data.shape[0]):
if pi[1] == N:
pi[0] += 1 # next row
pi[1] = 0 # first column again
axarr[pi[0], pi[1]].hist(data[i], normed=True) # i was wrong with density;
# normed=True should be used
pi[1] += 1
plt.show()
Output:

read three columns from a text file using matplotlib

My input file (text.txt) includes three columns. First one is belongs to x-axis, second column represents y-axis and third column represents y-axis again. When i run my code, i get "x.append(float(line.split()[0])) IndexError: list index out of range". How can I fix that error?
my code:
#!/usr/bin/python
import numpy as np
import matplotlib.pyplot as plt
with open("text.txt", "r") as data_file:
lines=data_file.readlines()
x=[]
y1=[]
y2=[]
counter=0
for line in lines:
if((line[0]!='#') and (line[0]!='#')):
x.append(float(line.split()[0]))
y1.append(float(line.split()[0]))
y2.append(float(line.split()[1]))
counter+=1
plt.plot(x, y1, y2)
plt.savefig("text.png", dpi=300)
my text.txt:input
# Carbon
# Gallium
#
# title
# xaxis
1.00 2.12 14.51
2.00 4.54 18.14
3.00 6.12 45.11
4.00 9.02 89.15
5.00 6.48 49.99
6.00 8.01 92.33
7.00 7.56 95.14
8.00 5.89 96.01
You are getting the error
IndexError: list index out of range
because your data file contains empty lines. They may be at the end of the file.
You could fix it by including
for line in lines:
if not line.strip(): continue
but instead of the code you posted, I would use NumPy's genfromtxt to parse the file this way:
import numpy as np
import matplotlib.pyplot as plt
with open("text.txt", "r") as f:
lines = (line for line in f if not any(line.startswith(c) for c in '##'))
x, y1, y2 = np.genfromtxt(lines, dtype=None, unpack=True)
plt.plot(x, y1, x, y2)
plt.savefig("text2.png", dpi=300)
If you want to fix your original code with minimal changes, it might look something like this:
with open("text.txt", "r") as f:
x = []
y1 = []
y2 = []
for line in f:
if not line.strip() or line.startswith('#') or line.startswith('#'):
continue
row = line.split()
x.append(float(row[0]))
y1.append(float(row[1]))
y2.append(float(row[2]))
plt.plot(x, y1, x, y2)
plt.savefig("text.png", dpi=300)
Tip: readlines()
reads the entire file and returns a list of strings. This can require a lot of
memory if the file is large. Therefore, never use lines=data_file.readlines() unless you really need the entire file converted into a list of strings. Otherwise, it requires less memory if you can process each line one-at-a-time using
for line in f:

inverse of FFT not the same as original function

I don't understand why the ifft(fft(myFunction)) is not the same as my function. It seems to be the same shape but a factor of 2 out (ignoring the constant y-offset). All the documentation I can see says there is some normalisation that fft doesn't do, but that ifft should take care of that. Here's some example code below - you can see where I've bodged the factor of 2 to give me the right answer. Thanks for any help - its driving me nuts.
import numpy as np
import scipy.fftpack as fftp
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
def fourier_series(x, y, wn, n=None):
# get FFT
myfft = fftp.fft(y, n)
# kill higher freqs above wavenumber wn
myfft[wn:] = 0
# make new series
y2 = fftp.ifft(myfft).real
# find constant y offset
myfft[1:]=0
c = fftp.ifft(myfft)[0]
# remove c, apply factor of 2 and re apply c
y2 = (y2-c)*2 + c
plt.figure(num=None)
plt.plot(x, y, x, y2)
plt.show()
if __name__=='__main__':
x = np.array([float(i) for i in range(0,360)])
y = np.sin(2*np.pi/360*x) + np.sin(2*2*np.pi/360*x) + 5
fourier_series(x, y, 3, 360)
You're removing half the spectrum when you do myfft[wn:] = 0. The negative frequencies are those in the top half of the array and are required.
You have a second fudge to get your results which is taking the real part to find y2: y2 = fftp.ifft(myfft).real (fftp.ifft(myfft) has a non-negligible imaginary part due to the asymmetry in the spectrum).
Fix it with myfft[wn:-wn] = 0 instead of myfft[wn:] = 0, and remove the fudges. So the fixed code looks something like:
import numpy as np
import scipy.fftpack as fftp
import matplotlib.pyplot as plt
def fourier_series(x, y, wn, n=None):
# get FFT
myfft = fftp.fft(y, n)
# kill higher freqs above wavenumber wn
myfft[wn:-wn] = 0
# make new series
y2 = fftp.ifft(myfft)
plt.figure(num=None)
plt.plot(x, y, x, y2)
plt.show()
if __name__=='__main__':
x = np.array([float(i) for i in range(0,360)])
y = np.sin(2*np.pi/360*x) + np.sin(2*2*np.pi/360*x) + 5
fourier_series(x, y, 3, 360)
It's really worth paying attention to the interim arrays that you are creating when trying to do signal processing. Invariably, there are clues as to what is going wrong that should direct you to the problem. In this case, you taking the real part masked the problem and made your task more difficult.
Just to add another quick point: Sometimes taking the real part of the resultant array is exactly the correct thing to do. It's often the case that you end up with an imaginary part to the signal output which is just down to numerical errors in the input to the inverse FFT. Typically this manifests itself as very small imaginary values, so taking the real part is basically the same array.
You are killing the negative frequencies between 0 and -wn.
I think what you mean to do is to set myfft to 0 for all frequencies outside [-wn, wn].
Change the following line:
myfft[wn:] = 0
to:
myfft[wn:-wn] = 0