How can I get an interpolated value from a Pandas data frame? - pandas

I have a simple Pandas data frame with two columns, 'Angle' and 'rff'. I want to get an interpolated 'rff' value based on entering an Angle that falls between two Angle values (i.e. between two index values) in the data frame. For example, I'd like to enter 3.4 for the Angle and then get an interpolated 'rff'. What would be the best way to accomplish that?
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
s = s.set_index('Angle') #Set 'Angle' as index
print(s)
result = s.at[3.0, "rff"]
print(result)

You may use numpy:
import numpy as np
np.interp(3.4, s.index, s.rff)
#59.6

You could use numpy for this:
import numpy as np
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
print(np.interp(3.4, s.Angle, s.rff))
>>> 59.6

Related

Ploting dataframe with NAs with linearly joined points

I have a dataframe where each column has many missing values. How can I make a plot where the datapoints in each column are joined with lines, i.e. NAs are ignored, instead of having a choppy plot?
import numpy as np
import pandas as pd
pd.options.plotting.backend = "plotly"
d = pd.DataFrame(data = np.random.choice([np.nan] + list(range(7)), size=(10,3)))
d.plot(markers=True)
One way is to use this for each column:
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, name="linear",
line_shape='linear'))
Are there any better ways to accomplish this?
You can use pandas interpolate. Have demonstrated using plotly express and chained use so underlying data is not changed.
Post comments have amended answer so that markers are not shown for interpreted points.
import numpy as np
import pandas as pd
import plotly.express as px
d = pd.DataFrame(data=np.random.choice([np.nan] + list(range(7)), size=(10, 3)))
px.line(d).update_traces(mode="lines+markers").add_traces(
px.line(d.interpolate(limit_direction="both")).update_traces(showlegend=False).data
)

Why does my Price column show in e-multiples?

I used this method to clean up the currency column of my data of "£" and ",". Also converted the non str values to NaN.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
### Reading the excel file with dtype
df = pd.read_excel("Housing Market B16+5Miles.xlsx", dtype={"Price" : str})
df.loc[df['Price'] == 'POA','Price'] = np.nan
House_Price = df["Price"].str.replace(",","").str.replace("£","").astype("float")
del df['Price']
df["Price"] = House_Price
df
df.describe()
by describing the dataframe, the column for the "Price" was all in decimals with an e-value at the end. Why did this happen and will it affect my analysis moving forward?
Your pandas settings might be set to display large numbers in scientific notation. You can change that using pd.set_option('display.float_format', lambda x: '%.3f' % x)

How to plot correlation heatmap when using pyspark+databricks

I am studying pyspark in databricks. I want to generate a correlation heatmap. Let's say this is my data:
myGraph=spark.createDataFrame([(1.3,2.1,3.0),
(2.5,4.6,3.1),
(6.5,7.2,10.0)],
['col1','col2','col3'])
And this is my code:
import pyspark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from ggplot import *
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
from pyspark.mllib.stat import Statistics
myGraph=spark.createDataFrame([(1.3,2.1,3.0),
(2.5,4.6,3.1),
(6.5,7.2,10.0)],
['col1','col2','col3'])
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=['col1','col2','col3'],
outputCol=vector_col)
myGraph_vector = assembler.transform(myGraph).select(vector_col)
matrix = Correlation.corr(myGraph_vector, vector_col)
matrix.collect()[0]["pearson({})".format(vector_col)].values
Until here, I can get the correlation matrix. The result looks like:
Now my problems are:
How to transfer matrix to data frame? I have tried the methods of How to convert DenseMatrix to spark DataFrame in pyspark? and How to get correlation matrix values pyspark. But it does not work for me.
How to generate a correlation heatmap which looks like:
Because I just studied pyspark and databricks. ggplot or matplotlib are both OK for my problem.
I think the point where you get confused is:
matrix.collect()[0]["pearson({})".format(vector_col)].values
Calling .values of a densematrix gives you a list of all values, but what you are actually looking for is a list of list representing correlation matrix.
import matplotlib.pyplot as plt
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
columns = ['col1','col2','col3']
myGraph=spark.createDataFrame([(1.3,2.1,3.0),
(2.5,4.6,3.1),
(6.5,7.2,10.0)],
columns)
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=['col1','col2','col3'],
outputCol=vector_col)
myGraph_vector = assembler.transform(myGraph).select(vector_col)
matrix = Correlation.corr(myGraph_vector, vector_col)
Until now it was basically your code. Instead of calling .values you should use .toArray().tolist() to get a list of lists representing the correlation matrix:
matrix = Correlation.corr(myGraph_vector, vector_col).collect()[0][0]
corrmatrix = matrix.toArray().tolist()
print(corrmatrix)
Output:
[[1.0, 0.9582184104641529, 0.9780872729407004], [0.9582184104641529, 1.0, 0.8776695567739841], [0.9780872729407004, 0.8776695567739841, 1.0]]
The advantage of this approach is that you can turn a list of lists easily into a dataframe:
df = spark.createDataFrame(corrmatrix,columns)
df.show()
Output:
+------------------+------------------+------------------+
| col1| col2| col3|
+------------------+------------------+------------------+
| 1.0|0.9582184104641529|0.9780872729407004|
|0.9582184104641529| 1.0|0.8776695567739841|
|0.9780872729407004|0.8776695567739841| 1.0|
+------------------+------------------+------------------+
To answer your second question. Just one of the many solutions to plot a heatmap (like this or this even better with seaborn).
def plot_corr_matrix(correlations,attr,fig_no):
fig=plt.figure(fig_no)
ax=fig.add_subplot(111)
ax.set_title("Correlation Matrix for Specified Attributes")
ax.set_xticklabels(['']+attr)
ax.set_yticklabels(['']+attr)
cax=ax.matshow(correlations,vmax=1,vmin=-1)
fig.colorbar(cax)
plt.show()
plot_corr_matrix(corrmatrix, columns, 234)

Distribution probabilities for each column data frame, in one plot

I am creating probability distributions for each column of my data frame by distplot from seaborn library sns.distplot(). For one plot I do
x = df['A']
sns.distplot(x);
I am trying to use the FacetGrid & Map to have all plots for each columns at once
in this way. But doesn't work at all.
g = sns.FacetGrid(df, col = 'A','B','C','D','E')
g.map(sns.distplot())
I think you need to use melt to reshape your dataframe to long format, see this MVCE:
df = pd.DataFrame(np.random.random((100,5)), columns = list('ABCDE'))
dfm = df.melt(var_name='columns')
g = sns.FacetGrid(dfm, col='columns')
g = (g.map(sns.distplot, 'value'))
Output:
From seaborn 0.11.2 it is not recommended to use FacetGrid directly. Instead, use sns.displot for figure-level plots.
np.random.seed(2022)
df = pd.DataFrame(np.random.random((100,5)), columns = list('ABCDE'))
dfm = df.melt(var_name='columns')
g = sns.displot(data=dfm, x='value', col='columns', col_wrap=3, common_norm=False, kde=True, stat='density')
You're getting this wrong on two levels.
Python syntax.
FacetGrid(df, col = 'A','B','C','D','E') is invalid, because col gets set to A and the remaining characters are interpreted as further arguments. But since they are not named, this is invalid python syntax.
Seaborn concepts.
Seaborn expects a single column name as input for the col or row argument. This means that the dataframe needs to be in a format that has one column which determines to which column or row the respective datum belongs.
You do not call the function to be used by map. The idea is of course that map itself calls it.
Solutions:
Loop over columns:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.randn(14,5), columns=list("ABCDE"))
fig, axes = plt.subplots(ncols=5)
for ax, col in zip(axes, df.columns):
sns.distplot(df[col], ax=ax)
plt.show()
Melt dataframe
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.randn(14,5), columns=list("ABCDE"))
g = sns.FacetGrid(df.melt(), col="variable")
g.map(sns.distplot, "value")
plt.show()
You can use the following:
# listing dataframes types
list(set(df.dtypes.tolist()))
# include only float and integer
df_num = df.select_dtypes(include = ['float64', 'int64'])
# display what has been selected
df_num.head()
# plot
df_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);
I think the easiest approach is to just loop the columns and create a plot.
import numpy as np
improt pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.random((100,5)), columns = list('ABCDE'))
for col in df.columns:
hist = df[col].hist(bins=10)
print("Plotting for column {}".format(col))
plt.show()

applying random distribution in each row of data frame

I have the following dataframe
import numpy as np
import pandas as pd
import scipy as sc
import scipy.stats as sct
d= {'col1': [1, 2,5,0.6], 'col2': [3, 4,1,0.8]}
df = pd. DataFrame(data=d)
I want to add two new column in that dataframe but the element of two new columns are the random poisson distribution of col1 and col2
I used the following coding to generate the new columns (col3 and col4).
df ['col3'] = int(sct.poisson.rvs(df.col1,size=1))
df ['col4'] = int(sct.poisson.rvs(df.col2,size=1))
This is the closet example of my dataframe which is quite huge and it contains 3,800,000 rows.
I can generate it using for loop. it took me too long time.
How can generate random poisson distribution based on dataframe without using loop?
Thanks
Zep
Try just using:
df['col3'] = sct.poisson.rvs(df.col1)
df['col4'] = sct.poisson.rvs(df.col2)