What is the difference between doing a regression with a dataframe and ndarray? - pandas

I would like to know why would I need to convert my dataframe to ndarray when doing a regression, since I get the same result for intercept and coef when I do not convert it?
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn import linear_model
%matplotlib inline
# import data and create dataframe
!wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv
df = pd.read_csv("FuelConsumption.csv")
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
# Split train/ test data
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
# Modeling
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
**# if I use the dataframe, train[['ENGINESIZE']] for 'x', and train[['CO2EMISSIONS']] for 'y'
below, I get the same result**
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
Thank you very much!

So df is the loaded dataframe, cdf is another frame with selected columns, and train is selected rows.
train[['ENGINESIZE']] is a 1 column dataframe (I believe train['ENGINESIZE'] would be a pandas Series).
I believe the preferred syntax for getting an array from the dataframe is:
train[['ENGINESIZE']].values # or
train[['ENGINESIZE']].to_numpy()
though
np.asanyarray(train[['ENGINESIZE']])
is supposed to do the same thing.
Digging down through the regr.fit code I see that it calls sklearn.utils.check_X_y which in turn calls sklearn.tils.check_array. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).
So it appears that if fit accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.

Related

How to most efficiently use Pandas UDF in Spark with multiple Series as inputs

I have some PySpark code that aims to run a machine learning model trained in sklearn on a pyspark dataframe looks like this:
from sklearn.ensemble import RandomForestRegressor
X = np.random.rand(1000, 100)
y = np.random.randint(2, size=1000)
tree = RandomForestRegressor(n_jobs=4)
tree.fit(X, y)
pdf = pd.DataFrame(X)
df = spark.createDataFrame(pdf)
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf('double')
# Input/output are both a pandas.Series of doubles
def pandas_plus_one(*args):
return pd.Series(tree.predict(pd.concat([args[i] for i in range(100)],axis=1)))
df = df.withColumn('result', pandas_plus_one(*[df[i] for i in range(100)]))
My question is that is this the most efficient way to do things with PySpark? In particular, I would like to avoid having to do pd.concat which involves copying all the Series (which were probably adjacent in memory anyways) to a new pandas DataFrame inside of the UDF function. The ideal solution would be for the Pandas UDF to accept a DataFrame as an input, but I haven't found a way to make it work.
Note: I am not looking for solutions that involve SparkML scikit-spark etc.

matplotlib - seaborn - the numbers on the correlation plots are not readable

The plot below shows the correlation for one column. The problem is that the numbers are not readable, because there are many columns in it.
How is it possible to show only 5 or 6 most important columns and not all of them with very low importance?
plt.figure(figsize=(20,3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:].T, annot=True,
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
You can limit the cells shown via .iloc[1:7]. If you also want to show the highest negative values, you could create a second plot with .iloc[-6:]. To have both together, you could use numpy's slicing function and write .iloc[np.r_[1:4, -3:0]].
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.rand(7, 27), columns=['price'] + [*'abcdefghijklmnopqrstuvwxyz'])
plt.figure(figsize=(20, 3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:7].T,
annot=True, annot_kws={'rotation':90, 'size': 20},
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
plt.show()
annot can also be a list of labels. Using this, you can define a string matrix that you use to display the desired numbers and set the others to an empty string.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns; sns.set_theme()
import pandas as pd
from string import ascii_letters
# generate random data
rs = np.random.RandomState(33)
df = pd.DataFrame(data=rs.normal(size=(100, 26)),
columns=list(ascii_letters[26:]))
importance_index = 5 # until which idx to hide values
data = df.corr()[['A']].sort_values('A', ascending=False).iloc[1:].T
labels = data.astype(str) # make a str-copy
labels.iloc[0,:importance_index] = ' ' # mask columns that you want to hide
sns.heatmap(data, annot=labels, cmap='Spectral_r', vmax=0.9, vmin=-0.31, fmt='', annot_kws={'rotation':90})
plt.show()
The output on some random data:
This works but it has its limits, particulary with setting fmt='' (can't use it to conveniently format decimals anymore, need to do it manually now). I would also question whether your approach is even the best one to take here. I think consistency in plots is quite important. I would rather evaluate if we can't rotate the heatmap labels (I've included it above) or leave them out completely since it is technically redundant due to the color-coding. Alternatively, you could only plot the cells with the "important" values.

FFT of exponentially decaying sinusoidal function

I have a set of simulation data to which I want to perform an FFT. I am using matplotlib to do this. However, the FFT is looking strange, so I don't know if I am missing something in my code. Would appreciate any help.
Original data:
time-varying data
FFT:
FFT
Code for the FFT calculation:
import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack as fftpack
data = pd.read_csv('table.txt',header=0,sep="\t")
fig, ax = plt.subplots()
mz_res=data[['mz ()']].to_numpy()
time=data[['# t (s)']].to_numpy()
ax.plot(time[:300],mz_res[:300])
ax.set_title("Time-varying mz component")
ax.set_xlabel('time')
ax.set_ylabel('mz amplitude')
fft_res=fftpack.fft(mz_res[:300])
power=np.abs(fft_res)
frequencies=fftpack.fftfreq(fft_res.size)
fig2, ax_fft=plt.subplots()
ax_fft.plot(frequencies[:150],power[:150]) // taking just half of the frequency range
I am just plotting the first 300 datapoints because the rest is not important.
Am I doing something wrong here? I was expecting single frequency peaks not what I got. Thanks!
Link for the input file:
Pastebin
EDIT
Turns out the mistake was in the conversion of the dataframe to a numpy array. For a reason I have yet to understand, if I convert a dataframe to a numpy array it is converted as an array of arrays, i.e., each element of the resulting array is itself an array of a single element. When I change the code to:
mz_res=data['mz ()'].to_numpy()
so that it is a conversion from a pandas series to a numpy array, then the FFT behaves as expected and I get single frequency peaks from the FFT.
So I just put this here in case someone else finds it useful. Lesson learned: the conversion from a pandas series to a numpy array yields a different result than the conversion from a pandas dataframe.
Solution:
Using the conversion from pandas series to numpy array instead of pandas dataframe to numpy array.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack as fftpack
data = pd.read_csv('table.txt',header=0,sep="\t")
fig, ax = plt.subplots()
mz_res=data['mz ()'].to_numpy() #series to array
time=data[['# t (s)']].to_numpy() #dataframe to array
ax.plot(time,mz_res)
ax.set_title("Time-varying mz component")
ax.set_xlabel('time')
ax.set_ylabel('mz amplitude')
fft_res=fftpack.fft(mz_res)
power=np.abs(fft_res)
frequencies=fftpack.fftfreq(fft_res.size)
indices=np.where(frequencies>0)
freq_pos=frequencies[indices]
power_pos=power[indices]
fig2, ax_fft=plt.subplots()
ax_fft.plot(freq_pos,power_pos) # taking just half of the frequency range
ax_fft.set_title("FFT")
ax_fft.set_xlabel('Frequency (Hz)')
ax_fft.set_ylabel('FFT Amplitude')
ax_fft.set_yscale('linear')
Yields:
Time-dependence
FFT

How to plot data from two columns of Excel Spreadhseet using madplotlib and numpy to make a line graph/plot

Link to the Spreadsheet: https://docs.google.com/spreadsheets/d/1c2hItirdrnvz2emJ4peJHaWrQlzahoHeVqetgHHAXvI/edit?usp=sharing
I am new to Python and am very keen to learn this since I like statistics and computer programming. Any help would be appreciated!
I used matploylib and numpy, but don't know how to graph this spreadsheet as a line graph.
If the data are in a common csv (comma separable values) format, they can easily be read into python. (Here I downloaded the file from the link in the question via File/Download as/comma separated values.
Using pandas and matplotlib
You can then read in data in pandas using pandas.read_csv(). This creates a DataFrame. Usually pandas automatically understands that the first row is the column names. You can then access the columns from the Dataframe via their names.
Plotting can easily performed with the DataFrame.plot(x,y) method, where x and y can be simply the column names to plot.
import pandas as pd
import matplotlib.pyplot as plt
# reading in the dataframe from the question text
df = pd.read_csv("data/1880-2016 Temperature Data Celc.csv")
# make Date a true Datetime
df["Year"] = pd.to_datetime(df["Year"], format="%Y")
# plot dataframe
ax = df.plot("Year", "Temperature in C")
ax.figure.autofmt_xdate()
plt.show()
In case one wants a scatterplot, use
df.plot( x="Year", y="Temperature in C", marker="o", linestyle="")
Using numpy and matplotlib
The same can be done with numpy. Reading in the data works with numpy.loadtxt where one has to provide a little bit more information about the data. E.g. expluding the first row and using comma as separator. The unpacked columns can be plotted with pyplot pyplot.plot(year, temp).
import numpy as np
import matplotlib.pyplot as plt
# reading in the data
year, temp = np.loadtxt("data/1880-2016 Temperature Data Celc.csv",
skiprows=1, unpack=True, delimiter=",")
#plotting with pyplot
plt.plot(year, temp, label="Temperature in C")
plt.xlabel("Year")
plt.ylabel("Temperature in C")
plt.legend()
plt.gcf().autofmt_xdate()
plt.show()
The result looks roughly the same as in the pandas case (because pandas simply uses matplotlib internally).
In case one wants a scatterplot, there are two options:
plt.plot(year, temp, marker="o", ls="", label="Temperature in C")
or
plt.scatter(year, temp, label="Temperature in C")

Too many values to unpack using NLTK and Pandas in Python

I am trying out different things to make the NLTK's naive bayes work using the NLTK and Pandas modules, but I am getting the "too many values to unpack" error.
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import re
import nltk
### Remove cases with missing name or missing ethnicity information
def read_file():
data = pd.read_csv("C:\sample.csv", encoding="utf-8")
frame = DataFrame(data)
frame.columns = ["Name", "Gender"]
return frame
#read_file()
def gender_features(word):
return {'last_letter': word[-1]}
#gender_features()
frame = read_file()
featuresets = [(gender_features(n), gender) for (n, gender) in frame]
train_set, test_set = features[500:], featuresets[:500]
classifier = nltkNaiveBayesClassifier.train(train_set)
I suspect you are trying to do something bigger than name classification when using panadas.DataFrame because the DataFrame object is normally used when you have limited RAM and wants to makes use of diskspace as you iterate through the data to extract features:
a 2-dimensional labeled data structure with columns of potentially
different types. You can think of it like a spreadsheet or SQL table,
or a dict of Series objects. It is generally the most commonly used
pandas object. Like Series, DataFrame accepts many different kinds of
input:
Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray
Structured or record ndarray
A Series
Another DataFrame
I suggest you go through the pandas tutorial to learn about the library first: http://pandas.pydata.org/pandas-docs/dev/tutorials.html
And then learn about the NLTK classification from http://www.nltk.org/book/ch06.html
Firstly, there are several things wrong in how you access pandas.DataFrame object.
To iterate through the rows of the dataframe, you should do this:
# Read file into pandas dataframe
df = DataFrame(pd.read_csv('sample.csv'))
df.columns = ['name', 'gender']
for index, row in df.iterrows():
print row['name'], row['gender']
Next to train a classifier, you should do this:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from nltk.corpus import names
from nltk.classify import NaiveBayesClassifier as nbc
# Create a sample.csv file
male_names = [','.join([i,'m']) for i in names.words('male.txt')]
female_names = [','.join([i,'m']) for i in names.words('female.txt')]
with open('sample.csv', 'w') as fout:
fout.write('\n'.join(male_names+female_names))
# Feature extractor function.
def gender_features(word):
return {'last_letter': word[-1]}
# Read file into pandas dataframe
df = DataFrame(pd.read_csv('sample.csv'))
df.columns = ['name', 'gender']
# Extract features.
featuresets = [(gender_features(name), gender) for index, (name, gender) in df.iterrows()]
# Split train and test set
train_set, test_set = featuresets[500:], featuresets[:500]
# Train a classifier
classifier = nbc.train(train_set)
# Test classifier on "Neo"
print classifier.classify(gender_features('Neo'))
[out]:
m