can not plot a graph using matplotlib showing error - matplotlib

Exception has occurred: ImportError
dlopen(/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/PIL/_imaging.cpython-39-darwin.so, 0x0002): symbol not found in flat namespace '_xcb_connect'
File "/Users/showrov/Desktop/Machine learning/Preprosessing/import_dataset.py", line 2, in <module>
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sys
print(sys.version)
data=pd.read_csv('Data_customer.csv')
print(data)
plt.plot(data[:2],data[:2])

data[:2] will return the first 2 rows. In order to plot, you need to use the columns.
Mention the column name directly like data['columnName'] otherwise use the iloc method.
for example: data.iloc[:, 1:2] in order to access 2nd column.
For more information about indexing operations, please check out this link

Related

create line chart in pyspark jupyter notebook on emr

I am running pyspark on an aws emr. I have a jupyter notebook, running in jupyter hub on the aws emr. I have read data into a spark dataframe named clusters_df. I'm now trying to create a simple line chart with k as the x axis and score as the y axis. I tried converting the dataframe to a pandas dataframe, since I don't think spark has built in data visualization. When I try to display the chart in the jupyter notebook I'm getting the messages below. I've also tried matplotlib. Both code examples are below, with the messages that get returned. Can anyone suggest how to create a line chart with a jupyter notebook running pyspark on an emr?
libraries imported:
import pyspark
##### running on emr
## function to create all tables
from pyspark.sql.types import *
from pyspark.context import SparkContext
from pyspark.sql import Window
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import first
import pyspark.sql.functions as func
from pyspark.sql.functions import lit,StringType,coalesce,lag,trim, upper, substring
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.functions import round, explode,row_number,udf, length, min, when, format_number
from pyspark.sql.functions import hour, year, month, dayofmonth, date_add, to_date,datediff,dayofyear, weekofyear, date_format, unix_timestamp
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import MinMaxScaler, PCA
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from pyspark.ml.feature import StandardScaler
import traceback
import sys
import time
import math
import datetime
import numpy as np
import pandas as pd
UPdate: I want to clarify I'm showing the two code examples below to show two examples of trying to create a linechart visualization in a jupyter notebook running with spark on an emr, that both fail to produce a line chart visualization.
the panadas example just returns the text shown. the matplotlib example returns the error shown because it doesn't seem to recognize spark anymore once the magic code is run in the cell to import matplotlib.
importing dataframe:
clusters_df=sqlContext.read.parquet("path")
code:
clusters_df.toPandas().plot.line(x="k",y="score");
output:
<AxesSubplot:xlabel='k'>
code:
%matplotlib inline
import matplotlib.pyplot as plt
pnds_df=clusters_df.toPandas()
plt.plot(pnds_df['k'],pnds_df['score'])
plt.show()
output:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-33-5e7649bc56fb> in <module>
3 import matplotlib.pyplot as plt
4
----> 5 pnds_df=clusters_df.toPandas()
6
7 plt.plot(pnds_df['k'],pnds_df['score'])
NameError: name 'clusters_df' is not defined

FInding fft gives keyerror :'Aligned ' pandas

I have a time series data
I am trying to find the fft .But it gives keyerror :Aligned when trying to get the value
my data looks like below
this is the code:
import datetime
import numpy as np
import scipy as sp
import scipy.fftpack
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
temp_fft = sp.fftpack.fft(data3)
Looks like your data is a pandas series. fft works with numpy arrays rather than series.
Easy resolution is to convert your series into a numpy array either via
data3.values
or
np.array(data3)
You can then pass that array into fft function. So the end result is:
temp_fft = sp.fftpack.fft(data3.values)
This should work for you now.

Need to run cell twice for the changed code to show output

I've run into an issue where I need to run the same cell twice after making a change. I've included a gif and the code.
In the gif I first change the seaborn style to darkgrid and run it, this should show the output as changed to the specified style on the first run, but I need to run it twice in order for the output to change.
Here is the code:
%matplotlib inline
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,14,100)
for i in range(1,5):
plt.plot(x,np.sin(x+i*0.5)*(7-i))
sns.set_style("white", {'axes.axisbelow': False})
plt.show()
I have tried separating the import lines to a previous cell but still the problem persists
set your style, before you plot anything . Move the line sns.set_style before for loop. It should work.
%matplotlib inline
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
sns.set_style("darkgrid", {'axes.axisbelow': False})
x = np.linspace(0,14,100)
for i in range(1,5):
plt.plot(x,np.sin(x+i*0.5)*(7-i))
plt.show()

Why there is giving, name 'pd ' is not defined error?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
dataset = pd.read_csv('homeprice.csv')
print(dataset)
Output
NameError Traceback (most recent call
last) in
----> 1 dataset = pd.read_csv('homeprice.csv')
2 print(dataset)
NameError: name 'pd' is not defined
You mention that you are using a Jupiter notebook so you may have two code cells:
First is with the imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
And the second is with the functionality:
dataset = pd.read_csv('homeprice.csv')
print(dataset)
In Jupiter notebooks you may run each cell separately. If this is what you do, you should remember the run the first cell before you execute the second one for the first time. This would make sure the right stuff is imported in the current context for you second cell.
!pip install tensorflow-gpu==2.2.0.0rc2
import tensorflow as tf
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Or
Just go with the latest TensorFlow version in the Colab

How to resample a time series Pandas dataframe?

I am trying to resample 1 minute based data to day. I have tried the following code on IPython
import pandas as pd
import numpy as np
from pandas import Series, DataFrame, Panel
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv("DATALOG_22_01_2014.csv",\
names = ['DATE','TIME','HUM1','TMP1','HUM2','TMP2','HUM3','TMP3','WS','WD'])
data.set_index(['DATE','TIME'])
data.resample('D',how=mean)
But I got the following error
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-75-aa63b6b16877> in <module>()
----> 1 data.resample('D', how=mean)
NameError: name 'mean' is not defined
Could you help me?
Thank you
Hugo
Try
data.resample('D', how='mean')
instead. Right now you're asking Python to pass the mean object to the resample method as the how argument, but you don't have one defined.