Error while reading textfile to Pyspark Dataframe - dataframe

I am running basic pyspark program as below in PySpark (1.6.0) but I am getting error. As per PySpark documentation https://spark.apache.org/docs/1.6.0/sql-programming-guide.html, the syntax seems to be correct but still not sure why it saying that 'SQLContext' object has no attribute 'textFile'
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext
if __name__ == '__main__':
conf = SparkConf().setAppName('TestingDF')
sc = SparkContext(conf=conf)
sqlc = SQLContext(sc)
lines = sqlc.textFile('/user/cloudera/practice4/question3/customers').map(lambda x: x.split(','))
I am getting below error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SQLContext' object has no attribute 'textFile'
'/user/cloudera/practice4/question3/customers' is basically sql table that I imported to HDFS from mysql via sqoop command
Python version is 2.6.6 (Basically I am testing all this on cloudera Quickstart VM 5.13)

Related

cannot import name 'int' from 'numpy'

I was just getting started with PyCharm and python for statistics.
And I got this error:
ImportError: cannot import name 'int' from 'numpy' (/home/tetiana/.local/lib/python3.8/site-packages/numpy/init.py)
Full traceback looks like this:
Traceback (most recent call last):
File "/home/tetiana/forVScode/python/first/first_try.py", line 1, in
from scipy import stats
File "/usr/lib/python3/dist-packages/scipy/stats/init.py", line 379, in
from .stats import *
File "/usr/lib/python3/dist-packages/scipy/stats/stats.py", line 180, in
import scipy.special as special
File "/usr/lib/python3/dist-packages/scipy/special/init.py", line 643, in
from .basic import *
File "/usr/lib/python3/dist-packages/scipy/special/basic.py", line 19, in
from . import orthogonal
File "/usr/lib/python3/dist-packages/scipy/special/orthogonal.py", line 81, in
from numpy import (exp, inf, pi, sqrt, floor, sin, cos, around, int,
ImportError: cannot import name 'int' from 'numpy' (/home/tetiana/.local/lib/python3.8/site-packages/numpy/init.py)
Process finished with exit code 1
How can I fix it?
Here is my code:
from scipy import stats
import pandas as pd
state = pd.read_csv('state_murder_rate_test_table.csv')
state['Population'].mean()
stats.trim_mean(state['Population'], 0.1)
state['Population'].median()
I checked whether the Python versions in os and in the project match and they are. I have python 3.8.10 and my os is Ubuntu 20.04
Referring to the current numpy documentation, there exists no type called numpy.int that you can import. I believe that the type you wanna import is numpy.integer or numpy.int_.
The code you provided does not have any statement like: from numpy import int. If you could provide a full traceback error, it'll be easier to see where the error stems from.
I hope this answer will be somewhat useful.

How to fix pandas callback error in Jupyter notebook or lab

I keep getting a AttributeError in my code when trying to open a large dataset in jupyter notebook and in Jupyter lab with pandas. I have tried many ways but I am trying to use pandas only.
I have uninstalled all Jupiter modules, all python packages, and I updated all of my anaconda packages.
Here is my code:
import pandas as pd
df = pd.read_excel('CalEnviroScreenSailor.xls')
print(df)
Here is the error message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/y5/m3s9mlh90hj78cpkj8ytxzkc0000gn/T/ipykernel_21302/2008694909.py in <module>
----> 1 df = pd.read_excel('CalEnviroScreenSailor.xls')
2 print(df)
AttributeError: module 'pandas' has no attribute 'read_excel'

import pandas as pd NameError: name 'null' not defined on jupyter notebook

Hello I'm currently taking a data analyst bootcamp course on Udemy and I'm using jupyter notebook with python version 3.9. I'm currently learning how to use pandas library I installed it on my computer and I even upgraded it to version 1.1.4. When I run
import pandas as pd
and execute the cell I get this error message
NameError Traceback (most recent call last)
<ipython-input-1-7dd3504c366f> in <module>
----> 1 import pandas as pd
~\pandas.py in <module>
25 {
26 "cell_type": "code",
---> 27 "execution_count": null,
28 "metadata": {},
29 "outputs": [],
NameError: name 'null' is not defined
I tried restarting the kernel and also restart and clear output but it's still giving me this error.
You may have a local file named pandas.py. Delete the local file pandas.py and rerun it. That will resolve.
The import statement is trying to import your local file instead of the pandas library
You may have a local file named pandas.py. Delete the local file pandas.py and rerun it. That will solve that.
The import statement is importing your local file instead of the pandas library.

Unable to use seaborn.countplot

I'm trying to plot some graphs using the latest version of Pycharm as a Python IDE.
As an interpreter, I'm using Anaconda with Python 3.4.3-0.
I have installed using conda install the news version of pandas (0.17.0), seaborn (0.6.0), numpy (1.10.1), matplotlib (1.4.3), ipython (4.0.1)
Inside the nesarc_pds.csv I have this:
IDNUM,S1Q2I
39191,1
39787,1
40082,1
40189,1
40226,1
40637,1
41306,1
41627,1
41710,1
42113,1
42120,1
42720,1
42909,1
43092,1
7,2
15,2
25,2
40,2
46,2
49,2
57,2
63,2
68,2
100,2
104,2
116,2
125,2
136,2
137,2
145,2
168,2
3787,9
6554,9
7616,9
11686,9
12431,9
14889,9
17694,9
19440,9
20141,9
21540,9
22476,9
24207,9
25762,9
29045,9
29731,9
So, that being said, this is my code:
import pandas as pd
import numpy
import seaborn as snb
import matplotlib.pyplot as plt
data = pd.read_csv("nesarc_pds.csv", low_memory=False)
#converting variable to numeric
pd.to_numeric(data["S1Q2I"], errors='coerce')
#setting a new dataset...
sub1=data[(data["S1Q2I"]==1) & (data["S3BQ1A5"]==1)]
sub2 = sub1.copy()
#setting the missing data 9 = unknown into NaN
sub2["S1Q2I"] = sub2["S1Q2I"].replace(9, numpy.nan)
#setting date to categorical type
sub2["S1Q2I"] = sub2["S1Q2I"].astype('category')
#plotting
snb.countplot(x="S1Q2I", data=sub2)
plt.xlabel("blablabla")
plt.title("lalala")
And then.....this is the error:
Traceback (most recent call last):
File "C:/Users/LPForGE_1/PycharmProjects/guido/haha.py", line 49, in <module>
snb.countplot(x="S1Q2I", data=sub2)
File "C:\Anaconda3\lib\site-packages\seaborn\categorical.py", line 2544, in countplot
errcolor)
File "C:\Anaconda3\lib\site-packages\seaborn\categorical.py", line 1263, in __init__
self.establish_colors(color, palette, saturation)
File "C:\Anaconda3\lib\site-packages\seaborn\categorical.py", line 300, in establish_colors
l = min(light_vals) * .6
ValueError: min() arg is an empty sequence
Any help would be really nice. I pretty much exhausted my intelligence trying to understand how to solve this.

error importing numpy

I have strange error, when I try to import numpy:
Traceback (most recent call last):
File "/home/timo/malltul/mafet/src/mafet/core/pattern.py", line 7, in <module>
import numpy as np
File "/usr/lib/python2.6/dist-packages/numpy/__init__.py", line 147, in <module>
import ma
File "/usr/lib/python2.6/dist-packages/numpy/ma/__init__.py", line 44, in <module>
import core
File "/usr/lib/python2.6/dist-packages/numpy/ma/core.py", line 4850, in <module>
all = _frommethod('all')
File "/usr/lib/python2.6/dist-packages/numpy/ma/core.py", line 4824, in __init__
self.__doc__ = self.getdoc()
File "/usr/lib/python2.6/dist-packages/numpy/ma/core.py", line 4830, in getdoc
signature = self.__name__ + get_object_signature(meth)
File "/usr/lib/python2.6/dist-packages/numpy/ma/core.py", line 109, in get_object_signature
import inspect
File "/usr/lib/python2.6/inspect.py", line 39, in <module>
import tokenize
File "/usr/lib/python2.6/tokenize.py", line 38, in <module>
COMMENT = N_TOKENS
NameError: name 'N_TOKENS' is not defined
It seems that the cause of the problem is that my script is in my own package named core and whenever I try to import numpy there, I get the error. Importing works fine elsewhere.
The only solution I've got this far is to rename my 'core' package to something else. Why does this matter? Am I doing something wrong?
I'm using Python2.6 on Ubuntu 10.14 . Numpy version is 1.3.0 .
EDIT: Actually renaming my package does not fix it. Renaming token.py in my package fixes it. Sorry for the error.
I doubt this has anything to do with your core module or with numpy.
From the stack trace, it would appear that the problem is with the tokenize module, which is part of Python, not part of numpy. Tokenize does from token import * and then uses N_TOKENS that's defined in token.py.
First of all, I'd check that there's no stray module called token on your PYTHONPATH:
>>> import token
>>> token.__file__
'/usr/lib/python2.6/token.pyc'
If this picks up the above file yet you still get the problem, I'd suggest reinstalling Python.