how to read csv file in pyspark? - apache-spark-sql

I am trying to read csv file using pyspark but its showing some error.
Can you tell what is the correct process to read csv file?
python code:
from pyspark.sql import *
df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)
i tried also below one:
sqlContext = SQLContext
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")
error:
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)
NameError: name 'spark' is not defined
and
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")
AttributeError: type object 'SQLContext' has no attribute 'load'

First you need to create a SparkSession like below
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("MyApp").getOrCreate()
and your csv needs to be on hdfs then you can use spark.csv
df = spark.read.csv('/tmp/data.csv', header=True)
where /tmp/data.csv is on hdfs

The simplest to read csv in pyspark - use Databrick's spark-csv module.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
Also you can read by string and parse to your separator.
reader = sc.textFile("file.csv").map(lambda line: line.split(","))

Related

Error while converting pandas dataframe to polars dataframe (pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object)

I am converting pandas dataframe to polars dataframe but pyarrow throws error.
My code:
import polars as pl
import pandas as pd
if __name__ == "__main__":
with open(r"test.xlsx", "rb") as f:
excelfile = f.read()
excelfile = pd.ExcelFile(excelfile)
sheetnames = excelfile.sheet_names
df = pd.concat(
[
pd.read_excel(
excelfile, sheet_name=x, header=0)
for x in sheetnames
], axis=0)
df_pl = pl.from_pandas(df)
Error:
File "pyarrow\array.pxi", line 312, in pyarrow.lib.array
File "pyarrow\array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow\error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
I tried changing pandas dataframe dtype to str and problem is solved, but i don't want to change dtypes. Is it bug in pyarrow or am I missing something?
Edit: Polars 0.13.42 and later
Polars now has a read_excel function that will correctly handle this situation. read_excel is now the preferred way to read Excel files into Polars.
Note: to use read_excel, you will need to install xlsx2csv (which can be installed with pip).
Polars: prior to 0.13.42
I can replicate this result. It is due to a column in the original Excel file that contains both text and numbers.
For example, create a new Excel file with one column in which you type both numbers and text, save it, and run your code on that file. I get the following traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/convert.py", line 299, in from_pandas
return DataFrame._from_pandas(df, rechunk=rechunk, nan_to_none=nan_to_none)
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/frame.py", line 454, in _from_pandas
pandas_to_pydf(
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 485, in pandas_to_pydf
arrow_dict = {
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 486, in <dictcomp>
str(col): _pandas_series_to_arrow(
File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 237, in _pandas_series_to_arrow
return pa.array(values, pa.large_utf8(), from_pandas=nan_to_none)
File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
There are several lengthy discussions on this issue, such as these:
to_parquet can't handle mixed type columns #21228
pyarrow.lib.ArrowTypeError: "Expected a string or bytes object, got a 'int' object" #349
This particular comment might be relevant, as you are concatenating the results of parsing multiple sheets in an Excel file. This may lead to conflicting dtypes for a column:
https://github.com/pandas-dev/pandas/issues/21228#issuecomment-419175116
How to approach this depends on your data and its use, so I can't recommend a blanket solution (i.e., fixing your source Excel file, or changing the dtype to str).
My problem is solved by saving pandas dataframe to 'csv' format and then importing 'csv' file in polars.
import os
import polars as pl
import pandas as pd
if __name__ == "__main__":
with open(r"test.xlsx", "rb") as f:
excelfile = f.read()
excelfile = pd.ExcelFile(excelfile)
sheetnames = excelfile.sheet_names
df = pd.concat([pd.read_excel(excelfile, sheet_name=x, header=0)
for x in sheetnames
], axis=0)
df.to_csv("temp.csv",index=False)
df_pl = pl.scan_csv("temp.csv")
os.remove("temp.csv")

TypeError: _any() missing 1 required keyword-only argument: 'where'

I am trying to read the file using pandas but it is showing me a type error. I am not able to discern why. Can someone help me?
Below is my code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#prepare the files
df = pd.read_csv("~/Downloads/Boston.csv") # for doing modifications
Traceback (most recent call last):
File "", line 1, in
df = pd.read_csv("~/Downloads/Boston.csv") # for doing modifications
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
low_memory=_c_parser_defaults["low_memory"],
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
iterator = kwds.get("iterator", False)
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1148, in read
names : iterable of names
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 435, in init
d = {'col1': [1, 2], 'col2': [3, 4]}
File "/Users/nikhiladiga/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 233, in init_dict
datelike_vals = maybe_infer_to_datetimelike(values)
TypeError: _any() missing 1 required keyword-only argument: 'where'
Could be that read_csv method has troubles parsing your file without any other indications.
Try using additional keywords arguments such as sep, usecols, etc.
Refer to documentation for more: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Loading .txt file from Google Cloud Storage into a Pandas DF

I'm trying to load a .txt file from a GCS bucket into pandas df via pd.read_csv. When I run this code on my local machine (sourcing the .txt file from a local directory), it works perfectly. However, when I try and run the code in a cloud function , accessing the same .txt file but from a GCS bucket, I get a 'TypeError: cannot use a string pattern on a bytes-like object'
The only thing that's different is the fact that I'm accessing the .txt file via the GCS bucket so its a bucket object (Blob) instead of a normal file. Would I need to download the blob as a string or as a file-like object first before doing pd.read_csv? code is below
def stage1_cogs_vfc(data, context):
from google.cloud import storage
import pandas as pd
import dask.dataframe as dd
import io
import numpy as np
start_bucket = 'my_bucket'
storage_client = storage.Client()
source_bucket = storage_client.bucket(start_bucket)
df = pd.DataFrame()
file_path = 'gs://my_bucket/SCE_Var_Fact_Costs.txt'
df = pd.read_csv(file_path,skiprows=12, encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python')
Traceback (most recent call last):
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 20, in stage1_cogs_vfc df = pd.read_csv(file_path,skiprows=12, encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python') File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f return _read(filepath_or_buffer, kwds) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 429, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__ self._make_engine(self.engine) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1132, in _make_engine self._engine = klass(self.f, **self.options) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2238, in __init__ self.unnamed_cols) = self._infer_columns() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2614, in _infer_columns line = self._buffered_line() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2689, in _buffered_line return self._next_line() File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2791, in _next_line next(self.data) File "/env/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2379, in _read yield pat.split(line.strip()) TypeError: cannot use a string pattern on a bytes-like object
``|
I found a similar situation here.
I also noticed that on the line:
source_bucket = storage_client.bucket(source_bucket)
you are using "source_bucket" for both: your variable name and parameter. I would suggest to change one of those.
However, I think you'd like to see this doc for any further question related to the API itself: Storage Client - Google Cloud Storage API
Building on points from #K_immer is my updated code that includes reading into 'Dask' df...
def stage1_cogs_vfc(data, context):
from google.cloud import storage
import pandas as pd
import dask.dataframe as dd
import io
import numpy as np
import datetime as dt
start_bucket = 'my_bucket'
destination_path = 'gs://my_bucket/ddf-*_cogs_vfc.csv'
storage_client = storage.Client()
bucket = storage_client.get_bucket(start_bucket)
blob = bucket.get_blob('SCE_Var_Fact_Costs.txt')
df0 = pd.DataFrame()
file_path = 'gs://my_bucket/SCE_Var_Fact_Costs.txt'
df0 = dd.read_csv(file_path,skiprows=12, dtype=object ,encoding ='utf-8', error_bad_lines= False, warn_bad_lines= False , header = None ,sep = '\s+|\^+',engine='python')
df7 = df7.compute() # converts dask df to pandas df
# then do your heavy ETL stuff here using pandas...

Write Pandas Dataframe to CSV with a variable name in pathway

I've written a python script that takes in a file and matches some columns in another file. I would like to write this to a csv with the name "[original file name]_matched". E.g. I have a bunch of files (xaa, xab, ...) and after running the script on each file I would also have (xaa_matched, xab_matched, etc...) This is what I've tried based on this solution: Set File_Path for to_csv() in Pandas
import sys
import os
filename = sys.argv[1]
# some code
path = r'/Users/mdong/dataScience/movie_representation/fuzzy_match_dir/'
input_file.to_csv(os.path.join(path,'match_' + filename), index = False)
However, I get back this error
Traceback (most recent call last):
File "movie_matching.py", line 29, in <module>
input_file.to_csv(os.path.join(path,filename), index = False)
File "/Users/mdong/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 1413, in to_csv
formatter.save()
File "/Users/mdong/anaconda/lib/python3.6/site-packages/pandas/io/formats/format.py", line 1568, in save
compression=self.compression)
File "/Users/mdong/anaconda/lib/python3.6/site-packages/pandas/io/common.py", line 382, in _get_handle
f = open(path_or_buf, mode, errors='replace')
FileNotFoundError: [Errno 2] No such file or directory: '/Users/mdong/dataScience/movie_representation/fuzzy_match_dir/fuzzy_match_dir/xaa.csv'
I'm not sure what's going wrong in order to troubleshoot, any pointers would be appreciated!
I would use pathlib in this situation.
from pathlib import Path
p = Path('/Path/to/your/folder/')
input_file.to_csv(Path(p, 'match_' + filename + '.csv')), index=False)
I would also check that your filename variable is what you expect it to be. Which you can do with Pathlib as well.
>>> p = Path('/Path/To/Thing.csv')
>>> p.stem
'Thing'
>>> p.name
'Thing.csv'

Pandas HDF5 append time series fails

Going through the documentation of pandas HDF5 usability (http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5) the given example raises an error:
import pandas as pd
import numpy as np
store = pd.HDFStore('store.h5')
np.random.seed(1234)
index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index)
store['df'] = df
df1 = df[0:4]
df2 = df[4:]
store.append('df', df1)
store.append('df', df2)
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-225-ef7f2e059c6a>", line 1, in <module>
store.append('df', df1)
File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 919, in append
**kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1252, in _write_to_group
raise ValueError('Can only append to Tables')
ValueError: Can only append to Tables
Has something changed here? Or am I doing something wrong?
You need to enable append by default store in the table format by setting the following option at the beginning as your store behaves like a DF currently:
pd.set_option('io.hdf.default_format','table')
Docs