I am learning PySpark from some online source. I googled around and found how I could read csv file into Spark DataFrame using the following codes
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.csv('my_file.csv', header=True)
pandas_df = spark_df.toPandas()
However, on the online site I am learning, it loads the csv file somehow into SparkSession without telling the audience how to do it. That is, when I typed (on the online site's browser)
print(spark.catalog.listTables())
The following output returns.
[Table(name='my_file', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
When I tried to print the catalog as above, I got an empty list back.
Is there anyway how to put the csv file into the SparkSession? I have tried to google for this but most of what I found is how to load csv into Spark DataFrame like I showed above.
Thanks very much.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(#type the app name).getOrCreate()
df = spark.read.csv('invoice.csv',inferSchema=True,header=True)
It seems how to do this is left far behind where it should be on the online site.
sdf = spark.read.csv('my_file.csv', header=True)
pdf = sdf.toPandas()
spark_temp = spark.createDataFrame(pdf)
spark_temp.createOrReplaceTempView('my_file')
print(spark.catalog.listTables())
[Table(name='my_file', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
One question remains though. I cannot use pd.read_csv('my_file.csv') directly. It resulted in some merge error or something.
This can work:
df = my_spark.read.csv("my_file.csv",inferSchema=True,header=True)
df.createOrReplaceTempView('my_file')
print(my_spark.catalog.listTables())
Related
I want to import several xml files from a bucket on GCS and then parse them into a pandas Dataframe. I found the pandas.read_xml function do to this which is great. Unfortunately
I keep getting the error:
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
I checked the xml files and they look fine.
This is the code:
from google.cloud import storage
import pandas as pd
#importing the data
client = storage.Client()
bucket = client.get_bucket('bucketname')
df = pd.DataFrame()
from google.cloud import storage
import pandas as pd
#parsing the data into pandas df
for blob in bucket.list_blobs():
print(blob)
split = str(blob.name).split("/")
country = split[0]
data = pd.read_xml(blob.open(mode='rt', encoding='iso-8859-1', errors='ignore'), compression='gzip')
df["country"] = country
print(country)
df.append(data)
When I print out the blob it gives me :
<Blob: textkernel, DE/daily/2020/2020-12-19/jobs.0.xml.gz, 1612169959288959>
maybe it has something to do with the pandas function trying to read the filename and not the content? Does someone have an idea about why this could be happening?
thank you!
I am new to python and I have question regarding genfromtxt(). I have the following code:
import numpy as np
Myfile = "C:\\Users\\suntzu\\Desktop\\winequality-red.csv"
ds = np.genfromtxt(Myfile,names=True, delimiter=',')
I am trying to redirect this output to a new file. I searched on google for sometime and I can seem to figure out on how to do this.
See if this helps:
To save as csv using numpy , Try this:
np.savetxt("save.csv",ds, delimiter=",")
To save as numpy file, try this :
np.save("save.npy",ds)
Does any one has idea how to run pandas program on spark standalone cluster machine(windows)? the program developed using pycharm and pandas?
Here the issue is i am able to run from command prompt using spark-submit --master spark://sparkcas1:7077 project.py and getting results. but the activity(status) I am not seeing # workers and also Running Application status and Completed application status from spark web UI: :7077
in the pandas program I just included only one statement " from pyspark import SparkContext
import pandas as pd
from pyspark import SparkContext
# reading csv file from url
workbook_loc = "c:\\2020\Book1.xlsx"
df = pd.read_excel(workbook_loc, sheet_name='Sheet1')
# converting to dict
print(df)
What could be the issue?
Pandas code runs only on the driver and no workers are involved in this. So there is no point of using pandas code inside spark.
If you are using spark 3.0 you can run your pandas code distributed by converting the spark df as koalas
I'm writing an airflow job to read a gzipped file from s3.
First I get the key for the object, which works fine
obj = self.s3_hook.get_key(key, bucket_name=self.s3_bucket)
obj looks fine, something like this:
path/to/file/data_1.csv.gz
Now I want to read the contents into a pandas dataframe. I've tried a number of things but this is my current iteration:
import pandas as pd
df = pd.read_csv(obj['Body'], compression='gzip')
This returns the following error:
TypeError: 's3.Object' object is not subscriptable
What am I doing wrong? i feel like I need to do something with StringIO or BytesIO...I was able to read it in as bytes, but thought there was a more straight forward way to get to a dataframe
Just in case it matters, one row of the data looks like this when I unzip and open in CSV:
9671211|ddc9979d5ff90a4714fec7290657c90f|2138|2018-01-30 00:00:12|2018-01-30 00:00:16.069048|42b32863522dbe52e963034bb0aa68b6|1909705|8803795|collect|\\N|0||0||0|
figured it out:
obj = self.s3_hook.get_key(key, bucket_name=self.s3_bucket)
df = pd.read_csv(obj.get()['Body'], compression='gzip', header = None, sep = '|')
I am trying to read a bunch of CSV files from Google Cloud Storage into pandas dataframes as explained in Read csv from Google Cloud storage to pandas dataframe
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs(prefix=prefix)
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename+'.csv', encoding='utf-8')
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)
It shows the following error message while importing gcfs. The packages 'dask' and 'gcsfs' have already been installed on my machine; however, cannot get rid of the following error.
File "C:\Program Files\Anaconda3\lib\site-packages\gcsfs\dask_link.py", line
121, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'
It seems there is some error or conflict between the gcsfs and dask packages. In fact, the dask library is not needed for your code to work. The minimal configuration for your code to run is to install the libraries ( I am posting its latest versions):
google-cloud-storage==1.14.0
gcsfs==0.2.1
pandas==0.24.1
Also, the filename already contains the .csv extension. So change the 9th line to this:
temp = pd.read_csv('gs://' + bucket_name + '/' + filename, encoding='utf-8')
With this changes I ran your code and it works. I suggest you to create a virtual env and install the libraries and run the code there:
This has been tested and seen to work from elsewhere - whether reading directly from GCS or via Dask. You may wish to try import of gcsfs and dask, see if you can see the _filesystems and see its contents
In [1]: import dask.bytes.core
In [2]: dask.bytes.core._filesystems
Out[2]: {'file': dask.bytes.local.LocalFileSystem}
In [3]: import gcsfs
In [4]: dask.bytes.core._filesystems
Out[4]:
{'file': dask.bytes.local.LocalFileSystem,
'gcs': gcsfs.dask_link.DaskGCSFileSystem,
'gs': gcsfs.dask_link.DaskGCSFileSystem}
As of https://github.com/dask/gcsfs/pull/129 , gcsfs behaves better if it is unable to register itself with Dask, so updating may solve your problem.
Few things to point out in the text above:
bucket_name and prefixes needed to be defined.
and the iteration over the filenames should append the each dataframe each time. Otherwise it is the last one that gets concatenated.
from google.cloud import storage
import pandas as pd
storage_client = storage.Client()
buckets_list = list(storage_client.list_buckets())
bucket_name='my_bucket'
bucket = storage_client.bucket(bucket_name)
blobs = bucket.list_blobs()
list_temp_raw = []
for file in blobs:
filename = file.name
temp = pd.read_csv('gs://'+bucket_name+'/'+filename, encoding='utf-8')
print(filename, temp.head())
list_temp_raw.append(temp)
df = pd.concat(list_temp_raw)