Using the BigQuery Connector with Spark

Using the BigQuery Connector with Spark - google-bigquery

I'm not getting the Google example work
https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example
PySpark
There are a few mistakes in the code i think, like:
'# Output Parameters
'mapred.bq.project.id': '',
Should be: 'mapred.bq.output.project.id': '',
and
'# Write data back into new BigQuery table.
'# BigQueryOutputFormat discards keys, so set key to None.
(word_counts
.map(lambda pair: None, json.dumps(pair))
.saveAsNewAPIHadoopDataset(conf))
will give an error message. If I change it to:
(word_counts
.map(lambda pair: (None, json.dumps(pair)))
.saveAsNewAPIHadoopDataset(conf))
I get the error message:
org.apache.hadoop.io.Text cannot be cast to com.google.gson.JsonObject
And whatever I try I can not make this work.
There is a dataset created in BigQuery with the name I gave it in the 'conf' with a trailing '_hadoop_temporary_job_201512081419_0008'
And a table is created with '_attempt_201512081419_0008_r_000000_0' on the end. But are always empty
Can anybody help me with this?
Thanks

We are working to update the documentation because, as you noted, the docs are incorrect in this case. Sorry about that! While we're working to update the docs, I wanted to get you a reply ASAP.
Casting problem
The most important problem you mention is the casting issue. Unfortunately,PySpark cannot use the BigQueryOutputFormat to create Java GSON objects. The solution (workaround) is to save the output data into Google Cloud Storage (GCS) and then load it manually with the bq command.
Code example
Here is a code sample which exports to GCS and loads the data into BigQuery. You could also use subprocess and Python to execute the bq command programatically.
#!/usr/bin/python
"""BigQuery I/O PySpark example."""
import json
import pprint
import pyspark
sc = pyspark.SparkContext()
# Use the Google Cloud Storage bucket for temporary BigQuery export data used
# by the InputFormat. This assumes the Google Cloud Storage connector for
# Hadoop is configured.
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory ='gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
# Input Parameters
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
# Perform word count.
word_counts = (
table_data
.map(lambda (_, record): json.loads(record))
.map(lambda x: (x['word'].lower(), int(x['word_count'])))
.reduceByKey(lambda x, y: x + y))
# Display 10 results.
pprint.pprint(word_counts.take(10))
# Stage data formatted as newline delimited json in Google Cloud Storage.
output_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_output'.format(bucket)
partitions = range(word_counts.getNumPartitions())
output_files = [output_directory + '/part-{:05}'.format(i) for i in partitions]
(word_counts
.map(lambda (w, c): json.dumps({'word': w, 'word_count': c}))
.saveAsTextFile(output_directory))
# Manually clean up the input_directory, otherwise there will be BigQuery export
# files left over indefinitely.
input_path = sc._jvm.org.apache.hadoop.fs.Path(input_directory)
input_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(input_path, True)
print """
###########################################################################
# Finish uploading data to BigQuery using a client e.g.
bq load --source_format NEWLINE_DELIMITED_JSON \
--schema 'word:STRING,word_count:INTEGER' \
wordcount_dataset.wordcount_table {files}
# Clean up the output
gsutil -m rm -r {output_directory}
###########################################################################
""".format(
files=','.join(output_files),
output_directory=output_directory)

Related

Julia load dataframe from s3 csv file

I'm having trouble finding an example to follow online for this simple use-case:
Load a CSV file from an s3 object location to julia DataFrame.
Here is what I tried that didn't work:
using AWSS3, DataFrames, CSV
filepath = S3Path("s3://muh-bucket/path/data.csv")
CSV.File(filepath) |> DataFrames # fails
# but I am able to stat the file
stat(filepath)
#=
Status( mode = -rw-rw-rw-,
...etc
size = 2141032 (2.0M),
blksize = 4096 (4.0K),
blocks = 523,
mtime = 2021-09-01T23:55:26,
...etc
=#
I can also read the file to a string object locally:
data_as_string = String(AWSS3.read(filepath);
#"column_1\tcolumn_2\tcolumn_3\t...etc..."
My AWS config is in order, I can access the object from julia locally.
How to I get this into a dataframe?

Thanks to help from the nice people on julia slack channel (#data).
bytes = AWSS3.read(S3Path("s3://muh-bucket/path/data.csv"))
typeof(bytes)
# Vector{UInt8} (alias for Array{UInt8, 1})
df = CSV.read(bytes, DataFrame)
Bingo, I'm in business. The CSV.jl maintainer mentions that S3Path types used to work when passed to CSV.read, so perhaps this will be even simpler in the future.
Helpful SO post for getting AWS configs in order

Using pandas to open Excel files stored in GCS from command line

The following code snippet is from a Google tutorial, it simply prints the names of files on GCP in a given bucket:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# bucket_name = "your-bucket-name"
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
list_blobs('sn_project_data')
No from the command line I can run:
$ python path/file.py
And in my terminal the files in said bucket are printed out. Great, it works!
However, this isn't quite my goal. I'm looking to open a file and act upon it. For example:
df = pd.read_excel(filename)
print(df.iloc[0])
However, when I pass the path to the above, the error returned reads "invalid file path." So I'm sure there is some sort of GCP specific function call to actually access these files...
What command(s) should I run?
Edit: This video https://www.youtube.com/watch?v=ED5vHa3fE1Q shows a trick to open files and needs to use StringIO in the process. But it doesn't support excel files, so it's not an effective solution.

read_excel() does not support google cloud storage file path as of now but it can read data in bytes.
pandas.read_excel(io, sheet_name=0, header=0, names=None,
index_col=None, usecols=None, squeeze=False, dtype=None, engine=None,
converters=None, true_values=None, false_values=None, skiprows=None,
nrows=None, na_values=None, keep_default_na=True, na_filter=True,
verbose=False, parse_dates=False, date_parser=None, thousands=None,
comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True,
storage_options=None)
Parameters: io : str, bytes, ExcelFile, xlrd.Book, path object, or
file-like object
What you can do is use the blob object and use download_as_bytes() to convert the object into bytes.
Download the contents of this blob as a bytes object.
For this example I just used a random sample xlsx file and read the 1st sheet:
from google.cloud import storage
import pandas as pd
bucket_name = "your-bucket-name"
blob_name = "SampleData.xlsx"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
data_bytes = blob.download_as_bytes()
df = pd.read_excel(data_bytes)
print(df)
Test done:

reading paritionned dataset in aws s3 with pyarrow doesn't add partition columns

i'm trying to read a partitionned dataset in aws s3, it looks like :
MyDirectory--code=1--file.parquet
--code=2--another.parquet
--code=3--another.parquet
i created a file_list containing the path to all the files in the directory then executed
df = pq.ParquetDataset(file_list, filesystem=fs).read().to_pandas()
everything works except that the partition column code doesn't exist in the dataframe df.
i tried it also using one path to MyDirectory insted of file_list, but found an error
"Found files in an intermediate directory: s3://bucket/Mydirectoty", i can't find any answer online.
Thank you!

AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
This snippet should work:
import awswrangler as wr
# Write
wr.s3.to_parquet(
df=df,
path="s3://...",
mode="overwrite",
dataset=True,
database="my_databse", # Optional, only if you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...", dataset=True)

If you're happy with other tools you can give dask a try. Assume all the data you want to read is in s3://folder you can just use
import dask.dataframe as dd
storage_options = {
'key': your_key,
'secret': your_secret}
df = dd.read_parquet("s3://folder",
storage_options=storage_options)

In Google collab I get IOPub data rate exceeded

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

An IOPub error usually occurs when you try to print a large amount of data to the console. Check your print statements - if you're trying to print a file that exceeds 10MB, its likely that this caused the error. Try to read smaller portions of the file/data.
I faced this issue while reading a file from Google Drive to Colab.
I used this link https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/io.ipynb
and the problem was in this block of code
# Download the file we just uploaded.
#
# Replace the assignment below with your file ID
# to download a different file.
#
# A file ID looks like: 1uBtlaggVyWshwcyP6kEI-y_W3P8D26sz
file_id = 'target_file_id'
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
# _ is a placeholder for a progress object that we ignore.
# (Our file is small, so we skip reporting progress.)
_, done = downloader.next_chunk()
downloaded.seek(0)
#Remove this print statement
#print('Downloaded file contents are: {}'.format(downloaded.read()))
I had to remove the last print statement since it exceeded the 10MB limit in the notebook - print('Downloaded file contents are: {}'.format(downloaded.read()))
Your file will still be downloaded and you can read it in smaller chunks or read a portion of the file.

The above answer is correct, I just commented the print statement and the error went away. just keeping it here so someone might find it useful. Suppose u are reading a csv file from google drive just import pandas and add pd.read_csv(downloaded) it will work just fine.
file_id = 'FILEID'
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
# _ is a placeholder for a progress object that we ignore.
# (Our file is small, so we skip reporting progress.)
_, done = downloader.next_chunk()
downloaded.seek(0)
pd.read_csv(downloaded);

Maybe this will help..
from via sv1997
IOPub Error on Google Colaboratory in Jupyter Notebook
IoPub Error is occurring in Colab because you are trying to display the output on the console itself(Eg. print() statements) which is very large.
The IoPub Error maybe related in print function.
So delete or annotate the print function. It may resolve the error.

%cd darknet
!sed -i 's/OPENCV=0/OPENCV=1/' Makefile
!sed -i 's/GPU=0/GPU=1/' Makefile
!sed -i 's/CUDNN=0/CUDNN=1/' Makefile
!sed -i 's/CUDNN_HALF=0/CUDNN_HALF=1/' Makefile
!apt update
!apt-get install libopencv-dev
its important to update your make file. and also, keep your input file name correct

PySpark - Spark clusters EC2 - unable to save to S3

I have set up a spark cluster with a master and 2 slaves (I'm using Spark Standalone). The cluster is working well with some of the examples but not my application. My application workflow is that, it will read the csv -> extract each line in the csv along with the header -> convert to JSON -> save to S3. Here is my code:
def upload_func(row):
f = row.toJSON()
f.saveAsTextFile("s3n://spark_data/"+ row.name +".json")
print(f)
print(row.name)
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.getOrCreate()
df = spark.read.csv("sample.csv", header=True, mode="DROPMALFORMED")
df.rdd.map(upload_func)
I have also export the AWS_Key_ID and AWS_Secret_Key into the ec2 environment. However with the above code, my application does not work. Below are the issues:
The JSON files are not saved in S3, I have tried run the application few times and also reload the S3 page but no data. The application completed without any error in the log. Also, the print(f) and print(row.name) are not printed out in the log. What do I need to fix to get the JSON save on S3 and is there anyway for me to print on the log for debug purpose?
Currently I need to put the csv file in the worker node so the application can read the csv file. How can I put the file in another place, let say the master node and when the application runs, it will split the csv file to all the worker nodes so they can do the upload parallel as a distributed system?
Help is really appreciated. Thanks for your help in advance.
UPDATED
After putting Logger to debug, I have identified the issue that the map function upload_func() is not being called or the application could not get inside this function (Logger printed messages before and after function call). Please help if you know the reason why?

you need to force the map to be evaluated; spark will only execute work on demand.
df.rdd.map(upload_func).count() should do it

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using the BigQuery Connector with Spark - google-bigquery

Related

Julia load dataframe from s3 csv file

Using pandas to open Excel files stored in GCS from command line

reading paritionned dataset in aws s3 with pyarrow doesn't add partition columns

In Google collab I get IOPub data rate exceeded

PySpark - Spark clusters EC2 - unable to save to S3

Categories

Resources