Retrieving data from s3 bucket in pyspark - amazon-s3

I am reading data from s3 bucket in pyspark . I need to parallelize read operation and doing some transformation on the data. But its throwing error. Below is the code.
s3 = boto3.resource('s3',aws_access_key_id=access_key,aws_secret_access_key=secret_key)
bucket = s3.Bucket(bucket)
prefix = 'clickEvent-2017-10-09'
files = bucket.objects.filter(Prefix = prefix)
keys=[k.key for k in files]
pkeys = sc.parallelize(keys)
I have a global variable d which is an empty list. And I am appending deviceId data into this.
applying flatMap on the keys
pkeys.flatMap(map_func)
This the function
def map_func(key):
print "in map func"
for line in key.get_contents_as_string().splitlines():
# parse one line of json
content = json.loads(line)
d.append(content['deviceID'])
But the above code gives me error.
Can anyone help!

You have two issues that I can see. The first is you are trying to manually read data from S3 using boto instead of using the direct S3 support built into spark and hadoop. It looks like you are trying to read text files containing json records per line. If that is case, you can just do this in spark:
df = spark.read.json('s3://my-bucket/path/to/json/files/')
This will create a spark DataFrame for you by reading in the JSON data with each line as a row. DataFrames require a rigid pre-defined schema (like a relational database table) which spark try to determine will determine by sampling some of your JSON data. After you have the DataFrame all you need to do to get your column is select it like this:
df.select('deviceID')
The other issue worth pointing out is you are attempting to use a global variable to store data computed across your spark cluster. It is possible to send data from your driver to all of the executors running on spark workers using either broadcast variables or implicit closures. But there is no way in spark to write to a variable in your driver from an executor! To transfer data from executors back to the driver you need to use spark's Action methods intended for exactly this purpose.
Actions are methods that tell spark you want a result computed so it needs to go execute the transformations you have told it about. In your case you would probably either want to:
If the results are large: use DataFrame.write to save the results of your tranformations back to S3
If the results are small: DataFrame.collect() to download them back to your driver and do something with them

Related

In BigQuery, query to get GCS metadata (filenames in GCS)

We have a GCS bucket with a subfolder at url https://storage.googleapis.com/our-bucket/path-to-subfolder. This sub-folder contains files:
file_1_3.png
file_7_4.png
file_3_2.png
file_4_1.png
We'd like to create a table in BigQuery with a column number1 with values 1,7,3,4 (first number in filename) and a column number2 with the second numbers. String splitting is easy, once the data (a column with filenames) is in BigQuery. How can the filenames be retrieved? Is it possible to query a GCS bucket for metadata on files?
EDIT: want to do this
Updating the answer to reflect the question of how do you retrieve GCS Bucket metadata on files.
There are two options you can have here depending on the use case:
Utilize a cloud function on a cron schedule to perform a read of metadata (like in the example you shared) then using the BQ Client library perform an insert. Then perform the regex listed below.
This option utilizes a feature (remote function) in preview so you may not have the functionality needed, however may be able to request it. This option would get you the latest data on read. It involves the following:
Create a Cloud Function that returns an array of blob names, see code below.
Create a connection resource in BigQuery (overall process is listed here however since the remote function portion is in preview the documentation and potentially your UI may not reflect the necessary options (it did not in mine).
Create a remote function (third code block in link)
Call the function from your code then manipulate as needed with regexp.
Example CF for option 2:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
blob_array = []
for blob in blobs:
blob_array.append()
return blob_array
Example remote function from documentation:
CREATE FUNCTION mydataset.remoteMultiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
REMOTE WITH CONNECTION us.myconnection
OPTIONS(endpoint="https://us-central1-myproject.cloudfunctions.net/multiply");
Once its in it will return the full gcs path of the file. From there you can use REGEX like the following regexp_extract(_FILE_NAME, 'file_(.+)_') to extract the important information.
Now that BQ Remote Function (RF) is GA as well as JSON, I thought of sharing a way to get any property of blobs in a bucket, right from BQ SQL.
!! Make sure to carefully read the official documentation first on how to set up RF as it's easy to miss a step. There are slight differences if you rather use 2nd Gen Function or Cloud run
Create following storage Cloud Function (here Python) - 1st gen good enough:
import json
from google.cloud import storage
storage_client = storage.Client()
def list_blobs(request):
print(request_json := request.get_json()) # print for debugging
calls = request_json['calls']
bucket_name = calls[0][0]
blobs = storage_client.list_blobs(bucket_name)
reply = [b._properties for b in blobs]
return json.dumps({'replies': [reply]})
Create BQ remote function (assumes fns dataset, us.api connection and my_project_id):
CREATE FUNCTION fns.list_blobs(bucket STRING)
RETURNS JSON
REMOTE WITH CONNECTION us.api
OPTIONS(endpoint="https://us-central1-my_project_id.cloudfunctions.net/storage")
The trick to return multiples values for a single request is to use JSON type
SELECT whatever properties you want
SELECT STRING(blob.name), STRING(blob.size), CAST(STRING(blob.updated) AS TIMESTAMP)
FROM
UNNEST(
JSON_EXTRACT_ARRAY(
fns.list_blobs('my_bucket')
)
) blob
The JSON is converted to an ARRAY, and UNNEST() pivots to multiple rows - unfortunately not columns too.
Voila ! I wish there was a easier way to fully parse a JSON array to a table, populating all columns at once, but as of this writing, you must explicitly extract the properties you want:
You can do many more cool things by extending the functions (cloud and remote) so you don't have to leave SQL, like,
generate and return signed URL to display/download right from a query result (e.g. BI tool)
use user_defined_context and branch logic in the CF code, to perform other operations like delete blobs or do other stuff
Object tables are read-only tables containing metadata index over the unstructured data objects in a specified Cloud Storage bucket. Each row of the table corresponds to an object, and the table columns correspond to the object metadata generated by Cloud Storage, including any custom metadata.
With Object tables we can get the file names and do operations on top of that in BigQuery itself.
https://cloud.google.com/bigquery/docs/object-table-introduction

How to get an array from JSON in the Azure Data Factory?

My actual (not properly working) setup has two pipelines:
Get API data to lake: for each row in metadata table in SQL calling the REST API and copy the reply (json-files) to the Blob datalake.
Copy data from the lake to SQL: For Each file auto create table in SQL.
The result is the correct number of tables in SQL. Only the content of the tables is not what I hoped for. They all contain 1 column named odata.metadata and 1 entry, the link to the metadata.
If I manually remove the metadata from the JSON in the datalake and then run the second pipeline, the SQL table is what I want to have.
Have:
{ "odata.metadata":"https://test.com",
"value":[
{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]}
Want:
[{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]
I tried to add $.['value'] in the API call. The result then was no odata.metadata line, but the array started with {value: which resulted in an error copying to SQL
I also tried to use mapping (in sink) to SQL. That gives the wanted result for the dataset I manually specified the mapping for, but only goes well for the dataset with the same number of column in the array. I don't want to manually do the mapping for 170 calls...
Does anyone know how handle this in ADF? For now I feel like the only solution is to add a Python step in the pipeline, but I hope for a somewhat standard ADF way to do this!
You can add another pipeline with dataflow to remove the content from JSON file before copying data to SQL, using flatten formatters.
Before flattening the JSON file:
This is what I see when JSON data copied to SQL database without flattening:
After flattening the JSON file:
Added a pipeline with dataflow to flatten the JSON file to remove 'odata.metadata' content from the array.
Source preview:
Flatten formatter:
Select the required object from the Input array
After selecting value object from input array, you can see only the values under value in Flatten formatter preview.
Sink preview:
File generated after flattening.
Copy the generated file as Input to SQL.
Note: If your Input file schema is not constant, you can enable Allow schema drift to allow schema changes
Reference: Schema drift in mapping data flow

Glue create_dynamic_frame.from_catalog return empty data

I'm debugging issue which create_dynamic_frame.from_catalog return no data, despite I'm able to view the data through Athena.
The Data Catelog is pointed to S3 folder and there are multiple files with same structure. The file type is csv, delimiter is space " ", consists of two column (string and json string), with no header.
This is CSV format file.
This is Athena query using crawler generated.
No result returned from dataframe when debug, any thought?
Take a look if you have enabled the Bookmark for this job. If you are running it multiple times, you need to reset the Bookmark or disable it.
Other thing to check is the logs. Maybe you can find some AccessDenied, the role that is running the job might have no access to this bucket.

is Parquet predicate pushdown works on S3 using Spark non EMR?

Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR).
Further explanation might be helpful since it might involve understanding on distributed file system.
I was wondering this myself so I just tested it out. We use EMR clusters and Spark 1.6.1 .
I generated some dummy data in Spark and saved it as a parquet file locally as well as on S3.
I created multiple Spark jobs with different kind of filters and column selections. I ran these tests once for the local file and once for the S3 file.
I then used the Spark History Server to see how much data each job had as input.
Results:
For the local parquet file: The results showed that the column selection and filters were pushed down to the read as the input size was reduced when the job contained filters or column selection.
For the S3 parquet file: The input size was always the same as the Spark job that processed all of the data. None of the filters or column selections were pushed down to the read. The parquet file was always completely loaded from S3. Even though the query plan (.queryExecution.executedPlan) showed that the filters were pushed down.
I will add more details about the tests and results when I have time.
Yes. Filter pushdown does not depend on the underlying file system. It only depends on the spark.sql.parquet.filterPushdown and the type of filter (not all filters can be pushed down).
See https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L313 for the pushdown logic.
Here's the keys I'd recommend for s3a work
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
For committing the work. use the S3A "zero rename committer" (hadoop 3.1+) or the EMR equivalent. The original FileOutputCommitters are slow and unsafe
Recently I tried this with Spark 2.4 and seems like Pushdown predicate works with s3.
This is the spark sql query:
explain select * from default.my_table where month = '2009-04' and site = 'http://jdnews.com/sports/game_1997_jdnsports__article.html/play_rain.html' limit 100;
And here is the part of output:
PartitionFilters: [isnotnull(month#6), (month#6 = 2009-04)], PushedFilters: [IsNotNull(site), EqualTo(site,http://jdnews.com/sports/game_1997_jdnsports__article.html/play_ra...
Which clearly stats that PushedFilters is not empty.
Note: The used table was created on top of AWS S3
Spark uses the HDFS parquet & s3 libraries so the same logic works.
(and in spark 1.6 they've added even a faster shortcut for flat schema parquet files)

Incrementally add data to Parquet tables in S3

I would like to keep a copy of my log data in in Parquet on S3 for ad hoc analytics. I mainly work with Parquet through Spark and that only seems to offer operations to read and write whole tables via SQLContext.parquetFile() and SQLContext.saveAsParquetFile().
Is there any way to add data to and existing Parquet table
without writing a whole new copy of it
particularly when it is stored in S3?
I know I can create separate tables for the updates and in Spark I can form the union of the corresponig DataFrames in Spark at query time but I have my doubts about the scalability of that.
I can use something other than Spark if needed.
The way to append to a parquet file is using SaveMode.Append
`yourDataFrame.write.mode(SaveMode.Append).parquet("/your/file")`
You don't need to union DataFrames after creating them separately, just supply all the paths related to your query to the parquetFile(paths) and get one DataFrame. Just as the signature of reading parquet file: sqlContext.parquetFile(paths: String*) suggests.
Under the hood, in newParquetRelation2, all the .parquet files from all the folders you supply, as well as all the _common_medata and _metadata would be filled into a single list and regard equally.