What are the SparkQL options for com.amazonaws.services.glue.writeDynamicFrame? - apache-spark-sql

In this documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html#aws-glue-programming-etl-format-parquet
it mentions: "any options that are accepted by the underlying SparkSQL code can be passed to it by way of the connection_options map parameter."
However, how can I find out what those options are? There's not a clear mapping between the Glue code and the SparkQL code.
(Specifically, I want to figure out how to control the size of the resulting parquet files)

SparkSQL options for various DataSources can be looked up in DataFrameWriter documentation (in Scala or pyspark docs). Datasource for writing parquet seems to only take compression parameter. For SparkSQL options when reading the data, have a look into DataFrameReader class.
To control the size of your output files you should play with parallelism - like #Yuri Bondaruk commented - using for example coalesc function.

Related

Transforming Python Classes to Spark Delta Rows

I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.
After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())

How to filter by tag in Jaeger

When trying to filter by tag, there is a small popup:
I have been looking for logfmt around, but all I can find is key=value format.
My questions are:
Is there a way for something more sophisticated? (starts_with, not equal, contains, etc)
I am trying to filter by url using http.url="http://example.com?bla=bla&foo=bar". I am pretty sure the value exists because I am copy/pasting from my trace. I am getting no results. Do I need to escape characters or do something else for this to work?
I did some research around logfmt as well. Based on the documentation of the original implementation and in the Python implementation of the parser (and respective tests), I would say that it doesn't support anything more sophisticated (like starts_with, not equal, contains). And this is because the output of the parser is a simple dictionary (with no regex involved in the values).
As for the second question, using the same mentioned Python parser, I was able to double-check that your filter looks fine:
from logfmt import parse_line
parse_line('http.url="http://example.com?bla=bla&foo=bar"')
Output:
{'http.url': 'http://example.com?bla=bla&foo=bar'}
This makes me suspect of an issue on the Jaeger side, but this is as far as I could go.

How to serialize data in example-in-example format for tensorflow-ranking?

I'm building a ranking model with tensorflow-ranking. I'm trying to serialize a data set in the TFRecord format and read it back at training time.
The tutorial doesn't show how to do this. There is some documentation here on an example-in-example data format but it's hard for me to understand: I'm not sure what the serialized_context or serialized_examples fields are or how they fit into examples and I'm not sure what the Serialize() function in the code block is.
Concretely, how can I write and read data in example-in-example format?
The context is a map from feature name to tf.train.Feature. The examples list is a list of maps from feature name to tf.train.Feature. Once you have these, the following code will create an "example-in-example":
context = {...}
examples = [{...}, {...}, ...]
serialized_context = tf.train.Example(features=tf.train.Features(feature=context)).SerializeToString()
serialized_examples = tf.train.BytesList()
for example in examples:
tf_example = tf.train.Example(features=tf.train.Features(feature=example))
serialized_examples.value.append(tf_example.SerializeToString())
example_in_example = tf.train.Example(features=tf.train.Features(feature={
'serialized_context': tf.train.Feature(bytes_list=tf.train.BytesList(value=[serialized_context])),
'serialized_examples': tf.train.Feature(bytes_list=serialized_examples)
}))
To read the examples back, you may call
tfr.data.parse_from_example_in_example(example_pb,
context_feature_spec = context_feature_spec,
example_feature_spec = example_feature_spec)
where context_feature_spec and example_feature_spec are maps from feature name to tf.io.FixedLenFeature or tf.io.VarLenFeature.
First of all, I recommend reading this article to ensure that you know how to create a tf.Example as well as tf.SequenceExample (which by the way, is the other data format supported by TF-Ranking):
Tensorflow Records? What they are and how to use them
In the second part of this article, you will see that a tf.SequenceExample has two components: 1) Context and 2)Sequence (or examples). This is the same idea that Example-in-Example is trying to implement. Basically, context is the set of features that are independent of the items that you want to rank (a search query in the case of search, or user features in the case of a recommendation system) and the sequence part is a list of items (aka examples). This could be a list of documents (in search) or movies (in recommendation).
Once you are comfortable with tf.Example, Example-in-Example will be easier to understand. Take a look at this piece of code for how to create an EIE instance:
https://www.gitmemory.com/issue/tensorflow/ranking/95/518480361
1) bundle context features together in a tf.Example object and serialize it
2) bundle sequence(example) features (each of which could contain a list of values) in another tf.Example object and serialize this one too.
3) wrap these inside a parent tf.Example
4) (if you're writing to tfrecords) serialize the parent tf.Example object and write to your tfrecord file.

Create new column from existing column in Dataset - Apache Spark Java

I am new to Spark ML and got stuck in a task which require some data normalization and there is very less documentation available on net for Spark ML - Java. Any help is much appreciated.
Problem Description :
I have a Dataset that contains encoded url in column (ENCODED_URL) and I want to create new column (DECODED_URL) in existing Dataset that contains decoded version of ENCODED_URL.
For Eg :
Current Dataset
ENCODED_URL
https%3A%2F%2Fmywebsite
New Dataset
ENCODED_URL | DECODED_URL
https%3A%2F%2Fmywebsite | https://mywebsite
Tried using withColumn but had no clue what i should pass as 2nd argument
Dataset<Row> newDs = ds.withColumn("new_col",?);
After reading the Spark documentation got an idea that it may be possible with SQLTransformer but couldn't figure out how to customize it to decode the url.
This is how i read information from CSV
Dataset<Row> urlDataset = s_spark.read().option("header", true).csv(CSV_FILE).persist(StorageLevel.MEMORY_ONLY());
A Spark primer
The first thing to know is that Spark Datasets are effectively immutable. Whenever you do a transformation, a new Dataset is created and returned. Another thing to keep in mind is the difference between actions and transformations -- actions cause Spark to actually to start crunching numbers and compute your DataFrame while transformations add to the definition of a DataFrame but are not computed unless an action is called. An example of an action is DataFrame#count while an example of a transformation is DataFrame#withColumn. See the full list of actions and transformations in the Spark Scala documentation.
A solution
withColumn allows you to either create a new column or replace an existing column in a Dataset (if the first argument is an existing column's name). The docs for withColumn will tell you that the second argument is supposed to be a Column object. Unfortunately, the Column documentation only describes methods available to Column objects but does not link to other ways to create Column objects, so it's not your fault that you're at a loss for what do next.
The thing you're looking for is org.apache.spark.sql.functions#regexp_replace. Putting it all together, your code should look something like this:
...
import org.apache.spark.sql.functions
Dataset<Row> ds = ... // reading from your csv file
ds = ds.withColumn(
"decoded_url",
functions.regexp_replace(functions.col("encoded_url"), "\\^https%3A%2F%2F", "https://"))
regexp_replace requires that we pass a Column object as the first value but nothing requires that it even exist on any Dataset because Column objects are basically instructions for how to compute a column, they don't actually contain any real data themselves. To illustrate this principle, we could write the above snippet as:
...
import org.apache.spark.sql.functions
Dataset<Row> ds = ... // reading from your csv file
Column myColExpression = functions.regexp_replace(functions.col("encoded_url"), "\\^https%3A%2F%2F", "https://"))
ds = ds.withColumn("decoded_url", myColExpression)
If you wanted, you could reuse myColExpression on other datasets that have an encoded_url column.
Suggestion
If you haven't already, you should familiarize yourself with the org.apache.spark.sql.functions class. It's a util class that's effectively the Spark standard lib for transformations.

GeoTIFF software generation

I have some data about the height of a set of points (for example, from the Google Elevation API). There is a task to save this data in GeoTIFF format, then to use in osgEarth (GDAL). How can this be done? It does not matter in what language.
A quick search on the Internet only gave me the answer to the reverse question (How do I open geotiff images with gdal in python?)
I would be very grateful for any help.
So i would do this with GDAL from python (You could also use rasterio which is a nice wrapper around gdal for file raster file handling)
You should put your data in a numpy array,let us call it some_nparray.
Then create the tif dataset gtiffDriver.Create(). Here you can provide the name of your file, the dimensions in number of columns and rows of your image, the number of bands (here 1), and the datatype. Here i said float32, however byte, int16 etc could also work, depending on your data (you can check it with heigh_data_array.dtype)
Next you should set the geotransform, which is the information about the corner coordinates and pixel resolution, and you should set the projection you are using. This is done with dataset.SetGeoTransform and dataset.SetProjection. How these are created is not in the scope of this question I believe. If you do not need it, i guess you can even skip that part.
Finally write your array to the file with WriteArray and close the file.
You code should look something like this. Here I use the convention that variables prefixed with some_ should be provided by you.
from osgeo import gdal
height_data_array = some_nparray
gtiffDriver = gdal.GetDriverByName('GTiff')
dataset = gtiffDriver.Create('result.tif',
height_data_array.shape[1],
height_data_array.shape[0],
1,
gdal.GDT_Float32)
dataset.SetGeoTransform(some_geotrans)
dataset.SetProjection(some_projection)
dataset.GetRasterBand(1).WriteArray(height_data_array)
dataset = None