What is DataFilter in pyspark? - sql

I am seeing something called as DataFilter in my query execution plan:
FileScan parquet [product_id#12,price#14] Batched: true, DataFilters: [isnotnull(product_id#12)], Format: Parquet, Location: InMemoryFileIndex[gs://monsoon-credittech.appspot.com/spark_datasets/products_parquet_dtc], PartitionFilters: [], PushedFilters: [IsNotNull(product_id)], ReadSchema: struct<product_id:int,price:int>
There is a
PartitionFilters: []
PushedFilters: [IsNotNull(product_id)]
DataFilters: [isnotnull(product_id#12)]
I understand PartitionFilter and PushedFilter. But, what is the DataFilter that is showing up here? There is an answer to a similar question here. However, the definition of DataFilter given there is exactly what I think PushedFilter is (Also, that answer has 1 downvote). So, Is my understanding of PushedFilter wrong? If not, then, what is DataFilter?

This explanation is for Spark's latest version at the time of this post (3.3.1).
PushedFilters are kind of a subset of DataFilters, you can see this in DataSourceScanExec.scala. They are the DataFilters whose predicates we can push down to filter on the metadata of the file you're trying to read in instead of against the data itself. Filtering against the metadata is of course way quicker then filtering against the data itself, because you might be able to skip reading large chunks of data when doing that.
So to structure everything, we have:
PartitionFilters: Filters on partition columns. Enable you to disregard directories within your parquet file.
DataFilters: Filters on non-partition columns.
PushedFilters: those DataFilters whose predicates we can push down
So when a filter is a DataFilter but not a PushedFilter, it means that we can't push down the predicate to filter on the underlying file's metadata.
Example
Let's take this example of parquet files (not all file formats support predicate pushdown, but parquet files do):
import org.apache.spark.sql.functions.col
val df = Seq(
(1,2,3),
(2,2,3),
(3,20,300),
(1,24,299),
).toDF("colA", "colB", "colC")
df.write.partitionBy("colA").mode("overwrite").parquet("datafilter.parquet")
So we're simply writing a parquet file that is partitioned by the colA column. The file structure looks like this:
datafilter.parquet/
├── colA=1
│   ├── part-00000-55cb3320-f145-4d64-8cba-55a72111c0c8.c000.snappy.parquet
│   └── part-00003-55cb3320-f145-4d64-8cba-55a72111c0c8.c000.snappy.parquet
├── colA=2
│   └── part-00001-55cb3320-f145-4d64-8cba-55a72111c0c8.c000.snappy.parquet
├── colA=3
│   └── part-00002-55cb3320-f145-4d64-8cba-55a72111c0c8.c000.snappy.parquet
└── _SUCCESS
Let's have a look at the 3 filter types:
PartitionFilter
spark.read.parquet("./datafilter.parquet").filter(col("colA") < 10).explain
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [colB#165,colC#166,colA#167] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:somePath/spark-tests/datafilter.parquet], PartitionFilters: [isnotnull(colA#167), (colA#167 < 10)], PushedFilters: [], ReadSchema: struct<colB:int,colC:int>
Here you see that our filter is a PartitionFilter, since our data has been partitioned by colA we can easily filter on the directories.
PushedFilter
spark.read.parquet("./datafilter.parquet").filter(col("colB") < 10).explain
== Physical Plan ==
*(1) Filter (isnotnull(colB#172) AND (colB#172 < 10))
+- *(1) ColumnarToRow
+- FileScan parquet [colB#172,colC#173,colA#174] Batched: true, DataFilters: [isnotnull(colB#172), (colB#172 < 10)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:somePath/spark-tests/datafilter.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(colB), LessThan(colB,10)], ReadSchema: struct<colB:int,colC:int>
Here you see our filter (colB < 10) makes part of the DataFilters. This is because colB is not a partition column.
It also is part of the PushedFilters, because this is a predicate we can push down. Parquet files store minimum and maximum values of chunks as metadata. So if the minimum value of a chunk is larger than 10, we know we can skip reading this chunk.
Non PushedFilters
spark.read.parquet("./datafilter.parquet").filter(col("colB") < col("colC")).explain
== Physical Plan ==
*(1) Filter ((isnotnull(colB#179) AND isnotnull(colC#180)) AND (colB#179 < colC#180))
+- *(1) ColumnarToRow
+- FileScan parquet [colB#179,colC#180,colA#181] Batched: true, DataFilters: [isnotnull(colB#179), isnotnull(colC#180), (colB#179 < colC#180)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:somePath/spark-tests/datafilter.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(colB), IsNotNull(colC)], ReadSchema: struct<colB:int,colC:int>
This filter is more complicated. colB < colC is not a filter that we can push down to filter onto the metadata of the parquet file. That means that we need to read the full data and filter afterwards in memory.

Related

how to save a parquet with pandas using same header than hadoop spark parquet?

I have a couple of files (csv,..) and am using pandas and pyarrow.table (0.17)
to save it as parquet on disk (parquet version 1.4)
colums
id : string
val : string
table = pa.Table.from_pandas(df)
pq.write_table(table, "df.parquet", version='1.0', flavor='spark', write_statistics=True, )
However, Hive and Spark does not recognize the parquet version:
org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version ((.*) )?\(build ?(.*)\)
at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
Just wondering, how to save as "spark parquet" in snappy format,
without using launching spark (ie a bit of overkill).
Metadata seems missing.
EDIT based on Pace Comments:
Issue on older version of parquer, now fixed
https://issues.apache.org/jira/browse/PARQUET-349
Older version of Hive, Spark still have the issue.
https://issues.apache.org/jira/browse/HIVE-19464
Python Arrow does not provide the parquet version:
https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360
Besides Pace excellent comments/inights.
A workaround of this issue ist to use fastparquet,
as follow:
import fastparquet as fp
df = df.to_pandas()
fp.write('yourdir', df, fixed_text=None,
compression='SNAPPY', file_scheme='hive')

Wikimedia pageview compression not working

I am trying to analyze monthly wikimedia pageview statistics. Their daily dumps are OK but monthly reports like the one from June 2021 (https://dumps.wikimedia.org/other/pageview_complete/monthly/2021/2021-06/pageviews-202106-user.bz2) seem broken:
[radim#sandbox2 pageviews]$ bzip2 -t pageviews-202106-user.bz2
bzip2: pageviews-202106-user.bz2: bad magic number (file not created by bzip2)
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
[radim#sandbox2 pageviews]$ file pageviews-202106-user.bz2
pageviews-202106-user.bz2: Par archive data
Any idea how to extract the data? What encoding is used here? Can it be Parquet file from their Hive analytics cluster?
These files are not bzip2 archives. They are Parquet files. Parquet-tools can be used to inspect them.
$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main schema /tmp/pageviews-202106-user.bz2 2>/dev/null
{
"type" : "record",
"name" : "hive_schema",
"fields" : [ {
"name" : "line",
"type" : [ "null", "string" ],
"default" : null
} ]
}

Importing multi level JSON arrays into RDMS by using jq to create SQL INSERT statements

Problem
I am looking at a some 2-5 million events of a time series, batch-queried daily from a REST API to feed into rows of a PostgreSQL database. I use a RDMS because I want to have the time series pre-structured for faster analysis and exploration later on and he input schema does not change, really. So Cassandra or MongoDB are not an option although they would happily accept JSON as is.
The database is running on a cloud infrastructure. PostgreSQL is accessible through the network. Maybe the database itself residing on a network-attached file system, so I/O is suffering from the respective latency and limited bandwidth.
Also, if running the crawler/importer as a serverless function, any solution on top of the JVM or python has a significant overhead.
Objective
As the task itself is rather simple and linear, I was looking for a very small footprint solution - at best without the dependency of any runtime or interpreter in the first place.
So building on a small bash script cascading httpie / curl, jq and the native psql client like so sound promising:
#!/bin/bash
DATE=`date +%Y-%m-%d`
http POST https://example.net/my/api/v2 date=$DATE | \
jq -r '<jq json filter>' | \
psql -U myuser -d mydb`
Question
To do just that, how to convert a multi-level JSON including arrays to a prepared SQL STATEMENT using just jq?
Expected SQL statement:
INSERT INTO events (date,node,sensor,prop1,prop2,prop3,prop4) VALUES
(2020-09-10,4170,1,0,-1,0,0),
(2020-09-10,4170,1,0,-1,300,0),
....
(2020-09-10,8888,2,0,-1,0,0)
JSON input:
{
"date": "2020-09-10",
"events": [
{
"sensor": 1,
"intervals": [
{
"prop1": 0,
"prop2": -1,
"prop3": 0,
"prop4": 0
},
{
"prop1": 0,
"prop2": -1,
"prop3": 300,
"prop4": 0
}],
"node": 4170
},
{
"sensor": 2,
"intervals": [
{
"prop1": 0,
"prop2": -1,
"prop3": 0,
"prop4": 0
}],
"node": 8888
}]
}
I'm going to answer my own question here with what I've learned from people on #jq on LiberaChat since I think it might be well worth documenting for a broader audience.
Solution
Essentially, I've found two solutions.
The first "elegant" one incorporates a two-fold join. This will work with a smaller amount of data but performance will take a hit due to it's non-linear, two-pass join. 500.000 events take several minutes.
The second one results in linear, almost sequential processing but is a lot less jq-like. 500.000 events take a couple of seconds on my machine.
a) The elegant way
jq -r '
("INSERT INTO events (date,node,sensor,prop1,prop2,prop3,prop4) VALUES "),
([ "(\(.date as $date | .events[] | [.node, .sensor] as $d | .intervals[] | [$date, $d[], values[]] | join(",")))"] | join(",\n")),
("ON CONFLICT DO NOTHING;")
'
with the jq result as:
INSERT INTO events (date,node,sensor,prop1,prop2,prop3,prop4) VALUES
(2020-09-10,4170,1,0,-1,0,0),
(2020-09-10,4170,1,0,-1,300,0),
(2020-09-10,8888,2,0,-1,0,0)
ON CONFLICT DO NOTHING;
JQ Play: https://jqplay.org/s/oU8mZUHHTm
b) The sequential but less elegant way
jq -r '
("INSERT INTO events (date,node,sensor,prop1,prop2,prop3,prop4) VALUES "),
([.date as $date | .events[] | [.node, .sensor] as $d | .intervals[] | [ $date, $d[], values[] ]] | ( .[:-1][] | "(" + join(",") + "),"), (last | "(" + join(",") + ")")),
("ON CONFLICT DO NOTHING;")
'
JQ Play: https://jqplay.org/s/pkVxWbBwC3
Both suggested solutions go the long way to avoid the excessive , at the end of the values list for SQL.
Why jq?
It is written in C yielding a very small footprint while offering quite advanced and well proven capabilities for filtering and restructuring JSON
It works stream based, so I'm expecting mostly sequential reads and writes (?)
The output can be adapted to the RDMS used and the application itself which makes this fairly portable
JQ Github Page
Alternative Solutions (Discussion)
If the RDMS supports JSON natively (e.g. PostgreSQL) import directly.
Variations of this approach are described here on SO: (How can I import a JSON file into PostgreSQL?)
While Postgres' JSON parser is arguably well proven and quite efficient, this approach is not suitable for millions of rows at a time imho. Without having conducted performance experiments, I'd argue that this might perform a little less efficient as the JSON BLOB is not indexed and the postgres query planner will end up performing random seeks thereon (I may be wrong) and there is an additional copy involved.
Create CSV from JSON and COPY into RDBMS
Most likely this already involves jq to create the CSV file. Maybe Python's pandas module is of any help here.
While this adds another layer of conversion to the matter, importing the CSV into the RDBMs is blazingly fast as it is merely a raw COPY of the rows with the index built on the fly. This may be good choice, yet, as opposed to the previous approaches, a COPY does not honour any TRIGGER that would act only on INSERTS/UPDATE if configured for an existing table.
If you were to filter out duplicates in a time series during import using ... ON CONFLICT DO NOTHING, this is not supported by COPY without an intermediate copy of the dataset: How Postgresql COPY TO STDIN With CSV do on conflic do update?

Python unit tests for Foundry's transforms?

I would like to set up tests on my transforms into Foundry, passing test inputs and checking that the output is the expected one. Is it possible to call a transform with dummy datasets (.csv file in the repo) or should I create functions inside the transform to be called by the tests (data created in code)?
If you check your platform documentation under Code Repositories -> Python Transforms -> Python Unit Tests, you'll find quite a few resources there that will be helpful.
The sections on writing and running tests in particular is what you're looking for.
// START DOCUMENTATION
Writing a Test
Full documentation can be found at https://docs.pytest.org
Pytest finds tests in any Python file that begins with test_.
It is recommended to put all your tests into a test package under the src directory of your project.
Tests are simply Python functions that are also named with the test_ prefix and assertions are made using Python’s assert statement.
PyTest will also run tests written using Python’s builtin unittest module.
For example, in transforms-python/src/test/test_increment.py a simple test would look like this:
def increment(num):
return num + 1
def test_increment():
assert increment(3) == 5
Running this test will cause checks to fail with a message that looks like this:
============================= test session starts =============================
collected 1 item
test_increment.py F [100%]
================================== FAILURES ===================================
_______________________________ test_increment ________________________________
def test_increment():
> assert increment(3) == 5
E assert 4 == 5
E + where 4 = increment(3)
test_increment.py:5: AssertionError
========================== 1 failed in 0.08 seconds ===========================
Testing with PySpark
PyTest fixtures are a powerful feature that enables injecting values into test functions simply by adding a parameter of the same name. This feature is used to provide a spark_session fixture for use in your test functions. For example:
def test_dataframe(spark_session):
df = spark_session.createDataFrame([['a', 1], ['b', 2]], ['letter', 'number'])
assert df.schema.names == ['letter', 'number']
// END DOCUMENTATION
If you don't want to specify your schemas in code, you can also read in a file in your repository by following the instructions in documentation under How To -> Read file in Python repository
// START DOCUMENTATION
Read file in Python repository
You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.
To start, In your python repository edit setup.py:
setup(
name=os.environ['PKG_NAME'],
# ...
package_data={
'': ['*.yaml', '*.csv']
}
)
This tells python to bundle the yaml and csv files into the package. Then place a config file (for example config.yaml, but can be also csv or txt) next to your python transform (e.g. read_yml.py see below):
- name: tbl1
primaryKey:
- col1
- col2
update:
- column: col3
with: 'XXX'
You can read it in your transform read_yml.py with the code below:
from transforms.api import transform_df, Input, Output
from pkg_resources import resource_stream
import yaml
import json
#transform_df(
Output("/Demo/read_yml")
)
def my_compute_function(ctx):
stream = resource_stream(__name__, "config.yaml")
docs = yaml.load(stream)
return ctx.spark_session.createDataFrame([{'result': json.dumps(docs)}])
So your project structure would be:
some_folder
config.yaml
read_yml.py
This will output in your dataset a single row with one column "result" with content:
[{"primaryKey": ["col1", "col2"], "update": [{"column": "col3", "with": "XXX"}], "name": "tbl1"}]
// END DOCUMENTATION

(InternalError) when calling the SelectObjectContent operation in boto3

I have a series of files that are in JSON that need to be split into multiple files to reduce their size. One issue is that the files are extracted using a third party tool and arrive as a JSON object on a single line.
I can use S3 select to process a small file (say around 300Mb uncompressed) but when I try and use a larger file - say 1Gb uncompressed (90Mb gzip compressed) I get the following error:
[ERROR] EventStreamError: An error occurred (InternalError) when calling the SelectObjectContent operation: We encountered an internal error. Please try again.
The query that I am trying to run is:
select count(*) as rowcount from s3object[*][*] s
I can't run the query from the console because the file is larger than 128Mb but the code that is performing the operation is as follows:
def execute_select_query(bucket, key, query):
"""
Runs a query against an object in S3.
"""
if key.endswith("gz"):
compression = "GZIP"
else:
compression = "NONE"
LOGGER.info("Running query |%s| against s3://%s/%s", query, bucket, key)
return S3_CLIENT.select_object_content(
Bucket=bucket,
Key=key,
ExpressionType='SQL',
Expression=query,
InputSerialization={"JSON": {"Type": "DOCUMENT"}, "CompressionType": compression},
OutputSerialization={'JSON': {}},
)