I have a spark dataframe. I attempt to drop a column, but in some situations the column appears to still be there.
my_range = spark.range(1000).toDF("number")
new_range = my_range.withColumn('num2', my_range.number*2).drop('number')
# can still sort by "number" column
new_range.sort('number')
Is this a bug? Or am I missing something?
Spark version is v3.3.1
python 3
I'm on a Mbook pro M1 20221
,I called .explain(True) on your sample dataset, lets take a look at output:
== Parsed Logical Plan ==
'Sort ['number ASC NULLS FIRST], true
+- Project [num2#61L]
+- Project [number#59L, (number#59L * cast(2 as bigint)) AS num2#61L]
+- Project [id#57L AS number#59L]
+- Range (0, 1000, step=1, splits=Some(8))
Parsed Logical Plan is first "raw" version of query plan. Here you can see Project [num2#61L] before sort - this is your drop
But at next stage (Analyzed Logical Plan) its different:
== Analyzed Logical Plan ==
num2: bigint
Project [num2#61L]
+- Sort [number#59L ASC NULLS FIRST], true
+- Project [num2#61L, number#59L]
+- Project [number#59L, (number#59L * cast(2 as bigint)) AS num2#61L]
+- Project [id#57L AS number#59L]
+- Range (0, 1000, step=1, splits=Some(8))
Spark was smart enough to figure out that you need this column, so project before sort includes this column right now. To be compliant with your code, there is new Project added after sort
Now last stage, so optimized logical plan:
== Optimized Logical Plan ==
Project [num2#61L]
+- Sort [number#59L ASC NULLS FIRST], true
+- Project [(id#57L * 2) AS num2#61L, id#57L AS number#59L]
+- Range (0, 1000, step=1, splits=Some(8))
In my opinion its not a bug but Spark design. Keep in mind that your code is executed within same action so due Spark lazy nature he is smart enough to adjust/optimize some code during planning.
The first answer is obviously correct, and whether or not the Spark approach is a good implementation is open to debate - I think it is.
As an embellishment: A checkpoint, if used, will mean an error:
spark.sparkContext.setCheckpointDir("/foo2/bar")
new_range = new_range.checkpoint()
new_range.sort('number').show()
returns:
AnalysisException: Column 'number' does not exist. Did you mean one of the following? [num2];
'Sort ['number ASC NULLS FIRST], true
+- LogicalRDD [num2#69L], false
Related
I am executing the given query containing a UNION. My intention is to reuse the exchange between both the query branches by disabling PushDownPredicate configuration setting in the Spark-shell.
With PushDownPredicate enabled, Spark will push down the filter condition close to the source and hence insert 1 Exchange (Shuffle) on each branch of the queries - resulting in total 2 Exchanges.
However, the expectation is that - with the PushDownPredicate disabled, Spark will not push the filter close to the source by holding the filter condition at its original place in the query - i.e. after the group by clause. This will let Spark use only a single Exchange (for both the queries) thereby reducing 1 shuffle.
Unfortunately I am not being able to produce this action with Spark SQL. The query and the commands are given below:
Configuration:
spark-sql> SET "spark.sql.optimizer.excludeRules", org.apache.spark.sql.catalyst.optimizer.PushDownPredicate;
Query:
select prodId, count(*) as cnt
from test_db.product
group by 1
having count(*) > 1000
and prodId = '1234'
union all
select prodId, count(*) as cnt
from test_db.product
group by 1
having count(*) < 100;
Physical Plan:
Union
:- *(2) Project [prodId#5539, count(1)#5576L]
: +- *(2) Filter (count(1)#5579L > 1000)
: +- *(2) HashAggregate(keys=[prodId#5539], functions=[count(1)])
: +- Exchange hashpartitioning(prodId#5539, 200)
: +- *(1) HashAggregate(keys=[prodId#5539], functions=[partial_count(1)])
: +- *(1) Project[prodId#5539]
: +- *(1) Filter (isnotnull(prodId#5539) && (prodId#5539 = 1234))
: +- *(1) FileScan parquet testdb.product[prodId#5539,eff_dt#5546] Batched: true, Format: Parquet, Location: CatalogFileIndex[s3://path], PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(prodId), EqualTo(ProdId, 1234)], ReadSchema: struct<prodId:string>
+- *(4) Project[prodId#5568, count(1)#5577L]
+- *(4) Filter (count(1)#5581L < 100)
+- *(4) HashAggregate(keys=[prodId#5568], functions=[count(1)])
+- Exchange hashpartitioning(prodId#5568, 200)
+- *(3) HashAggregate(keys=[prodId#5568], functions=[partial_count(1)])
+- *(3) Project [prodId#5568]
+- *(3) FileScan parquet testdb.product[prodId#5568,eff_dt#5575] Batched: true, Format: Parquet, Location: CatalogFileIndex[s3://path], PartitionCount: 1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<prodId:string>
As seen from the above Query Plan, Spark is using 2 Exchanges -> 1 on each query branch. My intention is to reduce the 2 Exchanges into 1 shared Exchange, by using the configuration setting mentioned above.
I am using Spark v2.4.0.
Can anyone please help as to where I may be going wrong. Am I setting the configuration properly ?
Any help is appreciated.
Thanks
You can try with this :
SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.PushDownPredicate
We have a big customer table with 7 million records and we are trying to process some transaction data (500K messages per batch) coming from the kafka stream.
During the processing, we need to join the transaction data with customer data. This is currently taking us around 10s and the requirement is to bring it down to 5s. Since the customer table is too large, we cannot use broadcast join. Is there any other optimization that we can make?
== Parsed Logical Plan ==
Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#119L]
+- Join Inner, Some((custId#110 = rowkey#0))
:- Subquery custProfile
: +- Project [rowkey#0,no_mkt_opto_flag#5,thrd_party_ads_opto_flag#4]
: +- Filter ((no_mkt_opto_flag#5 = N) && (thrd_party_ads_opto_flag#4 = N))
: +- Subquery jz_view_sub_cust_profile
: +- Project [rowkey#0,thrd_party_ads_opto_flag#4,no_mkt_opto_flag#5]
: +- MetastoreRelation db_localhost, ext_sub_cust_profile, None
+- LogicalRDD [key#109,custId#110,mktOptOutFlag#117,thirdPartyOptOutFlag#118], MapPartitionsRDD[190] at rddToDataFrameHolder at custStream.scala:166
== Analyzed Logical Plan ==
count: bigint
Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#119L]
+- Join Inner, Some((custId#110 = rowkey#0))
:- Subquery custProfile
: +- Project [rowkey#0,no_mkt_opto_flag#5,thrd_party_ads_opto_flag#4]
: +- Filter ((no_mkt_opto_flag#5 = N) && (thrd_party_ads_opto_flag#4 = N))
: +- Subquery jz_view_sub_cust_profile
: +- Project [rowkey#0,thrd_party_ads_opto_flag#4,no_mkt_opto_flag#5]
: +- MetastoreRelation db_localhost, ext_sub_cust_profile, None
+- LogicalRDD [key#109,custId#110,mktOptOutFlag#117,thirdPartyOptOutFlag#118], MapPartitionsRDD[190] at rddToDataFrameHolder at custStream.scala:166
== Optimized Logical Plan ==
Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#119L]
+- Project
+- Join Inner, Some((custId#110 = rowkey#0))
:- Project [rowkey#0]
: +- Filter ((no_mkt_opto_flag#5 = N) && (thrd_party_ads_opto_flag#4 = N))
: +- MetastoreRelation db_localhost, ext_sub_cust_profile, None
+- Project [custId#110]
+- LogicalRDD [key#109,custId#110,mktOptOutFlag#117,thirdPartyOptOutFlag#118], MapPartitionsRDD[190] at rddToDataFrameHolder at custStream.scala:166
== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#119L])
+- TungstenExchange SinglePartition, None
+- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#122L])
+- Project
+- SortMergeJoin [rowkey#0], [custId#110]
:- Sort [rowkey#0 ASC], false, 0
: +- TungstenExchange hashpartitioning(rowkey#0,200), None
: +- Project [rowkey#0]
: +- Filter ((no_mkt_opto_flag#5 = N) && (thrd_party_ads_opto_flag#4 = N))
: +- HiveTableScan [rowkey#0,no_mkt_opto_flag#5,thrd_party_ads_opto_flag#4], MetastoreRelation db_localhost, ext_sub_cust_profile, None
+- Sort [custId#110 ASC], false, 0
+- TungstenExchange hashpartitioning(custId#110,200), None
+- Project [custId#110]
+- Scan ExistingRDD[key#109,custId#110,mktOptOutFlag#117,thirdPartyOptOutFlag#118]
Assuming customer data is constant across mini-batches, partition this customer data on customerId using hash partitioner and cache it in RDD/DF.
Since transaction data is coming from Kafka, this data can also be partitioned on same key using hash partitioner while publishing into Kafka
https://www.javaworld.com/article/3066873/big-data/big-data-messaging-with-kafka-part-2.html
This should reduce time in joining two dataset but only condition is partition key should be same in both datasets(transaction data and customer data).
I have a largeDataFrame (multiple columns and billions of rows) and a smallDataFrame (single column and 10,000 rows).
I'd like to filter all the rows from the largeDataFrame whenever the some_identifier column in the largeDataFrame matches one of the rows in the smallDataFrame.
Here's an example:
largeDataFrame
some_idenfitier,first_name
111,bob
123,phil
222,mary
456,sue
smallDataFrame
some_identifier
123
456
desiredOutput
111,bob
222,mary
Here is my ugly solution.
val smallDataFrame2 = smallDataFrame.withColumn("is_bad", lit("bad_row"))
val desiredOutput = largeDataFrame.join(broadcast(smallDataFrame2), Seq("some_identifier"), "left").filter($"is_bad".isNull).drop("is_bad")
Is there a cleaner solution?
You'll need to use a left_anti join in this case.
The left anti join is the opposite of a left semi join.
It filters out data from the right table in the left table according to a given key :
largeDataFrame
.join(smallDataFrame, Seq("some_identifier"),"left_anti")
.show
// +---------------+----------+
// |some_identifier|first_name|
// +---------------+----------+
// | 222| mary|
// | 111| bob|
// +---------------+----------+
A version in pure Spark SQL (and using PySpark as an example, but with small changes
same is applicable for Scala API):
def string_to_dataframe (df_name, csv_string):
rdd = spark.sparkContext.parallelize(csv_string.split("\n"))
df = spark.read.option('header', 'true').option('inferSchema','true').csv(rdd)
df.registerTempTable(df_name)
string_to_dataframe("largeDataFrame", '''some_identifier,first_name
111,bob
123,phil
222,mary
456,sue''')
string_to_dataframe("smallDataFrame", '''some_identifier
123
456
''')
anti_join_df = spark.sql("""
select *
from largeDataFrame L
where NOT EXISTS (
select 1 from smallDataFrame S
WHERE L.some_identifier = S.some_identifier
)
""")
print(anti_join_df.take(10))
anti_join_df.explain()
will output expectedly mary and bob:
[Row(some_identifier=222, first_name='mary'),
Row(some_identifier=111, first_name='bob')]
and also Physical Execution Plan will show it is using
== Physical Plan ==
SortMergeJoin [some_identifier#252], [some_identifier#264], LeftAnti
:- *(1) Sort [some_identifier#252 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(some_identifier#252, 200)
: +- Scan ExistingRDD[some_identifier#252,first_name#253]
+- *(3) Sort [some_identifier#264 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(some_identifier#264, 200)
+- *(2) Project [some_identifier#264]
+- Scan ExistingRDD[some_identifier#264]
Notice Sort Merge Join is more efficient for joining / anti-joining data sets that are approximately of the same size.
Since you have mentioned that that the small dataframe is smaller, we should make sure that Spark optimizer chooses Broadcast Hash Join which will be much more efficient in this scenario :
I will change NOT EXISTS to NOT IN clause for this :
anti_join_df = spark.sql("""
select *
from largeDataFrame L
where L.some_identifier NOT IN (
select S.some_identifier
from smallDataFrame S
)
""")
anti_join_df.explain()
Let's see what it gave us :
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftAnti, ((some_identifier#302 = some_identifier#314) || isnull((some_identifier#302 = some_identifier#314)))
:- Scan ExistingRDD[some_identifier#302,first_name#303]
+- BroadcastExchange IdentityBroadcastMode
+- Scan ExistingRDD[some_identifier#314]
Notice that Spark Optimizer actually chose Broadcast Nested Loop Join and not Broadcast Hash Join. The former is okay since we have just two records to exclude from the left side.
Also notice that both execution plans do have LeftAnti so it is similar to #eliasah answer, but is implemented using pure SQL. Plus it shows that you can have more control over physical execution plan.
PS. Also keep in mind that if the right dataframe is much smaller than the left-side dataframe but is bigger than just a few records, you do want to have Broadcast Hash Join and not Broadcast Nested Loop Join nor Sort Merge Join. If this doesn't happen, you may need to tune up spark.sql.autoBroadcastJoinThreshold as it defaults to 10Mb, but it has to be bigger than the size of the "smallDataFrame".
I have to extract DB to external DB server for licensed software.
DB has to be Postgres and I cannot change select query from application (cannot change source code).
Table (it has to be 1 table) holds around 6,5M rows and has unique values in main column (prefix).
All requests are read request, no inserts/update/delete, and there are ~200k selects/day with peaks of 15 TPS.
Select query is:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table
WHERE '00436641997142' LIKE prefix
AND company = 0 and ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC
LIMIT 1;
Explain analyze shows following
Limit (cost=406433.75..406433.75 rows=1 width=113) (actual time=1721.360..1721.361 rows=1 loops=1)
-> Sort (cost=406433.75..406436.72 rows=1188 width=113) (actual time=1721.358..1721.358 rows=1 loops=1)
Sort Key: ("position"((prefix)::text, '%'::text)), (char_length(prefix)) DESC
Sort Method: quicksort Memory: 25kB
-> Seq Scan on table (cost=0.00..406427.81 rows=1188 width=113) (actual time=1621.159..1721.345 rows=1 loops=1)
Filter: ((company = 0) AND ('00381691997142'::text ~~ (prefix)::text) AND ((strpos(("Day")::text, (to_char(now(), 'ID'::text))::text) > 0) OR ("Day" IS NULL)) AND (((('now'::cstring)::time with time zone >= (timefrom)::time with time zone) AN (...)
Rows Removed by Filter: 6417130
Planning time: 0.165 ms
Execution time: 1721.404 ms`
Slowest part of query is:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table
WHERE '00436641997142' LIKE prefix
which generates 1,6s (tested only this part of query)
Part of query tested separately:
Seq Scan on table (cost=0.00..181819.07 rows=32086 width=113) (actual time=1488.359..1580.607 rows=1 loops=1)
Filter: ('004366491997142'::text ~~ (prefix)::text)
Rows Removed by Filter: 6417130
Planning time: 0.061 ms
Execution time: 1580.637 ms
About data itself:
column "prefix" has identical first several digits (first 5) and rest are different, unique ones.
Postgres version is 9.5
I've changed following settings of Postgres:
random-page-cost = 40
effective_cashe_size = 4GB
shared_buffer = 4GB
work_mem = 1GB
I have tried with several index types (unique, gin, gist, hash), but in all cases indexes are not used (as stated in explain above) and result speed is same.
I've also did, but no visible improvements:
vacuum analyze verbose table
Please recommend settings of DB and/or index configuration in order to speed up execution time of this query.
Current HW is
i5, SSD, 16GB RAM on Win7, but I have option to buy stronger HW.
As I understood, for cases where read (no inserts/updates) is dominant, faster CPU cores are much more important than number of cores or disk speed > please, confirm.
Add-on 1:
After adding 9 indexes, index is not used also.
Add-on 2:
1) I found out reason for not using index: word order in query in part like is reason. if query would be:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE prefix like '00436641997142%'
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
it uses index.
notice difference:
... WHERE '00436641997142%' like prefix ...
query which uses index correctly:
... WHERE prefix like '00436641997142%' ...
since I cannot change query itself, any idea how to overcome this? I can change data and Postgres settings, but not query itself.
2) Also, I intalled Postgres 9.6 version in order to use parallel seq.scan. In this case, parallel scan is used only if last part of query is ommited. So, query:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE '00436641997142' LIKE prefix
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null))
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
uses parallel mode.
Any idea how to force original query (I cannot change query):
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM erm_table WHERE '00436641997142' LIKE prefix
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
to use parallel seq. scan?
It's too hard to make an index for queries like strin LIKE pattern because wildcards (% and _) can stand everywhere.
I can suggest one risky solution:
Slightly redesign the table - make it indexable. Add two more column prefix_low and prefix_high of fixed width - for example char(32), or any arbitrary length enough for the task. Also add one smallint column for prefix length. Fill them with lowest and highest values matching prefix and prefix length. For example:
select rpad(rtrim('00436641997142%','%'), 32, '0') AS prefix_low, rpad(rtrim('00436641997142%','%'), 32, '9') AS prefix_high, length(rtrim('00436641997142%','%')) AS prefix_length;
prefix_low | prefix_high | prefix_length
----------------------------------+---------------------------------------+-----
00436641997142000000000000000000 | 00436641997142999999999999999999 | 14
Make index with these values
CREATE INDEX table_prefix_low_high_idx ON table (prefix_low, prefix_high);
Check modified requests against table:
SELECT prefix, changeprefix, deletelast, outgroup, tariff
FROM table
WHERE '00436641997142%' BETWEEN prefix_low AND prefix_high
AND company = 0
AND ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY prefix_length DESC
LIMIT 1
Check how well it works with indexes, try to tune it - add/remove index for prefix_length add it to between index and so on.
Now you need to rewrite queries to database. Install PgBouncer and PgBouncer-RR patch. It allows you rewrite queries on-fly with easy python code like in example:
import re
def rewrite_query(username, query):
q1=r"""^SELECT [^']*'(?P<id>\d+)%'[^'] ORDER BY (?P<position>\('%' in prefix\) ASC, char_length\(prefix\) LIMIT """
if not re.match(q1, query):
return query # nothing to do with other queries
else:
new_query = # ... rewrite query here
return new_query
Run pgBouncer and connect it to DB. Try to issue different queries like your application does and check how they are getting rewrited. Because you deal with text you have to tweak regexps to match all required queries and rewrite them properly.
When proxy is ready and debugged reconnect your application to pgBouncer.
Pro:
no changes to application
no changes to basic structure of DB
Contra:
extra maintenance - you need triggers to keep all new columns with actual data
extra tools to support
rewrite uses regexp so it's closely tied to particular queries issued by your app. You need to run it for some time and make robust rewrite rules.
Further development:
highjack parsed query tree in pgsql itself https://wiki.postgresql.org/wiki/Query_Parsing
If I understand your problem correctly, creating proxy server which rewrites queries could be solution here.
Here is an example from another question.
Then you could change "LIKE" to "=" in your query, and it would run a lot faster.
You should change your index by adding proper operator class, according to documentation:
The operator classes text_pattern_ops, varchar_pattern_ops, and
bpchar_pattern_ops support B-tree indexes on the types text, varchar,
and char respectively. The difference from the default operator
classes is that the values are compared strictly character by
character rather than according to the locale-specific collation
rules. This makes these operator classes suitable for use by queries
involving pattern matching expressions (LIKE or POSIX regular
expressions) when the database does not use the standard "C" locale.
As an example, you might index a varchar column like this:
CREATE INDEX test_index ON test_table (col varchar_pattern_ops);
I observed that running Model.where(*condition with 3 integer indexes*).first takes too much time on big tables.
first adds sorting by id, so it probabably should take and sort all of 1.5 million records:
Game.where(private: 0, status: 0).first
DEBUG -- : Game Load (__1278.6ms__)
SELECT "games".* FROM "games" WHERE "games"."private" = 0 AND "games"."status" = 0
__ORDER BY "games"."id" ASC__ LIMIT 1
Removing first turns things much faster:
Game.where(private: 0, status: 0)
DEBUG -- : Game Load (__68.0ms__)
SELECT "games".* FROM "games" WHERE "games"."private" = 0 AND "games"."status" = 0
However, if I manually remove sorting things are still going not so fast:
Game.where(private: 0, status: 0).order(nil).first
DEBUG -- : Game Load (__323.7ms__)
SELECT "games".* FROM "games" WHERE "games"."private" = 0 AND "games"."status" = 0 LIMIT 1
Does anyone know what's the cause of this? Now I consider using scope.to_a.first which seems to be much faster.
Explain plan for the first query is:
1. Limit (cost=0.43..59.68 rows=1 width=59)
2. -> Index Scan using games_pkey on games (cost=0.43..90007.49 rows=1519 width=59)
3. Filter: ?
UPD
It's strange but today I see other results for second query (now it's being executed almost instantly):
Game.where(private: 0, status: 0).order(nil).first
DEBUG -- : Game Load (__2.5ms__)
SELECT "games".* FROM "games" WHERE "games"."private" = 0 AND "games"."status" = 0 LIMIT 1
Without posting any details of the query plan it is very hard to debug the issue. There may be several factor that are causing a temporary slow down of the query, factors not connected with the code itself rather with the way the database stores the data.
Just relying on the execution time of the query may not truly expose whether a query is efficient or not.
Generally speaking, the LIMIT requires more resource if there is a sort condition involved, as the database have to sort the data internally in order to extract the N records. Of course, if the attribute used in the sort clause is not indexed, then the query will be even more inefficient.
ActiveRecord exposes both first and take. If you run
Game.where(private: 0, status: 0).first
then ActiveRecord will sort the records by primary key (unless you specify a sorting column) whereas if you use
Game.where(private: 0, status: 0).take
ActiveRecord will query the database and just get the first one. Which solution is better, well it depends. In the second case the result is unpredictable because the database will return the data in whatever order it wants.
Generally, applying a sort condition is quite cheap. But again, you need to check the query plan.
From the console, just append .explain to dump the query plan of a specific query. For example
puts Game.where(private: 0, status: 0).explain
puts Game.where(private: 0, status: 0).order(:id).explain