Why do the persist() and cache() methods shorten DataFrame plan in Spark? - dataframe

I am working with spark version 3.0.1. I am generating a large dataframe. At the end calculations, I save dataframe plan in json format. I need him.
But there is one problem. If I persist a DataFrame, then its plan in json format is completely truncated. That is, all data lineage disappears.
For example, I do this:
val myDf: DataFrame = ???
val myPersistDf = myDf.persist
//toJSON method cuts down my plan
val jsonPlan = myPersistDf.queryExecution.optimizedPlan.toJSON
As a result, only information about the current columns remains.
But for example, if you use spark version 3.1.2, then there is no such problem. That is, the plan is not cut.
It is also worth saying that if you do not call the toJSON method, then the plan is not cut:
// Plan is not being cut.
val textPlan = myPersistDf.queryExecution.optimizedPlan.toString
I made a small test project to make sure of this:
https://github.com/MinorityMeaning/CutPlanDataFrame
Please help me figure it out.
I need to get the full plan in json format.
UPD(1):
Now I'm trying to convert each node to json separately. Now it doesn't work perfectly, but I think we need to go in this direction.
val jsonPlan = s"[${getJson(result_df.queryExecution.optimizedPlan).mkString(",")}]"
def getJson(lp: TreeNode[_]): Seq[String] = {
val children = (lp.innerChildren ++ lp.children.map(c => c.asInstanceOf[TreeNode[_]])).distinct
JsonMethods.compact(JsonMethods.render(JsonMethods.parse(lp.toJSON)(0))) +:
getJson(t.asInstanceOf[TreeNode[_]])))
children.flatMap(t => getJson(t))
}
UPD(2):
OK, here's how I finally solved this problem.
I downloaded spark 3.0.1 from github. Then replaced the TreeNode class in this project with a file from spark 3.1.2. Recompiled the project.
As a result, I received a package spark-catalyst_2.12-3.0.1.jar
Which replaced the existing original packaging.
There is no option to switch to another version of spark. I have not found any other solutions to the problem.
Thank you guys for prompting. Your advice helped.

You can cherry pick below 2 commits into spark 3.0.1 to fix this issue.
* 1603775934 - [SPARK-35411][SQL][FOLLOWUP] Handle Currying Product while serializing TreeNode to JSON (8 months ago) <Tengfei Huang>
* 9804f07c17 - [SPARK-35411][SQL] Add essential information while serializing TreeNode to json (9 months ago) <Tengfei Huang>

Well, you already kinda have your answer.
TLDR: Upgrade.
You could, could, look into the github repo at the logical plan code, and look for difference in how logical plans are calculated between. 3.0.1 and 3.1.2. (Or look at the code for persist and see how it's changed.) You could then back port a patch to 3.0.1. But you'd still need to build a new version of spark and then deploy it so that the plan returned. But if you are doing all that work why not upgrade to 3.1.2 if you know it works? (Or some later version of Spark?)
(You must have some dependency on a sub-component that only compatible with 3.0.1?)

Related

Directly passing pandas data into zipline

I am currently looking for a way to directly pass in a pandas dataframe or csv file to zipline for simple backtesting WITHOUT having to ingest a data bundle. The reason is that I am planning to generate new data outside of the existing bundle during a backtest and it seems very inefficient to ingest a new bundle for every handle_data call.
I have been looking for this everywhere, including the source codes of zipline. I found that an older version of zipline has a 'data' param in the run_algo function call where you could pass in a df directly, but I can't find that old version at the moment. Is anyone attempting the same thing? Is there any way other than ingesting data bundles in the command line everytime?
I'm using zipline 1.3.0 and it actually does have a data param. This comment is from run_algo.py file of zipline:
data : pd.DataFrame, pd.Panel, or DataPortal, optional
The ohlcv data to run the backtest with.
This argument is mutually exclusive with:
``bundle``
``bundle_timestamp``
Hope it helped

Enable Impala Impersonation on Superset

Is there a way to make the logged user (on superset) to make the queries on impala?
I tried to enable the "Impersonate the logged on user" option on Databases but with no success because all the queries run on impala with superset user.
I'm trying to achieve the same! This will not completely answer this question since it does not still work but I want to share my research in order to maybe help another soul that is trying to use this instrument outside very basic use cases.
I went deep in the code and I found out that impersonation is not implemented for Impala. So you cannot achieve this from the UI. I found out this PR https://github.com/apache/superset/pull/4699 that for whatever reason was never merged into the codebase and tried to copy&paste code in my Superset version (1.1.0) but it didn't work. Adding some logs I can see that the configuration with the impersonation is updated, but then the actual Impala query is with the user I used to start the process.
As you can imagine, I am a complete noob at this. However I found out that the impersonation thing happens when you create a cursor and there is a constructor parameter in which you can pass the impersonation configuration.
I managed to correctly (at least to my understanding) implement impersonation for the SQL lab part.
In the sql_lab.py class you have to add in the execute_sql_statements method the following lines
with closing(engine.raw_connection()) as conn:
# closing the connection closes the cursor as well
cursor = conn.cursor(**database.cursor_kwargs)
where cursor_kwargs is defined in db_engine_specs/impala.py as the following
#classmethod
def get_configuration_for_impersonation(cls, uri, impersonate_user, username):
logger.info(
'Passing Impala execution_options.cursor_configuration for impersonation')
return {'execution_options': {
'cursor_configuration': {'impala.doas.user': username}}}
#classmethod
def get_cursor_configuration_for_impersonation(cls, uri, impersonate_user,
username):
logger.debug('Passing Impala cursor configuration for impersonation')
return {'configuration': {'impala.doas.user': username}}
Finally, in models/core.py you have to add the following bit in the get_sqla_engine def
params = extra.get("engine_params", {}) # that was already there just for you to find out the line
self.cursor_kwargs = self.db_engine_spec.get_cursor_configuration_for_impersonation(
str(url), self.impersonate_user, effective_username) # this is the line I added
...
params.update(self.get_encrypted_extra()) # already there
#new stuff
configuration = {}
configuration.update(
self.db_engine_spec.get_configuration_for_impersonation(
str(url),
self.impersonate_user,
effective_username))
if configuration:
params.update(configuration)
As you can see I just shamelessy pasted the code from the PR. However this kind of works only for the SQL lab as I already said. For the dashboards there is an entirely different way of querying Impala that I did not still find out.
This means that queries for the dashboards are handled in a different way and there isn't something like this
with closing(engine.raw_connection()) as conn:
# closing the connection closes the cursor as well
cursor = conn.cursor(**database.cursor_kwargs)
My gut (and debugging) feeling is that you need to first understand the sqlalchemy part and extend a new ImpalaEngine class that uses a custom cursor with the impersonation conf. Or something like that, however it is not simple (if we want to call this simple) as the sql_lab part. So, the trick is to find out where the query is executed and create a cursor with the impersonation configuration. Easy, isnt'it ?
I hope that this could shed some light to you and the others that have this issue. Let me know if you did find out another way to solve this issue, or if this comment was useful.
Update: something really useful
A colleague of mine succesfully implemented impersonation with impala without touching any superset related, but instead working directly with the impyla lib. A PR was open with the code to change. You can apply the patch directly in the impyla src used by superset. You have to edit both dbapi.py and hiveserver2.py.
As a reminder: we are still testing this and we do not know if it works with different accounts using the same superset instance.

Wait.on(signals) use in Apache Beam

Is it possible to write to 2nd BigQuery table after writing to 1st has finished in a batch pipeline using Wait.on() method(new feature in Apache Beam 2.4)? The example given in the Apache Beam documentation is:
PCollection<Void> firstWriteResults = data.apply(ParDo.of(...write to first database...));
data.apply(Wait.on(firstWriteResults))
// Windows of this intermediate PCollection will be processed no earlier than when
// the respective window of firstWriteResults closes.
.apply(ParDo.of(...write to second database...));
But why would I write to database from within ParDo? Can we not do the same by using the I/O transforms given in Dataflow?
Thanks.
Yes this is possible, although there are some known limitations and there is currently some work being done to further support this.
In order to make this work you can do something like the following:
WriteResult writeResult = data.apply(BigQueryIO.write()
...
.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
);
data.apply(Wait.on(writeResults.getFailedInserts()))
.apply(...some transform which writes to second database...);
It should be noted that this only works with streaming inserts and wont work with file loads. At the same time there is some work being done currently to better support this use case that you can follow here
Helpful references:
http://moi.vonos.net/cloud/beam-send-pubsub/
http://osdir.com/apache-beam-users/msg02120.html

Spark issue with the class generated from avro schema

I have piece of code written in spark that loads data from HDFS into java classes generated from avro idl. On RDD created in that way I am executing simple operation which results depends on fact whether I cache RDD before it or not
i.e if I run code below
val loadedData = loadFromHDFS[Data](path,...)
println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) // 200000
program will print 200000, on the other hand executing next code
val loadedData = loadFromHDFS[Data](path,...).cache()
println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) // 1
result in 1 printed to stdout.
When I inspect values of the fields after reading cached data it seems
I am pretty sure that root cause of described problem is issue with serialization of classes generated from avro idl, but I do not know how to resolve it. I tried to use Kryo, registering generated class (Data), registering different serializers from chill_avro for given class (SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but none of those ideas helps me.
How I can solve this problem?
Link to minimal, complete, and verifiable example.
Try the code below out -
val loadedData = loadFromHDFS[Data](path,...)
println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()).cache()

Ruby mongodb: Three newly created objects doesn't appear to exist during testing

I'm using mongodb to store some data. Then I have a function that gets the object with the latest timestamp and one with the oldest. I haven't experienced any issues during development or production with this method but when I try to implement a test for it the test fails approx 20% of the times. I'm using rspec to test this method and I'm not using mongoid or mongomapper. I create three objects with different timestamps but get a nil response since my dataset contains 0 objects. I have read a lot of articles about write_concern and that it might be the problem with "unsafe writes" but I have tried almost all the different combinations with these parameters (w, fsync, j, wtimeout) without any success. Does anyone have any idea how to solve this issue? Perhaps I have focused too much with the write_concern track and that the problems lies somewhere else.
This is the method that fetches the latest and oldest timestamp.
def first_and_last_timestamp(customer_id, system_id)
last = collection(customer_id).
find({sid:system_id}).
sort(["t",Mongo::DESCENDING]).
limit(1).next()
first = collection(customer_id).
find({sid:system_id}).
sort(["t",Mongo::ASCENDING]).
limit(1).next()
{ min: first["t"], max: last["t"] }
end
Im inserting data using this method where data is a json object.
def insert(customer_id, data)
collection(customer_id).insert(data)
end
I have reverted back to use the default for setting up my connection
Mongo::MongoClient.new(mongo_host, mongo_port)
I'm using the gem mongo (1.10.2). I'm not using any fancy setup for my mongo database. I've just installed mongo using brew on my mac and started it. The version of my mongo database is v2.6.1.