Passing DataFrame from notebook to another with pyspark - dataframe

i'am trying to call a DataFrame that i created in notebook1 to use it in my notebook2 in Databricks Community addition with pyspark and i tried this code dbutils.notebook.run("notebook1", 60, {"dfnumber2"})
but it shows this error.
py4j.Py4JException: Method _run([class java.lang.String, class java.lang.Integer, class java.util.HashSet, null, class java.lang.String]) does not exist
any help please?

The actual problem is that you pass last parameter ({"dfnumber2"}) incorrectly - with this syntax it's a set, not the map type. You need to use syntax: {"table_name": "dfnumber2"} to represent it as a dict/map.
But if you look into documentation of dbutils.notebook.run, you will see following phrase:
To implement notebook workflows, use the dbutils.notebook.* methods. Unlike %run, the dbutils.notebook.run() method starts a new job to run the notebook.
But jobs aren't supported on the Community Edition, so it won't work anyway.

Create a global temp view and pass the table name as argument to your next notebook.
Drnumber2.createOrReplaceGlobalTempView("dfnumber2")
dbutils.notebook.run("notebook1", 60, {table_name:"dfnumber2"})
In your notebook1 you can do
table_name= dbutils.widgets.get("table_name")
Dfnumber2 = spark.sql("select * from global_temp."+table_name)

Related

Can we pass dataframes between different notebooks in databricks and sequentially run multiple notebooks? [duplicate]

I have a notebook which will process the file and creates a data frame in structured format.
Now I need to import that data frame created in another notebook, but the problem is before running the notebook I need to validate that only for some scenarios I need to run.
Usually to import all data structures, we use %run. But in my case it should be combinations of if clause and then notebook run
if "dataset" in path": %run ntbk_path
its giving an error " path not exist"
if "dataset" in path": dbutils.notebook.run(ntbk_path)
this one I cannot get all the data structures.
Can someone help me to resolve this error?
To implement it correctly you need to understand how things are working:
%run is a separate directive that should be put into the separate notebook cell, you can't mix it with the Python code. Plus, it can't accept the notebook name as variable. What %run is doing - it's evaluating the code from specified notebook in the context of the current Spark session, so everything that is defined in that notebook - variables, functions, etc. is available in the caller notebook.
dbutils.notebook.run is a function that may take a notebook path, plus parameters and execute it as a separate job on the current cluster. Because it's executed as a separate job, then it doesn't share the context with current notebook, and everything that is defined in it won't be available in the caller notebook (you can return a simple string as execution result, but it has a relatively small max length). One of the problems with dbutils.notebook.run is that scheduling of a job takes several seconds, even if the code is very simple.
How you can implement what you need?
if you use dbutils.notebook.run, then in the called notebook you can register a temp view, and caller notebook can read data from it (examples are adopted from this demo)
Called notebook (Code1 - it requires two parameters - name for view name & n - for number of entries to generate):
name = dbutils.widgets.get("name")
n = int(dbutils.widgets.get("n"))
df = spark.range(0, n)
df.createOrReplaceTempView(name)
Caller notebook (let's call it main):
if "dataset" in "path":
view_name = "some_name"
dbutils.notebook.run(ntbk_path, 300, {'name': view_name, 'n': "1000"})
df = spark.sql(f"select * from {view_name}")
... work with data
it's even possible to do something like with %run, but it could require a kind of "magic". The foundation of it is the fact that you can pass arguments to the called notebook by using the $arg_name="value", and you can even refer to the values specified in the widgets. But in any case, the check for value will happen in the called notebook.
The called notebook could look as following:
flag = dbutils.widgets.get("generate_data")
dataframe = None
if flag == "true":
dataframe = ..... create datarame
and the caller notebook could look as following:
------ cell in python
if "dataset" in "path":
gen_data = "true"
else:
gen_data = "false"
dbutils.widgets.text("gen_data", gen_data)
------- cell for %run
%run ./notebook_name $generate_data=$gen_data
------ again in python
dbutils.widgets.remove("gen_data") # remove widget
if dataframe: # dataframe is defined
do something with dataframe

splitting columns with str.split() not changing the outcome

Will I have to use the str.split() for an exercise. I have a column called title and it looks like this:
and i need to split it into two columns Name and Season, the following code does not through an error but it doesn't seem to be doing anything as well when i'm testing it with df.head()
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
Any help as to why?
The code you have in your question is correct, and should be working. The issue could be coming from the execution order of your code though, if you're using Jupyter Notebook or some method that allows for unclear ordering of code execution.
I recommend starting a fresh kernel/terminal to clear all variables from the namespace, then executing those lines in order, e.g.:
# perform steps to load data in and clean
print(df.columns)
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
print(df.columns)
Alternatively you could add an assertion step in your code to ensure it's working as well:
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
assert {'Name', 'Season'}.issubset(set(df.columns)), "Columns were not added"

Using apache beam JsonTimePartitioning to create time partitioned tables in bigqiery

I have tried using the JsonTimePartitioning class in apache beam JAVA sdk to write data to dynamic tables in bigquery but i get "cannot find symbol" for the class JsonTimePartitioning.
this is how i try to import the class
import com.google.api.services.bigquery.model.JsonTimePartitioning;
and this is how i try to use it in my pipeline
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withJsonTimePartitioningTo(new JsonTimePartitioning().setType("DAY")));
I can't seem to find the JsonTimePartitioning anywhere. Can you point to an example that you are trying to follow? The existing methods on BigQueryIO either accept an instance of TimePartiotioning, or a value-provider for a String that is actually a JSON-serialized instance of the same TimePartitioning. And in fact, when calling the TimePartitioning version of the method, you still end up just serializing it into string internally:. You can find an example of how it's used here:
Loading historical data into time-partitioned BigQuery tables To load
historical data into a time-partitioned BigQuery table, specify
BigQueryIO.Write.withTimePartitioning(com.google.api.services.bigquery.model.TimePartitioning)
with a field used for column-based partitioning. For example:
PCollection<Quote> quotes = ...;
quotes.apply(BigQueryIO.write()
.withSchema(schema)
.withFormatFunction(quote -> new TableRow()
.set("timestamp", quote.getTimestamp())
.set(..other columns..))
.to("my-project:my_dataset.my_table")
.withTimePartitioning(new TimePartitioning().setField("time"))); ```

Write to a dynamic BigQuery table through Apache Beam

I am getting the BigQuery table name at runtime and I pass that name to the BigQueryIO.write operation at the end of my pipeline to write to that table.
The code that I've written for it is:
rows.apply("write to BigQuery", BigQueryIO
.writeTableRows()
.withSchema(schema)
.to("projectID:DatasetID."+tablename)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED));
With this syntax I always get an error,
Exception in thread "main" java.lang.IllegalArgumentException: Table reference is not in [project_id]:[dataset_id].[table_id] format
How to pass the table name with the correct format when I don't know before hand which table it should put the data in? Any suggestions?
Thank You
Very late to the party on this however.
I suspect the issue is you were passing in a string not a table reference.
If you created a table reference I suspect you'd have no issues with the above code.
com.google.api.services.bigquery.model.TableReference table = new TableReference()
.setProjectId(projectID)
.setDatasetId(DatasetID)
.setTableId(tablename);
rows.apply("write to BigQuery", BigQueryIO
.writeTableRows()
.withSchema(schema)
.to(table)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED));

SnappyData - Error creating Kafka streaming table

I'm seeing an issue when creating a spark streaming table using kafka from the snappy shell.
'The exception 'Invalid input 'C', expected dmlOperation, insert, withIdentifier, select or put (line 1, column 1):'
Reference: http://snappydatainc.github.io/snappydata/streamingWithSQL/#spark-streaming-overview
Here is my sql:
CREATE STREAM TABLE if not exists sensor_data_stream
(sensor_id string, metric string)
using kafka_stream
options (
storagelevel 'MEMORY_AND_DISK_SER_2',
rowConverter 'io.snappydata.app.streaming.KafkaStreamToRowsConverter',
zkQuorum 'localhost:2181',
groupId 'streamConsumer',
topics 'test:01');
The shell seems to not like the script at the first character 'C'. I'm attempting to execute the script using the following command:
snappy> run '/scripts/my_test_sensor_script.sql';
any help appreciated!
There is some inconsistency in documentation and actual syntax.The correct syntax is:
CREATE STREAM TABLE sensor_data_stream if not exists (sensor_id string,
metric string) using kafka_stream
options (storagelevel 'MEMORY_AND_DISK_SER_2',
rowConverter 'io.snappydata.app.streaming.KafkaStreamToRowsConverter',
zkQuorum 'localhost:2181',
groupId 'streamConsumer', topics 'test:01');
One more thing you need to do is to write row converter for your data
Mike, You need to create your own rowConverter class by implementing following trait -
trait StreamToRowsConverter extends Serializable {
def toRows(message: Any): Seq[Row]
}
and then specify that rowConverter fully qualified class name in the DDL.
The rowConverter is specific to a schema.
'io.snappydata.app.streaming.KafkaStreamToRowsConverter' is just an placeholder class name, which should be replaced by your own rowConverter class.