Using apache beam JsonTimePartitioning to create time partitioned tables in bigqiery - google-bigquery

I have tried using the JsonTimePartitioning class in apache beam JAVA sdk to write data to dynamic tables in bigquery but i get "cannot find symbol" for the class JsonTimePartitioning.
this is how i try to import the class
import com.google.api.services.bigquery.model.JsonTimePartitioning;
and this is how i try to use it in my pipeline
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withJsonTimePartitioningTo(new JsonTimePartitioning().setType("DAY")));

I can't seem to find the JsonTimePartitioning anywhere. Can you point to an example that you are trying to follow? The existing methods on BigQueryIO either accept an instance of TimePartiotioning, or a value-provider for a String that is actually a JSON-serialized instance of the same TimePartitioning. And in fact, when calling the TimePartitioning version of the method, you still end up just serializing it into string internally:. You can find an example of how it's used here:
Loading historical data into time-partitioned BigQuery tables To load
historical data into a time-partitioned BigQuery table, specify
BigQueryIO.Write.withTimePartitioning(com.google.api.services.bigquery.model.TimePartitioning)
with a field used for column-based partitioning. For example:
PCollection<Quote> quotes = ...;
quotes.apply(BigQueryIO.write()
.withSchema(schema)
.withFormatFunction(quote -> new TableRow()
.set("timestamp", quote.getTimestamp())
.set(..other columns..))
.to("my-project:my_dataset.my_table")
.withTimePartitioning(new TimePartitioning().setField("time"))); ```

Related

Passing DataFrame from notebook to another with pyspark

i'am trying to call a DataFrame that i created in notebook1 to use it in my notebook2 in Databricks Community addition with pyspark and i tried this code dbutils.notebook.run("notebook1", 60, {"dfnumber2"})
but it shows this error.
py4j.Py4JException: Method _run([class java.lang.String, class java.lang.Integer, class java.util.HashSet, null, class java.lang.String]) does not exist
any help please?
The actual problem is that you pass last parameter ({"dfnumber2"}) incorrectly - with this syntax it's a set, not the map type. You need to use syntax: {"table_name": "dfnumber2"} to represent it as a dict/map.
But if you look into documentation of dbutils.notebook.run, you will see following phrase:
To implement notebook workflows, use the dbutils.notebook.* methods. Unlike %run, the dbutils.notebook.run() method starts a new job to run the notebook.
But jobs aren't supported on the Community Edition, so it won't work anyway.
Create a global temp view and pass the table name as argument to your next notebook.
Drnumber2.createOrReplaceGlobalTempView("dfnumber2")
dbutils.notebook.run("notebook1", 60, {table_name:"dfnumber2"})
In your notebook1 you can do
table_name= dbutils.widgets.get("table_name")
Dfnumber2 = spark.sql("select * from global_temp."+table_name)

PySpark map function - send n rows instead of one to build a list

I am using Spark 3.x in Python. I have some data (in millions) in CSV files that I have to index in Apache Solr.
I have deployed pysolr module for this purpose
import pysolr
def index_module(row ):
...
solr_client = pysolr.Solr(SOLR_URI)
solr_client.add(row)
...
df = spark.read.format("csv").option("sep", ",").option("quote", "\"").option("escape", "\\").option("header", "true").load("sample.csv")
df.toJSON().map(index_module).count()
index_module module simply get one row of data frame as json and then index in Solr via pysolr module. Pysolr support to index list of documents instead of one. I have to update my logic so that instead of sending one document in each request, I'll send a list of document. Definatelty, it will improve the performance.
How can I achieve this in PySpark ? Is there any alternative or best approach instead of map and toJSON ?
Also, My all activities are completed in transformation functions. I am using count just to start the job. Is there any alternative dummy function (of action type) in spark to do the same?
Finally, I have to create Solr Object each time, is there any alternative for this ?

Write to a dynamic BigQuery table through Apache Beam

I am getting the BigQuery table name at runtime and I pass that name to the BigQueryIO.write operation at the end of my pipeline to write to that table.
The code that I've written for it is:
rows.apply("write to BigQuery", BigQueryIO
.writeTableRows()
.withSchema(schema)
.to("projectID:DatasetID."+tablename)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED));
With this syntax I always get an error,
Exception in thread "main" java.lang.IllegalArgumentException: Table reference is not in [project_id]:[dataset_id].[table_id] format
How to pass the table name with the correct format when I don't know before hand which table it should put the data in? Any suggestions?
Thank You
Very late to the party on this however.
I suspect the issue is you were passing in a string not a table reference.
If you created a table reference I suspect you'd have no issues with the above code.
com.google.api.services.bigquery.model.TableReference table = new TableReference()
.setProjectId(projectID)
.setDatasetId(DatasetID)
.setTableId(tablename);
rows.apply("write to BigQuery", BigQueryIO
.writeTableRows()
.withSchema(schema)
.to(table)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED));

Converting map into tuple

I am loading hbase table using pig.
product = LOAD 'hbase://product' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a:*', '-loadKey true') AS (id:bytearray, a:map[])
The relation product has a tuple that has map in it. I want to convert the map data into tuple
Here is the sample..
grunt>dump product;
06:177602927,[cloud_service#true,wvilnk#true,cmpgeo#true,cmplnk#true,webvid_standard#true,criteria_search#true,typeahead_search#true,aasgbr#true,lnkmin#false,aasdel#true,aasmcu#true,aasvia#true,lnkalt#false,aastlp#true,cmpeel#true,aasfsc#true,aasser#true,aasdhq#true,aasgbm#true,gboint#true,lnkupd#true,aasbig#true,webvid_basic#true,cmpelk#true]
06:177927527,[cloud_service#true,wvilnk#true,cmpgeo#true,cmplnk#true,webvid_standard#true,criteria_search#true,typeahead_search#true,aasgbr#false,lnkmin#false,aasdel#false,aasmcu#false,aasvia#false,lnkalt#false,aastlp#true,cmpeel#true,aasfsc#false,aasser#false,aasdhq#true,aasgbm#false,gboint#true,lnkupd#true,aasbig#false,webvid_basic#true,cmpelk#true,blake#true]
I want to convert each tuple into individual records like below
177602927,cloud_service,true
177602927,wvilnk,true
177602927,cmpgeo,true
177602927,cmpgeo,true
I am pretty new to pig and perhaps this is my first time to do something with Pig Latin. Any help is much appreciated.
I was able to find a fix for my problem.
I used a UDF called MapEntriesToBag which will convert all the maps into bags.
Here is my code.
>register /your/path/to/this/Jar/Pigitos-1.0-SNAPSHOT.jar
>DEFINE MapEntriesToBag pl.ceon.research.pigitos.pig.udf.MapEntriesToBag();
>product = LOAD 'hbase://product' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a:*', '-loadKey true') AS (id:bytearray, a:map[])
>b = foreach product generate flatten(SUBSTRING($0,3,12)), flatten(MapEntriesToBag($1));
The UDF is available in the Jar Pigitos-1.0-SNAPSHOT.jar. You can download this jar from here
For more information you can refer to this link. It has more interesting UDF's related to Map datatype.

WEKA instances from SQL query data

I am wondering how to create the instance in WEKA without actually using CSV or ARRF file. i mean if ihave result of SQL query having four string fields, than how can i used that result to create instance.
Assuming you are not using Weka Explorer and are coding this yourself, something like this should help you make an instance (in this code str1-str4 is your SQL result strings)! Note: This code requires you to load an .arff file with at least a couple of instances that are the same as what your new instances will be like!
Instance inst = new DenseInstance(4);
inst.setValue(attr1, str1);
inst.setValue(attr2, str2);
inst.setValue(attr3, str3);
inst.setValue(attr4, str4);
inst.setDataset(instanceList);
System.out.println("New instance: " + inst);