How to define TransferConfig for BigQuery Data Transfer Service - google-bigquery

The source projectId and source datasetId are supposed to be defined in the Params. But I am not sure how to set Params properly.
TransferConfig transferConfig = TransferConfig.newBuilder()
.setDisplayName(jobName)
.setDestinationDatasetId(dstDatasetId)
.setParams(Struct.newBuilder().build())
.build();
The corresponding cli looks like this.
bq mk --transfer_config
--data_source="cross_region_copy"
--display_name=copy-cli-display-name
--project_id=play
--target_dataset=copy_dataset_cli
--params='{"source_project_id": "tough-talent", "source_dataset_id": "billing"}'

Actually Google BigQuery Data Transfer service represents a plenty of client libraries with a flexible way developing data transport capabilities across BigQuery Data Transfer API.
Assuming that your aim is just to develop transfer service via compatible Bigquery transfer Java client library, you might be looking for TransferConfig.Builder class in order to create data transfer configuration. This class contains dedicated methods for declaring destination Bigquery location path, however source data transfer settings are basically enclosed in setParams() method, invoking Struct() class for propagating key-value structured data parameters, originally included in google.protobuf package.
Due to the fact that:
Struct represents a structured data value, consisting of fields
which map to dynamically typed values.
You can supposedly use either putAllFields() or putFields() methods for mapping key-value parameters within Struct.newBuilder() class as demonstrated in this example:
Struct struct1 = Struct.newBuilder().putFields(
"some-key", Value.newBuilder().setStringValue("some-value").build()
).build();
Find more related examples here.

Related

How do I launch a SmartSim orchestrator without models?

I'm trying to prototype using the SmartRedis Python client to interact with the SmartSim Orchestrator. Is it possible to launch the orchestrator without any other models in the experiment? If so, what would be the best way to do so?
It is entirely possible to do that. A SmartSim Experiment can contain different types of 'entities' including Models, Ensembles (i.e. groups of Models), and Orchestrator (i.e. the Redis-backed database). None of these entities, however, are 'required' to be in the Experiment.
Here's a short script that creates an experiment which includes only a database.
from SmartSim import Experiment
NUM_DB_NODES = 3
exp = Experiment("Database Only")
db = exp.create_database(db_nodes=NUM_DB_NODES)
exp.generate(db)
exp.start(db)
After this, the Orchestrator (with the number of shards specified by NUM_DB_NODES) will have been spunup. You can then connect the Python client using the following line:
client = smartredis.Client(db.get_address()[0],NUM_DB_NODES>1)

How to read ABAP code using a java client

I have a requirement were I need to read ABAP code written by SAP developers. I want to write my own client using Java/Python which can integrate with SAP system and get me the ABAP code.
What I understand that ABAP code is stored in SAP database like HANA, mysql etc. So is there a way which SAP provides where we can read the code like we can do in Git/SVN etc.
I've used RFC calls RPY_FUNCTIONMODULE_READ and RPY_FUNCTIONMODULE_READ_NEW through the perl NWRFC wrapper/library to retreive ABAP code.
You can access tables with below techniques:
Using SAP Connectors via RFC (RFC_READ_TABLE)
Using SOAP Web Service with same function (RFC_READ_TABLE)
Using custom web services with existing functions which are reading report, functions, etc.
You can use both Java or Pyhton for RFC, there is already exits github repo for python.
If you will select reading directly in db table, you need to know structure of saved data. It has own mechanism for OOP objects. Daniel Berlin try to implement binary parser in C++ in sap-reposrc-decompressor project. Never forget this source depended with SAP version.
I think using ADT (ABAP Development Tools) plugin is good for updated systems. There is already Eclipse plugin exists for ADT. ADT not exists in old systems.
If you are planning to use your solution in old system (after 7.01), you can build your own solution with abapGit and custom web services.
NOTE: Keep in mind, report and data elements (variables, tables, types) saved in separate tables. Dynpro objects (screens etc), reports (Smartforms) hard things to decompile.
Before you re-invent a wheel, Take a look at:
ABAPgit. https://docs.abapgit.org/
or the old SAPLink https://wiki.scn.sap.com/wiki/display/ABAP/SAPlink.
If you want JUST the source code, You could expose a very simple rest service/ Endpoint in SAP.
This service would just read the raw code and return it as plain text.
Every abaper could create this for you.
BUT is the raw source only. There is much more to a complete development
and why tools like ABAPGIT exist.
In SICF, create a new endpoint / service.
EG ZCODE_MONKEY with the class below as an example.
Now activate the service.
Call the endpoint
http://server:PORT/zcode_monkey?name=ZCODE_MONKEY
Sample implementation
CLASS zcode_monkey DEFINITION
PUBLIC
CREATE PUBLIC .
PUBLIC SECTION.
INTERFACES: if_http_extension.
ENDCLASS.
CLASS zcode_monkey IMPLEMENTATION.
METHOD if_http_extension~handle_request.
DATA: lo_src type ref to CL_OO_SOURCE,
l_name TYPE string,
l_repname type c length 30,
l_clskey type seoclskey ,
l_source type rswsourcet,
resultcode TYPE string.
FIELD-SYMBOLS: <line> TYPE LINE OF rswsourcet.
l_name = server->request->get_form_field( name = 'NAME' ).
l_clskey = l_name.
l_repname = l_name.
create OBJECT lo_src
EXPORTING
clskey = l_clskey
EXCEPTIONS
class_not_existing = 1
others = 2 .
IF sy-subrc <> 0.
read REPORT l_repname into l_source.
else.
lo_src->read( ).
lo_src->if_oo_clif_source~get_source( IMPORTING source = l_source ).
ENDIF.
LOOP AT l_source ASSIGNING <line>.
CONCATENATE resultCode
cl_abap_char_utilities=>cr_lf
<line>
INTO resultCode RESPECTING BLANKS. " always show respect ;)
ENDLOOP.
SErver->response->set_content_type( content_type = 'text/plain' ).
server->response->set_cdata( EXPORTING data = resultcode ).
server->response->set_status(
EXPORTING
code = 200
reason = 'this is a 3.50 piece of code. Dont ask...its a demo ' ).
ENDMETHOD.
ENDCLASS.

How to avoid duplicates in BigQuery by streaming with Apache Beam IO?

We are using a pretty simple flow where messages are retrieved from PubSub, their JSON content is being flatten into two types (for BigQuery and Postgres) and then inserted into both sinks.
But, we are seeing duplicates in both sinks (Postgres was kinda fixed with a unique constraint and a "ON CONFLICT... DO NOTHING").
At first we trusted in the supposedly "insertId" UUId that the Apache Beam/BigQuery creates.
Then we add a "unique_label" attribute to each message before queueing them into PubSub, using data from the JSON itself, which gives them uniqueness (a device_id + a reading's timestamp). And subscribed to the topic using that attribute with "withIdAttribute" method.
Finally we paid for GCP Support, and their "solutions" do not work. They have told us to even use Reshuffle transform, which is deprecated by the way, and some windowing (that we do not won't since we want near-real time data).
This the main flow, pretty basic:
[UPDATED WITH LAST CODE]
Pipeline
val options = PipelineOptionsFactory.fromArgs(*args).withValidation().`as`(OptionArgs::class.java)
val pipeline = Pipeline.create(options)
var mappings = ""
// Value only available at runtime
if (options.schemaFile.isAccessible){
mappings = readCloudFile(options.schemaFile.get())
}
val tableRowMapper = ReadingToTableRowMapper(mappings)
val postgresMapper = ReadingToPostgresMapper(mappings)
val pubsubMessages =
pipeline
.apply("ReadPubSubMessages",
PubsubIO
.readMessagesWithAttributes()
.withIdAttribute("id_label")
.fromTopic(options.pubSubInput))
pubsubMessages
.apply("AckPubSubMessages", ParDo.of(object: DoFn<PubsubMessage, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info("Processing readings: " + context.element().attributeMap["id_label"])
context.output("")
}
}))
val disarmedMessages =
pubsubMessages
.apply("DisarmedPubSubMessages",
DisarmPubsubMessage(tableRowMapper, postgresMapper)
)
disarmedMessages
.get(TupleTags.readingErrorTag)
.apply("LogDisarmedErrors", ParDo.of(object: DoFn<String, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info(context.element())
context.output("")
}
}))
disarmedMessages
.get(TupleTags.tableRowTag)
.apply("WriteToBigQuery",
BigQueryIO
.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry())
.to(options.bigQueryOutput)
)
pipeline.run()
DissarmPubsubMessage is a PTransforms that uses FlatMapElements transform to get TableRow and ReadingsInputFlatten (own class for Postgres)
We expect zero duplicates or the "best effort" (and we append some cleaning cron job), we paid for these products to run statistics and bigdata analysis...
[UPDATE 1]
I even append a new simple transform that logs our unique attribute through a ParDo which supposedly should ack the PubsubMessage, but this is not the case:
new flow with AckPubSubMessages step
Thanks!!
Looks like you are using the global window. One technique would be to window this into an N minute window. Then process the keys in the window and drop an items with dup keys.
The supported programming languages are Python and Java, your code seems to be Scala and as far as I know it is not supported. I strongly recommend using Java to avoid any unsupported feature for the programming language you use.
In addition, I would recommend the following approaches to work on duplicates, the option 2 could meet your need of near-real-time:
message_id. Probably you already read the FAQ - duplicates which points to deprecated doc. However, if you check the PubsubMessage object you will notice that messageId is still available and it will be populated if not set by the publisher:
"ID of this message, assigned by the server when the message is
published ... It must not be populated by the publisher in a
topics.publish call"
BigQuery Streaming. To validate duplicate during loading the data, right before inserting in BQ you can create UUID.Please refer the section Example sink: Google BigQuery.
Try the Dataflow template PubSubToBigQuery and validate there are not duplicates in BQ.

Writing different values to different BigQuery tables in Apache Beam

Suppose I have a PCollection<Foo> and I want to write it to multiple BigQuery tables, choosing a potentially different table for each Foo.
How can I do this using the Apache Beam BigQueryIO API?
This is possible using a feature recently added to BigQueryIO in Apache Beam.
PCollection<Foo> foos = ...;
foos.apply(BigQueryIO.write().to(new SerializableFunction<ValueInSingleWindow<Foo>, TableDestination>() {
#Override
public TableDestination apply(ValueInSingleWindow<Foo> value) {
Foo foo = value.getValue();
// Also available: value.getWindow(), getTimestamp(), getPane()
String tableSpec = ...;
String tableDescription = ...;
return new TableDestination(tableSpec, tableDescription);
}
}).withFormatFunction(new SerializableFunction<Foo, TableRow>() {
#Override
public TableRow apply(Foo foo) {
return ...;
}
}).withSchema(...));
Depending on whether the input PCollection<Foo> is bounded or unbounded, under the hood this will either create multiple BigQuery import jobs (one or more per table depending on amount of data), or it will use the BigQuery streaming inserts API.
The most flexible version of the API uses DynamicDestinations, which allows you to write different values to different tables with different schemas, and even allows you to use side inputs from the rest of the pipeline in all of these computations.
Additionally, BigQueryIO has been refactored into a number of reusable transforms that you can yourself combine to implement more complex use cases - see files in the source directory.
This feature will be included in the first stable release of Apache Beam and into the next release of Dataflow SDK (which will be based on the first stable release of Apache Beam). Right now you can use this by running your pipeline against a snapshot of Beam at HEAD from github.
As of Beam 2.12.0, this feature is available in the Python SDK as well. It is marked as experimental, so you will have to pass --experiments use_beam_bq_sink to enable it. You'd do something like so:
def get_table_name(element):
if meets_some_condition(element):
return 'mytablename1'
else:
return 'mytablename2'
p = beam.Pipeline(...)
my_input_pcoll = p | ReadInMyPCollection()
my_input_pcoll | beam.io.gcp.bigquery.WriteToBigQuery(table=get_table_name)
The new sink supports a number of other options, which you can review in the pydoc

Inferring topics with mallet, using the saved topic state

I've used the following command to generate a topic model from some documents:
bin/mallet train-topics --input topic-input.mallet --num-topics 100 --output-state topic-state.gz
I have not, however, used the --output-model option to generate a serialized topic trainer object. Is there any way I can use the state file to infer topics for new documents? Training is slow, and it'll take a couple of days for me to retrain, if I have to create the serialized model from scratch.
We did not use the command line tools shipped with mallet, we just use the mallet api to create the serialized model for inferences of the new document. Two point need special notice:
You need serialize out the pipes you used just after you finish the training (For my case, it is SerialPipes)
And of cause the model need also to be serialized after you finish the training(For my case, it is ParallelTopicModel)
Please check with the java doc:
http://mallet.cs.umass.edu/api/cc/mallet/pipe/SerialPipes.html
http://mallet.cs.umass.edu/api/cc/mallet/topics/ParallelTopicModel.html
Restoring a model from the state file appears to be a new feature in mallet 2.0.7 according to the release notes.
Ability to restore models from gzipped "state" files. From the new
TopicTrainer, use the --input-state [filename] argument. Note that you
can manually edit this file. Any token with topic set to -1 will be
immediately resampled upon loading.
If you mean you want to see how new documents fit into a previously trained topic model, then I'm afraid there is no simple command you can use to do it right.
The class cc.mallet.topics.LDA in mallet 2.0.7's source code provides such a utility, try to understand it and use it in your program.
P.S., If my memory serves, there is some problem with the implementation of the function in that class:
public void addDocuments(InstanceList additionalDocuments,
int numIterations, int showTopicsInterval,
int outputModelInterval, String outputModelFilename,
Randoms r)
You have to rewrite it.