apache beam use run options to read big query table data - google-bigquery

I am trying to build apache beam dataflow service to read data from BQ table as well as big query sql query and do some transformations on it.
I am trying to build a template with few run time parameters but however template build failing with the following error..
java.lang.IllegalStateException: Value only available at runtime, but accessed from a non-runtime context: RuntimeValueProvider{propertyName=tableDate, default=null}
at org.apache.beam.sdk.options.ValueProvider$RuntimeValueProvider.get (ValueProvider.java:254)
The pipeline options are defined as
public interface BeamOptions
extends DataflowPipelineOptions {
void setTableDate(ValueProvider<String> value);
ValueProvider<String> getTableDate();
void setTableSuffix(ValueProvider<String> value);
ValueProvider<String> getTableSuffix();
}
the pipeline is defined as follows
public static void main(String[] args) {
BeamOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(BeamOptions.class);
Pipeline pipeline = Pipeline.create(options);
String someQuery = "select * from table1 where date = replace_date";
//here trying to use some date passed from run time options
String updatedQuery = someQuery.replaceAll("replace_date",options.getTableDate.get());
//throwing error while building the template..
PCollection<TableRow> sqlRoows = pipeline
.apply("extract-bq-data",
"Read from BigQuery using query",
BigQueryIO.readTableRows().fromQuery(updatedQuery));
//use run time tableSuffix throwing error while building the template..
TableReference tableSpec =
new TableReference()
.setProjectId("xyz")
.setDatasetId("dataset")
.setTableId("tableSuffix".concat(options.tableSuffix.get())));
PCollection<TableRow> tableRows = pipeline.apply(
"Read from BigQuery table",
BigQueryIO.readTableRows().from(tableSpec)
);
Is there any way to pass runtime options and use that to read data from big query?

Related

Java - Insert a single row at a time into google Big Query ?

I am creating an application where every time a user clicks on an article, I need to capture the article data and the user data to calculate the reach of every article and be able to run analytics on the reached data.
My application is on App Engine.
When I check documentation for inserts into BQ, most of them point towards bulk inserts in the form of Jobs or Streams.
Question:
Is it even a good practice to insert into big Query one row at a time every time a user action is initiated ? If so, could you point me to some Java code to effectively do this ?
There are limits on the number of load jobs and DML queries (1,000 per day), so you'll need to use streaming inserts for this kind of application. Note that streaming inserts are different from loading data from a Java stream.
TableId tableId = TableId.of(datasetName, tableName);
// Values of the row to insert
Map<String, Object> rowContent = new HashMap<>();
rowContent.put("booleanField", true);
// Bytes are passed in base64
rowContent.put("bytesField", "Cg0NDg0="); // 0xA, 0xD, 0xD, 0xE, 0xD in base64
// Records are passed as a map
Map<String, Object> recordsContent = new HashMap<>();
recordsContent.put("stringField", "Hello, World!");
rowContent.put("recordField", recordsContent);
InsertAllResponse response =
bigquery.insertAll(
InsertAllRequest.newBuilder(tableId)
.addRow("rowId", rowContent)
// More rows can be added in the same RPC by invoking .addRow() on the builder
.build());
if (response.hasErrors()) {
// If any of the insertions failed, this lets you inspect the errors
for (Entry<Long, List<BigQueryError>> entry : response.getInsertErrors().entrySet()) {
// inspect row error
}
}
(From the example at https://cloud.google.com/bigquery/streaming-data-into-bigquery#bigquery-stream-data-java)
Note especially that a failed insert does not always throw an exception. You must also check the response object for errors.
Is it even a good practice to insert into big Query one row at a time every time a user action is initiated ?
Yes, it's pretty typical to stream event streams to BigQuery for analytics. You'll could get better performance if you buffer multiple events into the same streaming insert request to BigQuery, but one row at a time is definitely supported.
A simplified version of Google's example.
Map<String, Object> row1Data = new HashMap<>();
row1Data.put("booleanField", true);
row1Data.put("stringField", "myString");
Map<String, Object> row2Data = new HashMap<>();
row2Data.put("booleanField", false);
row2Data.put("stringField", "myOtherString");
TableId tableId = TableId.of("myDatasetName", "myTableName");
InsertAllResponse response =
bigQuery.insertAll(
InsertAllRequest.newBuilder(tableId)
.addRow("row1Id", row1Data)
.addRow("row2Id", row2Data)
.build());
if (response.hasErrors()) {
// If any of the insertions failed, this lets you inspect the errors
for (Map.Entry<Long, List<BigQueryError>> entry : response.getInsertErrors().entrySet()) {
// inspect row error
}
}
You can use Cloud Logging API to write one row at a time.
https://cloud.google.com/logging/docs/reference/libraries
Sample code from document
public class QuickstartSample {
/** Expects a new or existing Cloud log name as the first argument. */
public static void main(String... args) throws Exception {
// Instantiates a client
Logging logging = LoggingOptions.getDefaultInstance().getService();
// The name of the log to write to
String logName = args[0]; // "my-log";
// The data to write to the log
String text = "Hello, world!";
LogEntry entry =
LogEntry.newBuilder(StringPayload.of(text))
.setSeverity(Severity.ERROR)
.setLogName(logName)
.setResource(MonitoredResource.newBuilder("global").build())
.build();
// Writes the log entry asynchronously
logging.write(Collections.singleton(entry));
System.out.printf("Logged: %s%n", text);
}
}
In this case you need to create sink from dataflow logs. Then message will be redirect to the big Query table.
https://cloud.google.com/logging/docs/export/configure_export_v2

How to catch any exceptions thrown by BigQueryIO.Write and rescue the data which is failed to output?

I want to read data from Cloud Pub/Sub and write it to BigQuery with Cloud Dataflow. Each data contains a table ID where the data itself will be saved.
There are various factors that writing to BigQuery fails:
Table ID format is wrong.
Dataset does not exist.
Dataset does not allow the pipeline to access.
Network failure.
When one of the failures occurs, a streaming job will retry the task and stall. I tried using WriteResult.getFailedInserts() in order to rescue the bad data and avoid stalling, but it did not work well. Is there any good way?
Here is my code:
public class StarterPipeline {
private static final Logger LOG = LoggerFactory.getLogger(StarterPipeline.class);
public class MyData implements Serializable {
String table_id;
}
public interface MyOptions extends PipelineOptions {
#Description("PubSub topic to read from, specified as projects/<project_id>/topics/<topic_id>")
#Validation.Required
ValueProvider<String> getInputTopic();
void setInputTopic(ValueProvider<String> value);
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<MyData> input = p
.apply("ReadFromPubSub", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply("ParseJSON", MapElements.into(TypeDescriptor.of(MyData.class))
.via((String text) -> new Gson().fromJson(text, MyData.class)));
WriteResult writeResult = input
.apply("WriteToBigQuery", BigQueryIO.<MyData>write()
.to(new SerializableFunction<ValueInSingleWindow<MyData>, TableDestination>() {
#Override
public TableDestination apply(ValueInSingleWindow<MyData> input) {
MyData myData = input.getValue();
return new TableDestination(myData.table_id, null);
}
})
.withSchema(new TableSchema().setFields(new ArrayList<TableFieldSchema>() {{
add(new TableFieldSchema().setName("table_id").setType("STRING"));
}}))
.withFormatFunction(new SerializableFunction<MyData, TableRow>() {
#Override
public TableRow apply(MyData myData) {
return new TableRow().set("table_id", myData.table_id);
}
})
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry()));
writeResult.getFailedInserts()
.apply("LogFailedData", ParDo.of(new DoFn<TableRow, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
TableRow row = c.element();
LOG.info(row.get("table_id").toString());
}
}));
p.run();
}
}
There is no easy way to catch exceptions when writing to output in a pipeline definition. I suppose you could do it by writing a custom PTransform for BigQuery. However, there is no way to do it natively in Apache Beam. I also recommend against this because it undermines Cloud Dataflow's automatic retry functionality.
In your code example, you have the failed insert retry policy set to never retry. You can set the policy to always retry. This is only effective during something like an intermittent network failure (4th bullet point).
.withFailedInsertRetryPolicy(InsertRetryPolicy.alwaysRetry())
If the table ID format is incorrect (1st bullet point), then the CREATE_IF_NEEDED create disposition configuration should allow the Dataflow job to automatically create a new table without error, even if the table ID is incorrect.
If the dataset does not exist or there is an access permission issue to the dataset (2nd and 3rd bullet points), then my opinion is that the streaming job should stall and ultimately fail. There is no way to proceed under any circumstances without manual intervention.

BigQuery in Dataflow fails to load data from Cloud Storage: JSON object specified for non-record field

I have a Dataflow pipeline running locally on my machine writing to BigQuery. BigQuery in this batch job, requires a temporary location. I have provided one in my Cloud Storage. The relevant parts are:
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BigQueryOptions.class)
.setTempLocation("gs://folder/temp");
Pipeline p = Pipeline.create(options);
....
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("uuid").setType("STRING"));
fields.add(new TableFieldSchema().setName("start_time").setType("TIMESTAMP"));
fields.add(new TableFieldSchema().setName("end_time").setType("TIMESTAMP"));
TableSchema schema = new TableSchema().setFields(fields);
session_windowed_items.apply(ParDo.of(new FormatAsTableRowFn()))
.apply(BigQueryIO.Write
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.to("myproject:db.table"));
Where for FormatAsTableRowFn I have:
static class FormatAsTableRowFn extends DoFn<KV<String, String>, TableRow>
implements RequiresWindowAccess{
#Override
public void processElement(ProcessContext c) {
TableRow row = new TableRow()
.set("uuid", c.element().getKey())
// include a field for the window timestamp
.set("start_time", ((IntervalWindow) c.window()).start().toInstant()) //NOTE: I tried both with and without
.set("end_time", ((IntervalWindow) c.window()).end().toInstant()); // .toInstant receiving the same error
c.output(row);
}
}
If I print out row.toString() I will get legit timestamps:
{uuid=00:00:00:00:00:00, start_time=2016-09-22T07:34:38.000Z, end_time=2016-09-22T07:39:38.000Z}
When I run this code JAVA says: Failed to create the load job beam_job_XXX
Manually inspecting the temp folder in GCS, the objects look like:
{"mac":"00:00:00:00:00:00","start_time":{"millis":1474529678000,"chronology":{"zone":{"fixed":true,"id":"UTC"}},"zone":{"fixed":true,"id":"UTC"},"afterNow":false,"beforeNow":true,"equalNow":false},"end_time":{"millis":1474529978000,"chronology":{"zone":{"fixed":true,"id":"UTC"}},"zone":{"fixed":true,"id":"UTC"},"afterNow":false,"beforeNow":true,"equalNow":false}}
Looking at the failed job report in BigQuery, the Error says:
JSON object specified for non-record field: start_time (error code: invalid)
This is very strange, because I am pretty sure I said this is a TIMESTAMP, and I am 100% sure my schema in BigQuery conforms with the TableSchema in the SDK. (NOTE: setting the withCreateDisposition...CREATE_IF_NEEDEDyields the same result)
Could someone please tell me how I need to remedy this to get the data inside BigQuery?
Don't use Instant objects. Try using milliseconds/seconds.
https://cloud.google.com/bigquery/data-types
A positive number specifies the number of seconds since the epoch
So, something like this should work:
.getMillis() / 1000

Custom Liquibase executor combining JdbcExecutor and LoggingExecutor

I'm looking for a way to record and write all those SQL statements to an output
file which get executed while running a Liquibase migration against an empty
target database.
The idea behind this is to speed up the initialization phase of integration tests
against a test database by simply reading and executing the SQL statements from the
generated file for subsequent tests.
I had no luck using updateSQL due to different handling of changesets with
pre-conditions (e.g. changeSetExecuted resolves to true for "update" but false for
"updateSQL").
Another approach was to run the Liquibase migration first, then writing a temporary
changelog file using GenerateChangeLogCommand which is finally used by another Liquibase
instance to produce an SQL update file.
While this approach works, it a) feels a bit hacky-ish, b) the end result is not the same
as running the migration directly.
Anyway, what I've come up with is a custom implementation of JdbcExecutor which incorporates
a LoggingExecutor. The implementation looks as follows:
#LiquibaseService(skip = true)
public class LoggingJdbcExecutor extends JdbcExecutor {
private LoggingExecutor loggingExecutor;
public LoggingJdbcExecutor(Database database, Writer writer) {
loggingExecutor = new LoggingExecutor(this, writer, database);
setDatabase(database);
}
#Override
public void execute(SqlStatement sql, List<SqlVisitor> sqlVisitors) throws DatabaseException {
super.execute(sql, sqlVisitors);
loggingExecutor.execute(sql, sqlVisitors);
}
#Override
public int update(SqlStatement sql, List<SqlVisitor> sqlVisitors) throws DatabaseException {
final int result = super.update(sql, sqlVisitors);
loggingExecutor.update(sql, sqlVisitors);
return result;
}
#Override
public void comment(String message) throws DatabaseException {
super.comment(message);
loggingExecutor.comment(message);
}
}
This executor gets injected into Liquibase before update() is invoked as follows:
final String path = configuration.getUpdateSqlExportFile();
ExecutorService.getInstance().setExecutor(liquibase.getDatabase(), new LoggingJdbcExecutor(
liquibase.getDatabase(), new FileWriter(path)
));
Is this approach reasonable and future proof ? While it seems to work I'm not sure if maybe I'm
missing something and there's a better way.
Thanks

Bigquery: Extract Job does not create file

I am working on a Java application which uses Bigquery as the analytics engine. Was able to run query jobs (and get results) using the code on Insert a Query Job. Had to modify the code to use service account using this comment on stackoverflow.
Now, need to run an extract job to export a table to a bucket on GoogleStorage. Based on Exporting a Table, was able to modify the Java code to insert extract jobs (code below). When run, the extract job's status changes from PENDING to RUNNING to DONE. The problem is that no file is actually uploaded to the specified bucket.
Info that might be helpful:
The createAuthorizedClient function returns a Bigquery instance and works for query jobs, so probably no issues with the service account, private key etc.
Also tried creating and running the insert job manually on google's api-explorer and the file is successfully created in the bucket. Using the same values for project, dataset, table and destination uri as in code so these should be correct.
Here is the code (pasting the complete file in case somebody else finds this useful):
import java.io.File;
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.util.Arrays;
import java.util.List;
import com.google.api.client.googleapis.auth.oauth2.GoogleCredential;
import com.google.api.client.http.HttpTransport;
import com.google.api.client.http.javanet.NetHttpTransport;
import com.google.api.client.json.JsonFactory;
import com.google.api.client.json.jackson.JacksonFactory;
import com.google.api.services.bigquery.Bigquery;
import com.google.api.services.bigquery.Bigquery.Jobs.Insert;
import com.google.api.services.bigquery.BigqueryScopes;
import com.google.api.services.bigquery.model.Job;
import com.google.api.services.bigquery.model.JobConfiguration;
import com.google.api.services.bigquery.model.JobConfigurationExtract;
import com.google.api.services.bigquery.model.JobReference;
import com.google.api.services.bigquery.model.TableReference;
public class BigQueryJavaGettingStarted {
private static final String PROJECT_ID = "123456789012";
private static final String DATASET_ID = "MY_DATASET_NAME";
private static final String TABLE_TO_EXPORT = "MY_TABLE_NAME";
private static final String SERVICE_ACCOUNT_ID = "123456789012-...#developer.gserviceaccount.com";
private static final File PRIVATE_KEY_FILE = new File("/path/to/privatekey.p12");
private static final String DESTINATION_URI = "gs://mybucket/file.csv";
private static final List<String> SCOPES = Arrays.asList(BigqueryScopes.BIGQUERY);
private static final HttpTransport TRANSPORT = new NetHttpTransport();
private static final JsonFactory JSON_FACTORY = new JacksonFactory();
public static void main (String[] args) {
try {
executeExtractJob();
} catch (Exception e) {
e.printStackTrace();
}
}
public static final void executeExtractJob() throws IOException, InterruptedException, GeneralSecurityException {
Bigquery bigquery = createAuthorizedClient();
//Create a new Extract job
Job job = new Job();
JobConfiguration config = new JobConfiguration();
JobConfigurationExtract extractConfig = new JobConfigurationExtract();
TableReference sourceTable = new TableReference();
sourceTable.setProjectId(PROJECT_ID).setDatasetId(DATASET_ID).setTableId(TABLE_TO_EXPORT);
extractConfig.setSourceTable(sourceTable);
extractConfig.setDestinationUri(DESTINATION_URI);
config.setExtract(extractConfig);
job.setConfiguration(config);
//Insert/Execute the created extract job
Insert insert = bigquery.jobs().insert(PROJECT_ID, job);
insert.setProjectId(PROJECT_ID);
JobReference jobId = insert.execute().getJobReference();
//Now check to see if the job has successfuly completed (Optional for extract jobs?)
long startTime = System.currentTimeMillis();
long elapsedTime;
while (true) {
Job pollJob = bigquery.jobs().get(PROJECT_ID, jobId.getJobId()).execute();
elapsedTime = System.currentTimeMillis() - startTime;
System.out.format("Job status (%dms) %s: %s\n", elapsedTime, jobId.getJobId(), pollJob.getStatus().getState());
if (pollJob.getStatus().getState().equals("DONE")) {
break;
}
//Wait a second before rechecking job status
Thread.sleep(1000);
}
}
private static Bigquery createAuthorizedClient() throws GeneralSecurityException, IOException {
GoogleCredential credential = new GoogleCredential.Builder()
.setTransport(TRANSPORT)
.setJsonFactory(JSON_FACTORY)
.setServiceAccountScopes(SCOPES)
.setServiceAccountId(SERVICE_ACCOUNT_ID)
.setServiceAccountPrivateKeyFromP12File(PRIVATE_KEY_FILE)
.build();
return Bigquery.builder(TRANSPORT, JSON_FACTORY)
.setApplicationName("My Reports")
.setHttpRequestInitializer(credential)
.build();
}
}
Here is the output:
Job status (337ms) job_dc08f7327e3d48cc9b5ba708efe5b6b5: PENDING
...
Job status (9186ms) job_dc08f7327e3d48cc9b5ba708efe5b6b5: PENDING
Job status (10798ms) job_dc08f7327e3d48cc9b5ba708efe5b6b5: RUNNING
...
Job status (53952ms) job_dc08f7327e3d48cc9b5ba708efe5b6b5: RUNNING
Job status (55531ms) job_dc08f7327e3d48cc9b5ba708efe5b6b5: DONE
It is a small table (about 4MB) so the job taking about a minute seems ok. Have no idea why no file is created in the bucket OR how to go about debugging this. Any help would be appreciated.
As Craig pointed out, printed the status.errorResult() and status.errors() values.
getErrorResults(): {"message":"Backend error. Job aborted.","reason":"internalError"}
getErrors(): null
It looks like there was an access denied error writing to the path: gs://pixalate_test/from_java.csv. Can you make sure that the user that was performing the export job has write access to the bucket (and that the file doesn't already exist)?
I've filed an internal bigquery bug on this issue ... we should give a better error in this situation.
.
I believe the problem is with the bucket name you're using -- mybucket above is just an example, you need to replace that with a bucket you actually own in Google Storage. If you've never used GS before, the intro docs will help.
Your second question was how to debug this -- I'd recommend looking at the returned Job object once the status is set to DONE. Jobs that end in an error still make it to DONE state, the difference is that they have an error result attached, so job.getStatus().hasErrorResult() should be true. (I've never used the Java client libraries, so I'm guessing at that method name.) You can find more information in the jobs docs.
One more difference, I notice is you are not passing job type as config.setJobType(JOB_TYPE);
where constant is private static final String JOB_TYPE = "extract";
also for json, need to set format as well.
I had the same problem. But it turned out was that I typed the name of the table wrong. However, Google did not generate an error message saying that "the table does not exists." That would have helped me locate my problem.
Thanks!