How to work with exported Stack Driver logs from Google Cloud Projects into BigQuery - google-bigquery

I have created an "export" from my Stackdriver Logging page in my Google Cloud project. I configured the export to go to a BigQuery dataset.
When I go to BigQuery, I see the dataset.
There are no tables in my dataset, since Stackdriver export created the BigQuery dataset for me.
How do I see the data that was exported? Since there are no tables I cannot perform a "select * from X". I could create a table but I don't know what columns to add nor do I know how to tell Stackdriver logging to write to that table.
I must be missing a step.
Google has a short 1 minute video on exporting to Big Query but it stops exactly at the point where I am in the process.

When a new Stackdriver export is defined, it will then start to export newly written log records to the target sink (BQ in this case). As per the documentation found here:
https://cloud.google.com/logging/docs/export/
it states:
Since exporting happens for new log entries only, you cannot export
log entries that Logging received before your sink was created.
If one wants to export existing logs to a file, one can use gcloud (or API) as described here:
https://cloud.google.com/logging/docs/reference/tools/gcloud-logging#reading_log_entries
The output of this "dump" of existing log records can then used in whatever manner you see fit. For example, it could be imported into a BQ table.

To export logs in the bigquery from the stackdrive , you have to create Logger Sink using code or GCP logging UI
Then create Sink, add a filter.
https://cloud.google.com/logging/docs/export/configure_export_v2
Then add logs to stack driver using code
public static void writeLog(Severity severity, String logName, Map<String, String> jsonMap) {
List<Map<String, String>> maps = limitMap(jsonMap);
for (Map<String, String> map : maps) {
LogEntry logEntry = LogEntry.newBuilder(Payload.JsonPayload.of(map))
.setSeverity(severity)
.setLogName(logName)
.setResource(monitoredResource)
.build();
logging.write(Collections.singleton(logEntry));
}
}
private static MonitoredResource monitoredResource =
MonitoredResource.newBuilder("global")
.addLabel("project_id", logging.getOptions().getProjectId())
.build();
https://cloud.google.com/bigquery/docs/writing-results

Related

Incremental loading and BigQuery

Im writing an incremental loading pipeline to load data from MySQL to BigQuery and using Google Cloud Datastore as a metadata repo.
My current pipeline is written this way:
PCollection<TableRow> tbRows =
pipeline.apply("Read from MySQL",
JdbcIO.<TableRow>read().withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create("com.mysql.cj.jdbc.Driver", connectionConfig)
.withUsername(username)
.withPassword(password)
.withQuery(query).withCoder(TableRowJsonCoder.of())
.withRowMapper(JdbcConverters.getResultSetToTableRow())))
.setCoder(NullableCoder.of(TableRowJsonCoder.of()));
tbRows.apply("Write to BigQuery",
BigQueryIO.writeTableRows().withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND).to(outputTable));
tbRows.apply("Getting timestamp column",
MapElements.into(TypeDescriptors.strings())
.via((final TableRow row) -> (String) row.get(fieldName)))
.setCoder(NullableCoder.of(StringUtf8Coder.of())).apply("Max", Max.globally())
.apply("Updating Datastore", ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(final ProcessContext c) {
DatastoreConnector.udpate(table, c.element());
}
}));
The problem I am facing is that when the BigQuery Write step fails the Datastore is still updated, is there any way to wait for BigQuery Write finish before updating the Datastore?
Thanks!
Currently this cannot be done in the same pipeline with BigQueryIO.writeTableRows() since it produces a terminal output (PDone). I have some suggestions though.
I suspect BigQuery write failing is a rare occurrence. In this case can you delete corresponding Datastore data from a secondary job/process.
Have you considered a CDC solution that is better suited for writing incremental change data. For example see the Dataflow template here.

Cloud Dataflow: Step to read csv file on AWS S3 (TextIO.read) sometimes get stuck

example code is below.
// Java
// Apache Beam SDK verison: 2.16.0
final TupleTag<TableRow> successTag = new TupleTag<TableRow>() {};
final TupleTag<TableRow> deadLetterTag = new TupleTag<TableRow>() {};
Pipeline p = Pipeline.create(dataflowOptions)
PCollection<String> input = p.apply("ReadS3File", TextIO.read().from("s3://sourceBucket/sourceFilename.csv"));
PCollectionTuple outputTuple = input.apply("StringToBigQueryTableRow", ParDo.of(new DoFn<String, TableRow>() { /**/ }))).withOutputTags(successTag, TupleTagList.of(deadLetterTag)))
ReadS3File step gets stuck.
I'm reading the Dataflow documentation and examining the thread dump, it appears to be stuck at com.amazonaws.internal.SdkFilterInputStream.read
Common error guidance
I've tried to determine the root cause of the issue by analysing the thread dump, but I'm afraid it's not enough. I would recommend you to open a case on Google Cloud Platform, because this requires more information on your part which shouldn't be publicly shared in here.

How to avoid duplicates in BigQuery by streaming with Apache Beam IO?

We are using a pretty simple flow where messages are retrieved from PubSub, their JSON content is being flatten into two types (for BigQuery and Postgres) and then inserted into both sinks.
But, we are seeing duplicates in both sinks (Postgres was kinda fixed with a unique constraint and a "ON CONFLICT... DO NOTHING").
At first we trusted in the supposedly "insertId" UUId that the Apache Beam/BigQuery creates.
Then we add a "unique_label" attribute to each message before queueing them into PubSub, using data from the JSON itself, which gives them uniqueness (a device_id + a reading's timestamp). And subscribed to the topic using that attribute with "withIdAttribute" method.
Finally we paid for GCP Support, and their "solutions" do not work. They have told us to even use Reshuffle transform, which is deprecated by the way, and some windowing (that we do not won't since we want near-real time data).
This the main flow, pretty basic:
[UPDATED WITH LAST CODE]
Pipeline
val options = PipelineOptionsFactory.fromArgs(*args).withValidation().`as`(OptionArgs::class.java)
val pipeline = Pipeline.create(options)
var mappings = ""
// Value only available at runtime
if (options.schemaFile.isAccessible){
mappings = readCloudFile(options.schemaFile.get())
}
val tableRowMapper = ReadingToTableRowMapper(mappings)
val postgresMapper = ReadingToPostgresMapper(mappings)
val pubsubMessages =
pipeline
.apply("ReadPubSubMessages",
PubsubIO
.readMessagesWithAttributes()
.withIdAttribute("id_label")
.fromTopic(options.pubSubInput))
pubsubMessages
.apply("AckPubSubMessages", ParDo.of(object: DoFn<PubsubMessage, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info("Processing readings: " + context.element().attributeMap["id_label"])
context.output("")
}
}))
val disarmedMessages =
pubsubMessages
.apply("DisarmedPubSubMessages",
DisarmPubsubMessage(tableRowMapper, postgresMapper)
)
disarmedMessages
.get(TupleTags.readingErrorTag)
.apply("LogDisarmedErrors", ParDo.of(object: DoFn<String, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info(context.element())
context.output("")
}
}))
disarmedMessages
.get(TupleTags.tableRowTag)
.apply("WriteToBigQuery",
BigQueryIO
.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry())
.to(options.bigQueryOutput)
)
pipeline.run()
DissarmPubsubMessage is a PTransforms that uses FlatMapElements transform to get TableRow and ReadingsInputFlatten (own class for Postgres)
We expect zero duplicates or the "best effort" (and we append some cleaning cron job), we paid for these products to run statistics and bigdata analysis...
[UPDATE 1]
I even append a new simple transform that logs our unique attribute through a ParDo which supposedly should ack the PubsubMessage, but this is not the case:
new flow with AckPubSubMessages step
Thanks!!
Looks like you are using the global window. One technique would be to window this into an N minute window. Then process the keys in the window and drop an items with dup keys.
The supported programming languages are Python and Java, your code seems to be Scala and as far as I know it is not supported. I strongly recommend using Java to avoid any unsupported feature for the programming language you use.
In addition, I would recommend the following approaches to work on duplicates, the option 2 could meet your need of near-real-time:
message_id. Probably you already read the FAQ - duplicates which points to deprecated doc. However, if you check the PubsubMessage object you will notice that messageId is still available and it will be populated if not set by the publisher:
"ID of this message, assigned by the server when the message is
published ... It must not be populated by the publisher in a
topics.publish call"
BigQuery Streaming. To validate duplicate during loading the data, right before inserting in BQ you can create UUID.Please refer the section Example sink: Google BigQuery.
Try the Dataflow template PubSubToBigQuery and validate there are not duplicates in BQ.

Nexus 3 Repository Manager Create (Or Run Pre-generated) Task Without Using User Interface

This question arose when I was trying to reboot my Nexus3 container on a weekly schedule and connect to an S3 bucket I have. I have my container set up to connect to the S3 bucket just fine (it creates a new [A-Z,0-9]-metrics.properties file each time) but the previous artifacts are not found when looking though the UI.
I used the Repair - Reconcile component database from blob store task from the UI settings and it works great!
But... all the previous steps are done automatically through scripts and I would like the same for the final step of Reconciling the blob store.
Connecting to the S3 blob store is done with reference to examples from nexus-book-examples. As below:
Map<String, String> config = new HashMap<>()
config.put("bucket", "nexus-artifact-storage")
blobStore.createS3BlobStore('nexus-artifact-storage', config)
AWS credentials are provided during the docker run step so the above is all that is needed for the blob store set up. It is called by a modified version of provision.sh, which is a script from the nexus-book-examples git page.
Is there a way to either:
Create a task with a groovy script? or,
Reference one of the task types and run the task that way with a POST?
depending on the specific version of repository manager that you are using, there may be REST endpoints for listing and running scheduled tasks. This was introduced in 3.6.0 according to this ticket: https://issues.sonatype.org/browse/NEXUS-11935. For more information about the REST integration in 3.x, check out the following: https://help.sonatype.com/display/NXRM3/Tasks+API
For creating a scheduled task, you will have to add some groovy code. Perhaps the following would be a good start:
import org.sonatype.nexus.scheduling.TaskConfiguration
import org.sonatype.nexus.scheduling.TaskInfo
import org.sonatype.nexus.scheduling.TaskScheduler
import groovy.json.JsonOutput
import groovy.json.JsonSlurper
class TaskXO
{
String typeId
Boolean enabled
String name
String alertEmail
Map<String, String> properties
}
TaskXO task = new JsonSlurper().parseText(args)
TaskScheduler scheduler = container.lookup(TaskScheduler.class.name)
TaskConfiguration config = scheduler.createTaskConfigurationInstance(task.typeId)
config.enabled = task.enabled
config.name = task.name
config.alertEmail = task.alertEmail
task.properties?.each { key, value -> config.setString(key, value) }
TaskInfo taskInfo = scheduler.scheduleTask(config, scheduler.scheduleFactory.manual())
JsonOutput.toJson(taskInfo)

Dataproc + BigQuery examples - any available?

According to the Dataproc docos, it has "native and automatic integrations with BigQuery".
I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then write the results of this analysis back to BigQuery. You may be asking "why not just do the analysis in BigQuery directly!?" - the reason is because we are creating complex statistical models, and SQL is too high level for developing them. We need something like Python or R, ergo Dataproc.
Are they any Dataproc + BigQuery examples available? I can't find any.
To begin, as noted in this question the BigQuery connector is preinstalled on Cloud Dataproc clusters.
Here is an example on how to read data from BigQuery into Spark. In this example, we will read data from BigQuery to perform a word count.
You read data from BigQuery in Spark using SparkContext.newAPIHadoopRDD. The Spark documentation has more information about using SparkContext.newAPIHadoopRDD. '
import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
import com.google.gson.JsonObject
import org.apache.hadoop.io.LongWritable
val projectId = "<your-project-id>"
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"
val outputTableSchema =
"[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"
val jobName = "wordcount"
val conf = sc.hadoopConfiguration
// Set the job-level projectId.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
// Use the systemBucket for temporary BigQuery export data used by the InputFormat.
val systemBucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)
// Configure input and output for BigQuery access.
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
BigQueryConfiguration.configureBigQueryOutput(conf,
fullyQualifiedOutputTableId, outputTableSchema)
val fieldName = "word"
val tableData = sc.newAPIHadoopRDD(conf,
classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])
tableData.cache()
tableData.count()
tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)
You will need to customize this example with your settings, including your Cloud Platform project ID in <your-project-id> and your output table ID in <your-fully-qualified-table-id>.
Finally, if you end up using the BigQuery connector with MapReduce, this page has examples for how to write MapReduce jobs with the BigQuery connector.
The above example doesn't show how to write data to an output table. You need to do this:
.saveAsNewAPIHadoopFile(
hadoopConf.get(BigQueryConfiguration.TEMP_GCS_PATH_KEY),
classOf[String],
classOf[JsonObject],
classOf[BigQueryOutputFormat[String, JsonObject]], hadoopConf)
where the key: String is actually ignored