Java - Insert a single row at a time into google Big Query ? - google-bigquery

I am creating an application where every time a user clicks on an article, I need to capture the article data and the user data to calculate the reach of every article and be able to run analytics on the reached data.
My application is on App Engine.
When I check documentation for inserts into BQ, most of them point towards bulk inserts in the form of Jobs or Streams.
Question:
Is it even a good practice to insert into big Query one row at a time every time a user action is initiated ? If so, could you point me to some Java code to effectively do this ?

There are limits on the number of load jobs and DML queries (1,000 per day), so you'll need to use streaming inserts for this kind of application. Note that streaming inserts are different from loading data from a Java stream.
TableId tableId = TableId.of(datasetName, tableName);
// Values of the row to insert
Map<String, Object> rowContent = new HashMap<>();
rowContent.put("booleanField", true);
// Bytes are passed in base64
rowContent.put("bytesField", "Cg0NDg0="); // 0xA, 0xD, 0xD, 0xE, 0xD in base64
// Records are passed as a map
Map<String, Object> recordsContent = new HashMap<>();
recordsContent.put("stringField", "Hello, World!");
rowContent.put("recordField", recordsContent);
InsertAllResponse response =
bigquery.insertAll(
InsertAllRequest.newBuilder(tableId)
.addRow("rowId", rowContent)
// More rows can be added in the same RPC by invoking .addRow() on the builder
.build());
if (response.hasErrors()) {
// If any of the insertions failed, this lets you inspect the errors
for (Entry<Long, List<BigQueryError>> entry : response.getInsertErrors().entrySet()) {
// inspect row error
}
}
(From the example at https://cloud.google.com/bigquery/streaming-data-into-bigquery#bigquery-stream-data-java)
Note especially that a failed insert does not always throw an exception. You must also check the response object for errors.
Is it even a good practice to insert into big Query one row at a time every time a user action is initiated ?
Yes, it's pretty typical to stream event streams to BigQuery for analytics. You'll could get better performance if you buffer multiple events into the same streaming insert request to BigQuery, but one row at a time is definitely supported.

A simplified version of Google's example.
Map<String, Object> row1Data = new HashMap<>();
row1Data.put("booleanField", true);
row1Data.put("stringField", "myString");
Map<String, Object> row2Data = new HashMap<>();
row2Data.put("booleanField", false);
row2Data.put("stringField", "myOtherString");
TableId tableId = TableId.of("myDatasetName", "myTableName");
InsertAllResponse response =
bigQuery.insertAll(
InsertAllRequest.newBuilder(tableId)
.addRow("row1Id", row1Data)
.addRow("row2Id", row2Data)
.build());
if (response.hasErrors()) {
// If any of the insertions failed, this lets you inspect the errors
for (Map.Entry<Long, List<BigQueryError>> entry : response.getInsertErrors().entrySet()) {
// inspect row error
}
}

You can use Cloud Logging API to write one row at a time.
https://cloud.google.com/logging/docs/reference/libraries
Sample code from document
public class QuickstartSample {
/** Expects a new or existing Cloud log name as the first argument. */
public static void main(String... args) throws Exception {
// Instantiates a client
Logging logging = LoggingOptions.getDefaultInstance().getService();
// The name of the log to write to
String logName = args[0]; // "my-log";
// The data to write to the log
String text = "Hello, world!";
LogEntry entry =
LogEntry.newBuilder(StringPayload.of(text))
.setSeverity(Severity.ERROR)
.setLogName(logName)
.setResource(MonitoredResource.newBuilder("global").build())
.build();
// Writes the log entry asynchronously
logging.write(Collections.singleton(entry));
System.out.printf("Logged: %s%n", text);
}
}
In this case you need to create sink from dataflow logs. Then message will be redirect to the big Query table.
https://cloud.google.com/logging/docs/export/configure_export_v2

Related

Quartz scheduler for long running tasks skips jobs

This is my job. It takes about 3 to 5 minutes to complete each time:
[DisallowConcurrentExecution]
[PersistJobDataAfterExecution]
public class UploadNumberData : IJob
{
private readonly IServiceProvider serviceProvider;
public UploadNumberData(IServiceProvider serviceProvider)
{
this.serviceProvider = serviceProvider;
}
public async Task Execute(IJobExecutionContext context)
{
var jobDataMap = context.MergedJobDataMap;
string flattenedInput = jobDataMap.GetString("FlattenedInput");
string applicationName = jobDataMap.GetString("ApplicationName");
var parsedFlattenedInput = JsonSerializer.Deserialize<List<NumberDataUploadViewModel>>(flattenedInput);
var parsedApplicationName = JsonSerializer.Deserialize<string>(applicationName);
using (var scope = serviceProvider.CreateScope())
{
//Run Process
}
}
}
This is the function that calls the job:
try
{
var flattenedInput = JsonSerializer.Serialize(Input.NumData);
var triggerKey = Guid.NewGuid().ToString();
IJobDetail job = JobBuilder.Create<UploadNumberData >()
.UsingJobData("FlattenedInput", flattenedInput)
.UsingJobData("ApplicationName", flattenedApplicationName)
.StoreDurably()
.WithIdentity("BatchNumberDataJob", $"GP_BatchNumberDataJob")
.Build();
await scheduler.AddJob(job, true);
ITrigger trigger = TriggerBuilder.Create()
.ForJob(job)
.WithIdentity(triggerKey, $"GP_BatchNumberDataJob")
.WithSimpleSchedule(x => x.WithMisfireHandlingInstructionFireNow())
.StartNow()
.Build();
await scheduler.ScheduleJob(trigger);
}
catch(Exception e)
{
//log
}
Each job consists of 300 rows of data with the total count being about 14000 rows divided into 47 jobs.
This is the configuration:
NameValueCollection quartzProperties = new NameValueCollection
{
{"quartz.serializer.type","json" },
{"quartz.jobStore.type","Quartz.Impl.AdoJobStore.JobStoreTX, Quartz" },
{"quartz.jobStore.dataSource","default" },
{"quartz.dataSource.default.provider","MySql" },
{"quartz.dataSource.default.connectionString","connectionstring"},
{"quartz.jobStore.driverDelegateType","Quartz.Impl.AdoJobStore.MySQLDelegate, Quartz" },
{"quartz.jobStore.misfireThreshold","3600000" }
};
The problem now is that when I hit the function/api, only the first and last job gets inserted into the database. Strangely, the last job repeats itself multiple times as well.
I tried changing the Job Identity name to something different but I then get foreign key errors as my data is being inserted into the database.
Example sequence should be:
300,300,300,...,102
However, the sequence ends up being:
300,102,102,102
EDIT:
When I set the threads to 1 and changed the Job Identity to be dynamic, it works. However, does this defeat the purpose of DisallowConcurrentExecution?
I am reproduced your problem and found the way how you should rewrite your code to get expected behaviour as I understand it
Make job identity unique
First of all, I see you use same identity for every job you executing, duplicating causes because you have the same identity and 'replace' flag as 'true' in AddJob method call.
You are on the right way when you decide to use dynamic identity generation for each job, it could be new guid or some incremental int count for each identity. Something like this:
// 'i' variable is a job counter (0, 1, 2 ...)
.WithIdentity($"BatchNumberDataJob-{i}", $"GP_BatchNumberDataJob")
// or
.WithIdentity(Guid.NewGuid().ToString(), $"GP_BatchNumberDataJob")
// Also maybe you want to set 'replace' flag to 'false'
// to enable 'already exists' error if collision occurs.
// You may want handle such cases
await scheduler.AddJob(job, false);
After that you can remove [DisallowConcurrentExecution] attribute from the job, because it is based on a job key, it is not used anymore with such dynamic identity.
Concurrency
Basically, you have a few options how to execute your jobs, it really depends on what you trying to achieve.
Parallel execution
Fastest method to execute your code. Each job is completely separated from each others.
To do so you should prepare your database for such case (because as you said you have foreign key errors when you trying to achieve that behaviour).
It is hard exactly to say what you should change in database to support this behaviour because you say nothing about your database.
If your jobs needs to have an execution order - this method is not for you.
Ordered execution
The other way is to use ordered execution. If (for some reasons) you are not able to prepare your database to handle parallel job execution - you could use this method. This method is a way slower than parallel, but order which jobs are executing is determined.
You can achieve this behaviour two ways:
use jobs chaining. See this question.
set up max concurrency for scheduler:
var quartzProperties = new NameValueCollection
{
{"quartz.threadPool.maxConcurrency","1" },
};
So jobs will be executed in the way you triggering them in the right order completely without parallelism.
Summary
It is really depends of what you trying to achieve. If your point is a speed - then you should rework your database and your job to support completely separated job execution no matter which order it executing. If your point is an ordering - you should use non-parallel methods for job execution. It is up to you.

Apache Ignite Continuous Queries : How to get the field names and field values in the listener updates when there are dynamic fields?

I am working on a POC on whether or not we should go ahead with Apache Ignite both for commerical and enterprise use. There is a use case though that we are trying to find an answer for.
Preconditions
Dynamically creation of tables i.e. there may be new fields that come to be put into the cache. Meaning there is no precompiled POJO(Model) defining the attributes of the table/cache.
Use case
I would like to write a SELECT continuous query where it gives me the results that are modified. So I wrote that query but the problem is that when the listener gets a notification, I am not able to find all the field names that are modified from any method call. I would like to be able to get all the field names and field values in some sort of Map, which I can use and then submit to other systems.
You could track all modified field values using binary object and continuous query:
IgniteCache<Integer, BinaryObject> cache = ignite.cache("person").withKeepBinary();
ContinuousQuery<Integer, BinaryObject> query = new ContinuousQuery<>();
query.setLocalListener(events -> {
for (CacheEntryEvent<? extends Integer, ? extends BinaryObject> event : events) {
BinaryType type = ignite.binary().type("Person");
if (event.getOldValue() != null && event.getValue() != null) {
HashMap<String,Object> oldProps = new HashMap<>();
HashMap<String,Object> newProps = new HashMap<>();
for (String field : type.fieldNames()) {
oldProps.put(field,event.getOldValue().field(field));
newProps.put(field,event.getValue().field(field));
}
com.google.common.collect.MapDifference<Object, Object> diff = com.google.common.collect.Maps.difference(oldProps, newProps);
System.out.println(diff.entriesDiffering());
}
}
});
cache.query(query);
cache.put(1, ignite.binary().builder("Person").setField("name","Alice").build());
cache.put(1, ignite.binary().builder("Person").setField("name","Bob").build());

How to catch any exceptions thrown by BigQueryIO.Write and rescue the data which is failed to output?

I want to read data from Cloud Pub/Sub and write it to BigQuery with Cloud Dataflow. Each data contains a table ID where the data itself will be saved.
There are various factors that writing to BigQuery fails:
Table ID format is wrong.
Dataset does not exist.
Dataset does not allow the pipeline to access.
Network failure.
When one of the failures occurs, a streaming job will retry the task and stall. I tried using WriteResult.getFailedInserts() in order to rescue the bad data and avoid stalling, but it did not work well. Is there any good way?
Here is my code:
public class StarterPipeline {
private static final Logger LOG = LoggerFactory.getLogger(StarterPipeline.class);
public class MyData implements Serializable {
String table_id;
}
public interface MyOptions extends PipelineOptions {
#Description("PubSub topic to read from, specified as projects/<project_id>/topics/<topic_id>")
#Validation.Required
ValueProvider<String> getInputTopic();
void setInputTopic(ValueProvider<String> value);
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<MyData> input = p
.apply("ReadFromPubSub", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply("ParseJSON", MapElements.into(TypeDescriptor.of(MyData.class))
.via((String text) -> new Gson().fromJson(text, MyData.class)));
WriteResult writeResult = input
.apply("WriteToBigQuery", BigQueryIO.<MyData>write()
.to(new SerializableFunction<ValueInSingleWindow<MyData>, TableDestination>() {
#Override
public TableDestination apply(ValueInSingleWindow<MyData> input) {
MyData myData = input.getValue();
return new TableDestination(myData.table_id, null);
}
})
.withSchema(new TableSchema().setFields(new ArrayList<TableFieldSchema>() {{
add(new TableFieldSchema().setName("table_id").setType("STRING"));
}}))
.withFormatFunction(new SerializableFunction<MyData, TableRow>() {
#Override
public TableRow apply(MyData myData) {
return new TableRow().set("table_id", myData.table_id);
}
})
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry()));
writeResult.getFailedInserts()
.apply("LogFailedData", ParDo.of(new DoFn<TableRow, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
TableRow row = c.element();
LOG.info(row.get("table_id").toString());
}
}));
p.run();
}
}
There is no easy way to catch exceptions when writing to output in a pipeline definition. I suppose you could do it by writing a custom PTransform for BigQuery. However, there is no way to do it natively in Apache Beam. I also recommend against this because it undermines Cloud Dataflow's automatic retry functionality.
In your code example, you have the failed insert retry policy set to never retry. You can set the policy to always retry. This is only effective during something like an intermittent network failure (4th bullet point).
.withFailedInsertRetryPolicy(InsertRetryPolicy.alwaysRetry())
If the table ID format is incorrect (1st bullet point), then the CREATE_IF_NEEDED create disposition configuration should allow the Dataflow job to automatically create a new table without error, even if the table ID is incorrect.
If the dataset does not exist or there is an access permission issue to the dataset (2nd and 3rd bullet points), then my opinion is that the streaming job should stall and ultimately fail. There is no way to proceed under any circumstances without manual intervention.

BigQuery in Dataflow fails to load data from Cloud Storage: JSON object specified for non-record field

I have a Dataflow pipeline running locally on my machine writing to BigQuery. BigQuery in this batch job, requires a temporary location. I have provided one in my Cloud Storage. The relevant parts are:
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BigQueryOptions.class)
.setTempLocation("gs://folder/temp");
Pipeline p = Pipeline.create(options);
....
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("uuid").setType("STRING"));
fields.add(new TableFieldSchema().setName("start_time").setType("TIMESTAMP"));
fields.add(new TableFieldSchema().setName("end_time").setType("TIMESTAMP"));
TableSchema schema = new TableSchema().setFields(fields);
session_windowed_items.apply(ParDo.of(new FormatAsTableRowFn()))
.apply(BigQueryIO.Write
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.to("myproject:db.table"));
Where for FormatAsTableRowFn I have:
static class FormatAsTableRowFn extends DoFn<KV<String, String>, TableRow>
implements RequiresWindowAccess{
#Override
public void processElement(ProcessContext c) {
TableRow row = new TableRow()
.set("uuid", c.element().getKey())
// include a field for the window timestamp
.set("start_time", ((IntervalWindow) c.window()).start().toInstant()) //NOTE: I tried both with and without
.set("end_time", ((IntervalWindow) c.window()).end().toInstant()); // .toInstant receiving the same error
c.output(row);
}
}
If I print out row.toString() I will get legit timestamps:
{uuid=00:00:00:00:00:00, start_time=2016-09-22T07:34:38.000Z, end_time=2016-09-22T07:39:38.000Z}
When I run this code JAVA says: Failed to create the load job beam_job_XXX
Manually inspecting the temp folder in GCS, the objects look like:
{"mac":"00:00:00:00:00:00","start_time":{"millis":1474529678000,"chronology":{"zone":{"fixed":true,"id":"UTC"}},"zone":{"fixed":true,"id":"UTC"},"afterNow":false,"beforeNow":true,"equalNow":false},"end_time":{"millis":1474529978000,"chronology":{"zone":{"fixed":true,"id":"UTC"}},"zone":{"fixed":true,"id":"UTC"},"afterNow":false,"beforeNow":true,"equalNow":false}}
Looking at the failed job report in BigQuery, the Error says:
JSON object specified for non-record field: start_time (error code: invalid)
This is very strange, because I am pretty sure I said this is a TIMESTAMP, and I am 100% sure my schema in BigQuery conforms with the TableSchema in the SDK. (NOTE: setting the withCreateDisposition...CREATE_IF_NEEDEDyields the same result)
Could someone please tell me how I need to remedy this to get the data inside BigQuery?
Don't use Instant objects. Try using milliseconds/seconds.
https://cloud.google.com/bigquery/data-types
A positive number specifies the number of seconds since the epoch
So, something like this should work:
.getMillis() / 1000

Delaying writes to SQL Server

I am working on an app, and need to keep track of how any views a page has. Almost like how SO does it. It is a value used to determine how popular a given page is.
I am concerned that writing to the DB every time a new view needs to be recorded will impact performance. I know this borderline pre-optimization, but I have experienced the problem before. Anyway, the value doesn't need to be real time; it is OK if it is delayed by 10 minutes or so. I was thinking that caching the data, and doing one large write every X minutes should help.
I am running on Windows Azure, so the Appfabric cache is available to me. My original plan was to create some sort of compound key (PostID:UserID), and tag the key with "pageview". Appfabric allows you to get all keys by tag. Thus I could let them build up, and do one bulk insert into my table instead of many small writes. The table looks like this, but is open to change.
int PageID | guid userID | DateTime ViewTimeStamp
The website would still get the value from the database, writes would just be delayed, make sense?
I just read that the Windows Azure Appfabric cache does not support tag based searches, so it pretty much negates my idea.
My question is, how would you accomplish this? I am new to Azure, so I am not sure what my options are. Is there a way to use the cache without tag based searches? I am just looking for advice on how to delay these writes to SQL.
You might want to take a look at http://www.apathybutton.com (and the Cloud Cover episode it links to), which talks about a highly scalable way to count things. (It might be overkill for your needs, but hopefully it gives you some options.)
You could keep a queue in memory and on a timer drain the queue, collapse the queued items by totaling the counts by page and write in one SQL batch/round trip. For example, using a TVP you could write the queued totals with one sproc call.
That of course doesn't guarantee the view counts get written since its in memory and latently written but page counts shouldn't be critical data and crashes should be rare.
You might want to have a look at how the "diagnostics" feature in Azure works. Not because you would use diagnostics for what you are doing at all, but because it is dealing with a similar problem and may provide some inspiration. I am just about to implement a data auditing feature and I want to log that to table storage so also want to delay and bunch the updates together and I have taken a lot of inspiration from diagnostics.
Now, the way Diagnostics in Azure works is that each role starts a little background "transfer" thread. So, whenever you write any traces then that gets stored in a list in local memory and the background thread will (by default) bunch all the requests up and transfer them to table storage every minute.
In your scenario, I would let each role instance keep track of a count of hits and then use a background thread to update the database every minute or so.
I would probably use something like a static ConcurrentDictionary (or one hanging off a singleton) on each webrole with each hit incrementing the counter for the page identifier. You'd need to have some thread handling code to allow multiple request to update the same counter in the list. Alternatively, just allow each "hit" to add a new record to a shared thread-safe list.
Then, have a background thread once per minute increment the database with the number of hits per page since last time and reset the local counter to 0 or empty the shared list if you are going with that approach (again, be careful about the multi threading and locking).
The important thing is to make sure your database update is atomic; If you do a read-current-count from the database, increment it and then write it back then you may have two different web role instances doing this at the same time and thus losing one update.
EDIT:
Here is a quick sample of how you could go about this.
using System.Collections.Concurrent;
using System.Data.SqlClient;
using System.Threading;
using System;
using System.Collections.Generic;
using System.Linq;
class Program
{
static void Main(string[] args)
{
// You would put this in your Application_start for the web role
Thread hitTransfer = new Thread(() => HitCounter.Run(new TimeSpan(0, 0, 1))); // You'd probably want the transfer to happen once a minute rather than once a second
hitTransfer.Start();
//Testing code - this just simulates various web threads being hit and adding hits to the counter
RunTestWorkerThreads(5);
Thread.Sleep(5000);
// You would put the following line in your Application shutdown
HitCounter.StopRunning(); // You could do some cleverer stuff with aborting threads, joining the thread etc but you probably won't need to
Console.WriteLine("Finished...");
Console.ReadKey();
}
private static void RunTestWorkerThreads(int workerCount)
{
Thread[] workerThreads = new Thread[workerCount];
for (int i = 0; i < workerCount; i++)
{
workerThreads[i] = new Thread(
(tagname) =>
{
Random rnd = new Random();
for (int j = 0; j < 300; j++)
{
HitCounter.LogHit(tagname.ToString());
Thread.Sleep(rnd.Next(0, 5));
}
});
workerThreads[i].Start("TAG" + i);
}
foreach (var t in workerThreads)
{
t.Join();
}
Console.WriteLine("All threads finished...");
}
}
public static class HitCounter
{
private static System.Collections.Concurrent.ConcurrentQueue<string> hits;
private static object transferlock = new object();
private static volatile bool stopRunning = false;
static HitCounter()
{
hits = new ConcurrentQueue<string>();
}
public static void LogHit(string tag)
{
hits.Enqueue(tag);
}
public static void Run(TimeSpan transferInterval)
{
while (!stopRunning)
{
Transfer();
Thread.Sleep(transferInterval);
}
}
public static void StopRunning()
{
stopRunning = true;
Transfer();
}
private static void Transfer()
{
lock(transferlock)
{
var tags = GetPendingTags();
var hitCounts = from tag in tags
group tag by tag
into g
select new KeyValuePair<string, int>(g.Key, g.Count());
WriteHits(hitCounts);
}
}
private static void WriteHits(IEnumerable<KeyValuePair<string, int>> hitCounts)
{
// NOTE: I don't usually use sql commands directly and have not tested the below
// The idea is that the update should be atomic so even though you have multiple
// web servers all issuing similar update commands, potentially at the same time,
// they should all commit. I do urge you to test this part as I cannot promise this code
// will work as-is
//using (SqlConnection con = new SqlConnection("xyz"))
//{
// foreach (var hitCount in hitCounts.OrderBy(h => h.Key))
// {
// var cmd = con.CreateCommand();
// cmd.CommandText = "update hits set count = count + #count where tag = #tag";
// cmd.Parameters.AddWithValue("#count", hitCount.Value);
// cmd.Parameters.AddWithValue("#tag", hitCount.Key);
// cmd.ExecuteNonQuery();
// }
//}
Console.WriteLine("Writing....");
foreach (var hitCount in hitCounts.OrderBy(h => h.Key))
{
Console.WriteLine(String.Format("{0}\t{1}", hitCount.Key, hitCount.Value));
}
}
private static IEnumerable<string> GetPendingTags()
{
List<string> hitlist = new List<string>();
var currentCount = hits.Count();
for (int i = 0; i < currentCount; i++)
{
string tag = null;
if (hits.TryDequeue(out tag))
{
hitlist.Add(tag);
}
}
return hitlist;
}
}