Longer than expected upload times for big query insert

Longer than expected upload times for big query insert - google-bigquery

I am uploading data to big query as csv format with JSON schemas. What I am seeing is the very long times to load into big query. I take the start and ending load times from the pollJob.getStatistics() when the load is DONE and compute a delta time as (startTime - endTime)/1000. Then I look at the number of bytes loaded. The data is from files stored in google cloud storage that I reprocess in app engine to do some reformatting. I convert the string into a byte stream and then load as the contents of the load as follows:
public static void uploadFileToBigQuerry(TableSchema tableSchema,String tableData,String datasetId,String tableId,boolean formatIsJson,int waitSeconds,String[] fileIdElements) {
/* Init diagnostic */
String projectId = getProjectId();
if (ReadAndroidRawFile.testMode) {
String s = String.format("My project ID at start of upload to BQ:%s datasetID:%s tableID:%s json:%b \nschema:%s tableData:\n%s\n",
projectId,datasetId,tableId,formatIsJson,tableSchema.toString(),tableData);
log.info(s);
}
else {
String s = String.format("Upload to BQ tableID:%s tableFirst60Char:%s\n",
tableId,tableData.substring(0,60));
log.info(s);
}
/* Setup the data each time */
Dataset dataset = new Dataset();
DatasetReference datasetRef = new DatasetReference();
datasetRef.setProjectId(projectId);
datasetRef.setDatasetId(datasetId);
dataset.setDatasetReference(datasetRef);
try {
bigquery.datasets().insert(projectId, dataset).execute();
} catch (IOException e) {
if (ReadAndroidRawFile.testMode) {
String se = String.format("Exception creating datasetId:%s",e);
log.info(se);
}
}
/* Set destination table */
TableReference destinationTable = new TableReference();
destinationTable.setProjectId(projectId);
destinationTable.setDatasetId(datasetId);
destinationTable.setTableId(tableId);
/* Common setup line */
JobConfigurationLoad jobLoad = new JobConfigurationLoad();
/* Handle input format */
if (formatIsJson) {
jobLoad.setSchema(tableSchema);
jobLoad.setSourceFormat("NEWLINE_DELIMITED_JSON");
jobLoad.setDestinationTable(destinationTable);
jobLoad.setCreateDisposition("CREATE_IF_NEEDED");
jobLoad.setWriteDisposition("WRITE_APPEND");
jobLoad.set("Content-Type", "application/octet-stream");
}
else {
jobLoad.setSchema(tableSchema);
jobLoad.setSourceFormat("CSV");
jobLoad.setDestinationTable(destinationTable);
jobLoad.setCreateDisposition("CREATE_IF_NEEDED");
jobLoad.setWriteDisposition("WRITE_APPEND");
jobLoad.set("Content-Type", "application/octet-stream");
}
/* Setup the job config */
JobConfiguration jobConfig = new JobConfiguration();
jobConfig.setLoad(jobLoad);
JobReference jobRef = new JobReference();
jobRef.setProjectId(projectId);
Job outputJob = new Job();
outputJob.setConfiguration(jobConfig);
outputJob.setJobReference(jobRef);
/* Convert input string into byte stream */
ByteArrayContent contents = new ByteArrayContent("application/octet-stream",tableData.getBytes());
int timesToSleep = 0;
try {
Job job = bigquery.jobs().insert(projectId,outputJob,contents).execute();
if (job == null) {
log.info("Job is null...");
throw new Exception("Job is null");
}
String jobIdNew = job.getId();
//log.info("Job is NOT null...id:");
//s = String.format("job ID:%s jobRefId:%s",jobIdNew,job.getJobReference());
//log.info(s);
while (true) {
try{
Job pollJob = bigquery.jobs().get(jobRef.getProjectId(), job.getJobReference().getJobId()).execute();
String status = pollJob.getStatus().getState();
String errors = "";
String workingDataString = "";
if ((timesToSleep % 10) == 0) {
String statusString = String.format("Job status (%dsec) JobId:%s status:%s\n", timesToSleep, job.getJobReference().getJobId(), status);
log.info(statusString);
}
if (pollJob.getStatus().getState().equals("DONE")) {
status = String.format("Job done, processed %s bytes\n", pollJob.getStatistics().toString()); // getTotalBytesProcessed());
log.info(status); // compute load stats with this string
if ((pollJob.getStatus().getErrors() != null)) {
errors = pollJob.getStatus().getErrors(). toString();
log.info(errors);
}
The performance I get is as follows: the median upload of BYTES/(deltaTime) is 17 BYTES/sec! Yes, bytes, not kilo or mega...
Worse is that sometimes for only a few hundred bytes, just one row, it takes up to 5 minutes. I generally have no errors, but I am thinking that with this performance, I will not be able to upload each app before more data arrives. I am processing with a task queue in a backends instance. This task queue gets a time-out after about an hour of processing.
Is this poor performance because of the contents method?

A couple of things:
If you are loading a small amount of data, you may be better off using TableData.insertAll() rather than a load job, which lets you post the data and have it be available immediately.
Load jobs are Batch oriented jobs. That is, you can insert (more or less) as many as you'd like and they'll be processed when there are resources to do so. Sometimes you create a job and the worker pool is resizing so you have to wait. Sometimes the worker pool is full.
If you provide a project & Job ID we can look into the performance of individual jobs to see what's taking so long.
Load jobs process in parallel; that is, once they start executing they should go very quickly, but the time to start executing may take a long time.
There are three time fields in the job statistics. createTime, startTime, and endTime.
createTime is the moment the BigQuery server receives your request.
startTime is when BigQuery actually starts working on your job
endTime is when the job is completely done
I'd expect that most of the time is being spent between create and start. If that is not the case for small jobs, then it means that something is strange is going on, and a Job ID would help diagnose the issue.

Related

Cosmos DB - changeFeedWorker not working properly after Cosmos db reach 50 gb size and splitted db into 2 partitions

Just want to know if someone already get some issues with changeFeedWorker after that cosmos db reach 50Gb size limit for one partition and automatically split it into 2 partitions.
Since this split we have observed that after adding many new items to the db, changefeedworkers was not always triggered and our view that should get changes is partially updated.
In the lease container we can see now that we moved to 2 "LeaseToken" (1 and 2) was 0 before.
If someone has an idea where to look as it was working fine before the db split.
Here is how I start my worker:
async Task IChangeFeedWorker.StartAsync()
{
await _semaphoreSlim.WaitAsync();
string operationName = $"{nameof(CosmosChangeFeedWorker)}.{nameof(IChangeFeedWorker.StartAsync)}";
using var _ = BeginChangeFeedWorkerScope(_logger, operationName, _processorName, _instanceName);
try
{
if (Active)
{
return;
}
Container eventContainer = _cosmosClient.GetContainer(_databaseId, _eventContainerId);
Container leaseContainer = _cosmosClient.GetContainer(_databaseId, _leaseContainerId);
_changeFeedProcessor = eventContainer
.GetChangeFeedProcessorBuilder<CosmosEventChange>(_processorName, HandleChangesAsync)
.WithInstanceName(_instanceName)
.WithLeaseContainer(leaseContainer)
.WithStartTime(_startTimeOfTrackingChanges)
.Build();
await _changeFeedProcessor.StartAsync();
Active = true;
_logger.LogInformation(
"Change feed processor instance has been started.",
_processorName, _instanceName);
}
catch (Exception e)
{
_logger.LogError(e,
"Starting of change feed processor instance has failed.",
_processorName, _instanceName);
_changeFeedProcessor = null;
throw;
}
finally
{
_semaphoreSlim.Release();
}
}

Hangfire executes job twice

I am using Hangfire.AspNetCore 1.7.17 and Hangfire.MySqlStorage 2.0.3 for software that is currently in production.
Now and then, we get a report of jobs being executed twice, despite the usage of the [DisableConcurrentExecution] attribute with a timeout of 30 seconds.
It seems that as soon as those 30 seconds have passed, another worker picks up that same job again.
The code is fairly straightforward:
public async Task ProcessPicking(HttpRequest incomingRequest)
{
var filePath = await StoreStreamAsync(incomingRequest, TriggerTypes.Picking);
var picking = await XmlHelper.DeserializeFileAsync<Picking>(filePath);
// delay with 20 minutes so outbound-out gets the chance to be send first
BackgroundJob.Schedule(() => StartPicking(picking), TimeSpan.FromMinutes(20));
}
[TriggerAlarming("[IMPORTANT] Failed to parse picking message to **** object.")]
[DisableConcurrentExecution(30)]
public void StartPicking(Picking picking)
{
var orderlinePickModels = picking.ToSalesOrderlinePickQuantityRequests().ToList();
var orderlineStatusModels = orderlinePickModels.ToSalesOrderlineStatusRequests().ToList();
var isParsed = DateTime.TryParse(picking.Order.UnloadingDate, out var unloadingDate);
for (var i = 0; i < orderlinePickModels.Count; i++)
{
// prevents bugs with usage of i in the background jobs
var index = i;
var id = BackgroundJob.Enqueue(() => SendSalesOrderlinePickQuantityRequest(orderlinePickModels[index], picking.EdiReference));
BackgroundJob.ContinueJobWith(id, () => SendSalesOrderlineStatusRequest(
orderlineStatusModels.First(x=>x.SalesOrderlineId== orderlinePickModels[index].OrderlineId),
picking.EdiReference, picking.Order.PrimaryReference, isParsed ? unloadingDate : DateTime.MinValue));
}
}
[TriggerAlarming("[IMPORTANT] Failed to send order line pick quantity request to ****.")]
[AutomaticRetry(Attempts = 2)]
[DisableConcurrentExecution(30)]
public void SendSalesOrderlinePickQuantityRequest(SalesOrderlinePickQuantityRequest request, string ediReference)
{
var audit = new AuditPostModel
{
Description = $"Finished job to send order line pick quantity request for item {request.Itemcode}, part of ediReference {ediReference}.",
Object = request,
Type = AuditTypes.SalesOrderlinePickQuantity
};
try
{
_logger.LogInformation($"Started job to send order line pick quantity request for item {request.Itemcode}.");
var response = _service.SendSalesOrderLinePickQuantity(request).GetAwaiter().GetResult();
audit.StatusCode = (int)response.StatusCode;
if (!response.IsSuccessStatusCode) throw new TriggerRequestFailedException();
audit.IsSuccessful = true;
_logger.LogInformation("Successfully posted sales order line pick quantity request to ***** endpoint.");
}
finally
{
Audit(audit);
}
}
It schedules the main task (StartPicking) that creates the objects required for the two subtasks:
Send picking details to customer
Send statusupdate to customer
The first job is duplicated. Perhaps the second job as well, but this is not important enough to care about as it just concerns a statusupdate. However, the first job causes the customer to think that more items have been picked than in reality.
I would assume that Hangfire updates the state of a job to e.g. in progress, and checks this state before starting a job. Is my time-out on the disabled concurrent execution too low? Is it possible in this scenario that the database connection to update the state takes about 30 seconds (to be fair, it is running on a slow server with ~8GB Ram, 6 vCores) due to which the second worker is already picking up the job again?
Or is this a Hangfire specific issue that must be tackled?

Libgit2sharp Push Performance Degrading With Many Commits

The project I am working on uses GIT in a weird way. Essentially it writes and pushes one commit at a time. The project could result in one branch having hundreds of thousands of commits. When testing we found that after only about 500 commits the performance of the GIT push started to degrade. Upon further investigation using a process monitor we believe that the degradation is due to a walk of the entire tree for the branch being pushed. Since we are only ever pushing one new commit at any given time is there any way to optimize this?
Alternatively is there a way to limit the commit history to be something like 50 commits to reduce this overhead?
I am using LibGit2Sharp Version 0.20.1.0
Update 1
To test I wrote the following code:
void Main()
{
string remotePath = #"E:\GIT Test\Remote";
string localPath = #"E:\GIT Test\Local";
string localFilePath = Path.Combine(localPath, "TestFile.txt");
Repository.Init(remotePath, true);
Repository.Clone(remotePath, localPath);
Repository repo = new Repository(localPath);
for(int i = 0; i < 2000; i++)
{
File.WriteAllText(localFilePath, RandomString((i % 2 + 1) * 10));
repo.Stage(localFilePath);
Commit commit = repo.Commit(
string.Format("Commit number: {0}", i),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now));
Stopwatch pushWatch = Stopwatch.StartNew();
Remote defaultRemote = repo.Network.Remotes["origin"];
repo.Network.Push(defaultRemote, "refs/heads/master:refs/heads/master");
pushWatch.Stop();
Trace.WriteLine(string.Format("Push {0} took {1}ms", i, pushWatch.ElapsedMilliseconds));
}
}
private const string Characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static readonly Random Random = new Random();
/// <summary>
/// Get a Random string of the specified length
/// </summary>
public static string RandomString(int size)
{
char[] buffer = new char[size];
for (int i = 0; i < size; i++)
{
buffer[i] = Characters[Random.Next(Characters.Length)];
}
return new string(buffer);
}
And ran the process monitor found here:
http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
The time for each push ended up being generally low with large spikes in time increasing both in frequency and in latency. When looking at the output from the process monitor I believe these spikes lined up with a long stretch where objects in the .git\objects folder were being accessed. For some reason occasionally on a pull there are large reads of the objects which when looked at closer appears to be a walk through the commits and objects.
The above flow is a condensed version of the actual flow we were actually doing in the project. In our actual flow we would first create a new branch "Temp" from "Master", make a commit to "Temp", push "Temp", merge "Temp" with "Master" then push "Master". When we timed each part of that flow we found the push was by far the longest running operation and it was increasing in elapsed time as the commits piled up on "Master".
Update 2
I recently updated to use libgit2sharp version 0.20.1.0 and this problem still exists. Does anyone know why this occurs?
Update 3
We change some of our code to create the temporary branch off of the first commit ever on the "Master" branch to reduce the commit tree traversal overhead but found it still exists. Below is an example that should be easy to compile and run. It shows the tree traversal happens when you create a new branch regardless of the commit position. To see the tree traversal I used the process monitor tool above and command line GIT Bash to examine what each object it opened was. Does anyone know why this happens? Is it expected behavior or am I just doing something wrong? It appears to be the push that causes the issue.
void Main()
{
string remotePath = #"E:\GIT Test\Remote";
string localPath = #"E:\GIT Test\Local";
string localFilePath = Path.Combine(localPath, "TestFile.txt");
Repository.Init(remotePath, true);
Repository.Clone(remotePath, localPath);
// Setup Initial Commit
string newBranch;
using (Repository repo = new Repository(localPath))
{
CommitRandomFile(repo, 0, localFilePath, "master");
newBranch = CreateNewBranch(repo, "master");
repo.Checkout(newBranch);
}
// Commit 1000 times to the new branch
for(int i = 1; i < 1001; i++)
{
using(Repository repo = new Repository(localPath))
{
CommitRandomFile(repo, i, localFilePath, newBranch);
}
}
// Create a single new branch from the first commit ever
// For some reason seems to walk the entire commit tree
using(Repository repo = new Repository(localPath))
{
CreateNewBranch(repo, "master");
}
}
private const string Characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static readonly Random Random = new Random();
/// <summary>
/// Generate and commit a random file to the specified branch
/// </summary>
public static void CommitRandomFile(Repository repo, int seed, string rootPath, string branch)
{
File.WriteAllText(rootPath, RandomString((seed % 2 + 1) * 10));
repo.Stage(rootPath);
Commit commit = repo.Commit(
string.Format("Commit: {0}", seed),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now));
Stopwatch pushWatch = Stopwatch.StartNew();
repo.Network.Push(repo.Network.Remotes["origin"], "refs/heads/" + branch + ":refs/heads/" + branch);
pushWatch.Stop();
Trace.WriteLine(string.Format("Push {0} took {1}ms", seed, pushWatch.ElapsedMilliseconds));
}
/// <summary>
/// Create a new branch from the specified source
/// </summary>
public static string CreateNewBranch(Repository repo, string sourceBranch)
{
Branch source = repo.Branches[sourceBranch];
string newBranch = Guid.NewGuid().ToString();
repo.Branches.Add(newBranch, source.Tip);
Stopwatch pushNewBranchWatch = Stopwatch.StartNew();
repo.Network.Push(repo.Network.Remotes["origin"], "refs/heads/" + newBranch + ":refs/heads/" + newBranch);
pushNewBranchWatch.Stop();
Trace.WriteLine(string.Format("Push of new branch {0} took {1}ms", newBranch, pushNewBranchWatch.ElapsedMilliseconds));
return newBranch;
}
/// <summary>
/// Get a Random string of the specified length
/// </summary>
public static string RandomString(int size)
{
char[] buffer = new char[size];
for (int i = 0; i < size; i++)
{
buffer[i] = Characters[Random.Next(Characters.Length)];
}
return new string(buffer);
}

Why batch jobs working on one server but not on other one?

I have same code on two servers.
I have job for adding batch job to the queue
static void Job_ScheduleBatch2(Args _args)
{
BatchHeader batHeader;
BatchInfo batInfo;
RunBaseBatch rbbTask;
str sParmCaption = "My Demonstration (b351) 2014-22-09T04:09";
SysRecurrenceData sysRecurrenceData = SysRecurrence::defaultRecurrence();
;
sysRecurrenceData = SysRecurrence::setRecurrenceStartDateTime(sysRecurrenceData, DateTimeUtil::utcNow());
sysRecurrenceData = SysRecurrence::setRecurrenceUnit(sysRecurrenceData, SysRecurrenceUnit::Minute,1);
rbbTask = new Batch4DemoClass();
batInfo = rbbTask .batchInfo();
batInfo .parmCaption(sParmCaption);
batInfo .parmGroupId(""); // The "Empty batch group".
batHeader = BatchHeader ::construct();
batHeader .addTask(rbbTask);
batHeader.parmRecurrenceData(sysRecurrenceData);
//batHeader.parmAlerts(NoYes::Yes,NoYes::Yes,NoYes::Yes,NoYes::Yes,NoYes::Yes);
batHeader.addUserAlerts(curUserId(),NoYes::No,NoYes::No,NoYes::No,NoYes::Yes,NoYes::No);
batHeader .save();
info(strFmt("'%1' batch has been scheduled.", sParmCaption));
}
and I have a batch job
class Batch4DemoClass extends RunBaseBatch
{
int methodVariable1;
int metodVariable2;
#define.CurrentVersion(1)
#localmacro.CurrentList
methodVariable1,
metodVariable2
endmacro
}
public container pack()
{
return [#CurrentVersion,#CurrentList];
}
public void run()
{
// The purpose of your job.
info(strFmt("epeating batch job Hello from Batch4DemoClass .run at %1"
,DateTimeUtil ::toStr(
DateTimeUtil ::utcNow())
));
}
public boolean unpack(container _packedClass)
{
Version version = RunBase::getVersion(_packedClass);
switch (version)
{
case #CurrentVersion:
[version,#CurrentList] = _packedClass;
break;
default:
return false;
}
return true;
}
On one server it do runs and in batch history I can see it is saving messages to the batch log and on the other server it does nothing. One server is R2 (running) and R3 (not running).
Both codes are translated to CIL. Both servers are Batch server. I can see the batch jobs in USMF/System administration/Area page inquries. Only difference I can find is that jobs on R2 has filled AOS instance name in genereal and in R3 it is empty. In R2 I can see that info messages in log and batch history but there is nothing in R3.
Any idea how to make batch jobs running?

See how to Configure an AOS instance as a batch server.
There are 3 preconditions:
Marked as a batch server
Time interval set correctly
Batch group set correctly
Update:
Deleting the user of a batch job may make the batch job never complete. Once the execute queue has filled, no further progress will happen.
Deleting the offending batch jobs is problematic, as only the owner can do so! Then consider changing BatchJob.aosValidateDelete.

Yes, worth a try to clear the tables, also if you could provide a screenshot of the already running batchjobs and their status, that might be of help further.

[play-framework : 2.1.2-java : thread pool configuration and AsyncResult

We have a typical web-service which serves JSON data read from a remote database. I was trying out returning Result and AsyncResult, each with the following configuration:
play {
akka {
event-handlers = ["akka.event.slf4j.Slf4jEventHandler"]
loglevel = WARNING
actor {
default-dispatcher = {
fork-join-executor {
parallelism-factor = 1.0
parallelism-max = 1
}
}
}
}
}
and one with
parallelism-factor = 1.0
parallelism-max = 5
Following are observations where the time taken to complete 500 requests is given (average of 5 readings):
1. parallelism-max=1 and parallelism-factor=1.0
Result :
Completion time = 291662 ms.
AsyncResult:
Completion time = 55601 ms
2. parallelism-max=5 and parallelism-factor=1.0
Result :
Completion time = 46419 ms.
AsyncResult:
Completion time = 46977 ms
We can see that with parallelism-max=1, AsyncResult clearly takes very less time compare to Result. However, with parallelism-max=5, Result and AsyncResult give very similar timings.
Shouldn't the time required become less as the number of threads increases, for AsyncResult also ?
Requesting for help to understand the reasons behind this observation.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Longer than expected upload times for big query insert - google-bigquery

Related

Cosmos DB - changeFeedWorker not working properly after Cosmos db reach 50 gb size and splitted db into 2 partitions

Hangfire executes job twice

Libgit2sharp Push Performance Degrading With Many Commits

Why batch jobs working on one server but not on other one?

[play-framework : 2.1.2-java : thread pool configuration and AsyncResult

Categories

Resources