The project I am working on uses GIT in a weird way. Essentially it writes and pushes one commit at a time. The project could result in one branch having hundreds of thousands of commits. When testing we found that after only about 500 commits the performance of the GIT push started to degrade. Upon further investigation using a process monitor we believe that the degradation is due to a walk of the entire tree for the branch being pushed. Since we are only ever pushing one new commit at any given time is there any way to optimize this?
Alternatively is there a way to limit the commit history to be something like 50 commits to reduce this overhead?
I am using LibGit2Sharp Version 0.20.1.0
Update 1
To test I wrote the following code:
void Main()
{
string remotePath = #"E:\GIT Test\Remote";
string localPath = #"E:\GIT Test\Local";
string localFilePath = Path.Combine(localPath, "TestFile.txt");
Repository.Init(remotePath, true);
Repository.Clone(remotePath, localPath);
Repository repo = new Repository(localPath);
for(int i = 0; i < 2000; i++)
{
File.WriteAllText(localFilePath, RandomString((i % 2 + 1) * 10));
repo.Stage(localFilePath);
Commit commit = repo.Commit(
string.Format("Commit number: {0}", i),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now));
Stopwatch pushWatch = Stopwatch.StartNew();
Remote defaultRemote = repo.Network.Remotes["origin"];
repo.Network.Push(defaultRemote, "refs/heads/master:refs/heads/master");
pushWatch.Stop();
Trace.WriteLine(string.Format("Push {0} took {1}ms", i, pushWatch.ElapsedMilliseconds));
}
}
private const string Characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static readonly Random Random = new Random();
/// <summary>
/// Get a Random string of the specified length
/// </summary>
public static string RandomString(int size)
{
char[] buffer = new char[size];
for (int i = 0; i < size; i++)
{
buffer[i] = Characters[Random.Next(Characters.Length)];
}
return new string(buffer);
}
And ran the process monitor found here:
http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
The time for each push ended up being generally low with large spikes in time increasing both in frequency and in latency. When looking at the output from the process monitor I believe these spikes lined up with a long stretch where objects in the .git\objects folder were being accessed. For some reason occasionally on a pull there are large reads of the objects which when looked at closer appears to be a walk through the commits and objects.
The above flow is a condensed version of the actual flow we were actually doing in the project. In our actual flow we would first create a new branch "Temp" from "Master", make a commit to "Temp", push "Temp", merge "Temp" with "Master" then push "Master". When we timed each part of that flow we found the push was by far the longest running operation and it was increasing in elapsed time as the commits piled up on "Master".
Update 2
I recently updated to use libgit2sharp version 0.20.1.0 and this problem still exists. Does anyone know why this occurs?
Update 3
We change some of our code to create the temporary branch off of the first commit ever on the "Master" branch to reduce the commit tree traversal overhead but found it still exists. Below is an example that should be easy to compile and run. It shows the tree traversal happens when you create a new branch regardless of the commit position. To see the tree traversal I used the process monitor tool above and command line GIT Bash to examine what each object it opened was. Does anyone know why this happens? Is it expected behavior or am I just doing something wrong? It appears to be the push that causes the issue.
void Main()
{
string remotePath = #"E:\GIT Test\Remote";
string localPath = #"E:\GIT Test\Local";
string localFilePath = Path.Combine(localPath, "TestFile.txt");
Repository.Init(remotePath, true);
Repository.Clone(remotePath, localPath);
// Setup Initial Commit
string newBranch;
using (Repository repo = new Repository(localPath))
{
CommitRandomFile(repo, 0, localFilePath, "master");
newBranch = CreateNewBranch(repo, "master");
repo.Checkout(newBranch);
}
// Commit 1000 times to the new branch
for(int i = 1; i < 1001; i++)
{
using(Repository repo = new Repository(localPath))
{
CommitRandomFile(repo, i, localFilePath, newBranch);
}
}
// Create a single new branch from the first commit ever
// For some reason seems to walk the entire commit tree
using(Repository repo = new Repository(localPath))
{
CreateNewBranch(repo, "master");
}
}
private const string Characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static readonly Random Random = new Random();
/// <summary>
/// Generate and commit a random file to the specified branch
/// </summary>
public static void CommitRandomFile(Repository repo, int seed, string rootPath, string branch)
{
File.WriteAllText(rootPath, RandomString((seed % 2 + 1) * 10));
repo.Stage(rootPath);
Commit commit = repo.Commit(
string.Format("Commit: {0}", seed),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now),
new Signature("TestAuthor", "TestEmail#Test.com", System.DateTimeOffset.Now));
Stopwatch pushWatch = Stopwatch.StartNew();
repo.Network.Push(repo.Network.Remotes["origin"], "refs/heads/" + branch + ":refs/heads/" + branch);
pushWatch.Stop();
Trace.WriteLine(string.Format("Push {0} took {1}ms", seed, pushWatch.ElapsedMilliseconds));
}
/// <summary>
/// Create a new branch from the specified source
/// </summary>
public static string CreateNewBranch(Repository repo, string sourceBranch)
{
Branch source = repo.Branches[sourceBranch];
string newBranch = Guid.NewGuid().ToString();
repo.Branches.Add(newBranch, source.Tip);
Stopwatch pushNewBranchWatch = Stopwatch.StartNew();
repo.Network.Push(repo.Network.Remotes["origin"], "refs/heads/" + newBranch + ":refs/heads/" + newBranch);
pushNewBranchWatch.Stop();
Trace.WriteLine(string.Format("Push of new branch {0} took {1}ms", newBranch, pushNewBranchWatch.ElapsedMilliseconds));
return newBranch;
}
/// <summary>
/// Get a Random string of the specified length
/// </summary>
public static string RandomString(int size)
{
char[] buffer = new char[size];
for (int i = 0; i < size; i++)
{
buffer[i] = Characters[Random.Next(Characters.Length)];
}
return new string(buffer);
}
Related
I am trying to write to a S3 sink.
private static StreamingFileSink<String> createS3SinkFromStaticConfig(
final Map<String, Properties> applicationProperties
) {
Properties sinkProperties = applicationProperties.get(SINK_PROPERTIES);
String s3SinkPath = sinkProperties.getProperty(SINK_S3_PATH_KEY);
return StreamingFileSink
.forRowFormat(
new Path(s3SinkPath),
new SimpleStringEncoder<String>(StandardCharsets.UTF_8.toString())
)
.build();
}
The following code works and I can see the results in S3
input.map(value -> { // Parse the JSON
JsonNode jsonNode = jsonParser.readValue(value, JsonNode.class);
return new Tuple2<>(jsonNode.get("ticker").asText(), jsonNode.get("price").asDouble());
}).returns(Types.TUPLE(Types.STRING, Types.DOUBLE))
.keyBy(0) // Logically partition the stream per stock symbol
.timeWindow(Time.seconds(10), Time.seconds(5)) // Sliding window definition
.min(1) // Calculate minimum price per stock over the window
.setParallelism(3) // Set parallelism for the min operator
.map(value -> value.f0 + ": ----- " + value.f1.toString() + "\n")
.addSink(createS3SinkFromStaticConfig(applicationProperties));
But the following doesn't write anything to S3.
KeyedStream<EnrichedMetric, EnrichedMetricKey> input = env.addSource(new EnrichedMetricSource())
.assignTimestampsAndWatermarks(
WatermarkStrategy.<EnrichedMetric>forMonotonousTimestamps()
.withTimestampAssigner(((event, l) -> event.getEventTime()))
).keyBy(new EnrichedMetricKeySelector());
DataStream<String> statsStream = input
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new PValueStatisticsWindowFunction());
statsStream.addSink(createS3SinkFromStaticConfig(applicationProperties));
PValueStatisticsWindowFunction is a ProcessWindowFunction as below.
#Override
public void process(EnrichedMetricKey enrichedMetricKey,
Context context,
Iterable<EnrichedMetric> in,
Collector<String> out) throws Exception {
int count = 0;
for (EnrichedMetric m : in) {
count++;
}
out.collect("Count: " + count);
}
When I run the Flink app locally, statsStream.print() prints the results to log/flink-*-taskexecutor-*.out.
In the cluster, I can see checkpoint is enabled and the various checkpoints history from the Flink dashboard. I also made sure the S3 path is in the format s3a://<bucket>
Not sure what I am missing here.
My understanding for Ignite Persistent Storage is that the data is not only saved in memory, but also written to disk.
When the node is restarted, it should read the data from disk to memory.
So, I am using this example to test it out. But I update it a little bit because I don't want to use xml.
This is my slightly updated code.
public class PersistentIgniteExpr {
/**
* Organizations cache name.
*/
private static final String ORG_CACHE = "CacheQueryExample_Organizations";
/** */
private static final boolean UPDATE = true;
public void test(String nodeId) {
// Apache Ignite node configuration.
IgniteConfiguration cfg = new IgniteConfiguration();
// Ignite persistence configuration.
DataStorageConfiguration storageCfg = new DataStorageConfiguration();
// Enabling the persistence.
storageCfg.getDefaultDataRegionConfiguration().setPersistenceEnabled(true);
// Applying settings.
cfg.setDataStorageConfiguration(storageCfg);
List<String> addresses = new ArrayList<>();
addresses.add("127.0.0.1:47500..47502");
TcpDiscoverySpi tcpDiscoverySpi = new TcpDiscoverySpi();
tcpDiscoverySpi.setIpFinder(new TcpDiscoveryMulticastIpFinder().setAddresses(addresses));
cfg.setDiscoverySpi(tcpDiscoverySpi);
try (Ignite ignite = Ignition.getOrStart(cfg.setIgniteInstanceName(nodeId))) {
// Activate the cluster. Required to do if the persistent store is enabled because you might need
// to wait while all the nodes, that store a subset of data on disk, join the cluster.
ignite.active(true);
CacheConfiguration<Long, Organization> cacheCfg = new CacheConfiguration<>(ORG_CACHE);
cacheCfg.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL);
cacheCfg.setBackups(1);
cacheCfg.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC);
cacheCfg.setIndexedTypes(Long.class, Organization.class);
IgniteCache<Long, Organization> cache = ignite.getOrCreateCache(cacheCfg);
if (UPDATE) {
System.out.println("Populating the cache...");
try (IgniteDataStreamer<Long, Organization> streamer = ignite.dataStreamer(ORG_CACHE)) {
streamer.allowOverwrite(true);
for (long i = 0; i < 100_000; i++) {
streamer.addData(i, new Organization(i, "organization-" + i));
if (i > 0 && i % 10_000 == 0)
System.out.println("Done: " + i);
}
}
}
// Run SQL without explicitly calling to loadCache().
QueryCursor<List<?>> cur = cache.query(
new SqlFieldsQuery("select id, name from Organization where name like ?")
.setArgs("organization-54321"));
System.out.println("SQL Result: " + cur.getAll());
// Run get() without explicitly calling to loadCache().
Organization org = cache.get(54321l);
System.out.println("GET Result: " + org);
}
}
}
When I run the first time, it works as intended.
After running it one time, I am assuming that data is written to disk since the code is about persistent storage.
When I run the second time, I commented out this part.
if (UPDATE) {
System.out.println("Populating the cache...");
try (IgniteDataStreamer<Long, Organization> streamer = ignite.dataStreamer(ORG_CACHE)) {
streamer.allowOverwrite(true);
for (long i = 0; i < 100_000; i++) {
streamer.addData(i, new Organization(i, "organization-" + i));
if (i > 0 && i % 10_000 == 0)
System.out.println("Done: " + i);
}
}
}
That is the part where data is written. When the sql query is executed, it is returning null. That means data is not written to disk?
Another question is I am not very clear about TcpDiscoverySpi. Can someone explain about it as well?
Thanks in advance.
Do you have any exceptions at node startup?
Very probably, you don't have IGNITE_HOME env variable configured. And the Work Directory for persistence is chosen somehow differently each time you run a node.
You can either setup IGNITE_HOME env variable or add a code line to setup workDirectory explicitly: cfg.setWorkDirectory("C:\\workDirectory");
TcpDiscoverySpi provides a way to discover remote nodes in a grid, so the starting node can join a cluster. It is better to use TcpDiscoveryVmIpFinder if you know the list of IPs. TcpDiscoveryMulticastIpFinder broadcasts UDP messages to a network to discover other nodes. It does not require IPs list at all.
Please see https://apacheignite.readme.io/docs/cluster-config for more details.
I'm still fairly new to the whole libgit2 and libgit2sharp codebases, but I've been trying to tackle the issue of In-Memory repositories and Refdb storage of references.
I've gotten everything almost working in my new branch:
Create the In-Memory repository and attach Refdb and Odb instance to it.
Create initial commit and set the reference for refs/heads/master.
Create the second commit...
The problem I run into now is updating refs/heads/master to the second commit. The UpdateTarget call to refs/heads/master runs into an error in git_reference_set_target:
LibGit2Sharp.NameConflictException : config value 'user.name' was not found
at LibGit2Sharp.Core.Ensure.HandleError(Int32 result) in \libgit2sharp\LibGit2Sharp\Core\Ensure.cs:line 154
at LibGit2Sharp.Core.Ensure.ZeroResult(Int32 result) in \libgit2sharp\LibGit2Sharp\Core\Ensure.cs:line 172
at LibGit2Sharp.Core.Proxy.git_reference_set_target(ReferenceHandle reference, ObjectId id, String logMessage) in \libgit2sharp\LibGit2Sharp\Core\Proxy.cs:line 2042
at LibGit2Sharp.ReferenceCollection.UpdateDirectReferenceTarget(Reference directRef, ObjectId targetId, String logMessage) in \libgit2sharp\LibGit2Sharp\ReferenceCollection.cs:line 476
at LibGit2Sharp.ReferenceCollection.UpdateTarget(Reference directRef, ObjectId targetId, String logMessage) in \libgit2sharp\LibGit2Sharp\ReferenceCollection.cs:line 470
at LibGit2Sharp.ReferenceCollection.UpdateTarget(Reference directRef, String objectish, String logMessage) in \libgit2sharp\LibGit2Sharp\ReferenceCollection.cs:line 498
at LibGit2Sharp.ReferenceCollection.UpdateTarget(String name, String canonicalRefNameOrObjectish, String logMessage) in \libgit2sharp\LibGit2Sharp\ReferenceCollection.cs:line 534
at LibGit2Sharp.ReferenceCollection.UpdateTarget(String name, String canonicalRefNameOrObjectish) in \libgit2sharp\LibGit2Sharp\ReferenceCollection.cs:line 565
at LibGit2Sharp.Tests.RepositoryFixture.CanCreateInMemoryRepositoryWithBackends() in \libgit2sharp\LibGit2Sharp.Tests\RepositoryFixture.cs:line 791
I have not been able to debug down into the libgit2 level, but from what I can tell the issue is in git_reference_create_matching call to git_reference__log_signature.
Is there a call I am missing that can update a reference in a Bare repo that doesn't require a signature? If any libgit2 guys know of how to do this in libgit2, I can implement it on the C# side.
As a test, I created two unit tests that perform the same actions In-Memory and on disk, and the In-Memory fails when calling UpdateTarget after creating the second commit. This follows the code on the wiki:
private Commit CreateCommit(Repository repository, string fileName, string content, string message = null)
{
if (message == null)
{
message = "i'm a commit message :)";
}
Blob newBlob = repository.ObjectDatabase.CreateBlobFromContent(content);
// Put the blob in a tree
TreeDefinition td = new TreeDefinition();
td.Add(fileName, newBlob, Mode.NonExecutableFile);
Tree tree = repository.ObjectDatabase.CreateTree(td);
// Committer and author
Signature committer = new Signature("Auser", "auser#example.com", DateTime.Now);
Signature author = committer;
// Create binary stream from the text
return repository.ObjectDatabase.CreateCommit(
author,
committer,
message,
tree,
repository.Commits,
true);
}
[Fact]
public void CanCreateRepositoryWithoutBackends()
{
SelfCleaningDirectory scd = BuildSelfCleaningDirectory();
Repository.Init(scd.RootedDirectoryPath, true);
ObjectId commit1Id;
using (var repository = new Repository(scd.RootedDirectoryPath))
{
Commit commit1 = CreateCommit(repository, "filePath.txt", "Hello commit 1!");
commit1Id = commit1.Id;
repository.Refs.Add("refs/heads/master", commit1.Id);
Assert.Equal(1, repository.Commits.Count());
Assert.NotNull(repository.Refs.Head);
Assert.Equal(1, repository.Refs.Count());
}
using (var repository = new Repository(scd.RootedDirectoryPath))
{
Commit commit2 = CreateCommit(repository, "filePath.txt", "Hello commit 2!");
Assert.Equal(commit1Id, commit2.Parents.First().Id);
repository.Refs.UpdateTarget("refs/heads/master", commit2.Sha);
Assert.Equal(2, repository.Commits.Count());
Assert.Equal(1, repository.Refs.Count());
Assert.NotNull(repository.Refs.Head);
Assert.Equal(commit2.Sha, repository.Refs.Head.ResolveToDirectReference().TargetIdentifier);
}
}
[Fact]
public void CanCreateInMemoryRepositoryWithBackends()
{
OdbBackendFixture.MockOdbBackend odbBackend = new OdbBackendFixture.MockOdbBackend();
RefdbBackendFixture.MockRefdbBackend refdbBackend = new RefdbBackendFixture.MockRefdbBackend();
ObjectId commit1Id;
using (var repository = new Repository())
{
repository.Refs.SetBackend(refdbBackend);
repository.ObjectDatabase.AddBackend(odbBackend, 5);
Commit commit1 = CreateCommit(repository, "filePath.txt", "Hello commit 1!");
commit1Id = commit1.Id;
repository.Refs.Add("refs/heads/master", commit1.Id);
Assert.Equal(1, repository.Commits.Count());
Assert.NotNull(repository.Refs.Head);
Assert.Equal(commit1.Sha, repository.Refs.Head.ResolveToDirectReference().TargetIdentifier);
// Emulating Git, repository.Refs enumerable does not include the HEAD.
// Thus, repository.Refs.Count will be 1 and refdbBackend.References.Count will be 2.
Assert.Equal(1, repository.Refs.Count());
Assert.Equal(2, refdbBackend.References.Count);
}
using (var repository = new Repository())
{
repository.Refs.SetBackend(refdbBackend);
repository.ObjectDatabase.AddBackend(odbBackend, 5);
Commit commit2 = CreateCommit(repository, "filePath.txt", "Hello commit 2!");
Assert.Equal(commit1Id, commit2.Parents.First().Id);
//repository.Refs.UpdateTarget(repository.Refs["refs/heads/master"], commit2.Id);
//var master = repository.Refs["refs/heads/master"];
//Assert.Equal(commit1Id.Sha, master.TargetIdentifier);
repository.Refs.UpdateTarget("refs/heads/master", commit2.Sha); // fails at LibGit2Sharp.Core.Proxy.git_reference_set_target(ReferenceHandle reference, ObjectId id, String logMessage)
//repository.Refs.Add("refs/heads/master", commit2.Id); // fails at LibGit2Sharp.Core.Proxy.git_reference_create(RepositoryHandle repo, String name, ObjectId targetId, Boolean allowOverwrite, String logMessage)
Assert.Equal(2, repository.Commits.Count());
Assert.Equal(1, repository.Refs.Count());
Assert.NotNull(repository.Refs.Head);
Assert.Equal(commit2.Sha, repository.Refs.Head.ResolveToDirectReference().TargetIdentifier);
}
}
I have found some suggestions on how to add a block to a page, but can't get it to work the way I want, so perhaps someone can help out.
What I want to do is to have a scheduled job that reads through a file, creating new pages with a certain pagetype and in the new page adding some blocks to a content property. The blocks fields will be updated with data from the file that is read.
I have the following code in the scheduled job, but it fails at
repo.Save((IContent) newBlock, SaveAction.Publish);
giving the error
The page name must contain at least one visible character.
This is my code:
public override string Execute()
{
//Call OnStatusChanged to periodically notify progress of job for manually started jobs
OnStatusChanged(String.Format("Starting execution of {0}", this.GetType()));
//Create Person page
PageReference parent = PageReference.StartPage;
//IContentRepository contentRepository = EPiServer.ServiceLocation.ServiceLocator.Current.GetInstance<IContentRepository>();
//IContentTypeRepository contentTypeRepository = EPiServer.ServiceLocation.ServiceLocator.Current.GetInstance<IContentTypeRepository>();
//var repository = EPiServer.ServiceLocation.ServiceLocator.Current.GetInstance<IContentRepository>();
//var slaegtPage = repository.GetDefault<SlaegtPage>(ContentReference.StartPage);
IContentRepository contentRepository = EPiServer.ServiceLocation.ServiceLocator.Current.GetInstance<IContentRepository>();
IContentTypeRepository contentTypeRepository = EPiServer.ServiceLocation.ServiceLocator.Current.GetInstance<IContentTypeRepository>();
SlaegtPage slaegtPage = contentRepository.GetDefault<SlaegtPage>(parent, contentTypeRepository.Load("SlaegtPage").ID);
if (slaegtPage.MainContentArea == null) {
slaegtPage.MainContentArea = new ContentArea();
}
slaegtPage.PageName = "001 Kim";
//Create block
var repo = ServiceLocator.Current.GetInstance<IContentRepository>();
var newBlock = repo.GetDefault<SlaegtPersonBlock1>(ContentReference.GlobalBlockFolder);
newBlock.PersonId = "001";
newBlock.PersonName = "Kim";
newBlock.PersonBirthdate = "01 jan 1901";
repo.Save((IContent) newBlock, SaveAction.Publish);
//Add block
slaegtPage.MainContentArea.Items.Add(new ContentAreaItem
{
ContentLink = ((IContent) newBlock).ContentLink
});
slaegtPage.URLSegment = UrlSegment.CreateUrlSegment(slaegtPage);
contentRepository.Save(slaegtPage, EPiServer.DataAccess.SaveAction.Publish);
_stopSignaled = true;
//For long running jobs periodically check if stop is signaled and if so stop execution
if (_stopSignaled) {
return "Stop of job was called";
}
return "Change to message that describes outcome of execution";
}
You can set the Name by
((IContent) newBlock).Name = "MyName";
I am uploading data to big query as csv format with JSON schemas. What I am seeing is the very long times to load into big query. I take the start and ending load times from the pollJob.getStatistics() when the load is DONE and compute a delta time as (startTime - endTime)/1000. Then I look at the number of bytes loaded. The data is from files stored in google cloud storage that I reprocess in app engine to do some reformatting. I convert the string into a byte stream and then load as the contents of the load as follows:
public static void uploadFileToBigQuerry(TableSchema tableSchema,String tableData,String datasetId,String tableId,boolean formatIsJson,int waitSeconds,String[] fileIdElements) {
/* Init diagnostic */
String projectId = getProjectId();
if (ReadAndroidRawFile.testMode) {
String s = String.format("My project ID at start of upload to BQ:%s datasetID:%s tableID:%s json:%b \nschema:%s tableData:\n%s\n",
projectId,datasetId,tableId,formatIsJson,tableSchema.toString(),tableData);
log.info(s);
}
else {
String s = String.format("Upload to BQ tableID:%s tableFirst60Char:%s\n",
tableId,tableData.substring(0,60));
log.info(s);
}
/* Setup the data each time */
Dataset dataset = new Dataset();
DatasetReference datasetRef = new DatasetReference();
datasetRef.setProjectId(projectId);
datasetRef.setDatasetId(datasetId);
dataset.setDatasetReference(datasetRef);
try {
bigquery.datasets().insert(projectId, dataset).execute();
} catch (IOException e) {
if (ReadAndroidRawFile.testMode) {
String se = String.format("Exception creating datasetId:%s",e);
log.info(se);
}
}
/* Set destination table */
TableReference destinationTable = new TableReference();
destinationTable.setProjectId(projectId);
destinationTable.setDatasetId(datasetId);
destinationTable.setTableId(tableId);
/* Common setup line */
JobConfigurationLoad jobLoad = new JobConfigurationLoad();
/* Handle input format */
if (formatIsJson) {
jobLoad.setSchema(tableSchema);
jobLoad.setSourceFormat("NEWLINE_DELIMITED_JSON");
jobLoad.setDestinationTable(destinationTable);
jobLoad.setCreateDisposition("CREATE_IF_NEEDED");
jobLoad.setWriteDisposition("WRITE_APPEND");
jobLoad.set("Content-Type", "application/octet-stream");
}
else {
jobLoad.setSchema(tableSchema);
jobLoad.setSourceFormat("CSV");
jobLoad.setDestinationTable(destinationTable);
jobLoad.setCreateDisposition("CREATE_IF_NEEDED");
jobLoad.setWriteDisposition("WRITE_APPEND");
jobLoad.set("Content-Type", "application/octet-stream");
}
/* Setup the job config */
JobConfiguration jobConfig = new JobConfiguration();
jobConfig.setLoad(jobLoad);
JobReference jobRef = new JobReference();
jobRef.setProjectId(projectId);
Job outputJob = new Job();
outputJob.setConfiguration(jobConfig);
outputJob.setJobReference(jobRef);
/* Convert input string into byte stream */
ByteArrayContent contents = new ByteArrayContent("application/octet-stream",tableData.getBytes());
int timesToSleep = 0;
try {
Job job = bigquery.jobs().insert(projectId,outputJob,contents).execute();
if (job == null) {
log.info("Job is null...");
throw new Exception("Job is null");
}
String jobIdNew = job.getId();
//log.info("Job is NOT null...id:");
//s = String.format("job ID:%s jobRefId:%s",jobIdNew,job.getJobReference());
//log.info(s);
while (true) {
try{
Job pollJob = bigquery.jobs().get(jobRef.getProjectId(), job.getJobReference().getJobId()).execute();
String status = pollJob.getStatus().getState();
String errors = "";
String workingDataString = "";
if ((timesToSleep % 10) == 0) {
String statusString = String.format("Job status (%dsec) JobId:%s status:%s\n", timesToSleep, job.getJobReference().getJobId(), status);
log.info(statusString);
}
if (pollJob.getStatus().getState().equals("DONE")) {
status = String.format("Job done, processed %s bytes\n", pollJob.getStatistics().toString()); // getTotalBytesProcessed());
log.info(status); // compute load stats with this string
if ((pollJob.getStatus().getErrors() != null)) {
errors = pollJob.getStatus().getErrors(). toString();
log.info(errors);
}
The performance I get is as follows: the median upload of BYTES/(deltaTime) is 17 BYTES/sec! Yes, bytes, not kilo or mega...
Worse is that sometimes for only a few hundred bytes, just one row, it takes up to 5 minutes. I generally have no errors, but I am thinking that with this performance, I will not be able to upload each app before more data arrives. I am processing with a task queue in a backends instance. This task queue gets a time-out after about an hour of processing.
Is this poor performance because of the contents method?
A couple of things:
If you are loading a small amount of data, you may be better off using TableData.insertAll() rather than a load job, which lets you post the data and have it be available immediately.
Load jobs are Batch oriented jobs. That is, you can insert (more or less) as many as you'd like and they'll be processed when there are resources to do so. Sometimes you create a job and the worker pool is resizing so you have to wait. Sometimes the worker pool is full.
If you provide a project & Job ID we can look into the performance of individual jobs to see what's taking so long.
Load jobs process in parallel; that is, once they start executing they should go very quickly, but the time to start executing may take a long time.
There are three time fields in the job statistics. createTime, startTime, and endTime.
createTime is the moment the BigQuery server receives your request.
startTime is when BigQuery actually starts working on your job
endTime is when the job is completely done
I'd expect that most of the time is being spent between create and start. If that is not the case for small jobs, then it means that something is strange is going on, and a Job ID would help diagnose the issue.