BigQuery Streaming Insertion takes time in minutes for write-heavy application - google-bigquery

I have an write-heavy Springboot application integrating with Bigquery for heavy load , facing 10 minutes to insert some of the entries.Here are my configurations
Number of Entries Stored: 1 Million/min
Number of pods : 100
Insertion Type : Streaming Data(Using JsonStreamWrite)
Deployed Cloud : Azure
Average time taken for insertion : 650 ms
Max time taken : 22 mins (for a single insert)
Number of Threads Per Pod : 15 threads
Each pod has a BigQuery Connection and tries to insert in BigQuery. Now as 10% of the inserts are taking time in minutes , we are facing a lot of timeout and performance issues. Is there an efficient way to write data in BigQuery with such large loads.
We use the following Google client libraries
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-storage</artifactId>
</dependency>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-bigquerystorage</artifactId>
</dependency>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-bigquery</artifactId>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependencies>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>libraries-bom</artifactId>
<version>25.4.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
private void updateRequestMetadataOperations(JSONArray requestMetaDataArr){
JSONArray firstObjArr = new JSONArray();
JSONObject firstTableJsonObj = new JSONObject();
firstTableJsonObj.put("firstColumn",firstColumnVal);
firstTableJsonObj.put("secondColumn",secondColumnVal);
firstTableJsonObj.put("thirdColumn",thirdColumnVal);
firstTableJsonObj.put("fourthColumn",fourthColumnVal);
firstTableJsonObj.put("fifthColumn",fifthColumnVal);
firstTableJsonObj.put("sixthColumn",sixthColumnVal);
.
.
.
firstTableJsonObj.put("twentyColumn",twentyColumnVal);
firstObjArr.put(firstTableJsonObj);
}
public void insertIntoBigQuery(String tableName, JSONArray jsonArr) throws Exception{
if(jsonArr.length()==0){
return;
}
JsonStreamWriter jsonStreamWriter = JsonStreamWriterUtil.getWriteStreamMap(tableName);
if(jsonStreamWriter!=null) {
jsonStreamWriter.append(jsonArr);
}
}
public JsonStreamWriter createWriteStream(String table) throws IOException, Descriptors.DescriptorValidationException, InterruptedException {
BigQueryWriteClient bqClient = BigQueryWriteClient.create();
WriteStream stream = WriteStream.newBuilder().setType(WriteStream.Type.COMMITTED).build();
TableName tableName = TableName.of("ProjectId", "DataSet", table);
CreateWriteStreamRequest createWriteStreamRequest =
CreateWriteStreamRequest.newBuilder()
.setParent(tableName.toString())
.setWriteStream(stream)
.build();
WriteStream writeStream = bqClient.createWriteStream(createWriteStreamRequest);
JsonStreamWriter jsonStreamWriter = JsonStreamWriter
.newBuilder(writeStream.getName(), writeStream.getTableSchema())
.build();
return jsonStreamWriter;
}

In general, BigQuery streaming insert is meant for small real-time data update and it is light and fast.Batch or load, on the other hand, accepts file uploads and it is meant for larger and heavier updates. The BigQuery Storage Write API is a unified data-ingestion API for BigQuery. It combines streaming ingestion and batch loading into a single high-performance API.
Is there an efficient way to write data in BigQuery with such large loads?
The insert process is optimized for bulk operations resulting in much higher levels of performance and lower loading times. Also note that
a large payload can also lead to a slow insert, especially if it is coming from an outside network with additional latencies.
If you need to gain speed you can always opt for an asynchronous approach. You can always consider using a message bus like Pub/Sub with Dataflow to write into BigQuery.The Storage Write API is a gRPC API that uses bidirectional connections. The AppendRows method creates a connection to a stream. Generally, a single connection supports at least 1MB/s of throughput. The upper bound depends on several factors, such as network bandwidth, the schema of the data, and server load, but can exceed 10MB/s. If you require more throughput, create more connections.
You can also refer to this document for quote and limit.
Your createWriteStream method seems fine as per the GCP code example.

Related

Infinispan clustered lock performance does not improve with more nodes?

I have a piece of code that is essentially executing the following with Infinispan in embedded mode, using version 13.0.0 of the -core and -clustered-lock modules:
#Inject
lateinit var lockManager: ClusteredLockManager
private fun getLock(lockName: String): ClusteredLock {
lockManager.defineLock(lockName)
return lockManager.get(lockName)
}
fun createSession(sessionId: String) {
tryLockCounter.increment()
logger.debugf("Trying to start session %s. trying to acquire lock", sessionId)
Future.fromCompletionStage(getLock(sessionId).lock()).map {
acquiredLockCounter.increment()
logger.debugf("Starting session %s. Got lock", sessionId)
}.onFailure {
logger.errorf(it, "Failed to start session %s", sessionId)
}
}
I take this piece of code and deploy it to kubernetes. I then run it in six pods distributed over six nodes in the same region. The code exposes createSession with random Guids through an API. This API is called and creates sessions in chunks of 500, using a k8s service in front of the pods which means the load gets balanced over the pods. I notice that the execution time to acquire a lock grows linearly with the amount of sessions. In the beginning it's around 10ms, when there's about 20_000 sessions it takes about 100ms and the trend continues in a stable fashion.
I then take the same code and run it, but this time with twelve pods on twelve nodes. To my surprise I see that the performance characteristics are almost identical to when I had six pods. I've been digging in to the code but still haven't figured out why this is, I'm wondering if there's a good reason why infinispan here doesn't seem to perform better with more nodes?
For completeness the configuration of the locks are as follows:
val global = GlobalConfigurationBuilder.defaultClusteredBuilder()
global.addModule(ClusteredLockManagerConfigurationBuilder::class.java)
.reliability(Reliability.AVAILABLE)
.numOwner(1)
and looking at the code the clustered locks is using DIST_SYNC which should spread out the load of the cache onto the different nodes.
UPDATE:
The two counters in the code above are simply micrometer counters. It is through them and prometheus that I can see how the lock creation starts to slow down.
It's correctly observed that there's one lock created per session id, this is per design what we'd like. Our use case is that we want to ensure that a session is running in at least one place. Without going to deep into detail this can be achieved by ensuring that we at least have two pods that are trying to acquire the same lock. The Infinispan library is great in that it tells us directly when the lock holder dies without any additional extra chattiness between pods, which means that we have a "cheap" way of ensuring that execution of the session continues when one pod is removed.
After digging deeper into the code I found the following in CacheNotifierImpl in the core library:
private CompletionStage<Void> doNotifyModified(K key, V value, Metadata metadata, V previousValue,
Metadata previousMetadata, boolean pre, InvocationContext ctx, FlagAffectedCommand command) {
if (clusteringDependentLogic.running().commitType(command, ctx, extractSegment(command, key), false).isLocal()
&& (command == null || !command.hasAnyFlag(FlagBitSets.PUT_FOR_STATE_TRANSFER))) {
EventImpl<K, V> e = EventImpl.createEvent(cache.wired(), CACHE_ENTRY_MODIFIED);
boolean isLocalNodePrimaryOwner = isLocalNodePrimaryOwner(key);
Object batchIdentifier = ctx.isInTxScope() ? null : Thread.currentThread();
try {
AggregateCompletionStage<Void> aggregateCompletionStage = null;
for (CacheEntryListenerInvocation<K, V> listener : cacheEntryModifiedListeners) {
// Need a wrapper per invocation since converter could modify the entry in it
configureEvent(listener, e, key, value, metadata, pre, ctx, command, previousValue, previousMetadata);
aggregateCompletionStage = composeStageIfNeeded(aggregateCompletionStage,
listener.invoke(new EventWrapper<>(key, e), isLocalNodePrimaryOwner));
}
The lock library uses a clustered Listener on the entry modified event, and this one uses a filter to only notify when the key for the lock is modified. It seems to me the core library still has to check this condition on every registered listener, which of course becomes a very big list as the number of sessions grow. I suspect this to be the reason and if it is it would be really really awesome if the core library supported a kind of key filter so that it could use a hashmap for these listeners instead of going through a whole list with all listeners.
I believe you are creating a clustered lock per session id. Is this what you need ? what is the acquiredLockCounter? We are about to deprecate the "lock" method in favour of "tryLock" with timeout since the lock method will block forever if the clustered lock is never acquired. Do you ever unlock the clustered lock in another piece of code? If you shared a complete reproducer of the code will be very helpful for us. Thanks!

Why Apache Ignite Cache.replace-K-V-V api call performing slow?

We are running Ignite cluster with 12 nodes running on Ignite 2.7.0 on openjdk
1.8 at RHEL platform.
Seeing heavy cputime spent with https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/IgniteCache.html#replace-K-V-V-
We are witnessing slowness with one of our process and when we tried to drill it
further by profiling the JVM, the main culprit (taking ~78% of total time)
seems to be coming from Ignite cache.repalce(K,V,V) api call.
Out of 77.9 by replace, 39% is taken by GridCacheAdapater.equalVal and 38.5%
by GridCacheAdapter.put
Cache is Partitioned and ATOMIC with readThrough,writeThrough,writeBehindEnabled set to True.
Attaching the profiling snapshot of one node(similar is the profiling result on other nodes), Can someone please check and suggest what
could be the cause OR some known performance issue with this Ignite version related to cache.replace(k,v,v) api ?
JVM Prolfiling Snapshot of one node
I guess that it can be related to next issue:
https://issues.apache.org/jira/browse/IGNITE-5003
The problem there related to the operations for the same key before the previous batch of updates (that contains this key) will be stored in the database.
As I see it should be added to Ignite 2.8.
Update:
I tested putAll operation. From the next two pictures you can see that putAll waiting for GridCacheWriteBehindStore.write (two different threads) that contains updateCache:
public void write(Entry<? extends K, ? extends V> entry) {
try {
if (log.isDebugEnabled())
log.debug(S.toString("Store put",
"key", entry.getKey(), true,
"val", entry.getValue(), true));
updateCache(entry.getKey(), entry, StoreOperation.PUT);
}
And provided issue can affect your put operations (or replace as well).

TestNg DataProvider not running all tests simultaneously but as batch

I am using the sauce lab for running selenium testNg java script where i have a single #Test method that accepts 250 distinct value from a #dataProvider of TestNG as input. Expected: To spawn 250 browser session parallel in saucelabs and execute the #Test method 250 times parallel.
Actual: I can see only a max of 10-12 at a time and remaining sessions follows as the running batch completes.
Please find below my code
POM.XML snippet:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.12.4</version>
<configuration>
<parallel>methods</parallel>
<threadCount>250</threadCount>
<data-provider-thread-count>250</data-provider-thread-count>
<redirectTestOutputToFile>false</redirectTestOutputToFile>
</configuration>
</plugin>
DataProvider Code:
#DataProvider(name="SearchData", parallel=true)
public Object[][] GetSearchData() {
//Returning 2D array of Test Data
Object[][] arrayObject = readFromExcel("C:/Test_Workspace/TestData/ICJ-DataProvider.xls","Sheet1");
return arrayObject;
}
#Test(dataProvider = "SearchData")
public void TestE2E(String hocn, String username, String password, Method method)
throws MalformedURLException, InvalidElementStateException, UnexpectedException {
this.createDriver("chrome", "54.0", "Windows 10", method.getName());
WebDriver driver = this.getWebDriver();
Service.visitPage(driver, hocn, username, password);
}
As you can see, I am passing threadCount=250 and data-provider-thread-count=250 from pom.xml. Still it runs as a batch of 10 to complete the 250 data in data provider.
Image showing only 10 instances at a time instead of 250
Can some one please guide me in getting all 250 sessions up at a time?
The problem has got nothing to do with TestNG.
You are being throttled by SauceLabs.
Quoting the SauceLabs documentation.
Checking Your Concurrency Limit
Each Sauce Labs account has a set maximum number of concurrent
sessions. You can find your concurrency limit on the My Account page
(at https://saucelabs.com/beta/users/username). If this number does
not match your subscription or invoiced contract, please contact
Support.
Subaccounts may have had their concurrency limit lowered by their
parent account. To access higher concurrency levels, you will need to
ask the person responsible for the parent account to increase your
limit.
For more information, please refer to the below documentation on SauceLabs portal.
Why am I not getting the parallelism/concurrency I expected?
Understanding Concurrency Limits and Team Accounts

Solr DataImportHandler - batchSize="-1" does not work

I am using Apache Solr 6.4.1.
Because I am using a really big database (over 3mio rows), I would like to add batchSize="-1" in the db-data-config.xml.
But if I do this, it did work. Without batchSize I can get the first 2k rows than I get a "java.lang.RuntimeException: java.lang.StackOverflowError" Error.
In Solrconfig.xml
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst>
In db-data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://***:1433;integratedSecurity=true;
Initial Catalog=***;"
batchSize="-1"/>
...
Why is batchSize="-1" dont working? (batchSize="200" or other is working)
UPDATE
if I set Debug in Dataimporthandler to false, then it works!
I don't think that set batchSize to '-1' would help in your situation. This is written inside the source code of Solr DataImportHandler:
if (batchSize == -1)
batchSize = Integer.MIN_VALUE;
[... omissis ...]
Statement statement = c.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
statement.setFetchSize(batchSize);
So double check what kind of parameters accepts MS JDBC driver for the setFetchSize method.
setFetchSize - Gives the JDBC driver a hint as to the number of rows
that should be fetched from the database when more rows are needed for
ResultSet objects generated by this
Statement. If the value specified is zero, then the hint
is ignored. The default value is zero.
So the driver is free to ignore this hint, may be it is just reading in the whole table. And you could also try to change the version of your JDBC driver...
I think you first should adapt the value depending to network latency and the amount of record you want to retrive at each round trip.
Indexing performance and mssql server load depends on the batchsize. Try starting with a small size and then gradually increase it.
If this not works try to radically change your JDBC driver.
Returning to batchSize parameter, there are only few cases where you don't need it. Generally this is the behaviour the method should have:
if you have configured your JVM with enough memory to read the entire table
if your JDBC driver would rise an exception invoking setFetchSize() method
if you're dealing with MySql JDBC driver which has a known bug

Hive: acquire explicit exclusive lock

Configuration (hortonworks)
hive: BUILD hive-1.2.1.2.3.0.0
Hadoop 2.7.1.2.3.0.0-2557
I'm trying to execute
lock table event_metadata EXCLUSIVE;
Hive response:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Current transaction manager does not support explicit lock requests. Transaction manager: org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
In the code there is obvious place where explicit locks are disabled:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/1.2.0/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java#DbTxnManager
321 #Override
322 public boolean supportsExplicitLock() {
323 return false;
324 }
Questions:
how can I make explicit locks work? In what version of hive do they appear?
Here is an example http://www.ericlin.me/how-table-locking-works-in-hive for cloudera that explicit locks work.
You may set the concurrency parameter on the fly:
set hive.support.concurrency=true;
After this you may try executing your command
Hive includes a locking feature that uses Apache Zookeeper for locking. Zookeeper implements highly reliable distributed coordination. Other than some additional setup and configuration steps, Zookeeper is invisible to Hive users.
In the $HIVE_HOME/hive-site.xml file, set the following properties:
<property>
<name>hive.zookeeper.quorum</name>
<value>zk1.site.pvt,zk1.site.pvt,zk1.site.pvt</value>
<description>The list of zookeeper servers to talk to.
This is only needed for read/write locks.
</description>
</property>
<property>
<name>hive.support.concurrency</name>
<value>true</value>
<description>Whether Hive supports concurrency or not.
A Zookeeper instance must be up and running for the default Hive lock manager to support read-write locks.</description>
</property>
After restarting hive, run the command
hive> lock table event_metadata EXCLUSIVE;
Reference: Programing Hive, O'REILLY
EDIT:
DummyTxnManager.java, which provides default Hive behavior, has
#Override
public boolean supportsExplicitLock() {
return true;
}
DummyTxnManager replicates pre Hive-0.13 behavior doesn't support transactions
where as
DbTxnManager.java,which stores the transactions in the metastore database, has:
#Override
public boolean supportsExplicitLock() {
return false;
}
Try the following:
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager;
unlock table tablename;