Bigquery Stream: Missing data after new table created - google-bigquery

We recently noticed that within a short period of time after a new table being created, data which were streamed in, without any exceptions or errors, just got missing. Is there any known grace time the streaming should wait?

I finally figured out what happened by printing out trace info step by step. The multi-thread contributed to cover up the issue for a long time.
This the original 'missing data' code to create a table:
insert = sBIGQUERY.tables().insert(mProjectId, mDataset, table);
logger.info("Table " + tid.toString()+" is created at " + new Date(insert
.execute().getCreationTime()));
where insert.execute().getCreationTime() never returned.... (I don't know why) and thus the rest of my process(put data back to the sending queue to wait for next stream) didn't execute.
After I change it to:
sBIGQUERY.tables().insert(mProjectId, mDataset, table).execute();
logger.info("Table " + tid.toString()+" is created");
It runs properly and we get all the data up to BQ.
#Jordan Tigani, do you know the reason for getCreationTime() never get back? (or during quite a long period than I can wait for)

There is a 'warm up' time of a few seconds after streaming first occurs on a table before it is available for querying. There is a similar warm up time if you stop streaming to the table for more than 24 hours and then start again.
See the docs here: https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability

Related

How to correctly add a labor transaction record in a Maximo automation script

Hi I'm trying to add a labor transaction from an action automation script with the object being ASSIGNMENT in Maximo. I am currently trying the code below.
labTransSet = MXServer.getMXServer().getMboSet("LABTRANS",ui);
labTrans = labTransSet.add();
labTrans.setValue("laborcode", userLabor);
labTrans.setValue("wonum", assignWonum);
sds1=SimpleDateFormat("hh.mm aa").format(firstDate);
sds2=SimpleDateFormat("hh.mm aa").format(Date());
labTrans.setValue("STARTTIME", sds1);
labTrans.setValue("FINISHTIME", sds2);
labTransSet.save();
labTransSet.close();
userLabor is the username of the current user
assignWonum is the assignment work order number
firstDate is the scheduled date field from the assignment
The labor record is being added correctly with the right data, but when I go to route my workflow after the script is called from a button, I am given the warning BMXAA8229W WOACTIVITY has been updated by another user and the work order does not route. I am under the impression that this is happening because the assignment object for the script is being queried at the same time I try to add and save a labor record. Does anyone know if my guess is correct or what else the problem is and how I can fix this? Thanks
That error occurs because Maximo already has one version of the record loaded into memory when the record in the database is modified independently. Maximo then tries to work with the in-memory object and sees it doesn't match what is in the database and throws that error. Timing doesn't really have anything to do with it (other than that an edit happened at some point after the record was loaded into memory).
What you need to do is make sure you are modifying the exact same task/assignment/labtrans record that has already been loaded into memory. That "MXServer.getMXServer().getMboSet" stuff is guaranteed to use a new object. That is how you start a new transaction in Maximo; how you make sure you are not using anything already loaded into memory. I suspect you want to get your set off of the implicit "mbo" object the script will give to you.

Spark structured streaming groupBy not working in append mode (works in update)

I'm trying to get a streaming aggregation/groupBy working in append output mode, to be able to use the resulting stream in a stream-to-stream join. I'm working on (Py)Spark 2.3.2, and I'm consuming from Kafka topics.
My pseudo-code is something like below, running in a Zeppelin notebook
orderStream = spark.readStream().format("kafka").option("startingOffsets", "earliest").....
orderGroupDF = (orderStream
.withWatermark("LAST_MOD", "20 seconds")
.groupBy("ID", window("LAST_MOD", "10 seconds", "5 seconds"))
.agg(
collect_list(struct("attra", "attrb2",...)).alias("orders"),
count("ID").alias("number_of_orders"),
sum("PLACED").alias("number_of_placed_orders"),
min("LAST_MOD").alias("first_order_tsd")
)
)
debug = (orderGroupDF.writeStream
.outputMode("append")
.format("memory").queryName("debug").start()
)
After that, I would expected that data appears on the debug query and I can select from it (after the late arrival window of 20 seconds has expired. But no data every appears on the debug query (I waited several minutes)
When I changed output mode to update the query works immediately.
Any hint what I'm doing wrong?
EDIT: after some more experimentation, I can add the following (but I still don't understand it).
When starting the Spark application, there is quite a lot of old data (with event timestamps << current time) on the topic from which I consume. After starting, it seems to read all these messages (MicroBatchExecution in the log reports "numRowsTotal = 6224" for example), but nothing is produced on the output, and the eventTime watermark in the log from MicroBatchExecution stays at epoch (1970-01-01).
After producing a fresh message onto the input topic with eventTimestamp very close to current time, the query immediately outputs all the "queued" records at once, and bumps the eventTime watermark in the query.
What I can also see that there seems to be an issue with the timezone. My Spark programs runs in CET (UTC+2 currently). The timestamps in the incoming Kafka messages are in UTC, e.g "LAST__MOD": "2019-05-14 12:39:39.955595000". I have set spark_sess.conf.set("spark.sql.session.timeZone", "UTC"). Still, the microbatch report after that "new" message has been produced onto the input topic says
"eventTime" : {
"avg" : "2019-05-14T10:39:39.955Z",
"max" : "2019-05-14T10:39:39.955Z",
"min" : "2019-05-14T10:39:39.955Z",
"watermark" : "2019-05-14T10:35:25.255Z"
},
So the eventTime somehow links of with the time in the input message, but it is 2 hours off. The UTC difference has been subtraced twice. Additionally, I fail to see how the watermark calculation works. Given that I set it to 20 seconds, I would have expected it to be 20 seconds older than the max eventtime. But apparently it is 4 mins 14 secs older. I fail to see the logic behind this.
I'm very confused...
It seems that this was related to the Spark version 2.3.2 that I used, and maybe more concretely to SPARK-24156. I have upgraded to Spark 2.4.3 and here I get the results of the groupBy immediately (well, of course after the watermark lateThreshold has expired, but "in the expected timeframe".

getgroup() is very slow

I am using the function getgroup() to read all of the groups of a user in the active directory.
I'm not sure if I'm doing something wrong but it is very very slow. Each time it arrives at this point, it takes several seconds. I'm also accessing the rest of Active directory using the integrated function of "Accountmanagement" and it executes instantly.
Here's the code:
For y As Integer = 0 To AccountCount - 1
Dim UserGroupArray As PrincipalSearchResult(Of Principal) = UserResult(y).GetGroups()
UserInfoGroup(y) = New String(UserGroupArray.Count - 1) {}
For i As Integer = 0 To UserGroupArray.Count - 1
UserInfoGroup(y)(i) = UserGroupArray(i).ToString()
Next
Next
Later on...:
AccountChecker_Listview.Groups.Add(New ListViewGroup(Items(y, 0), HorizontalAlignment.Left))
For i As Integer = 0 To UserInfoGroup(y).Count - 1
AccountChecker_Listview.Items.Add(UserInfoGroup(y)(i)).Group = AccountChecker_Listview.Groups(y)
Next
Item(,) contains my normal Active directory data that I display Item(y, 0) contain the username.
y is the number of user accounts in AD. I also have some other code for the other information in this loop but it's not the issue here.
Anyone know how to make this goes faster or if there is another solution?
I'd recommend trying to find out where the time is spent. One option is to use a profiler, either the one built into Visual Studio or a third-party profiler like Redgate's Ants Profiler or the Yourkit .Net Profiler.
Another is to trace the time taken using the System.Diagnostics.Stopwatch class and use the results to guide your optimization efforts. For example time the function that retrieves data from Active Directory and separately time the function that populates the view to narrow down where the bottleneck is.
If the bottleneck is in the Active Directory lookup you may want to consider running the operation asynchronously so that the window is not blocked and populates as new data is retrieved. If it's in the listview you may want to consider for example inserting the data in a batch operation.

Load Runner Session ID Changes Indefinitely

Good day
I'm trying to perform load testing with LoadRunner 11. Here's an issue:
I've got automatically generated script after actions recording
Need to catch Session ID. I do it with web_reg_save_param() in the next way:
web_reg_save_param("S_ID",
"LB=Set-Cookie: JSESSIONID=",
"RB=; Path=/app/;",
LAST);
web_add_cookie("S_ID; DOMAIN={host}");
I catch ID from the response (Tree View):
D2B6F5B05A1366C395F8E86D8212F324
Compare it with Replay Log and see:
"S_ID = 75C78912AE78D26BDBDE73EBD9ADB510".
Compare 2 IDs above with the next request ID and see 3rd ID (Tree View):
80FE367101229FA34EB6429F4822E595
Why do I have 3 different IDs?
Let me know if I have to provide extra information.
You should Use(Search=All) below Code. Provided your Right and left boundary is correct:
web_reg_save_param("S_ID",
"LB=Set-Cookie: JSESSIONID=",
"RB=; Path=/app/;",
"Search=All",
LAST);
web_add_cookie("{S_ID}; DOMAIN={host}");
For Details refer HP Mannual for web_reg_save_param function.
I do not see what the conflict or controversy is here. Yes, items related to state or session will definitely change from user to user, one recording session to the next. They may even change from one request to the next. You may need to record several times to identify the change and use pattern for when you need to collect and when you need to reuse the collected data from a response in a subsequent request.
Take a listen to this podcast. It should help
http://www.perfbytes.com/dynamic-data-correlation

Restart my delta loading after delete the infopackage in PSA by mistake

here i have got one issue.can some one please help me to resolve this.
i was trying to extract some data to DS 0FI_AP_6...
then in InfoPackage Monitor I can see like..
-->Requests (messages): Everything OK
-->Extraction (messages): Everything OK
-->Transfer (IDocs and TRFC): Missing messages or warnings
-->Info IDoc 2 : sent, not arrived ; IDoc ready for dispatch (ALE service)
Data Package 1 : 23752 Records arrived in BW
Data Package 2 : 15216 Records arrived in BW
Request IDoc : Application document posted
Info IDoc 1 : Application document posted
Info IDoc 3 : Application document posted
Info IDoc 4 : Application document posted
-->Processing (data packet): Everything OK
Data Package 1 ( 38672 Records ) : Everything OK
in Status Menu I am having message like...
Missing data packages for PSA Table
Diagnosis
Data packets are missing from PSA Table . BI processing does not
return any errors. The data transport from the source system to BI was
probably incorrect.
Procedure
Check the tRFC overview in the source system.
You access this log using the wizard or following the menu path
"Environment -> Transact. RFC -> Source System".
Error handling:
If the tRFC is incorrect, resolve the errors listed there.
Check that the source system is connected properly to BI. In
particular, check the remote user authorizations in BI.
Please suggest me how to resolve this issue...
thanks in advance for your help and quick reply is much appreciated.
But what the worst thing is I deleted the infopackage in PSA by mistake.
In the normal case, if I repeat the process again, the delta load would be OK, but now the delta load remains error.
so gurus,
1. how can I restart my delta loading correctly?
2. I want to modify the timestamp in the delta table, but how to do it ?
Go to T-Code RSA7 in the source system. This will tell you the date/timestamp that the delta is set to. If the date was changed to a range that no longer works then you will need to re-initialize the datasource in the BW system side. However, the Delta date may still be fine becauase it may have never been changed when you tried to first do your load because of the connection issues.
You can create a new infopackage and set the update to Initialize Datasource with Data Transfer. This will essentially run a full load from the datasource and then reset the delta pointer date/timestamp to when you ran it. This way you will capture all the data that you needed and anything that was already in the PSA should be overwritten.
Also note that you should delete or set the request status to red on the previous request that may contain bad data in the PSA.
From the original error it seems like you are having an RFC connection issue between the datasource and BW. Contact your BASIS support and have them check the connection to make sure it is good. To ensure that your datasource is extracting properly you can run t-code RSA3 on it in the source system. This will ensure that the extraction of data is working properly.