Librato composite error: What does: Unable to execute composite: ["error": "Requested MD data from SD endpoint"]. mean? - librato

I want to create an alert that triggers whenever one of the following counter statistics is not zero:
a.b.c.failed
a.b.e.failed
I already use these statistics separately on a dashboard page, but as they occur rarely, I'd like an alert.
It appears I have to make a sum composite so that I can trigger the alert when the sum is above zero. I think the composite would look something like:
sum(series("a.b.*.failed",{}))
However, every attempt I make gives the error:
Unable to execute composite: ["error": "Requested MD data from SD endpoint"]
There is another thread that suggested replacing the {} with "*" (including the quotes). This no longer gives an error, but gives a bizarre result (it's above zero all the time, even though there only very rarely any 'failed' statistics above zero).

The correct expression for my case is:
sum(derive(series("a.b.*.failed","*")))
Using "*" works to select the source.
Derive gives the change of each statistic instead of the cumulative total (but I'm not sure why the cumulative total was showing up - it is not shown normally for these statistics).
Sum adds the change of the different statistics.
I don't understand why {} doesn't work - I think that is related to the mystery of the meaning the error message that uses undocumented terminology (MD and SD endpoints). Librato documentation of their composite statistics function language is very minimal and provides few examples and few explanations of the meaning of terms and technical foundations.

Related

How to get around a GEOS error when doing st_union?

I have a big layer with lines, and a view that needs to calculate the length of these lines without counting their overlaps
A working query that does half the job (but does not account for the overlap, so overestimates the number)
select name, sum(st_length(t.geom)) from mytable t where st_isvalid(t.geom) group by name
The intended query that returns SQL Error [XX000]: ERROR: GEOSUnaryUnion: TopologyException: found non-noded intersection between LINESTRING (446659 422287, 446661 422289) and LINESTRING (446659 422288, 446660 422288) at 446659.27944086661 422288.0015405959
select name,st_length(st_union(t.geom)) from mytable t where st_isvalid(t.geom) group by name
The thing is that the later works fine for the first 200 rows, it's only when I try to export the entire view that I get the error
Would there be a way to use the preferred query first, and if it returns an error on a row use the other one? Something like:
case when st_length(st_union(t.geom)) = error then sum(st_length(t.geom))
else st_length(st_union(t.geom)) end
Make sure your geometries are valid before union by wrapping them in ST_MakeValid(). You can also query their individual validity using select id, ST_IsValid(t.geom) from mytable; to maybe filter out or correct the affected ones. In cases where one of you geometries is itself invalid in this way, it'll help. This will still leave cases where the invalidity appears after combining multiple valid geometries together.
See if ST_UnaryUnion(ST_Collect(ST_MakeValid(t.geom))) changes anything. It will try to dissolve and node the component linestrings.
When really desperate, you can make a PL/pgSQL wrapper around both of your functions and switch to the backup one in the exception block.
At the expense of some precision and with the benefit of a bit higher performance, you could try snapping them to grid ST_Union(ST_SnapToGrid(t.geom,1e-7)), gradually increasing the grid size to 1e-6, 1e-5. Some geometries could be not actually intersecting, but be so close, PostGIS can't tell at the precision it operates at. You can also try applying this only to your problematic geometries, if you can pinpoint them.
As reminded by #dr_jts PostGIS 3.1.0 includes a new overlay engine, so if your select postgis_full_version(); shows anything below that and GEOS 3.9.0, it might be worth upgrading. The upcoming PostGIS 3.2.0 with GEOS 3.10.1 should also provide some iprovement in validity checks.
Here's a related thread.

BQ Switching to TIMESTAMP Partitioned Table

I'm attempting to migrate IngestionTime (_PARTITIONTIME) to TIMESTAMP partitioned tables in BQ. In doing so, I also need to add several required columns. However, when I flip the switch and redirect my dataflow to the new TIMESTAMP partitioned table, it breaks. Things to note:
Approximately two million rows (likely one batch) is successfully inserted. The job continues to run but doesn't insert anything after that.
The job runs in batches.
My project is entirely in Java
When I run it as streaming, it appears to work as intended. Unfortunately, it's not practical for my use case and batch is required.
I've been investigating the issue for a couple of days and tried to break down the transition into the smallest steps possible. It appears that the step responsible for the error is introducing REQUIRED variables (it works fine when the same variables are NULLABLE). To avoid any possible parsing errors, I've set default values for all of the REQUIRED variables.
At the moment, I get the following combination of errors and I'm not sure how to address any of them:
The first error, repeats infrequently but usually in groups:
Profiling Agent not found. Profiles will not be available from this
worker
Occurs a lot and in large groups:
Can't verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner
Appears to be one very large group of these:
Aborting Operations. java.lang.RuntimeException: Unable to read value from state
Towards the end, this error appears every 5 minutes only surrounded by mild parsing errors described below.
Processing stuck in step BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 20m00s without outputting or completing in state finish
Due to the sheer volume of data my project parses, there are several parsing errors such as Unexpected character. They're rare but shouldn't break data insertion. If they do, I have a bigger problem as the data I collect changes frequently and I can adjust the parser only after I see the error, and therefore, see the new data format. Additionally, this doesn't cause the ingestiontime table to break (or my other timestamp partition tables to break). That being said, here's an example of a parsing error:
Error: Unexpected character (',' (code 44)): was expecting double-quote to start field name
EDIT:
Some relevant sample code:
public PipelineResult streamData() {
try {
GenericSection generic = new GenericSection(options.getBQProject(), options.getBQDataset(), options.getBQTable());
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Windowing", generic.getWindowDuration(options.getWindowDuration()))
.apply(generic.getPubsubToString())
.apply(ParDo.of(new CrowdStrikeFunctions.RowBuilder()))
.apply(new BigQueryBuilder().setBQDest(generic.getBQDest())
.setStreaming(options.getStreamingUpload())
.setTriggeringFrequency(options.getTriggeringFrequency())
.build());
return pipeline.run();
}
catch (Exception e) {
LOG.error(e.getMessage(), e);
return null;
}
Writing to BQ. I did try to set the partitoning field here directly, but it didn't seem to affect anything:
BigQueryIO.writeTableRows()
.to(BQDest)
.withMethod(Method.FILE_LOADS)
.withNumFileShards(1000)
.withTriggeringFrequency(this.triggeringFrequency)
.withTimePartitioning(new TimePartitioning().setType("DAY"))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER);
}
After a lot of digging, I found the error. I had parsing logic (a try/catch) that returned nothing (essentially a null row) in the event there was a parsing error. This would break BigQuery as my schema had several REQUIRED rows.
Since my job ran in batches, even one null row would cause the entire batch job to fail and not insert anything. This also explains why streaming inserted just fine. I'm surprised that BigQuery didn't throw an error claiming that I was attempting to insert a null into a required field.
In reaching this conclusion, I also realized that setting the partition field in my code was also necessary as opposed to just in the schema. It could be done using
.setField(partitionField)

Prometheus: how to rate a sum of the same counter from different machines?

I have a Prometheus counter, for which I want to get its rate on a time range (the real target is to sum the rate, and sometimes use histogram_quantile on that for histogram metric).
However, I've got multiple machines running that kind of job, each one sets its own instance label. This causes different inc operations on this counter in different machines to create different entities of the counter, as the combination of labels values is unique.
The problem is that rate() works separately on each such counter entity.
The result is that counter entities with unique combinations don't get into account for rate().
For example, if I've got:
mycounter{aaa="1",instance="1.2.3.4:6666",job="job1"} value: 1
mycounter{aaa="2",instance="1.2.3.4:6666",job="job1"} value: 1
mycounter{aaa="2",instance="1.2.3.4:7777",job="job1"} value: 1
mycounter{aaa="1",instance="5.5.5.5:6666",job="job1"} value: 1
All counter entities are unique, so they get values of 1.
If counter labels are always unique (come from different machines), rate(mycounter[5m]) would get values of 0 in this case,
and sum(rate(mycounter[5m])) would get 0, which is not what I need!
I want to ignore the instance label so that it would refer these mycounter inc operations as they were made on the same counter entity.
In other words, I expect to have only 2 entities (they can have a common instance value or no instance label):
mycounter{aaa="1", job="job1"} value: 2
mycounter{aaa="2", job="job1"} value: 2
In such a case, inc operation in new machine (with existing aaa value) would increase some entity counter instead of adding new entity with value of 1, and rate() would get real rates for each, so we may sum() them.
How do I do that?
I made several tries to solve it but all failed:
Doing a rate() of the sum() - fails because of type mismatch...
Removing the automatic instance label, using metric_relabel_configswork with action: labeldrop in configuration, but then it assigns the default address value.
Changing all instance values to a common one using metric_relabel_configswork with replacement, but it seems that one of the entities overwrites all others, so it doesn't help...
Any suggestions?
Prometheus version: 2.3.2
Thanks in Advance!
You'd better expose your counters at 0 on application start, if the other labels (aaa, etc) have a limited set of possible combinations. This way rate() function works correctly at the bottom level and sum() will give you correct results.
If you have to do a rate() of the sum(), read this first:
Note that when combining rate() with an aggregation operator (e.g. sum()) or a function aggregating over time (any function ending in _over_time), always take a rate() first, then aggregate. Otherwise rate() cannot detect counter resets when your target restarts.
If you can tolerate this (or the instances reset counters at the same time), there's a way to work around. Define a recording rule as
record: job:mycounter:sum
expr: sum without(instance) (mycounter)
and then this expression works:
sum(rate(job:mycounter:sum[5m]))
The obvious query rate(sum(...)) won't work in most cases, since the resulting sum(...) may hide possible resets to zero for individual time series, which are passed to sum. So usually the correct answer is to use sum(rate(...)) instead. See this article for details.
Unfortunately, Prometheus may miss some increases for slow-changing counter when calculating rate() as shown in the original question above. The same applies to increase() calculations. See this issue, this comment and this article for details. Prometheus developers are going to fix these issues - see this design doc.
In the mean time try to use VictoriaMetrics when you need exact values for rate() and increase() functions over slow-changing counter (and distributed counter).

Is it worth introducing "incorrect" results to avoid crashing a program?

In my organisation, I see a lot of places where code has been put inside monitor blocks (RPG's version of try..except) to prevent raising exceptions on arithmetic errors. For instance:
Monitor;
Pxxhour = Bctime/60;
PxxMin = %Rem(Bctime:60);
On-Error;
Pxxhour = 0;
PxxMin = 0;
Pxxhour and Pxxmin are screen fields that will be displayed to users. So if there is an error in the operations, these get a value of 0. Though this prevents the program from crashing, how does it help? Users keep seeing the wrong values on the screen. Similarly, I see code which assigns the highest possible value for a given variable rather than allowing an overflow exception. Though this will prevent the program from blowing up, how does it help in the long run? Wouldn't calculations have wrong values and result in wrong business data?
The answers given below by #jmarkmurphy and #Charles successfully address the question from an RPG and IBM midrange perspective, which is what I was after.
There's two use cases for a MONITOR block...
Expected errors
Unexpected errors
For expected errors, replacing bad or invalid data with an accepted value is a valid solution in some cases. The trick is knowing which cases. The answer to that is something your business people would need to help decide. Depends of what the program is doing and what data has the problem.
For instance, given some sort of internal sales report, you might have something like so:
dcl-c DIVIDE_BY_ZERO const(00102);
dcl-c RESULT_TO_LARGE const(00103);
monitor;
averageSale = totalSalesAmount / numberSales;
on-error DIVIDE_BY_ZERO;
averageSale = 0;
on-error RESULT_TO_LARGE;
averageSale = *HIVAL;
endmon;
What's important about the above is that I'm expecting one of two possible errors and I've decided to handle them a certain way. The business people don't care that technically averageSale is undefined when numberSales is *ZERO. They'y just want a zero to appear on the report. They also understand that there's only so much room on the page and that if the number is all nines, the actual value might be bigger.
And unexpected error, such as a decimal data error, would not be caught be this MONITOR block.
For an unexpected caught by a monitor block via a ON-ERROR with *ALL or no error code specified, I'd expect to see some sort of logging of the issue followed by either skipping the problem record or cleaning shutting down depending on what the program is doing in the first place.
It appears that your code is expecting certain error(s), but without explicitly defining which error(s) codes it's willing to handle. This is lazy and not a good practice.
As far as your questions about rather or not the handling of those expected errors is valid...only you and your users can decide that
You might want to take a look at Chapter 7 - Exception and error handling of the IBM Redbook Who Knew You Could Do That with RPG IV? Modern RPG for the Modern Programmer
What Should I Do When I Have Errors in my Calculations
Programs that blow up on users are bad, even if it is the user's fault. It makes the user believe that the program is buggy, and then anything unexpected that happens becomes the program's fault; something to be fixed. Things can get really out of hand in this manner causing help desk calls for ordinary occurrences that just appear a little odd, even when the outcome is actually correct.
One option is to validate the user input to prevent calculation errors, but what do you do when you can't really prevent all of them. In our world, one of these situations is in invoicing. 5250 screens have limited real estate and you can't always make the fields big enough to hold all eventualities. So there are tradeoffs. Maybe you need to be able to sell thousands of some small items on a single invoice, but the largest total invoice you have ever had is $100K. So you size your fields like this:
dcl-s quantity Packed(5:0);
dcl-s unitPrice Packed(7:2);
dcl-s ammount Packed(9:2);
All are odd because they take up the same space on disk as the next lower even precision. You don't sell fractional quantities, and the maximum value in each field is:
quantity = 99,999;
unitPrice = $99,999.99;
amount = $9,999,999.99;
Now you can see that these maximums should easily handle all valid invoices, but it also leaves plenty of potential for calculation errors. If the user keys in maximum numbers for quantity and unitPrice, the resulting number would require a Packed(12:2) field. That would cause an overflow. In an invoice when the unit price is stored in the invoice detail, we can add an edit when the quantity and unit price is entered that checks for an extended amount overflow, and send an appropriate error message. But what if unit prices are not stored in the invoice detail, but instead in a pricing table. Now there is not a good way, if a price is changed for example to ensure that none of the existing invoices will be affected adversely.
So what do you do about a decimal overflow, or any other calculation error, be it a data problem, or something else? And what happens if the error occurs Blowing up the program is not a good option. Another option, the one that seems to be taken in the question is to apply some default value that the users will quickly recognize is out of the ordinary. It will appear in reports, and on screens. When the users see those excessively large, or small numbers, then they can know to go back and check the data.

endeca returning zero results in refinements when none of refinements available in ref app?

I am using Endeca 3.1.2 Assembler API. When I am hitting the Endeca query, its giving me some bunch of refinements which contains zero counts and some positive counts .
Example:
category
**category1(0)**
category2(25)
**category3(0)**
Like this result I am getting. When I am hitting the same query in jspref application I am not getting any refinements which contains zero count.
My expectation is that I don't want to get that zero count refinements on the available refinements.
Please help me to get out from this.
You might have disabled refinements enabled in your query.
Check whether you have the Ndr parameter in Dgraph request log file
If so, ensure your code doesn't have: ENEQuery.setNavDisabledRefinementsConfig() method.
Endeca has one of the features called implicit dimensions. There might be the case that implicit dimension is being displayed to the front-end. Endeca provides implicit dimension as part of the query response.
Following code is being used to get implicit dimension.
Navigation.getCompleteDimensions().getDimension(dimensionid)