Uptime in GCP: Measure the % uptime where site is available from _any_ location - uptime

Our site is running on Google App Engine, and we've set up monitoring and uptime alerts.
We want to be able to report on site uptime as a %, where the site is consider up if it can be accessed from any of the 6 locations:
If location A cannot access the site, but locations B, C, D, E and F can. The site is up.
If locations A, B, C, D and E cannot access the site, but location F can. The site is up.
If locations A, B, C, D, E and F cannot access the site. The site is down.
Currently the % calculation is: (1 - Total number of failed checks) / (Total number of checks). This unfortunately means that the uptime is affected by a single location being unable to access the site.
Is it possible to get the uptime calculation we're after?

You can create an uptime chart for that.
If you group data by app, and set aggregator to fraction true, then graph will reach zero only when all of the uptime checks fail.
Query will look something like that (this is for VM instance):
fetch gce_instance
| metric 'monitoring.googleapis.com/uptime_check/check_passed'
| filter (metric.check_id == 'uptime-1')
| group_by 1m, [value_check_passed_count_true: count_true(value.check_passed)]
| every 1m

Related

Is it possible to create log source health alerts in Azure Sentinel?

I am attempting to create an alert that lets me know if a data source stops providing logs to Sentinel. While I know it displays anomalies in log data on the dash board, I am hoping to receive alerts if a source stops providing logs for an extended period of time.
Something like creating a rule with the following query (CEF in this case):
CommonSecurityLog
| where TimeGenerated > ago(24h)
| summarize count() by DeviceVendor, DeviceProduct, DeviceName, DeviceExternalID
| where count_ == 0

How to build an ongoing alert that catches sudden spikes for a certain http error code?

I could really use an ongoing alert that catches a sudden rise (spike) in a certain error code (such as 404 or 502 etc...)
I tried giving this some thought on how to achieve that, and... Well... I could really use your help with the script :-)
From my understanding the search query should "know" or, "sense" the normal traffic (not sure for how long, maybe for 1hr, 2hrs) and alert when there is a spike in the error code compared to 1-2 hours ago.
I think the error code spike threshold should be more than 5% of total traffic, while occurring for longer than 90 seconds.
Here is a Splunk Query I use today, I appreciate your help tuning it to what I described above:
tag=NginxLogs host=www1 OR host=www2 |stats count by status|eventstats sum(count) as total|eval perc=round((count/total)*100,2)|where status="404" AND perc>5
The top command automatically provides the count and percent.
http://docs.splunk.com/Documentation/Splunk/7.1.2/SearchReference/Top
tag=NginxLogs host=www1 OR host=www2
| top status
| search percent > 5 AND status > 399
If you have the url,http request method and user in your splunk logs, you can add it as a part of this alert. Example:
tag=NginxLogs host=www1 OR host=www2
| eventstats distinct_count(userid) as NoOfUsersAffected by requestUri,status,httpmethod
| top status,httpmethod,NoOfUsersAffected by requestUri
| search NoOfUsersAffected > 2 AND ((status>499 AND percentage > 5) OR (StatusCode=400 AND percentage > 95))
You can use the following alert message:
$result.percent$ % ($result.count$ calls) has StatusCode $result.status$ for
$result.requestUri$ - $result.httpmethod$.
$result.NoOfUsersAffected$ users were affected
You will get alert like:
21.19 % (850 calls) has StatusCode 500 for https://app.test.com/hello - GET.
90 users are affected

Spark : Data processing using Spark for large number of files says SocketException : Read timed out

I am running Spark in standalone mode on 2 machines which have these configs
500gb memory, 4 cores, 7.5 RAM
250gb memory, 8 cores, 15 RAM
I have created a master and a slave on 8core machine, giving 7 cores to worker. I have created another slave on 4core machine with 3 worker cores. The UI shows 13.7 and 6.5 G usable RAM for 8core and 4core respectively.
Now on this I have to process an aggregate of user ratings over a period of 15 days. I am trying to do this using Pyspark
This data is stored in hourwise files in day-wise directories in an s3 bucket, every file must be around 100MB eg
s3://some_bucket/2015-04/2015-04-09/data_files_hour1
I am reading the files like this
a = sc.textFile(files, 15).coalesce(7*sc.defaultParallelism) #to restrict partitions
where files is a string of this form 's3://some_bucket/2015-04/2015-04-09/*,s3://some_bucket/2015-04/2015-04-09/*'
Then I do a series of maps and filters and persist the result
a.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to do a reduceByKey to get an aggregate score over the span of days.
b = a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to make a redis call for the actual terms for the items the user has rated, so I call mapPartitions like this
final_scores = b.mapPartitions(get_tags)
get_tags function creates a redis connection each time of invocation and calls redis and yield a (user, item, rate) tuple
(The redis hash is stored in the 4core)
I have tweaked the settings for SparkConf to be at
conf = (SparkConf().setAppName(APP_NAME).setMaster(master)
.set("spark.executor.memory", "5g")
.set("spark.akka.timeout", "10000")
.set("spark.akka.frameSize", "1000")
.set("spark.task.cpus", "5")
.set("spark.cores.max", "10")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max.mb", "10")
.set("spark.shuffle.consolidateFiles", "True")
.set("spark.files.fetchTimeout", "500")
.set("spark.task.maxFailures", "5"))
I run the job with driver-memory of 2g in client mode, since cluster mode doesn't seem to be supported here.
The above process takes a long time for 2 days' of data (around 2.5hours) and completely gives up on 14 days'.
What needs to improve here?
Is this infrastructure insufficient in terms of RAM and cores (This is offline and can take hours, but it has got to finish in 5 hours or so)
Should I increase/decrease the number of partitions?
Redis could be slowing the system, but the number of keys is just too huge to make a one time call.
I am not sure where the task is failing, in reading the files or in reducing.
Should I not use Python given better Spark APIs in Scala, will that help with efficiency as well?
This is the exception trace
Lost task 4.1 in stage 0.0 (TID 11, <node>): java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:227)
at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at org.apache.http.util.EntityUtils.consume(EntityUtils.java:88)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
at org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:126)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:236)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
I could really use some help, thanks in advance
Here is what my main code looks like
def main(sc):
f=get_files()
a=sc.textFile(f, 15)
.coalesce(7*sc.defaultParallelism)
.map(lambda line: line.split(","))
.filter(len(line)>0)
.map(lambda line: (line[18], line[2], line[13], line[15])).map(scoring)
.map(lambda line: ((line[0], line[1]), line[2])).persist(StorageLevel.MEMORY_ONLY_SER)
b=a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
c=taggings.mapPartitions(get_tags)
c.saveAsTextFile("f")
a.unpersist()
b.unpersist()
The get_tags function is
def get_tags(partition):
rh = redis.Redis(host=settings['REDIS_HOST'], port=settings['REDIS_PORT'], db=0)
for element in partition:
user = element[0]
song = element[1]
rating = element[2]
tags = rh.hget(settings['REDIS_HASH'], song)
if tags:
tags = json.loads(tags)
else:
tags = scrape(song, rh)
if tags:
for tag in tags:
yield (user, tag, rating)
The get_files function is as:
def get_files():
paths = get_path_from_dates(DAYS)
base_path = 's3n://acc_key:sec_key#bucket/'
files = list()
for path in paths:
fle = base_path+path+'/file_format.*'
files.append(fle)
return ','.join(files)
The get_path_from_dates(DAYS) is
def get_path_from_dates(last):
days = list()
t = 0
while t <= last:
d = today - timedelta(days=t)
path = d.strftime('%Y-%m')+'/'+d.strftime('%Y-%m-%d')
days.append(path)
t += 1
return days
As a small optimization, I have created two separate tasks, one to read from s3 and get additive sum, second to read transformations from redis. The first tasks has high number of partitions since there are around 2300 files to read. The second one has much lesser number of partitions to prevent redis connection latency, and there is only one file to read which is on the EC2 cluster itself. This is only partial, still looking for suggestions to improve ...
I was in a similar usecase: doing coalesce on a RDD with 300,000+ partitions. The difference is that I was using s3a(SocketTimeoutException from S3AFileSystem.waitAysncCopy). Finally the issue was resolved by setting a larger fs.s3a.connection.timeout(Hadoop's core-site.xml). Hopefully you can get a clue.

summarize mutlitple values sent to graphite at the same time

I'm trying to display the sum of several values ​​sent to Graphite (carbon-cache) for the same timestamp.
Sent values are like :
test.nb 10 1421751600
test.nb 11 1421751600
test.nb 12 1421751600
test.nb 13 1421751600
and I would Graphite to display value "46" for timestamp 1421751600.
Only the last value "13" is displayed on Graphite.
Here are configuration files :
storage-aggregation.conf
[test_sum]
pattern = ^test\.*
xFilesFactor = 0.1
aggregationMethod = sum
storage-schemas.conf
[TEST]
pattern = ^test\.
retentions = 10s:30d
Is there a way to do this with Graphite/Carbon ?
Thx.
storage-aggregation.conf file defines how to aggregate data to lower precision retentions and since you only have one retention precision defined: 10s for 30 days, this is not needed.
In order to this with Graphite daemons, you will have to use
carbon-aggregator.py that is run in front of carbon-cache.py to buffer metrics over time. Check [aggregator] section in config file. By default, carbon-aggregator listens on port 2023 (default) so you will have to send data points to this port and not carbon-cache port (2004 by default).
Also, you will have to specify the aggregation rule in aggregation-rules.conf that will allow you to add several metrics together as the come in. You can find detailed explanation here.

GWT-RPC, Apache, Tomcat server data size checking

Following up on this GWT-RPC question (and answer #1) re. field size checking, I would like to know the right way to check pre-deserialization for max data size sent to server, something like if request data size > X then abort the request. Valuing simplicity and based on answer on aforementioned question/answer, I am inclined to believe checking for max overall request size would suffice, finer grained checks (i.e., field level checks) could be deferred to post-deserialization, but I am open to any best-practice suggestion.
Tech stack of interest: GWT-RPC client-server communication with Apache-Tomcat front-end web-server.
I suppose a first step would be to globally limit the size of any request (LimitRequestBody in httpd.conf or/and others?).
Are there finer-grained checks like something that can be set per RPC request? If so where, how? How much security value do finer grain checks bring over one global setting?
To frame the question more specifically with an example, let's suppose we have the two following RPC request signatures on the same servlet:
public void rpc1(A a, B b) throws MyException;
public void rpc2(C c, D d) throws MyException;
Suppose I approximately know the following max sizes:
a: 10 kB
b: 40 kB
c: 1 M B
d: 1 kB
Then I expect the following max sizes:
rpc1: 50 kB
rpc2: 1 MB
In the context of this example, my questions are:
Where/how to configure the max size of any request -- i.e., 1 MB in my above example? I believe it is LimitRequestBody in httpd.conf but not 100% sure whether it is the only parameter for this purpose.
If possible, where/how to configure max size per servlet -- i.e., max size of any rpc in my servlet is 1 MB?
If possible, where/how to configure/check max size per rpc request -- i.e., max rpc1 size is 50 kB and max rpc2 size is 1 MB?
If possible, where/how to configure/check max size per rpc request argument -- i.e., a is 10 kB, b is 40 kB, c is 1 MB, and d is 1 kB. I suspect it makes practical sense to do post-deserialization, doesn't it?
For practical purposes based of cost/benefit, what level of pre-deserialization checking is generally recommended -- 1. global, 2. servlet, 3. rpc, 4. object-argument? Stated differently, what is roughly the cost-complexity on one hand and the added value on the other hand of each of the above pre-deserialization level checks?
Thanks much in advance.
Based on what I have learned since I asked the question, my own answer and strategy until someone can show me better is:
First line of defense and check is Apache's LimitRequestBody set in httpd.conf. It is the overall max for all rpc calls across all servlets.
Second line of defense is servlet pre-deserialization by overriding GWT AbstractRemoteServiceServlet.readContent. For instance, one could do it as shown further below I suppose. This was the heart of what I was fishing for in this question.
Then one can further check each rpc call argument post-deserialization. One could conveniently use the JSR 303 validation both on the server and client side -- see references StackOverflow and gwt r.e. client side.
Example on how to override AbstractRemoteServiceServlet.readContent:
#Override
protected String readContent(HttpServletRequest request) throws ServletException, IOException
{
final int contentLength = request.getContentLength();
// _maxRequestSize should be large enough to be applicable to all rpc calls within this servlet.
if (contentLength > _maxRequestSize)
throw new IOException("Request too large");
final String requestPayload = super.readContent(request);
return requestPayload;
}
See this question in case the max request size if > 2GB.
From a security perspective, this strategy seems quite reasonable to me to control the size of data users send to server.