How to change the Splunk Forwarder Formatting of my Logs? - splunk

I am using our Enterprise's Splunk forawarder which seems to be logging events in splunk like this which makes reading splunk logs a bit difficult.
{"log":"[https-jsse-nio-8443-exec-5] 19 Jan 2021 15:30:57,237+0000 UTC INFO rdt.damien.services.CaseServiceImpl CaseServiceImpl :: showCase :: Case Created \n","stream":"stdout","time":"2021-01-19T15:30:57.24005568Z"}
However, there are different Orgs in our Sibling Enterprise who log splunks thus which is far more readable. (No relation between us and them in tech so not able to leverage their tech support to triage this)
[http-nio-8443-exec-7] 15 Jan 2021 21:08:49,511+0000 INFO DaoOImpl [{applicationSystemCode=dao-app, userId=ANONYMOUS, webAnalyticsCorrelationId=|}]: This is a sample log
Please note the difference in logs (mine vs other):
{"log":"[https-jsse-nio-8443-exec-5]..
vs
[http-nio-8443-exec-7]...
Our Enterprise team is struggling to determine what causes this. I checked my app.log which looks ok (logged using Log4J) and doesn't have the aforementioned {"log" :...} entry.
[https-jsse-nio-8443-exec-5] 19 Jan 2021 15:30:57,237+0000 UTC INFO
rdt.damien.services.CaseServiceImpl CaseServiceImpl:: showCase :: Case
Created
Could someone guide me as to where could the problem/configuration lie that is causing the Splunk Forwarder to send the logs with the {"log":... format to splunk? I thought it was something to do with JSON type vs RAW which I too dont understand if its the cause and if it is - what configs are driving that?

Over the course of investigation - I found that is not SPLUNK thats doing this but rather the docker container. The docker container defaults to json-file that writes the outputs to the /var/lib/docker/containers folder with the **-json postfix which contains the logs in the `{"log" : <EVENT NAME} format.
I need to figure out how to fix the docker logging (aka the docker logging driver) to write in a non-json format.

Related

Janusgraph not using index in production

Problem
When performing queries in my production environment, the index is not being used and a full scan is performed, but my development environment works fine and uses the index.
After looking deeper at the problem in production, it also seems that the index information is being saved to the storage backend, but the data is not, and is being stored locally. I have no idea why this is...
I will explain the architecture now:
Environments
The following describe my two environments. Important to note, the index in question is a composite index, as such uses the storage backend, but I still included the index-backend in the architecture environment (aka Elasticsearch).
Both local and production environment versions are the same, i.e:
Janusgraph: 0.5.2
ScyllaDB: 0.5.2
Elasticsearch: 7.13.1
Local Environment
Services are running in docker-compose, consisting of a single Janusgraph instance, a single ScyllaDB instance, and a single Elasticsearch Instance.
Production Environment
Running on AWS, kubernetes cluster managed with EKS, I have multiple janusgraph deployments, which connect to a ScyllaDB cluster (in the same k8s cluster), which is done via Scylla For Kubernetes (https://operator.docs.scylladb.com/stable/), and an Elasticsearch cluster.
Setup
The following will give the simplest example I can that contains the problems I describe.
I pre-create the index's with the Janusgraph management system, such as:
# management.groovy
import org.janusgraph.graphdb.database.management.ManagementSystem
cluster = Cluster.open("/opt/janusgraph/my_scripts/gremlin.yaml")
client = cluster.connect()
graph = JanusGraphFactory.open("/opt/janusgraph/my_scripts/env.properties")
g = graph.traversal().withRemote(DriverRemoteConnection.using(client, "g"))
m = graph.openManagement()
uid_property = m.makePropertyKey("uid").dataType(String).make()
user_label = m.makeVertexLabel("User").make()
m.buildIndex("index::User::uid", Vertex.class).addKey(uid_property).indexOnly(user_label).buildCompositeIndex()
m.commit()
Upon inspection with m.printSchema() I can see that the index's are ENABLED, in both my local environment and production environment.
I proceed to import all the data that needs to exist on the graph, both local env and production env are OK.
Performing Queries
The following outline what happens when I run a query
Local Environment
What we see here is a simple lookup just to check that the query is using the index:
gremlin> g.V().has("User", "uid", "00003b90-dcc2-494d-a179-ac9009029501").profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
JanusGraphStep([],[~label.eq(User), uid.eq(... 1 1 1.837 100.00
\_condition=(~label = User AND uid = 00003b90-dcc2-494d-a179-ac9009029501)
\_isFitted=true
\_query=multiKSQ[1]#4000
\_index=index::User::uid
\_orders=[]
\_isOrdered=true
optimization 0.038
optimization 0.497
backend-query 1 0.901
\_query=index::User::uid:multiKSQ[1]#4000
\_limit=4000
>TOTAL - - 1.837 -
Production Environment
Again, we run the query to see if it using the index (which it is not)
g.V().has("User", "uid", "00003b90-dcc2-494d-a179-ac9009029501").profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
JanusGraphStep([],[~label.eq(User), uid.eq(... 1 1 11296.568 100.00
\_condition=(~label = User AND uid = 00003b90-dcc2-494d-a179-ac9009029501)
\_isFitted=false
\_query=[]
\_orders=[]
\_isOrdered=true
optimization 0.025
optimization 0.102
scan 0.000
\_condition=VERTEX
\_query=[]
\_fullscan=true
>TOTAL - - 11296.568 -
What Happened? So far my best guess:
The storage backend is NOT being used for storing data, but is being used for storing information about the indexes
Update: Aug 16 2021, after digging around some more I found out something interesting
It is now clear that the data is actually not being saved to the storage backend at all.
In my local environment I set the storage.directory environment variable to /var/lib/janusgraph/data, which mounts onto an empty directory, this directory remains empty. Any vertex/edge updates get's saved to the scyllaDB storage backend, and the data persists between janusgraph instance restarts.
In my production environment, this directory (/var/lib/janusgraph/data) is populated with files:
-rw-r--r-- 1 janusgraph janusgraph 0 Aug 16 05:46 je.lck
-rw-r--r-- 1 janusgraph janusgraph 9650 Aug 16 05:46 je.config.csv
-rw-r--r-- 1 janusgraph janusgraph 450 Aug 16 05:46 je.info.0
-rw-r--r-- 1 janusgraph janusgraph 0 Aug 16 05:46 je.info.0.lck
drwxr-xr-x 2 janusgraph janusgraph 118 Aug 16 05:46 .
-rw-r--r-- 1 janusgraph janusgraph 7533 Aug 16 05:46 00000000.jdb
drwx------ 1 janusgraph janusgraph 75 Aug 16 05:53 ..
-rw-r--r-- 1 janusgraph janusgraph 19951 Aug 16 06:09 je.stat.csv
and any subsequent updates on the graph seem to be reflected here, the update do not get put onto the storage backend, and other janusgraph instances on kubernetes cannot see any changes other instances make, leading me to come to the conclusion, the storage backend is not being used for storing data
The domain name used for the storage.hostname and index.hostname both resolve to IP address's, confirmed with using nslookup.
The endpoints must also work, as the keyspace janusgraph is created, and also has a different replication factor that I defined, and also retains the index information regardless of restarting the janusgraph instances.
Idea 1 (Index is not enabled)
This was disproved via running m.printSchema() showing that all the index's were ENABLED
Idea 2 (Storage backends have different data)
I looked at the data stored in scylladb, and got a summary with nodetool cfstats, this does show something different:
# Local
Keyspace : janusgraph
Read Count: 1688328
Read Latency: 2.5682805710738673E-5 ms
Write Count: 1055210
Write Latency: 1.702409946835227E-5 ms
...
Memtable cell count: 126411
Memtable data size: 345700491
Memtable off heap memory used: 480247808
# Production
Keyspace : janusgraph
Read Count: 6367
Read Latency: 2.1203078372860058E-5 ms
Write Count: 21
Write Latency: 0.0 ms
...
Memtable cell count: 4
Memtable data size: 10092
Memtable off heap memory used: 131072
Although I don't know how to explain the difference, it is clear that both backends contain all the data, verified with various count() queries over labels, such as g.V().hasLabel("User").count(), which both environments report the same result
Idea 3 (Elasticsearch Warnings)
When launching a gremlin console session, there is a difference in that the production environment shows:
07:27:09 WARN org.elasticsearch.client.RestClient - request [PUT http://*******<i_removed_the_domain>******:9200/_cluster/settings] returned 3 warnings: [299 Elasticsearch-7.13.4-c5f60e894ca0c61cdbae4f5a686d9f08bcefc942 "[node.data] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."],[299 Elasticsearch-7.13.4-c5f60e894ca0c61cdbae4f5a686d9f08bcefc942 "[node.master] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."],[299 Elasticsearch-7.13.4-c5f60e894ca0c61cdbae4f5a686d9f08bcefc942 "[node.ml] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."]
but as my problem is using composite index's, I believe we can disregard elasticsearch warnings.
Idea 4 (ScyllaDB cluster node resources)
Another idea I had was increasing the node resources, even with 7gb RAM, the problem still persists.
Finally...
I don't know what to try next in order to solve this problem, this is my first time pushing Janusgraph into production and perhaps I have missed something important. I have been stuck on this problem for quite a while, hence now asking the community here for help.
Thank you very much for reading this for, and hopefully helping me to solve this problem
I solved the problem myself, I realised that my K8s Deployment .yaml file I use for deploying needed all environment variables to have the prefix janusgraph., as such the janusgraph server was starting with all default variables rather than my selected ones.
Every-time I was creating a gremlin shell session (which connected to it's localhost server), although I was specifying all the correct endpoints and configuration, it was still saving the data according to default janusgraph variables. Although, even in this case, I don't know why the index's were successfully created on my specified backend.
But none the less, the solution was to make sure environment variables have the prefix janusgraph.

Presto cannot access the web page while using SSL

my presto version is 0.240
my operation: i want to use ssl for use https in presto
so i change my config refer only by this url: https://trino.io/docs/current/security/internal-communication.html
but i can't Access to the presto address https://192.168.100.142:9999/
I don't know which step I did wrong.
What should I do to implement HTTPS for Presto?
this is my config:
A cluster of two machines
node 1 142 hostname:sbider-dev-01
/opt/presto-server-0.240/etc/config.properties
coordinator=true
node-scheduler.include-coordinator=true
query.max-memory=7.5GB
query.max-memory-per-node=3.5GB
query.max-total-memory-per-node=3.5GB
experimental.reserved-pool-enabled=false
memory.heap-headroom-per-node=0.5GB
#experimental.spill-enabled=true
#experimental.max-spill-per-node=8GB
#experimental.query-max-spill-per-node=8GB
query.low-memory-killer.policy=total-reservation-on-blocked-nodes
#http-server.http.port=9999
#discovery-server.enabled=true
#discovery.uri=http://192.168.100.142:9999
internal-communication.shared-secret="8HRJWX41DwtuYZcNw8uMbshA8wDLoLS78tT3UVL+Z+m0xG7KCygGurE9SXEbGy2bLtPLza1MhAnWJp2mJp/S+j9EFWWuztXz7cHJhSz9QFiVxYCs1Wzn+IVKgHD5z+iGbdKjwRtgUjwNvS4MIfqwqwKlVZiEtGgEDv7j/kAgpOYPvFCRJfb/U/+b7qPpwPNDA6kXu3Dj5p1Q81+kmbFO59WSh6c4QwqdbFHAaY8XFWo8tIogxpmwQQqV3BvICmesxlIhBH/pOGgoyl86QQ/TaAMaWjaddNcgO5keTGhhOj/juGZ/gbOL/PHGNs1ENSPRnjvIGLHFQPDrm36YenhfTH5L7X0Q9HwwnEpEoYkDJsmMEV+elPZK767nZXHryuvDvHGs0PhYSRO8ekOgC3CaE1tfiGh5M9H5C2fnyeGRQ0iwtgXh83kRDuPzVrRx5yj2cHQJOZu+CcXCJ3aa1Tijxq56RfdcEz9Frr8n8aXaNMtRlchcXn3+B4biByS9duq28VHHBDlyYQQ6VSKbLDt1GBi5oOQICtrGuOY+/MD+rnV5uxPUQcSIh9KmA1WjahJEz0ItDKpB66JgVkTrVDWEJPeozKTvHRLG9sBudRhQ5abJGEAhx9b78dUbTcEkRlPuvUN1WjwVlUzjyUDKd14ocuhpoOBzjV9kFhTqQZ4zgNo="
http-server.http.enabled=false
#node.internal-address-source=FQDN
node.internal-address=sbider-dev-01,sbider-dev-02
http-server.https.enabled=true
http-server.https.port=9999
# jks文件全路径
http-server.https.keystore.path=/ceshi/keystore.jks
http-server.https.keystore.key=123456
discovery.uri=https://192.168.100.142:9999
internal-communication.https.required=true
internal-communication.https.keystore.path=/ceshi/keystore.jks
internal-communication.https.keystore.key=123456
node 2 143 hostname cat /opt/presto-server-0.240/etc/config.properties
coordinator=flase
query.max-memory=7.5GB
query.max-memory-per-node=3.5GB
query.max-total-memory-per-node=3.5GB
experimental.reserved-pool-enabled=false
memory.heap-headroom-per-node=0.5GB
#experimental.spill-enabled=true
#experimental.max-spill-per-node=8GB
#experimental.query-max-spill-per-node=8GB
query.low-memory-killer.policy=total-reservation-on-blocked-nodes
#discovery.uri=http://192.168.100.142:9999
internal-communication.shared-secret="8HRJWX41DwtuYZcNw8uMbshA8wDLoLS78tT3UVL+Z+m0xG7KCygGurE9SXEbGy2bLtPLza1MhAnWJp2mJp/S+j9EFWWuztXz7cHJhSz9QFiVxYCs1Wzn+IVKgHD5z+iGbdKjwRtgUjwNvS4MIfqwqwKlVZiEtGgEDv7j/kAgpOYPvFCRJfb/U/+b7qPpwPNDA6kXu3Dj5p1Q81+kmbFO59WSh6c4QwqdbFHAaY8XFWo8tIogxpmwQQqV3BvICmesxlIhBH/pOGgoyl86QQ/TaAMaWjaddNcgO5keTGhhOj/juGZ/gbOL/PHGNs1ENSPRnjvIGLHFQPDrm36YenhfTH5L7X0Q9HwwnEpEoYkDJsmMEV+elPZK767nZXHryuvDvHGs0PhYSRO8ekOgC3CaE1tfiGh5M9H5C2fnyeGRQ0iwtgXh83kRDuPzVrRx5yj2cHQJOZu+CcXCJ3aa1Tijxq56RfdcEz9Frr8n8aXaNMtRlchcXn3+B4biByS9duq28VHHBDlyYQQ6VSKbLDt1GBi5oOQICtrGuOY+/MD+rnV5uxPUQcSIh9KmA1WjahJEz0ItDKpB66JgVkTrVDWEJPeozKTvHRLG9sBudRhQ5abJGEAhx9b78dUbTcEkRlPuvUN1WjwVlUzjyUDKd14ocuhpoOBzjV9kFhTqQZ4zgNo="
http-server.http.enabled=false
#node.internal-address-source=FQDN
node.internal-address=sbider-dev-01,sbider-dev-02
http-server.https.enabled=true
http-server.https.port=9999
http-server.https.keystore.path=/ceshi/keystore.jks
http-server.https.keystore.key=123456
discovery.uri=https://192.168.100.142:9999
internal-communication.https.required=true
internal-communication.https.keystore.path=/ceshi/keystore.jks
internal-communication.https.keystore.key=123456
server log in sbider-dev-01: cat /opt/presto-server-0.240/var/log/server.log
Companion catalogs: catalog_name1=catalog_name2,catalog_name3=catalog_name4,...
2021-01-12T12:41:09.766+0800 INFO main Bootstrap transaction.idle-check-interval 1.00m 1.00m Time interval between idle transactions checks
2021-01-12T12:41:09.766+0800 INFO main Bootstrap transaction.idle-timeout 5.00m 5.00m Amount of time before an inactive transaction is considered expired
2021-01-12T12:41:09.767+0800 INFO main Bootstrap transaction.max-finishing-concurrency 1 1 Maximum parallelism for committing or aborting a transaction
2021-01-12T12:41:09.767+0800 WARN main Bootstrap UNUSED PROPERTIES
2021-01-12T12:41:09.767+0800 WARN main Bootstrap internal-communication.shared-secret
2021-01-12T12:41:09.767+0800 WARN main Bootstrap
2021-01-12T12:41:11.037+0800 ERROR main com.facebook.presto.server.PrestoServer Unable to create injector, see the following errors:
1) Configuration property 'internal-communication.shared-secret' was not used
at com.facebook.airlift.bootstrap.Bootstrap.lambda$initialize$2(Bootstrap.java:238)
1 error
com.google.inject.CreationException: Unable to create injector, see the following errors:
1) Configuration property 'internal-communication.shared-secret' was not used
at com.facebook.airlift.bootstrap.Bootstrap.lambda$initialize$2(Bootstrap.java:238)
1 error
at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:543)
at com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:159)
at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:106)
at com.google.inject.Guice.createInjector(Guice.java:87)
at com.facebook.airlift.bootstrap.Bootstrap.initialize(Bootstrap.java:245)
at com.facebook.presto.server.PrestoServer.run(PrestoServer.java:131)
at com.facebook.presto.server.PrestoServer.main(PrestoServer.java:77)
You're following Trino (fka Presto SQL) documentation for securing internal documentation, but got Presto binary from facebook's fork of the project (prestodb).
Go to https://trino.io/download.html to get latest Trino release.
The alternative solution (using prestodb's documentation and prestodb's binary) is NOT a safe, viable alternative, due to security issues known and not fixed in prestodb code base.

Tracking down S3 Costs (S3FS)

During a transition, our S3 costs jumped a lot due to ListBucket and HeadObject calls. We are trying to figure out how to debug a sudden increase in our S3 costs. We made some changes that should NOT have affected it but the major change seems to be
10-20X increase in HeadObject calls
Sudden appearance of ListBucket calls
I have attached a chart showing the jump between the April 10, 2018 and April 14, 2018. The dates in between, we made the following changes
Changed from (debian 8) S3FS v1.61 (super old from 2012, not even in Github) to v1.84 (latest)
https://github.com/s3fs-fuse/s3fs-fuse
Moved from N. Virginia to N. California AZ (10% higher cost)
The giant yellow bars are showing the moving of the files using Amazon CLI (April 11 to 13)
In order to try to calm this down, we added to the mount command in /etc/fstab the following:
noatime,stat_cache_expire=3600,enable_noobj_cache
The bars that look uneven starting Apr 14 are now stable around $25/day
Options that are already there were there since the start (no change)
_netdev,allow_other,use_cache=/tmp,umask=0000,use_path_request_style,ensure_diskfree=10240
We have done the following to try to debug this
Enabled S3 Logging
Dumped the logs into Athena and then CSV export into MySQL
These logs are just 1 days worth
Screenshot "query 1" shows that there is 4.8m hits into a path ... basically, we think it is traversing the entire directory tree (with most like about 100k files) looking for a file if it exists
Screenshot "query 2" shows the same thing (kind of) where it is also doing down a path
Not really sure what else to do but our normal bill of about $5/day (including other services) is now about $25/day (5x increase) .. with the /etc/fstab changes, it is down to $13/day but still trying to get it to $5/day if we can get back to the zero ListBucket calls and 20% of the HeadObject calls.
Any ideas on what to try greatly appreciated.
ListBucket and HeadObject API calls were being made updatedb (and located).
Solution: Add your mount point (in my case /mnt/s3fs) to PRUNEPATHS in /etc/updatedb.conf so updatedb does not include this when it scans
https://linux.die.net/man/5/updatedb.conf
https://github.com/s3fs-fuse/s3fs-fuse/issues/193#issuecomment-109617253

Need help in Apache Camel multicast/parallel/concurrent processing

I am trying to achieve concurrent/parallel processing in my requirement, but I did not get appropriate help in my multiple attempts in this regard.
I have 5 remote directories ( which may be added or removed) which contains log files, I want to Dow load them for every 15 minutes to my local directory and want to perform Lucene indexing after completion of ftp transfer job, I want to add routers dynamically.
Since all those remote machines are different end points , and different routes. I don't have any particular end point to kickoff all these.
Start
<parallel>
<download remote dir from: sftp1>
<download remote dir from: sftp2>
....
</parallel>
<After above task complete>
<start Lucene indexing>
<end>
Repeat above for every 15 minutes,
I want to download all folders paralally, Kindly suggest the solution if anybody worked on similar requirement.
I would like to know how to start/initiate these multiple routes (like this multiple remote directories) should be kick started when I don't have a starter end point. I would like to start all ftp operations parallel and on completing those then indexing. Thanks for taking time to reading this post , I really appreciate your help.
I tried like this,
from (bean:foo? Method=start).multicast ().to (direct:a).to (direct:b)...
From (direct:a) .from (sftp:xxx).to (localdir)
from (direct:b).from (sftp:xxx).to (localdir)
camel-ftp support periodic polling via the consumer.delay property
add camel-ftp consumer routes dynamically for each server as shown in this unit test
you can then aggregate your results based on a size or timeout value to initiate the Lucene indexing, etc
[todo - put together an example]

Restart my delta loading after delete the infopackage in PSA by mistake

here i have got one issue.can some one please help me to resolve this.
i was trying to extract some data to DS 0FI_AP_6...
then in InfoPackage Monitor I can see like..
-->Requests (messages): Everything OK
-->Extraction (messages): Everything OK
-->Transfer (IDocs and TRFC): Missing messages or warnings
-->Info IDoc 2 : sent, not arrived ; IDoc ready for dispatch (ALE service)
Data Package 1 : 23752 Records arrived in BW
Data Package 2 : 15216 Records arrived in BW
Request IDoc : Application document posted
Info IDoc 1 : Application document posted
Info IDoc 3 : Application document posted
Info IDoc 4 : Application document posted
-->Processing (data packet): Everything OK
Data Package 1 ( 38672 Records ) : Everything OK
in Status Menu I am having message like...
Missing data packages for PSA Table
Diagnosis
Data packets are missing from PSA Table . BI processing does not
return any errors. The data transport from the source system to BI was
probably incorrect.
Procedure
Check the tRFC overview in the source system.
You access this log using the wizard or following the menu path
"Environment -> Transact. RFC -> Source System".
Error handling:
If the tRFC is incorrect, resolve the errors listed there.
Check that the source system is connected properly to BI. In
particular, check the remote user authorizations in BI.
Please suggest me how to resolve this issue...
thanks in advance for your help and quick reply is much appreciated.
But what the worst thing is I deleted the infopackage in PSA by mistake.
In the normal case, if I repeat the process again, the delta load would be OK, but now the delta load remains error.
so gurus,
1. how can I restart my delta loading correctly?
2. I want to modify the timestamp in the delta table, but how to do it ?
Go to T-Code RSA7 in the source system. This will tell you the date/timestamp that the delta is set to. If the date was changed to a range that no longer works then you will need to re-initialize the datasource in the BW system side. However, the Delta date may still be fine becauase it may have never been changed when you tried to first do your load because of the connection issues.
You can create a new infopackage and set the update to Initialize Datasource with Data Transfer. This will essentially run a full load from the datasource and then reset the delta pointer date/timestamp to when you ran it. This way you will capture all the data that you needed and anything that was already in the PSA should be overwritten.
Also note that you should delete or set the request status to red on the previous request that may contain bad data in the PSA.
From the original error it seems like you are having an RFC connection issue between the datasource and BW. Contact your BASIS support and have them check the connection to make sure it is good. To ensure that your datasource is extracting properly you can run t-code RSA3 on it in the source system. This will ensure that the extraction of data is working properly.