Manual AWS X-Ray traces not showing even though they are sent - aws-xray

I'm sending xray information from Python manually (no Django, Flask, etc.). I can see the xray information sent in the logs, for example:
Jan 24 16:50:17 ip-172-16-7-143 python3[10700]: DEBUG:sending: {"format":"json","version":1}
Jan 24 16:50:17 ip-172-16-7-143 python3[10700]: {"aws": {"xray": {"sdk": "X-Ray for Python", "sdk_version": "2.4.3"}}, "end_time": 1579884617.5194468, "id": "c59efdf40abecd22", "in_progress": false, "name": "handle request", "service": {"runtime": "CPython", "runtime_version": "3.6.9"}, "start_time": 1579884515.5117097, "trace_id": "1-5e2b1fe3-c1c3cbc802cae49e9c364371"} to 127.0.0.1:2000.
But nothing shows up in the console. I've tried all the different filters and time frames, but nothing shows up. Where should I be looking?
UPDATE:
Adding xray logs:
2020-01-24T01:50:35Z [Info] Initializing AWS X-Ray daemon 3.2.0
2020-01-24T01:50:35Z [Info] Using buffer memory limit of 9 MB
2020-01-24T01:50:35Z [Info] 144 segment buffers allocated
2020-01-24T01:50:35Z [Info] Using region: us-east-2
2020-01-24T01:50:35Z [Info] HTTP Proxy server using X-Ray Endpoint : https://xray.us-east-2.amazonaws.com
2020-01-24T01:50:35Z [Info] Starting proxy http server on 127.0.0.1:2000

From the log it looks like your X-Ray daemon never received any trace segment, otherwise there should be a log line like "[Info] Successfully sent batch of 1 segments (0.100 seconds)".
Are you using the official X-Ray Python SDK? How did the "manually sending" work? Please verify the daemon and your application is running in the same network circumstance. For example, if the daemon is running in a container, please make sure its UDP 2000 port is opened, vice versa.

Related

X-Ray Daemon don't receive any data from envoy

I have a service running a task definition with three containers:
service itself
envoy
x-ray daemon
And I want to trace and monitor my services interacting with each other with x-ray.
But I don't see any data in x-ray.
I can see the request logs and everything in the envoy logs but there are no error messages about missing connection to the x-ray daemon.
Envoy container has three env variables:
APPMESH_VIRTUAL_NODE_NAME = mesh/mesh-name/virtualNode/service-virtual-node
ENABLE_ENVOY_XRAY_TRACING = 1
ENVOY_LOG_LEVEL = trace
The x-ray daemon is pretty plain and has just a name and an image (amazon/aws-xray-daemon:1).
But when looking in the logs of the x-ray dameon, there is only the following:
2022-05-31T14:48:05.042+02:00 2022-05-31T12:48:05Z [Info] Initializing AWS X-Ray daemon 3.0.0
2022-05-31T14:48:05.042+02:00 2022-05-31T12:48:05Z [Info] Using buffer memory limit of 76 MB
2022-05-31T14:48:05.042+02:00 2022-05-31T12:48:05Z [Info] 1216 segment buffers allocated
2022-05-31T14:48:05.051+02:00 2022-05-31T12:48:05Z [Info] Using region: eu-central-1
2022-05-31T14:48:05.788+02:00 2022-05-31T12:48:05Z [Error] Get instance id metadata failed: RequestError: send request failed
2022-05-31T14:48:05.788+02:00 caused by: Get http://169.254.169.254/latest/meta-data/instance-id: dial tcp xxx.xxx.xxx.254:80: connect: invalid argument
2022-05-31T14:48:05.789+02:00 2022-05-31T12:48:05Z [Info] Starting proxy http server on 127.0.0.1:2000
As far as I read, the error you can see in these logs doesn't affect the functionality (https://repost.aws/questions/QUr6JJxyeLRUK5M4tadg944w).
I'm pretty sure I'm missing a configuration or access right.
It's running already on staging but I set this up several weeks ago and I don't find any differences between the configurations.
Thanks in advance!
In my case, I made a copy-paste mistake by copying trailing line break into the name of the environment variable ENABLE_ENVOY_XRAY_TRACING which wasn't visible in the overview and only inside the text field.

400 bad request when attempting connection to AWS Neptune with IAM enabled

I am unable to connect to neptune instance that has IAM enabled. I have followed the AWS documentation (corrected a few of my silly errors on the way) but without luck.
When I connect via my Java application using the SigV4Signer and when I use the gremlin console, I get a 400 bad request websocket error.
o.a.t.g.d.Handler$GremlinResponseHandler : Could not process the response
io.netty.handler.codec.http.websocketx.WebSocketHandshakeException: Invalid handshake response getStatus: 400 Bad Request
at io.netty.handler.codec.http.websocketx.WebSocketClientHandshaker13.verify(WebSocketClientHandshaker13.java:267)
at io.netty.handler.codec.http.websocketx.WebSocketClientHandshaker.finishHandshake(WebSocketClientHandshaker.java:302)
at org.apache.tinkerpop.gremlin.driver.handler.WebSocketClientHandler.channelRead0(WebSocketClientHandler.java:69)
When I run com.amazon.neptune.gremlin.driver.example.NeptuneGremlinSigV4Example (from my machine over port-forwarding AND from the EC2 jumphost) I get:
java.util.concurrent.TimeoutException: Timed out while waiting for an available host - check the client configuration and connectivity to the server if this message persists
I am able to connect to my neptune instance using the older deprecated certificate mechanism. I am using a jumphost ec2 instance and port-forwarding.
I believe that the SigV4 aspect is working as in the neptune audit logs I can see attempts to connect with the aws_access_key:
1584098990319, <jumphost_ip>:47390, <db_instance_ip>:8182, HTTP_GET, [unknown], [unknown], "HttpObjectAggregator$AggregatedFullHttpRequest(decodeResult: success, version: HTTP/1.1, content: CompositeByteBuf(ridx: 0, widx: 0, cap: 0, components=0)) GET /gremlin HTTP/1.1 upgrade: websocket connection: upgrade sec-websocket-key: g44zxck9hTI9cZrq05V19Q== sec-websocket-origin: http://localhost:8182 sec-websocket-version: 13 Host: localhost:8182 X-Amz-Date: 20200313T112950Z Authorization: AWS4-HMAC-SHA256 Credential=<my_access_key>/20200313/eu-west-2/neptune-db/aws4_request, SignedHeaders=host;sec-websocket-key;sec-websocket-origin;sec-websocket-version;upgrade;x-amz-date, Signature=<the_signature> content-length: 0", /gremlin
But when I look
This is the policy that I created:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"neptune-db:*"
],
"Resource": [
"arn:aws:neptune-db:eu-west-2:<my_aws_account>:*/*"
]
}
]
}
I have previously tried with a policy that references my cluster resource id.
I created a new api user with this policy attached as its only permission. (I've tried this twice).
IAM is showing my that the graph-user I created has not successfully logged in (duh).
Seems that the issue is with the IAM set-up somewhere along the line. Is it possible to get more information out of AWS with regards to why the connection attempt is failing?
I am using the most recent release of Neptune and the 3.4.3 Gremlin Driver and console. I am using Java 8 when running the NeptuneGremlinSigV4Example and building the libraries to deploy to the console.
thanks
It appears from the audit log output that the SigV4 Signature that is being created is using localhost as the Host header. This is most likely due to the fact that you're using a proxy to connect to Neptune. By default, the NeptuneGremlinSigV4Example assumes that you're connecting directly to a Neptune endpoint and reuses the endpoint as the Host header in creating the Signature.
To get around this, you can use the following example code that overrides this process and allows you to use a proxy and still sign the request properly.
https://github.com/aws-samples/amazon-neptune-samples/tree/master/gremlin/gremlin-java-client-demo
I was able to get this to work using the following.
Create an SSL tunnel from you local workstation to your EC2 jumphost:
ssh -i <key-pem-file> -L 8182:<neptune-endpoint>:8182 ec2-user#<ec2-jumphost-hostname>
Set the following environment variables:
export AWS_ACCESS_KEY_ID=<access_key>
export AWS_SECRET_ACCESS_KEY=<secret_key>
export SERVICE_REGION=<region_id> (i.e. us-west-2)
Once the tunnel is up and your environment variables are set, use the following format with the Gremlin-Java-Client-Demo:
java -jar target/gremlin-java-client-demo.jar --nlb-endpoint localhost --lb-port 8182 --neptune-endpoint <neptune-endpoint> --port 8182 --enable-ssl --enable-iam-auth

erlang failed to resolve ipv6 addresses using parameter from rabbitmq

I'm using rabbitmq cluster in k8s which has only pure ipv6 address. inet return nxdomain error when parsing the k8s service name.
The paramter passed to erlang from rabbitmq is:
RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+A 128 -kernel inetrc '/etc/rabbitmq/erl_inetrc' -proto_dist inet6_tcp"
RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp"
erl_inetrc: |-
{inet6, true}.
when rabbitmq using its plugin rabbit_peer_discovery_k8s to invoke k8s api:
2019-10-15 07:33:55.000 [info] <0.238.0> Peer discovery backend does not support locking, falling back to randomized delay
2019-10-15 07:33:55.000 [info] <0.238.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized start
up delay.
2019-10-15 07:33:55.000 [debug] <0.238.0> GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/tazou/endpoints/zt4-crmq
2019-10-15 07:33:55.015 [debug] <0.238.0> Response: {error,{failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},{inet,[inet]
,nxdomain}]}}
2019-10-15 07:33:55.015 [debug] <0.238.0> HTTP Error {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},{inet,[inet],nxdom
ain}]}
2019-10-15 07:33:55.015 [info] <0.238.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}}
,
{inet,[inet],nxdomain}]}
2019-10-15 07:33:55.016 [error] <0.237.0> CRASH REPORT Process <0.237.0> with 0 neighbours exited with reason: no case clause matching {error,"{fa
iled_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from
_config/0 line 167 in application_master:init/4 line 138
2019-10-15 07:33:55.016 [info] <0.43.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kub
ernetes.default.svc.cluster.local\",443}},\n
in k8s console, the address could be resolved:
[rabbitmq]# nslookup -type=AAAA kubernetes.default.svc.cluster.local
Server: 2019:282:4000:2001::6
Address: 2019:282:4000:2001::6#53
kubernetes.default.svc.cluster.local has AAAA address fd01:abcd::1
the inet could return ipv6 address.
kubectl exec -ti zt4-crmq-0 rabbitmqctl eval 'inet:gethostbyname("kubernetes.default.svc.cluster.local").'
{ok,{hostent,"kubernetes.default.svc.cluster.local",[],inet6,16,
[{64769,43981,0,0,0,0,0,1}]}}
as I know, plugin call httpc:request to invoke k8s api. I don't know what's the gap between httpc:request and inet:gethostbyname. I also don't what's used by httpc:request to resolve the address of hostname.
I query for the rabbitmq plugin, It's said that rabbitmq plugin don't aware how erlang resovlve the address. https://github.com/rabbitmq/rabbitmq-peer-discovery-k8s/issues/55.
Anything else I could set for erl_inetrc so that erlang could resolve the ipv6 address? what did i miss to config? or how could i debug from erlang side? I'm new to erlang.
B.R,
Tao

Duplicate field 'Status' when I try to run 'pio status'

When ever I do pio status I get the following error
[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.13.0 is installed at /Users/prvns/tools/PredictionIO-0.13.0
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at /Users/prvns/tools/PredictionIO-0.13.0/vendors/spark-2.3.1-bin-hadoop2.7
[INFO] [Management$] Apache Spark 2.3.1 detected (meets minimum requirement of 1.6.3)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[ERROR] [Management$] Unable to connect to all storage backends successfully.
The following shows the error message from the storage backend.
PUT http://localhost:9200/pio_meta/_mapping/engine_instances: HTTP/1.1 400 Bad Request
{"error":{"root_cause":[{"type":"parse_exception","reason":"Failed to parse content to map"}],"type":"parse_exception","reason":"Failed to parse content to map","caused_by":{"type":"json_parse_exception","reason":"Duplicate field 'status'\n at [Source: org.elasticsearch.common.compress.DeflateCompressor$1#6b496f00; line: 1, column: 462]"}},"status":400} (org.apache.predictionio.shaded.org.elasticsearch.client.ResponseException)
Dumping configuration of initialized storage backend sources.
Please make sure they are correct.
Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOME -> /usr/local/Cellar/elasticsearch/6.2.4/, HOSTS -> localhost, PORTS -> 9200, SCHEMES -> http, TYPE -> elasticsearch
My pio-env.sh looks like
SPARK_HOME=$PIO_HOME/vendors/spark-2.3.1-bin-hadoop2.7
POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar
MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH
PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=ELASTICSEARCH
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=ELASTICSEARCH
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200
PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/Cellar/elasticsearch/6.2.4/
Why is this not working?
I was using ElasticSearch 6.x. I replaced it with ElasticSearch 5.x and it worked.

Error running topology in production cluster with Apache Storm 1.0.0, topology does not start

I have a topology that runs well on a Local cluster.
But when I try to run it on a production cluster the following things happens:
The nimbus is up
The storm UI is up
The two workers I use are up
Zookeper is up
I run storm with
storm jar myjar.jar MyClass
Nimbus submits the topology
The topologies and the workers appears in the storm UI
BUT:
The topology does not start despite the fact that its status is ACTIVE
The log file of the topology does not appear in the workers.
I have the following log in the worker on the supervisor.log:
2016-04-15 13:18:19.831 o.a.s.d.supervisor [WARN] There was a connection problem with nimbus. #error {
:cause jobs-rec-storm-nimbus
:via
[{:type java.lang.RuntimeException
:message org.apache.storm.thrift.transport.TTransportException: java.net.UnknownHostException: jobs-rec-storm-nimbus
:at [org.apache.storm.security.auth.TBackoffConnect retryNext TBackoffConnect.java 64]}
{:type org.apache.storm.thrift.transport.TTransportException
:message java.net.UnknownHostException: jobs-rec-storm-nimbus
:at [org.apache.storm.thrift.transport.TSocket open TSocket.java 226]}
{:type java.net.UnknownHostException
:message jobs-rec-storm-nimbus
:at [java.net.AbstractPlainSocketImpl connect AbstractPlainSocketImpl.java 184]}]
:trace
[[java.net.AbstractPlainSocketImpl connect AbstractPlainSocketImpl.java 184]
[java.net.SocksSocketImpl connect SocksSocketImpl.java 392]
[java.net.Socket connect Socket.java 589]
[org.apache.storm.thrift.transport.TSocket open TSocket.java 221]
[org.apache.storm.thrift.transport.TFramedTransport open TFramedTransport.java 81]
[org.apache.storm.security.auth.SimpleTransportPlugin connect SimpleTransportPlugin.java 103]
[org.apache.storm.security.auth.TBackoffConnect doConnectWithRetry TBackoffConnect.java 53]
[org.apache.storm.security.auth.ThriftClient reconnect ThriftClient.java 99]
[org.apache.storm.security.auth.ThriftClient <init> ThriftClient.java 69]
[org.apache.storm.utils.NimbusClient <init> NimbusClient.java 106]
[org.apache.storm.utils.NimbusClient getConfiguredClientAs NimbusClient.java 78]
[org.apache.storm.utils.NimbusClient getConfiguredClient NimbusClient.java 41]
[org.apache.storm.blobstore.NimbusBlobStore prepare NimbusBlobStore.java 268]
[org.apache.storm.utils.Utils getClientBlobStoreForSupervisor Utils.java 462]
[org.apache.storm.daemon.supervisor$fn__9590 invoke supervisor.clj 942]
[clojure.lang.MultiFn invoke MultiFn.java 243]
[org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9351$fn__9369 invoke supervisor.clj 582]
[org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9351 invoke supervisor.clj 581]
[org.apache.storm.event$event_manager$fn__8903 invoke event.clj 40]
[clojure.lang.AFn run AFn.java 22]
[java.lang.Thread run Thread.java 745]]}
2016-04-15 13:18:19.831 o.a.s.d.supervisor [INFO] Finished downloading code for storm id jobs-KafkaMigration-topology-3-1460740616
2016-04-15 13:18:19.850 o.a.s.d.supervisor [INFO] Missing topology storm code, so can't launch worker with assignment ...(some more numbers)
So I asume that I have a connection problem with nimbus, but the properties file in the worker is:
storm.zookeeper.servers:
- "192.168.22.209"
- "192.168.22.216"
- "192.168.22.217"
storm.local.dir: "/app/home/storm"
storm.zookeeper.root: "/storm-prod"
#
nimbus.seeds: ["192.168.120.96"]
And if I make a ping to the nimbus ip from the workers, it returns OK
Where is the error, How can I fix it?
Thanks!
Whats appears to happen in this context is that Storm supervisor resolves nimbus from whatever is configured in storm.yaml seeds/host the first time and from then on uses nimbus host name to download the topology artifacts.
If that is correct, DNS is mandatory for a cluster setup. This is far from ideal, specially when using containers in an orchestrated environment like kubernetes.
Current workaround i'm using is adding
storm.local.hostname: "<local.ip.value>"
to the storm.yaml
Thanks to #bastien who provided the tip on storm user mailing list
I ran into the similar issue. Turns out my firewall rules were blocking the supervisor ports. Make sure the supervisor and nimbus are able to talk to each other.
I found that I need to have the hostnames of the boxes match what I was calling them in the /etc/hosts file
in host file i had
xxx.xxx.xxx.xxx nimbus
but the host name on the box was different and it was pulling the hostname from the os
changing the host name on the os of the nimbus server resolved my issue.