kafka connect transforms RegExRouter exiting with unrecoverable exception - amazon-s3

I have made a kafka pipeline to copy a sqlserver table to s3
During sink, i'm trying to transform topic names dropping prefix with the regexrouter function :
"transforms":"dropPrefix",
"transforms.dropPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex":"SQLSERVER-TEST-(.*)",
"transforms.dropPrefix.replacement":"$1"
The sink fails with the message :
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:586)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:322)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:188)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:564)
... 10 more
If i remove the transform, the pipeline works fine
Problem can be reproduced with this docker-compose :
version: '2'
services:
smtproblem-zookeeper:
image: zookeeper
container_name: smtproblem-zookeeper
ports:
- "2181:2181"
smtproblem-kafka:
image: confluentinc/cp-kafka:5.0.0
container_name: smtproblem-kafka
ports:
- "9092:9092"
links:
- smtproblem-zookeeper
- smtproblem-minio
environment:
KAFKA_ADVERTISED_HOST_NAME : localhost
KAFKA_ZOOKEEPER_CONNECT: smtproblem-zookeeper:2181/kafka
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://smtproblem-kafka:9092
KAFKA_CREATE_TOPICS: "_schemas:3:1:compact"
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
smtproblem-schema_registry:
image: confluentinc/cp-schema-registry:5.0.0
container_name: smtproblem-schema-registry
ports:
- "8081:8081"
links:
- smtproblem-kafka
- smtproblem-zookeeper
environment:
SCHEMA_REGISTRY_HOST_NAME: http://smtproblem-schema_registry:8081
SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: PLAINTEXT://smtproblem-kafka:9092
SCHEMA_REGISTRY_GROUP_ID: schema_group
smtproblem-kafka-connect:
image: confluentinc/cp-kafka-connect:5.0.0
container_name: smtproblem-kafka-connect
command: bash -c "wget -P /usr/share/java/kafka-connect-jdbc http://central.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre8/mssql-jdbc-6.4.0.jre8.jar && /etc/confluent/docker/run"
ports:
- "8083:8083"
links:
- smtproblem-zookeeper
- smtproblem-kafka
- smtproblem-schema_registry
- smtproblem-minio
environment:
CONNECT_BOOTSTRAP_SERVERS: smtproblem-kafka:9092
CONNECT_REST_PORT: 8083
CONNECT_GROUP_ID: "connect_group"
CONNECT_OFFSET_FLUSH_INTERVAL_MS: 1000
CONNECT_CONFIG_STORAGE_TOPIC: "connect_config"
CONNECT_OFFSET_STORAGE_TOPIC: "connect_offsets"
CONNECT_STATUS_STORAGE_TOPIC: "connect_status"
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
CONNECT_KEY_CONVERTER: "io.confluent.connect.avro.AvroConverter"
CONNECT_VALUE_CONVERTER: "io.confluent.connect.avro.AvroConverter"
CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL: "http://smtproblem-schema_registry:8081"
CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: "http://smtproblem-schema_registry:8081"
CONNECT_INTERNAL_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
CONNECT_INTERNAL_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
CONNECT_REST_ADVERTISED_HOST_NAME: "smtproblem-kafka_connect"
CONNECT_LOG4J_ROOT_LOGLEVEL: INFO
CONNECT_LOG4J_LOGGERS: org.reflections=ERROR
CONNECT_PLUGIN_PATH: "/usr/share/java"
AWS_ACCESS_KEY_ID: localKey
AWS_SECRET_ACCESS_KEY: localSecret
smtproblem-minio:
image: minio/minio:edge
container_name: smtproblem-minio
ports:
- "9000:9000"
entrypoint: sh
command: -c 'mkdir -p /data/datalake && minio server /data'
environment:
MINIO_ACCESS_KEY: localKey
MINIO_SECRET_KEY: localSecret
volumes:
- "./minioData:/data"
smtproblem-sqlserver:
image: microsoft/mssql-server-linux:2017-GA
container_name: smtproblem-sqlserver
environment:
ACCEPT_EULA: "Y"
SA_PASSWORD: "Azertyu&"
ports:
- "1433:1433"
Create a database in sqlserver container :
$ sudo docker exec -it smtproblem-sqlserver bash
# /opt/mssql-tools/bin/sqlcmd -S localhost -U SA -P 'Azertyu&'
Create a test database :
create database TEST
GO
use TEST
GO
CREATE TABLE TABLE_TEST (id INT, name NVARCHAR(50), quantity INT, cbMarq INT NOT NULL IDENTITY(1,1), cbModification smalldatetime DEFAULT (getdate()))
GO
INSERT INTO TABLE_TEST VALUES (1, 'banana', 150, 1); INSERT INTO TABLE_TEST VALUES (2, 'orange', 154, 2);
GO
exit
exit
Create a source connector :
curl -X PUT http://localhost:8083/connectors/sqlserver-TEST-source-bulk/config -H 'Content-Type: application/json' -H 'Accept: application/json' -d '{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.password": "Azertyu&",
"validate.non.null": "false",
"tasks.max": "3",
"table.whitelist": "TABLE_TEST",
"mode": "bulk",
"topic.prefix": "SQLSERVER-TEST-",
"connection.user": "SA",
"connection.url": "jdbc:sqlserver://smtproblem-sqlserver:1433;database=TEST"
}'
Create the sink connector :
curl -X PUT http://localhost:8083/connectors/sqlserver-TEST-sink/config -H 'Content-Type: application/json' -H 'Accept: application/json' -d '{
"topics": "SQLSERVER-TEST-TABLE_TEST",
"topics.dir": "TABLE_TEST",
"s3.part.size": 5242880,
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"tasks.max": 3,
"schema.compatibility": "NONE",
"s3.region": "us-east-1",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"partitioner.class": "io.confluent.connect.storage.partitioner.DefaultPartitioner",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"s3.bucket.name": "datalake",
"store.url": "http://smtproblem-minio:9000",
"flush.size": 1,
"transforms":"dropPrefix",
"transforms.dropPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex":"SQLSERVER-TEST-(.*)",
"transforms.dropPrefix.replacement":"$1"
}'
Error can be shown in Kafka connect UI, or with curl status command :
curl -X GET http://localhost:8083/connectors/sqlserver-TEST-sink/status
Thanks for your help

So, if we debug, we can see what it is trying to do...
There is a HashMap with the original topic name (SQLSERVER_TEST_TABLE_TEST-0), and the transform has already been applied (TABLE-TEST-0), so if we lookup the "new" topicname, it cannot find the S3 writer for the TopicPartition.
Therefore, the map returns null, and the subsequent .buffer(record) throws an NPE.
I had a similar use case for this before -- writing more than one topic into a single S3 path, and I ended up having to write a custom partitioner, e.g. class MyPartitioner extends DefaultPartitioner.
If you build a JAR using some custom code like that, put it under usr/share/java/kafka-connect-storage-common, then edit the connector config for partitioner.class, it should work as expected.
I'm not really sure if this is a "bug", per say, because back up the call stack, there is no way to get a reference to the regex transform at the time the topicPartitionWriters are declared with the source topic name(s).
If anything, the storage connector configurations should allow a separate regex transform that can edit the encodedPartition (the path where it writes the files)

Related

RabbitMQ in Kubernetes - Create User as part of Statefulset deployment kind

I am new to the Kubernetes and learning by experimenting. I have created RabbitMQ statefulset and it's working. However, the issue I am facing is the way I use it's admin portal.
By default RabbitMQ provides the guest/guest credential but that works only with localhsot. It gives me a thought that I supposed to have another user for admin as well as for my connection string at API side to access RabbitMQ. (currently in API side also I use guest:guest#.... as bad practice)
I like to change but I don't know how. I can manually login to the RabbitMQ admin portal (after deployment and using guest:guest credential) can create new user. But I thought of automating that as part of Kubernetes Statefulset deployment.
I have tried to add post lifecycle hook of kubernetes but that did not work well. I have following items:
rabbitmq-configmap:
rabbitmq.conf: |
## Clustering
#cluster_formation.peer_discovery_backend = k8s
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
cluster_formation.k8s.address_type = hostname
cluster_partition_handling = autoheal
#cluster_formation.k8s.hostname_suffix = rabbitmq.${NAMESPACE}.svc.cluster.local
#cluster_formation.node_cleanup.interval = 10
#cluster_formation.node_cleanup.only_log_warning = true
rabbitmq-serviceaccount:
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: rabbitmq
rules:
- apiGroups: [""]
resources: ["endpoints"]
verbs:
- get
- list
- watch
rabbitmq-statefulset:
initContainers:
- name: "rabbitmq-config"
image: busybox
volumeMounts:
- name: rabbitmq-config
mountPath: /tmp/rabbitmq
- name: rabbitmq-config-rw
mountPath: /etc/rabbitmq
command:
- sh
- -c
# the newline is needed since the Docker image entrypoint scripts appends to the config file
- cp /tmp/rabbitmq/rabbitmq.conf /etc/rabbitmq/rabbitmq.conf && echo '' >> /etc/rabbitmq/rabbitmq.conf;
cp /tmp/rabbitmq/enabled_plugins /etc/rabbitmq/enabled_plugins;
containers:
- name: rabbitmq
image: rabbitmq
ports:
- containerPort: 15672
Any help?
There are multiple way to do it
You can use the RabbitMQ CLI to add the user into it.
Add the environment variables and change the username/password instead of guest .
image: rabbitmq:management-alpine
environment:
RABBITMQ_DEFAULT_USER: user
RABBITMQ_DEFAULT_PASS: password
Passing argument to image
https://www.rabbitmq.com/cli.html#passing-arguments
Mounting the configuration file to RabbitMQ volume.
Rabbitmq.conf file
auth_mechanisms.1 = PLAIN
auth_mechanisms.2 = AMQPLAIN
loopback_users.guest = false
listeners.tcp.default = 5672
#default_pass = admin
#default_user = admin
hipe_compile = false
#management.listener.port = 15672
#management.listener.ssl = false
management.tcp.port = 15672
management.load_definitions = /etc/rabbitmq/definitions.json
#default_pass = admin
#default_user = admin
definitions.json
{
"users": [
{
"name": "user",
"password_hash": "password",
"hashing_algorithm": "rabbit_password_hashing_sha256",
"tags": "administrator"
}
],
"vhosts":[
{"name":"/"}
],
"queues":[
{"name":"qwer","vhost":"/","durable":true,"auto_delete":false,"arguments":{}}
]
}
Another option
Dockerfile
FROM rabbitmq
# Define environment variables.
ENV RABBITMQ_USER user
ENV RABBITMQ_PASSWORD password
ADD init.sh /init.sh
EXPOSE 15672
# Define default command
CMD ["/init.sh"]
init.sh
#!/bin/sh
# Create Rabbitmq user
( sleep 5 ; \
rabbitmqctl add_user $RABBITMQ_USER $RABBITMQ_PASSWORD 2>/dev/null ; \
rabbitmqctl set_user_tags $RABBITMQ_USER administrator ; \
rabbitmqctl set_permissions -p / $RABBITMQ_USER ".*" ".*" ".*" ; \
echo "*** User '$RABBITMQ_USER' with password '$RABBITMQ_PASSWORD' completed. ***" ; \
echo "*** Log in the WebUI at port 15672 (example: http:/localhost:15672) ***") &
# $# is used to pass arguments to the rabbitmq-server command.
# For example if you use it like this: docker run -d rabbitmq arg1 arg2,
# it will be as you run in the container rabbitmq-server arg1 arg2
rabbitmq-server $#
You can read more here

Unable to invoke another service with Dapr

I'm having major problems getting Dapr up and running with my microservices. Every time I try to invoke another service, it returns a 500 error with the message
client error: the server closed connection before returning the first response byte. Make sure the server returns 'Connection: close' response header before closing the connection
The services and dapr sidecars are currently running in docker-compose on our dev machines but will run in Kubernetes when it is deployed properly.
When I look at the logs for the dapr containers in Docker for Windows, I can see the application being discovered on port 443 and a few initialisation messages but nothing else ever gets logged after that, even when I make my invoke request.
I have a container called clients, which I'm calling an API called test in it and this is then trying to call Microsoft's example weather forecast API in another container called simpleapi.
I'm using swaggerUI to call the apis. The test api returns 200 but when I put a breakpoint on the invoke, I can see the response is 500.
If I call the weatherforecast api directly using swaggerui, it returns a 200 with the expected payload.
I have the Dapr dashboard running in a container and it doesn't show any applications.
Docker-Compose.yml
version: '3.4'
services:
clients:
image: ${DOCKER_REGISTRY-}clients
container_name: "Clients"
build:
context: .
dockerfile: Clients/Dockerfile
ports:
- "50002:50002"
depends_on:
- placement
- database
networks:
- platform
clients-dapr:
image: "daprio/daprd:edge"
container_name: clients-dapr
command: [
"./daprd",
"-app-id", "clients",
"-app-port", "443",
"-placement-host-address", "placement:50006",
"-dapr-grpc-port", "50002"
]
depends_on:
- clients
network_mode: "service:clients"
simpleapi:
image: ${DOCKER_REGISTRY-}simpleapi
build:
context: .
dockerfile: SimpleAPI/Dockerfile
ports:
- "50003:50003"
depends_on:
- placement
networks:
- platform
simpleapi-dapr:
image: "daprio/daprd:edge"
container_name: simpleapi-dapr
command: [
"./daprd",
"-app-id", "simpleapi",
"-app-port", "443",
"-placement-host-address", "placement:50006",
"-dapr-grpc-port", "50003"
]
depends_on:
- simpleapi
network_mode: "service:simpleapi"
placement:
image: "daprio/dapr"
container_name: placement
command: ["./placement", "-port", "50006"]
ports:
- "50006:50006"
networks:
- platform
dashboard:
image: "daprio/dashboard"
container_name: dashboard
ports:
- "8080:8080"
networks:
- platform
networks:
platform:
Test controller from the Clients API.
[Route("api/[controller]")]
[ApiController]
public class TestController : ControllerBase
{
[HttpGet]
public async Task<ActionResult> Get()
{
var httpClient = DaprClient.CreateInvokeHttpClient();
var response = await httpClient.GetAsync("https://simpleapi/weatherforecast");
return Ok();
}
}
This is a major new project for my company and it's looking like we're going to have to abandon Dapr and implement everything ourselves if we can't get this working soon.
I'm hoping there's some glaringly obvious problem here.
Actually turned out to be quite simple.
I needed to tell dapr to use ssl.
The clients-dapr needed the -app-ssl parameter so clients-dapr should have been as follows (the simpleapi-dapr needs the same param adding too)
clients-dapr:
image: "daprio/daprd:edge"
container_name: clients-dapr
command: [
"./daprd",
"-app-id", "clients",
"-app-port", "443",
"-app-ssl",
"-placement-host-address", "placement:50006",
"-dapr-grpc-port", "50002"
]
depends_on:
- clients
network_mode: "service:clients"
you can run your service-specific port without docker and check dapr works as expected. you can specify http port & grpc port.
dapr run `
--app-id serviceName `
--app-port 5139 `
--dapr-http-port 3500 `
--dapr-grpc-port 50001 `
--components-path ./dapr-components
if the above setup works then you can setup with the docker. check above solution

Assign variable within kubernetes yaml job

I would like to run a command within the yaml file for kubernetes:
Here is the part of the yaml file that i use
The idea is to calculate a precent value based on mapped and unmapped values. mapped and unmapped are set properly but the percent line fails
I think the problem comes from the single quotes in the BEGIN statement of the awk command which i guess need to escape ???
If mapped=8 and unmapped=7992
Then percent is (8/(8+7992)*100) = 0.1%
command: ["/bin/sh","-c"]
args: ['
...
echo "Executing command" &&
map=${grep -c "^#" outfile.mapped.fq} &&
unmap=${grep -c "^#" outfile.unmapped.fq} &&
percent=$(awk -v CONVFMT="%.10g" -v map="$map" -v unmap="$unmap" "BEGIN { print ((map/(unmap+map))*100)}") &&
echo "finished"
']
Thanks to the community comments: Ed Morton & david
For those files with data, please create configmap:
outfile.mapped.fq
outfile.unmapped.fq
kubectl create configmap config-volume --from-file=/path_to_directory_with_files/
Create pod:
apiVersion: v1
kind: Pod
metadata:
name: awk-ubu
spec:
containers:
- name: awk-ubuntu
image: ubuntu
workingDir: /test
command: [ "/bin/sh", "-c" ]
args:
- echo Executing_command;
map=$(grep -c "^#" outfile.mapped.fq);
unmap=$(grep -c "^#" outfile.unmapped.fq);
percent=$(awk -v CONVFMT="%.10g" -v map="$map" -v unmap="$unmap" "BEGIN { print ((map/(unmap+map))*100)}");
echo $percent;
echo Finished;
volumeMounts:
- name: special-config
mountPath: /test
volumes:
- name: special-config
configMap:
# Provide the name of the ConfigMap containing the files you want
# to add to the container
name: config-volume
restartPolicy: Never
Once completed verify the result:
kubectl logs awk-ubu
Executing_command
53.3333
Finished

Selenium isn't able to reach a docker container with docker-compose run

I have the following docker-compose.yml which starts a chrome-standalone container and a nodejs application:
version: '3.7'
networks:
selenium:
services:
selenium:
image: selenium/standalone-chrome-debug:3
networks:
- selenium
ports:
- '4444:4444'
- '5900:5900'
volumes:
- /dev/shm:/dev/shm
user: '7777:7777'
node:
image: node_temp:latest
build:
context: .
target: development
args:
UID: '${USER_UID}'
GID: '${USER_GID}'
networks:
- selenium
env_file:
- .env
ports:
- '8090:8090'
volumes:
- .:/home/node
depends_on:
- selenium
command: >
sh -c 'yarn install &&
yarn dev'
I'm running the containers as follows:
docker-compose up -d selenium
docker-compose run --service-ports node sh
and starting the e2e from within the shell.
When running the e2e tests, selenium can be reached from the node container(through: http://selenium:4444), but node isn't reachable from the selenium container.
I have tested this by VNC'ing into the selenium container and pointing the browser to: http://node:8090. (The node container is reachable on the host however, through: http://localhost:8090).
I first thought that docker-compose run doesn't add the running container to the proper network, however by running docker network inspect test_app I get the following:
[
{
"Name": "test_app_selenium",
"Id": "df6517cc7b6446d1712b30ee7482c83bb7c3a9d26caf1104921abd6bbe2caf68",
"Created": "2019-06-30T16:08:50.724889157+02:00",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.31.0.0/16",
"Gateway": "172.31.0.1"
}
]
},
"Internal": false,
"Attachable": true,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"8a76298b237790c62f80ef612debb021549439286ce33e3e89d4ee2f84de3aec": {
"Name": "test_app_node_run_78427bac2fd1",
"EndpointID": "04310bc4e564f831e5d08a0e07891d323a5953fa936e099d20e5e384a6053da8",
"MacAddress": "02:42:ac:1f:00:03",
"IPv4Address": "172.31.0.3/16",
"IPv6Address": ""
},
"ef087732aacf0d293a2cf956855a163a081fc3748ffdaa01c240bde452eee0fa": {
"Name": "test_app_selenium_1",
"EndpointID": "24a597e30a3b0b671c8b19fd61b9254bea9e5fcbd18693383d93d3df789ed895",
"MacAddress": "02:42:ac:1f:00:02",
"IPv4Address": "172.31.0.2/16",
"IPv6Address": ""
}
},
"Options": {},
"Labels": {
"com.docker.compose.network": "selenium",
"com.docker.compose.project": "test_app",
"com.docker.compose.version": "1.24.1"
}
}
]
Which shows both containers running on the "selenium" network. I'm not sure however if the node container is properly aliased on the network and if this is proper behaviour.
Am I missing some config here?
Seems like docker-compose run names the container differently to evade the service namespace as noted in docker-compose.yml. http://node:8090 was therefore not reachable.
I solved this by adding a --name flag as follows:
docker-compose run --service-ports --name node node sh
EDIT:
It took me a while to notice, but I was overcomplicating the implementation by a lot. The above docker-compose.yml can be simplified by adding host networking. This simply exposes all running containers on localhost and makes them reachable on localhost by their specified ports. Considering that I don't need any encapsulation (it's meant for dev), the following docker-compose.yml sufficed:
version: '3.7'
services:
selenium:
image: selenium/standalone-chrome:3
# NOTE: port definition is useless with network_mode: host
network_mode: host
user: '7777:7777'
node:
image: node_temp:latest
build:
context: .
target: development
args:
UID: '${USER_UID}'
GID: '${USER_GID}'
network_mode: host
env_file:
- .env
volumes:
- .:/home/node
command: >
sh -c 'yarn install &&
yarn dev'

solr indexing custom json - creates empty documents

I am quite new to solr. I currently have it running in cloud mode using docker compose (my configuration can be seen at the end of the question)
I created a collection called audittrail using default configuration. The idea is that I'll send event logging info from another app to solr. It has a convenient looking schema full of dynamic fields by default. (I know I shouldn't just use default settings in production, right now I'm looking for a proof of concept).
Now I'm following this document in an attempt to index some of my data: https://lucene.apache.org/solr/guide/7_2/transforming-and-indexing-custom-json.html#mapping-parameters
> curl 'http://0.0.0.0:8983/api/collections/audittrail/update/json'\
'?split=/events&'\
'f=action_kind_s:/action_kind_s&'\
'f=time_dt:/events/time_dt'\
'&echo=true' \ ########## NOTE this means we're running in debug more. solr returns the documents it should be creating
-H 'Content-type:application/json' -d '{
"action_kind_s": "task_exec",
"events": [
{
"event_kind_s": "start",
"in_transaction_b": false,
"time_dt": "2018-03-09T12:57:07Z"
},
{
"event_kind_s": "start_txn",
"in_transaction_b": true,
"time_dt": "2018-03-09T12:57:07Z"
},
{
"event_kind_s": "diff",
"in_transaction_b": true,
"key_s": "('MerchantWorkerProcess', 5819715045818368L)",
"property_s": "claim_time",
"time_dt": "2018-03-09T12:57:07Z",
"value_dt": "2018-03-09T12:57:07Z"
},
],
"final_status_s": "COMPLETE",
"request_s": "1dfda9955dac6f3cfd76fbedee98b15f6edc0db",
"task_name_s": "0p5k20100CcnMVxaxoWl32WlfPixjV1OFKgv0k1KZ0m_acc_work"
}'
# response:
{
"responseHeader":{
"status":0,
"QTime":1},
"docs":[{},
{},
{}]}
That's three empty documents...
So I thought maybe it was because I wasn't specifying an id. So I gave each event a unique id and tried again with the added &f=id:/events/id. Same result
Originally I tried using wildcards (&f=/**) with the same effect.
There is obviously something missing in my understanding.
So my question is:
What should I do to get my documents populated correctly?
EDIT
Also, my solr node logs arent turnng up any errors. Here's a sample:
2018-03-09 14:30:50.770 INFO (qtp257895351-21) [c:audittrail s:shard2 r:core_node4 x:audittrail_shard2_replica_n2] o.a.s.u.p.LogUpdateProcessorFactory [audittrail_shard2_replica_n2] webapp=null path=/update/json params={split=/events}{add=[78953602-6b02-4948-8443-fd1ebc340921 (1594470800573857792)]} 0 3
2018-03-09 14:31:05.770 INFO (commitScheduler-14-thread-1) [c:audittrail s:shard2 r:core_node4 x:audittrail_shard2_replica_n2] o.a.s.u.DirectUpdateHandler2 start commit{_version_=1594470816305643520,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2018-03-09 14:31:05.770 INFO (commitScheduler-14-thread-1) [c:audittrail s:shard2 r:core_node4 x:audittrail_shard2_replica_n2] o.a.s.u.SolrIndexWriter Calling setCommitData with IW:org.apache.solr.update.SolrIndexWriter#13d117d6 commitCommandVersion:1594470816305643520
2018-03-09 14:31:05.918 INFO (commitScheduler-14-thread-1) [c:audittrail s:shard2 r:core_node4 x:audittrail_shard2_replica_n2] o.a.s.s.SolrIndexSearcher Opening [Searcher#4edc35b0[audittrail_shard2_replica_n2] realtime]
2018-03-09 14:31:05.921 INFO (commitScheduler-14-thread-1) [c:audittrail s:shard2 r:core_node4 x:audittrail_shard2_replica_n2] o.a.s.u.DirectUpdateHandler2 end_commit_flush
docker-compose.yml
version: '3'
services:
zookeeper:
image: zookeeper:3.4.11
ports:
- "2181:2181"
hostname: "zookeeper"
container_name: "zookeeper"
solr1:
image: solr:7.2.1
ports:
- "8983:8983"
container_name: solr1
links:
- zookeeper:ZK
command: /opt/solr/bin/solr start -f -z zookeeper:2181
solr2:
image: solr:7.2.1
ports:
- "8984:8983"
container_name: solr2
links:
- zookeeper:ZK
command: /opt/solr/bin/solr start -f -z zookeeper:2181
Here are the exact steps I go through to index some data.
This does not actually index anything and I want to know why
docker-compose up
create the collection
curl -X POST 'http://0.0.0.0:8983/solr/admin/collections?action=CREATE&name=audittrail&numShards=2'
{
"responseHeader":{
"status":0,
"QTime":6178},
"success":{
"172.24.0.3:8983_solr":{
"responseHeader":{
"status":0,
"QTime":3993},
"core":"audittrail_shard1_replica_n1"},
"172.24.0.4:8983_solr":{
"responseHeader":{
"status":0,
"QTime":4399},
"core":"audittrail_shard2_replica_n2"}},
"warning":"Using _default configset. Data driven schema functionality is enabled by default, which is NOT RECOMMENDED for production use. To turn it off: curl http://{host:port}/solr/audittrail/config -d '{\"set-user-property\": {\"update.autoCreateFields\":\"false\"}}'"}
curl to create some data ( this is the same curl as in the main question. but not in debug mode:
curl 'http://0.0.0.0:8983/api/collections/audittrail/update/json?split=/events&f=action_kind_s:/action_kind_s&f=time_dt:/events/time_dt' -H 'Content-type:application/json' -d '{ "action_kind_s": "task_exec", "events": [{"event_kind_s": "start","in_transaction_b": false, "time_dt": "2018-03-09T12:57:07Z"},{"event_kind_s": "start_txn", "in_transaction_b": true,"time_dt": "2018-03-09T12:57:07Z"},{"event_kind_s": "diff", "in_transaction_b": true,"key_s": "('MerchantWorkerProcess', 5819715045818368L)","property_s": "claim_time","time_dt": "2018-03-09T12:57:07Z","value_dt": "2018-03-09T12:57:07Z"},], "final_status_s": "COMPLETE", "request_s": "xxx", "task_name_s": "xxx"}'
{
"responseHeader":{
"status":0,
"QTime":126}}
Do the query:
curl 'http://0.0.0.0:8983/solr/audittrail/select?q=*:*'
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":12,
"params":{
"q":"*:*"}},
"response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
}}
It seems that it's only the echo parameter that doesn't do what you expect it to do - remove that, and add commit=true to your URL to make Solr commit the documents to the index as soon as possible before returning, and you can then find documents (by searching for *:* in the admin interface under collection -> query with your fields present in the index:
{
"action_kind_s":"task_exec",
"time_dt":"2018-03-09T12:57:07Z",
"id":"b56100f5-ff61-45e7-8d6b-8072bac6c952",
"_version_":1594486636806144000},
{
"action_kind_s":"task_exec",
"time_dt":"2018-03-09T12:57:07Z",
"id":"f49fc3cb-eac6-4d02-bcdf-b7c1a34782e3",
"_version_":1594486636807192576}