Elasticsearch - Extracting PDF content and encoding with base64 - pdf

I want to be able to extract content from a PDF file and to be able to search within that content using ElasticSearch.
I did install elasticsearch/elasticsearch-mapper-attachments/2.6.0
I have created a new index named "docs".
I did create a file named "tmp.json" with that content :
{"title": "file.pdf", "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="}
I did execute the following :
curl -X PUT "http://localhost:9200/docs/attachment/_mapping" -d '{
"attachment": {
"properties" : {
'file" : {
"type" : "attachment",
"fields" : {
"title" : {"store":"yes"},
"file":{
"type":"string",
"term_vector":"with_positions_offsets",
"store":"yes"}
}
}
}
}
}'
and the following :
curl -X POST "http://localhost:9200/docs/attachment" -d #tmp.json
The problem is that the content is stored as it is in the file.
I was expecting the content to be decoded, like so :
base64.b64decode("IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg==")
That gives :
b'"God Save the Queen" (alternatively "God Save the King"'
To encode in base64, here what I do :
import json, base64
file64 = base64.b64encode(open('file.pdf', "rb").read()).decode('ascii')
f = open('tmp.json', 'w')
data = {"file":file64, "title":fname}
json.dump(data,f)
f.close()
I would like to be able to see the content using kibana (but for now I see only the base64 data ...)

This didn't work :
curl -X PUT "http://localhost:9200/docs/attachment/_mapping" -d '{
"attachment": {
"properties" : {
"content" : {
"type" : "attachment",
"fields" : {
"title" : {"store":"yes"},
"content":{
"type":"string",
"term_vector":"with_positions_offsets",
"store":"yes"}
}
}
}
}
}'
This worked, and I can see the content of the PDF through Kibana :
curl -X PUT "http://localhost:9200/docs" -d '{
"mappings" : {
"attachment" : {
"properties" : {
"content" : {
"type" : "attachment",
"fields" : {
"content" : { "store" : "yes" },
"author" : { "store" : "yes" },
"title" : { "store" : "yes"},
"date" : { "store" : "yes" },
"keywords" : { "store" : "yes", "analyzer" : "keyword" },
"name" : { "store" : "yes" },
"content_length" : { "store" : "yes" },
"content_type" : { "store" : "yes" }
}
}
}
}
}
}'

Related

Mapping ElasticSearch apache module field

I am new to ES and I am facing a little problem I am struggling with.
I integrated metricbeat apache module with ES and the it works fine.
The problem is that metricbeat apache module reports the KB of web traffic of apache (field apache.status.total_kbytes), instead I would like to create my own field, the name of which would be "apache.status.total_mbytes).
I am trying to create a new mapping via Dev Console using the followind api commands:
PUT /metricbeat-7.2.0/_mapping
{
"settings":{
},
"mappings" : {
"apache.status.total_mbytes" : {
"full_name" : "apache.status.total_mbytes",
"mapping" : {
"total_mbytes" : {
"type" : "long"
}
}
}
}
}
Still ES returns the following error:
{
"error" : {
"root_cause" : [
{
"type" : "mapper_parsing_exception",
"reason" : "Root mapping definition has unsupported parameters: [settings : {}] [mappings : {apache.status.total_mbytes={mapping={total_mbytes={type=long}}, full_name=apache.status.total_mbytes}}]"
}
],
"type" : "mapper_parsing_exception",
"reason" : "Root mapping definition has unsupported parameters: [settings : {}] [mappings : {apache.status.total_mbytes={mapping={total_mbytes={type=long}}, full_name=apache.status.total_mbytes}}]"
},
"status" : 400
}
FYI
The following may shed some light
GET /metricbeat-*/_mapping/field/apache.status.total_kbytes
Returns
{
"metricbeat-7.9.2-2020.10.06-000001" : {
"mappings" : {
"apache.status.total_kbytes" : {
"full_name" : "apache.status.total_kbytes",
"mapping" : {
"total_kbytes" : {
"type" : "long"
}
}
}
}
},
"metricbeat-7.2.0-2020.10.05-000001" : {
"mappings" : {
"apache.status.total_kbytes" : {
"full_name" : "apache.status.total_kbytes",
"mapping" : {
"total_kbytes" : {
"type" : "long"
}
}
}
}
}
}
What am I missing? Is the _mapping command wrong?
Thanks in advance,
A working example:
Create new index
PUT /metricbeat-7.2.0
{
"settings": {},
"mappings": {
"properties": {
"apache.status.total_kbytes": {
"type": "long"
}
}
}
}
Then GET metricbeat-7.2.0/_mapping/field/apache.status.total_kbytes will result in (same as your example):
{
"metricbeat-7.2.0" : {
"mappings" : {
"apache.status.total_kbytes" : {
"full_name" : "apache.status.total_kbytes",
"mapping" : {
"total_kbytes" : {
"type" : "long"
}
}
}
}
}
}
Now if you want to add a new field to an existing mapping use the API this way:
Update an existing index
PUT /metricbeat-7.2.0/_mapping
{
"properties": {
"total_mbytes": {
"type": "long"
}
}
}
Then GET metricbeat-7.2.0/_mapping will show you the updated mapping:
{
"metricbeat-7.2.0" : {
"mappings" : {
"properties" : {
"apache" : {
"properties" : {
"status" : {
"properties" : {
"total_kbytes" : {
"type" : "long"
}
}
}
}
},
"total_mbytes" : {
"type" : "long"
}
}
}
}
}
Also, take a look at Put Mapping Api

restart jobtracker through cloudera manager API

I am trying to restart Mapreduce Jobtracker through Cloudera Manager API. Stats for Jobtracker is as follows :
local-iMac-399:$ curl -u 'admin:admin' 'http://hadoop-namenode.dev.com:7180/api/v6/clusters/Cluster%201/services/mapreduce/roles/mapreduce-JOBTRACKER-0675ebab2b87e3869e0d90167cf4bf86'
{
"name" : "mapreduce-JOBTRACKER-0675ebab2b87e3869e0d90167cf4bf86",
"type" : "JOBTRACKER",
"serviceRef" : {
"clusterName" : "cluster",
"serviceName" : "mapreduce"
},
"hostRef" : {
"hostId" : "24259373-7e71-4089-8251-faf055e42ad7"
},
"roleUrl" : "http://hadoop-namenode.dev.com:7180/cmf/roleRedirect/mapreduce-JOBTRACKER-0675ebab2b87e3869e0d90167cf4bf86",
"roleState" : "STARTED",
"healthSummary" : "GOOD",
"healthChecks" : [ {
"name" : "JOB_TRACKER_FILE_DESCRIPTOR",
"summary" : "GOOD"
}, {
"name" : "JOB_TRACKER_GC_DURATION",
"summary" : "GOOD"
}, {
"name" : "JOB_TRACKER_HOST_HEALTH",
"summary" : "GOOD"
}, {
"name" : "JOB_TRACKER_LOG_DIRECTORY_FREE_SPACE",
"summary" : "GOOD"
}, {
"name" : "JOB_TRACKER_SCM_HEALTH",
"summary" : "GOOD"
}, {
"name" : "JOB_TRACKER_UNEXPECTED_EXITS",
"summary" : "GOOD"
}, {
"name" : "JOB_TRACKER_WEB_METRIC_COLLECTION",
"summary" : "GOOD"
} ],
"configStalenessStatus" : "STALE",
"haStatus" : "ACTIVE",
"maintenanceMode" : false,
"maintenanceOwners" : [ ],
"commissionState" : "COMMISSIONED",
"roleConfigGroupRef" : {
"roleConfigGroupName" : "mapreduce-JOBTRACKER-BASE"
}
}
local-iMac-399:$
Dont know How do I use API to restart just Jobtracker ?
I tried to restart Hive service using following command but got some error
local-iMac-399:$curl -X POST -u 'admin:admin' 'http://hadoop-namenode.dev.com:7180/api/v6/clusters/Cluster%201/services/hive/roleCommands/restart'
{
"message" : "No content to map due to end-of-input\n at [Source: org.apache.cxf.transport.http.AbstractHTTPDestination$1#4169c499; line: 1, column: 1]"
}
I would appreciate if someone help in understanding how to use Cloudera Manager API
Based on the information provided, this is how you'd invoke the CM API JobTracker restart
curl -u 'admin:admin' -X POST -H "Content-Type:application/json" -d '{"items":["mapreduce-JOBTRACKER-0675ebab2b87e3869e0d90167cf4bf86"]}' 'http://hadoop-namenode.dev.com:7180/api/v6/clusters/Cluster%201/services/mapreduce/roleCommands/restart'

Elastic Search Index not analyzed

I'm new to elastic search, and I'm having a hard time with the analyzers.
I am creating an index like this (to replicate my problem, you can copy and paste the follwoing code in your console directly.)
Please read comments in the script for my problem and questions.
#!/bin/bash
# fails if the index doesn't exist but that's OK
curl -XDELETE 'http://localhost:9200/movies/'
# creating the index that will allow type wrapper, and generate _id automatically from the path
curl -XPOST http://localhost:9200/movies -d '{
"settings" : {
"number_of_shards" : 1,
"mapping.allow_type_wrapper" : true,
"analysis": {
"analyzer": {
"en_std": {
"type":"standard",
"stopwords": "_english_"
}
}
}
},
"mappings" : {
"movie" : {
"_id" : {
"path" : "movie.id"
}
}
}
}'
# inserting some data
curl -XPOST http://localhost:9200/movies/movie -d '{
"movie" : {
"id" : 101,
"title" : "Bat Man",
"starring" : {
"firstname" : "Christian",
"lastname" : "Bale"
}
}
}'
#trying to get by ID ... \m/ works!!!
curl -XGET http://localhost:9200/movies/movie/101
# tryign to search using query_string ... \m/ works
curl -XPOST http://localhost:9200/movies/movie/_search -d '{
"query" : {
"query_string" : {
"query" : "bat"
}
}
}'
# when i try to search in a paricular field it fails. returns 0 hits
curl -XPOST http://localhost:9200/movies/_search -d '{
"query" : {
"query_string" : {
"query" : "bat",
"fields" : ["movie.title"]
}
}
}'
#I thought the analyzer was the problem, so i checked.
curl 'http://localhost:9200/movies/movie/_search?pretty=true' -d '{
"query" : {
"query_string" : {
"query" : "bat"
}
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "movie.title"
}
}
}
}'
# The field wasn't analyzed.
# the follwoing is the result
#{
# "took" : 1,
# "timed_out" : false,
# "_shards" : {
# "total" : 1,
# "successful" : 1,
# "failed" : 0
# },
# "hits" : {
# "total" : 1,
# "max_score" : 0.13424811,
# "hits" : [ {
# "_index" : "movies",
# "_type" : "movie",
# "_id" : "101",
# "_score" : 0.13424811,
# "fields" : {
# "terms" : [ "Bat Man" ]
# }
# } ]
# }
#}
# So i even tried the term as such... Nope didn't work :( 0 hits.
curl -XPOST http://localhost:9200/movies/_search -d '{
"query" : {
"query_string" : {
"query" : "Bat Man",
"fields" : ["movie.title"]
}
}
}'
Can anyone point out what i'm doing wrong?
You should insert a sleep 1 command right after inserting the doc and everything will work.
Elasticsearch provides search in near real-time (read this). Everytime you index a document, the Lucene index is not updated (refreshed in terms of Elasticsearch) immediately. How frequently your index is refreshed is configurable on Index level. You can also forcefully refresh the index by passing the query parameter refresh=true with every HTTP request, which will make ES update the index. But you may start suffering on the performance because of that depending upon your requirement.
There is a Refresh API as well.

Meteor: meteor-collectionapi PUT overwrites the entire object instead of just updating the field I want

I'm trying to use the meteor-collectionapi package to update my database. I've set up a basic collection to test out the functionality.
I'm starting with this data:
{ "name" : "Darrell David", "age" : "18", "gender" : "Male", "_id" : "8BW9Yg2oKByBGdnSa" }
{ "name" : "Julie Smith", "age" : "21", "gender" : "Female", "_id" : "fAaFwCEXLzrmejnJK" }
{ "name" : "Todd Davis", "age" : "32", "gender" : "Male", "_id" : "ixKjhkTmjrNte2DjP" }
Now, I want to update the gender of the first player to "Female" so I call this using CURL:
curl -H "X-Auth-Token: 97f0ad9e24ca5e0408a269748d7fe0a0" -X PUT -d "{\"$set\":{\"gender\":\"Female\"}}" http://localhost:3000/collectionapi/players/8BW9Yg2oKByBGdnSa
And what I wind up with is this:
{ "_id" : "8BW9Yg2oKByBGdnSa", "" : { "gender" : "Female" } }
{ "name" : "Julie Smith", "age" : "21", "gender" : "Female", "_id" : "fAaFwCEXLzrmejnJK" }
{ "name" : "Todd Davis", "age" : "32", "gender" : "Male", "_id" : "ixKjhkTmjrNte2DjP" }
The first player has been completely overwritten and the name and age fields have been lost.
What am I missing here? When I execute this command in the MongoDB console it works perfectly:
db.players.update(
{ _id: "8BW9Yg2oKByBGdnSa" },
{ $set: { gender: "Female" } }
)
I'm guessing that bash is replacing "$set" with an empty environment variable
eg. echo "$set" vs echo "\$set"
so update your PUT command to:
curl -H "X-Auth-Token: 97f0ad9e24ca5e0408a269748d7fe0a0" -X PUT -d "{\"\$set\":{\"gender\":\"Female\"}}" http://localhost:3000/collectionapi/players/8BW9Yg2oKByBGdnSa
By default Collection.update() will replace a document if no modifiers are present ($set, $unset, $push, $pull etc). So the command being sent to the server is to replace the document with {"":{"gender":"Female"}}

how to reduce amount of data in neo4j query?

when requesting data over Neo4j, as in, say,
curl -i -XPOST -d'{ "query" : "start n=node(*) return n" }'
-H "accept:application/json;stream=true"
-H content-type:application/json
http://localhost:7474/db/data/cypher
i get, as documented, a response like this:
{
"columns" : [ "n" ],
"data" : [ [ {
"outgoing_relationships" : "http://localhost:7474/db/data/node/0/relationships/out",
"data" : {
},
"traverse" : "http://localhost:7474/db/data/node/0/traverse/{returnType}",
"all_typed_relationships" : "http://localhost:7474/db/data/node/0/relationships/all/{-list|&|types}",
"property" : "http://localhost:7474/db/data/node/0/properties/{key}",
"self" : "http://localhost:7474/db/data/node/0",
"properties" : "http://localhost:7474/db/data/node/0/properties",
"outgoing_typed_relationships" : "http://localhost:7474/db/data/node/0/relationships/out/{-list|&|types}",
"incoming_relationships" : "http://localhost:7474/db/data/node/0/relationships/in",
"extensions" : {
},
"create_relationship" : "http://localhost:7474/db/data/node/0/relationships",
"paged_traverse" : "http://localhost:7474/db/data/node/0/paged/traverse/{returnType}{?pageSize,leaseTime}",
"all_relationships" : "http://localhost:7474/db/data/node/0/relationships/all",
"incoming_typed_relationships" : "http://localhost:7474/db/data/node/0/relationships/in/{-list|&|types}"
} ], [ {
"outgoing_relationships" : "http://localhost:7474/db/data/node/1/relationships/out",
"data" : {
"glyph" : "δΈ€",
"~isa" : "glyph"
},
"traverse" : "http://localhost:7474/db/data/node/1/traverse/{returnType}",
"all_typed_relationships" : "http://localhost:7474/db/data/node/1/relationships/all/{-list|&|types}",
"property" : "http://localhost:7474/db/data/node/1/properties/{key}",
"self" : "http://localhost:7474/db/data/node/1",
"properties" : "http://localhost:7474/db/data/node/1/properties",
"outgoing_typed_relationships" : "http://localhost:7474/db/data/node/1/relationships/out/{-list|&|types}",
"incoming_relationships" : "http://localhost:7474/db/data/node/1/relationships/in",
"extensions" : {
},
"create_relationship" : "http://localhost:7474/db/data/node/1/relationships",
"paged_traverse" : "http://localhost:7474/db/data/node/1/paged/traverse/{returnType}{?pageSize,leaseTime}",
"all_relationships" : "http://localhost:7474/db/data/node/1/relationships/all",
"incoming_typed_relationships" : "http://localhost:7474/db/data/node/1/relationships/in/{-list|&|types}"
} ], [ {
"outgoing_relationships" : "http://localhost:7474/db/data/node/2/relationships/out",
"data" : {
"~isa" : "LPG",
"LPG" : "1"
},
"traverse" : "http://localhost:7474/db/data/node/2/traverse/{returnType}",
"all_typed_relationships" : "http://localhost:7474/db/data/node/2/relationships/all/{-list|&|types}",
"property" : "http://localhost:7474/db/data/node/2/properties/{key}",
"self" : "http://localhost:7474/db/data/node/2",
"properties" : "http://localhost:7474/db/data/node/2/properties",
"outgoing_typed_relationships" : "http://localhost:7474/db/data/node/2/relationships/out/{-list|&|types}",
"incoming_relationships" : "http://localhost:7474/db/data/node/2/relationships/in",
"extensions" : {
},
"create_relationship" : "http://localhost:7474/db/data/node/2/relationships",
"paged_traverse" : "http://localhost:7474/db/data/node/2/paged/traverse/{returnType}{?pageSize,leaseTime}",
"all_relationships" : "http://localhost:7474/db/data/node/2/relationships/all",
"incoming_typed_relationships" : "http://localhost:7474/db/data/node/2/relationships/in/{-list|&|types}"
} ], [ {
and so on and on. the URLs delivered with each node are certainly well meant, but they also occupy a major portion of the data transmitted. they're also highly redundant and not what i', after with my query. is there any way to drop all of that traverse,
all_typed_relationships,
property,
self,
properties,
outgoing_typed_relationships,
incoming_relationships,
extensions,
create_relationship,
paged_traverse,
all_relationships,
incoming_typed_relationships
jazz?
The only way is to specify the properties you want returned in the return statement. Like:
return id(n), n.glyph;