Indexer IOException job fail while Indexing nutch crawled data in “Bluemix” solr - indexing

I'm trying to index the nutch crawled data by Bluemix solr. I used the following command in my command prompt:
bin/nutch index -D solr.server.url="https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/admin/collections -D solr.auth=true -D solr.auth.username="USERNAME" -D solr.auth.password="PASS" Crawl/crawldb -linkdb Crawl/linkdb Crawl/segments/2016*
But it fails to finish the indexing. The result is as followed:
Indexer: starting at 2016-06-16 16:31:50
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SolrIndexWriter
solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
solr.server.url : URL of the Solr instance (mandatory)
solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.commit.size : buffer size when sending to Solr (default 1000)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
Indexing 153 documents
Indexing 153 documents
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
I guess it has something to do with the solr.server.url address, maybe the end of it. I changed it in different ways
e.g
"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/example_collection/update".
(since it is used for indexing JSON/CSV/... files by the the Bluemix Solr ).
But no chance to now.
Anyone knows how can I fix it? And if the problem is as I guessed, anyone knows what exactly should the solr.server.url be ?
By the way, "example_collection" is my collections name, and I'm working with nutch1.11.

As far as I know, indexing nutch crawled data in Bluemix R&R, by the index command provided in nutch itself(bin/nutch index...) is not possible.
I realized that for indexing nutch crawled data in Bluemix Retrieve and Rank service one should:
Crawl seeds with nutch e.g
$:bin/crawl -w 5 urls crawl 25
you can check the status of crawling with:
bin/nutch readdb crawl/crawldb/ -stats
Dumped the crawled dataas files:
$:bin/nutch dump -flatdir -outputDir dumpData/ -segment crawl/segments/
Post those that are possible e.g xml files to solr Collection on Retrieve and Rank:
Post_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update"' %(solr_cluster_id, solr_collection_name)
cmd ='''curl -X POST -H %s -u %s %s --data-binary #%s''' %(Cont_type_xml, solr_credentials, Post_url, myfilename)
subprocess.call(cmd,shell=True)
Convert the rest to json with Bluemix Doc-Conv service:
doc_conv_url = '"https://gateway.watsonplatform.net/document-conversion/api/v1/convert_document?version=2015-12-15"'
cmd ='''curl -X POST -u %s -F config="{\\"conversion_target\\":\\"answer_units\\"}" -F file=#%s %s''' %(doc_conv_credentials, myfilename, doc_conv_url)
process = subprocess.Popen(cmd, shell= True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
and then save these Json results in a json file.
Post this json file to the collection:
Post_converted_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update/json/docs?commit=true&split=/answer_units/id&f=id:/answer_units/id&f=title:/answer_units/title&f=body:/answer_units/content/text"' %(solr_cluster_id, solr_collection_name)
cmd ='''curl -X POST -H %s -u %s %s --data-binary #%s''' %(Cont_type_json, solr_credentials, Post_converted_url, Path_jsonFile)
subprocess.call(cmd,shell=True)
Send Queries:
pysolr_client = retrieve_and_rank.get_pysolr_client(solr_cluster_id, solr_collection_name)
results = pysolr_client.search(Query_term)
print(results.docs)
Codes are in python.
For beginners: You can use the curl commands directly in you CMD. I hope it helpes others.

like this:
nutch index -Dsolr.server.url=http://username:password#localhost:8983/solr/nutch crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20170816191100/ -filter -normalize -deleteGone
it works.

Related

import status lingers after GraphDB repository deleted

GraphDB Free/9.4.1, RDF4J/3.3.1
I'm working on using the /rest/data/import/server/{repo-id} endpoint to initiate the importing of an RDF/XML file.
Steps:
put SysML.owl in the ${graphdb.workbench.importDirectory} directory.
chmod a+r SysML.owl
create repository test1 (in Workbench - using all defaults except RepositoryID := "test1")
curl http://127.0.0.1:7200/rest/data/import/server/test1 => as expected:
[{"name":"SysML.owl","status":"NONE"..."timestamp":1606848520821,...]
curl -XPOST --header 'Content-Type: application/json' --header 'Accept: application/json' -d ' { "fileNames":[ "SysML.owl" ] }' http://127.0.0.1:7200/rest/data/import/server/test1 => SC==202
after 60 seconds, curl http://127.0.0.1:7200/rest/data/import/server/test1 =>
[{"name":"SysML.owl","status":"DONE","message":"Imported successfully in 7s.","context":null,"replaceGraphs":[],"baseURI":
"file:/home/steve/graphdb-import/SysML.owl", "forceSerial":false,"type":"file","format":null,"data":null,"timestamp":
1606848637716, [...other json content deleted]
Repository test1 now has the 263,119 (824 inferred) statements from SysML.owl loaded
BUT if I then
delete the repository using the Workbench page at http://localhost:7200/repository, wait 180 seconds
curl http://127.0.0.1:7200/rest/data/import/server/test => same as in step 5 above, despite repository having been deleted.
curl -X GET --header 'Accept: application/json' 'http://localhost:7200/rest/repositories' => test1 not shown.
create the repository again, using the Workbench - same settings as previously. wait 60 seconds. Initial 70 statements present.
curl http://127.0.0.1:7200/rest/data/import/server/test1 =>
The same output as from the earlier usage - when I was using the prior repository instance. "status":"DONE", same timestamp - which is prior to the time at which I deleted, recreated the test1 repository.
The main-2020-12-01.log shows the INFO messages pertaining to the repository test1, plugin registrations, etc. Nothing indicating why the prior repository instance's import status is lingering.
And this is of concern because I was expecting to use some polling of the status to determine when the data is loaded so my processing can proceed. Some good news - I can issue the import server file request again and after waiting 60 seconds, the 263,119 statements are present. But the timestamp on the import is the earlier repo instance's timestamp. It was not reset via the latest import request.
I'm probably missing some cleanup step(s), am hoping someone knows which.
Thanks,
-Steve
The status is simply for your reference and doesn't represent the actual presence of data in the repository. You could achieve a similar thing simply by clearing all data in the repository without recreating it.
If you really need to rely on those status records you can clear the status for a given file once you polled it and determined it's done (or prior to starting an import) with this curl:
curl -X DELETE http://127.0.0.1:7200/rest/data/import/server/test1/status \
-H 'content-type: application/json' -d '["SysML.owl"]'
Note that this is an undocumented API and it may change without notice.

Using gcloud, how can I find the Virtual Machine Name?

I am trying to find the Google Cloud Compute Virtual Machine Name, using the gcloud command while logged in to the VM.
Me searching in the documentation didn't yield a result...
Thanks!
See Metadata service.
Specifically:
curl \
--header "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/name
The Metadata service is a well-implemented API, you can navigate up-down the tree of resources, for example, dropping the final name from the above URL, enumerates all the resources under instance:
curl \
--header "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/
returns:
attributes/
cpu-platform
description
disks/
guest-attributes/
hostname
id
image
legacy-endpoint-access/
licenses/
machine-type
maintenance-event
name
network-interfaces/
preempted
remaining-cpu-time
scheduling/
service-accounts/
tags
virtual-clock/
zone
You can then pick any of the above, append it to the URL to continue browsing.
The documentation in the link is similarly comprehensive and, in the case of instance metadata, as expected, reflects the response from GET'ing
https://compute.googleapis.com/compute/v1/.../instances/...
i.e.:
https://cloud.google.com/compute/docs/reference/rest/v1/instances/get#response-body

How to check content of a Noobaa bucket

I am able to check status of Nooba bucket using noobaa bucket status <bucket> command.
$ noobaa bucket status XYZ
INFO[0005] ✅ Exists: NooBaa "noobaa"
INFO[0005] ✅ Exists: Service "noobaa-mgmt"
INFO[0006] ✅ Exists: Secret "noobaa-operator"
INFO[0006] ✅ Exists: Secret "noobaa-admin"
INFO[0008] ✈️ RPC: bucket.read_bucket() Request: {Name:XYZ}
INFO[0010] ✅ RPC: bucket.read_bucket() Response OK: took 14.3ms
Bucket status:
Bucket : XYZ
OBC Namespace : xyz-namespace
OBC BucketClass : default-bucket-class
Type : REGULAR
Mode : OPTIMAL
ResiliencyStatus : OPTIMAL
QuotaStatus : QUOTA_NOT_SET
Num Objects : 1
Data Size : 3.000 B
Data Size Reduced : 5.000 B
Data Space Avail : 1.000 PB
But I am not able to check content present inside Noobaa bucket.
How can we check content of a Noobaa bucket? using Noobaa CLI or any other way?
Your question made me realize that noobaa CLI should have noobaa object list command so I opened a new issue for this enhancement on the operator github repo. Thanks :)
Until this is added, there are several ways we use to list objects:
run noobaa ui - notice that it opens the browser quickly, but on the terminal it prints the credentials for you to use for login. You can probably find the buckets and the drill down to the objects in the UI on your own, and you can also check out some recorded videos that navigate the UI - for example this video.
Take the admin S3 credentials and endpoint from noobaa status and then use your favorite s3 client - I currently use aws-cli or rclone:
alias s3='AWS_ACCESS_KEY_ID=$NOOBAA_ACCESS_KEY AWS_SECRET_ACCESS_KEY=$NOOBAA_SECRET_KEY aws --endpoint $NOOBAA_S3_ENDPOINT --no-verify-ssl s3'
and then:
s3 ls XYZ
Not many noticed but the NooBaa system CR contains a useful Readme text in its status, with commands to "Test S3 client" - ready to copy-paste to set up your aws-cli, including kubectl port-forward to support secure networks and reading the credentials from secrets. Check it out with kubectl describe noobaa. This 40 seconds youtube video shows this briefly. BTW, the readme text is generated for the system but its text does not contain actual secrets, only kubectl commands to read those secrets if permitted to.
$ kubectl describe noobaa
...
Phase: Ready
Readme:
Welcome to NooBaa!
-----------------
NooBaa Core Version: 5.3.0-9f579d9
NooBaa Operator Version: 2.1.0
Lets get started:
1. Connect to Management console:
Read your mgmt console login information (email & password) from secret: "noobaa-admin".
kubectl get secret noobaa-admin -n backup-service -o json | jq '.data|map_values(#base64d)'
Open the management console service - take External IP/DNS or Node Port or use port forwarding:
kubectl port-forward -n backup-service service/noobaa-mgmt 11443:443 &
open https://localhost:11443
2. Test S3 client:
kubectl port-forward -n backup-service service/s3 10443:443 &
NOOBAA_ACCESS_KEY=$(kubectl get secret noobaa-admin -n backup-service -o json | jq -r '.data.AWS_ACCESS_KEY_ID|#base64d')
NOOBAA_SECRET_KEY=$(kubectl get secret noobaa-admin -n backup-service -o json | jq -r '.data.AWS_SECRET_ACCESS_KEY|#base64d')
alias s3='AWS_ACCESS_KEY_ID=$NOOBAA_ACCESS_KEY AWS_SECRET_ACCESS_KEY=$NOOBAA_SECRET_KEY aws --endpoint https://localhost:10443 --no-verify-ssl s3'
s3 ls
...
Last option, which should have been mentioned first, but unfortunately I just saw it is broken in the current version v2.1.0 (opened new issue), is to use the generic noobaa api command in order to call the object_api list_objects method like so:
noobaa api object list_objects '{ "bucket": "first.bucket" }'
I hope that helps, feel free to open github issues with suggestions/issues.
Thanks!
(NooBaa CTO)

Is there a way to add additional configurable settings in OpsCenter 6.0.2 Lifecycle Manager config profiles?

I would really like to add the following settings to our spark-defaults.conf using OpsCenter 6.0.2 in order to avoid configuration drift. Is there a way to add these config items to the config profile template?
spark.cores.max 4
spark.driver.memory 2g
spark.executor.memory 4g
spark.python.worker.memory 2g
NOTE: As Mike Lococo has pointed out in the comments for this answer -- this answer may work to update the config profile values but will not result in those values being written to spark-defaults.conf.
The following is not a solution!
You can; you have to update the config profile via the LCM Config Profile API (https://docs.datastax.com/en/opscenter/6.0/api/docs/lcm_config_profile.html#lcm-config-profile).
First, identify the config profile that needs updating:
$ curl http://localhost:8888/api/v1/lcm/config_profiles
Get the href for the specific config profile that needs updating, request it, and save the response body to a file:
$ curl http://localhost:8888/api/v1/lcm/config_profiles/026fe8e3-0bb8-49c1-9888-8187b1624375 > profile.json
Now, in the profile.json file you just saved to, you add or edit the key at json > spark-defaults-conf to include the following keys:
"spark-defaults-conf": {
"spark-cores-max": 4,
"spark-python-worker-memory": "2g",
"spark-ssl-enabled": false,
"spark-drivers-memory": "2g",
"spark-executor-memory": "4g"
}
Save the updated profile.json. Finally, execute an HTTP PUT to the same config profile URL, using the edited file as the request data:
$ curl -X PUT http://localhost:8888/api/v1/lcm/config_profiles/026fe8e3-0bb8-49c1-9888-8187b1624375 -d #profile.json

Couchdb bad_utf8_character_code

I am using couchdb through couchrest in a ruby on rails application. When I try to use futon it alerts with a message box saying bad_utf8_character_code. If I try to access records from rails console using Model.all it raises either of 500:internal server error,
RestClient::ServerBrokeConnection: Server broke connection: or Errno::ECONNREFUSED: Connection refused - connect(2)
Could any one help me to sort this issue?
I ran into this issue. I tried various curl calls to delete, modify, and even just view the offending document. None of them worked. Finally, I decided to pull the documents down to my local machine one at a time, skip the "bad" one, and then replicate from my local out to production.
Disable app (so no more records are being written to the db)
Delete and recreate local database (run these commands in a shell):
curl -X DELETE http://127.0.0.1:5984/mydb
curl -X PUT http://127.0.0.1:5984/mydb
Pull down documents from live to local using this ruby script
require 'bundler'
require 'json'
all_docs = JSON.parse(`curl http://server.com:5984/mydb/_all_docs`)
docs = all_docs['rows']
ids = docs.map{|doc| doc['id']}
bad_ids = ['196ee4a2649b966b13c97672e8863c49']
good_ids = ids - bad_ids
good_ids.each do |curr_id|
curr_doc = JSON.parse(`curl http://server.com:5984/mydb/#{curr_id}`)
curr_doc.delete('_id')
curr_doc.delete('_rev')
data = curr_doc.to_json.gsub("\\'", "\'").gsub('"','\"')
cmd = %Q~curl -X PUT http://127.0.0.1:5984/mydb/#{curr_id} -d "#{data}"~
puts cmd
`#{cmd}`
end
Destroy (delete) and recreate production database (I did this in futon)
Replicate
curl -X POST http://127.0.0.1:5984/_replicate -d '{"target":"http://server.com:5984/mydb", "source":"http://127.0.0.1:5984/mydb"}' -H "Content-Type: application/json"
Restart app