Elasticsearch Parse Exception error when attempting to index PDF - pdf

I'm just getting started with elasticsearch. Our requirement has us needing to index thousands of PDF files and I'm having a hard time getting just ONE of them to index successfully.
Installed the Attachment Type plugin and got response: Installed mapper-attachments.
Followed the Attachment Type in Action tutorial but the process hangs and I don't know how to interpret the error message. Also tried the gist which hangs in the same place.
$ curl -X POST "localhost:9200/test/attachment/" -d json.file
{"error":"ElasticSearchParseException[Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]]","status":400}
More details:
The json.file contains an embedded Base64 PDF file (as per instructions). The first line of the file appears correct (to me anyway): {"file":"JVBERi0xLjQNJeLjz9MNCjE1OCAwIG9iaiA8...
I'm not sure if maybe the json.file is invalid or if maybe elasticsearch just isn't set up to parse PDFs properly?!?
Encoding - Here's how we're encoding the PDF into json.file (as per tutorial):
coded=`cat fn6742.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${coded}\"}"
echo "$json" > json.file
also tried:
coded=`openssl base64 -in fn6742.pdf
log:
[2012-06-07 12:32:16,742][DEBUG][action.index ] [Bailey, Paul] [test][0], node[AHLHFKBWSsuPnTIRVhNcuw], [P], s[STARTED]: Failed to execute [index {[test][attachment][DauMB-vtTIaYGyKD4P8Y_w], source[json.file]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]
at org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:147)
at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:50)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:451)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:437)
at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:290)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:210)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Hoping someone can help me see what I'm missing or did wrong?

The following error points to the source of the problem.
Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]
The UTF-8 codes [106, 115, 111, ...] show that you are trying to index string "json.file" instead of content of the file.
To index content of the file simply add letter "#" in front of the file name.
curl -X POST "localhost:9200/test/attachment/" -d #json.file

Turns out it's necessary to export ES_JAVA_OPTS=-Djava.awt.headless=true before running a java app on a 'headless' server... who would'a thought!?!

Related

What is the recommended architecture for using amazon neptune in a scalable way?

I am building an application backed by a Neptune database. Because I want the application to be scalable, I am using AWS Lambda + API gateway to build a REST API to interact with the database. This seems to be a reasonable idea based on the fact that this use case is documented in the Neptune docs.
The Neptune docs recommend reusing the websocket connection to the database across the entire execution context of the function, which is what I am doing at the moment. The docs also recommend resetting the connection and retrying upon errors (see here), which I am also using. However, I am seeing exceptions every now and then (perhaps every 20 requests on average). One of the exceptions I get is
ConnectionResetError: Cannot write to closing transport
which seems to be the same as this issue.
The other one is:
Traceback (most recent call last):
File "/var/task/chalice/app.py", line 1685, in _get_view_function_response
response = view_function(**function_args)
File "/var/task/app.py", line 57, in resource
return Resource(app.current_request, g).process()
File "/var/task/backoff/_sync.py", line 94, in retry
ret = target(*args, **kwargs)
File "/var/task/chalicelib/handlers/resource.py", line 106, in get
values = resources.valueMap().with_(WithOptions.tokens).toList()
File "/var/task/gremlin_python/process/traversal.py", line 57, in toList
return list(iter(self))
File "/var/task/gremlin_python/process/traversal.py", line 47, in __next__
self.traversal_strategies.apply_strategies(self)
File "/var/task/gremlin_python/process/traversal.py", line 548, in apply_strategies
traversal_strategy.apply(traversal)
File "/var/task/gremlin_python/driver/remote_connection.py", line 63, in apply
remote_traversal = self.remote_connection.submit(traversal.bytecode)
File "/var/task/gremlin_python/driver/driver_remote_connection.py", line 60, in submit
results = result_set.all().result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/var/task/gremlin_python/driver/resultset.py", line 90, in cb
f.result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/var/lang/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/var/task/gremlin_python/driver/connection.py", line 82, in _receive
data = self._transport.read()
File "/var/task/gremlin_python/driver/aiohttp/transport.py", line 104, in read
raise RuntimeError("Connection was already closed.")
RuntimeError: Connection was already closed.
In case it is relevant, I am using gremlingpython==3.5.1
It seems to me that these issues are all ultimately a consequence of using AWS Lambda, namely due to the mismatch between the longevity of websocket connections and the ephemeral nature of lambda execution contexts. The question then is: Am I doing the wrong thing by trying to use AWS lambda for my API? Would it be more appropriate to setup an EC2 instance and deal with the scalability in some other way?
P.S. Previously I did create and close a connection in every function execution (as previously recommended in the Neptune docs), which did work fine but was naturally slow.
The latest version of Neptune only supports Gremlin 3.4.11 (https://docs.aws.amazon.com/neptune/latest/userguide/engine-releases-1.0.5.1.html). I would start by using gremlin-python 3.4.11 and see if that resolves your issue. Gremlin-python 3.5 replaced Tornado with AIO HTTP (ref) for websocket connections and I suspect that change may be causing a slight change in behavior that a future release supporting Gremlin 3.5 will address.
I wonder whether the 'Connection was already closed' error message is not being treated as a retriable error by the retry logic?
What happens if you add this error message to the list of retriable_error_msgs in the Python example in the docs?

redis.exceptions.DataError: Invalid input of type: 'NoneType'. Convert to a byte, string or number first

I've recently started to use Redis and RQ to run background processes. I built a Dash app which works fine on Heroku and used to work locally as well. Recently, I tried to test the same app locally again and I keep getting the following error - although I'm using exactly the same code hosted on Heroku:
redis.exceptions.DataError: Invalid input of type: 'NoneType'. Convert to a byte, string or number first.
In my requirements.txt and virtual env on Ubuntu 18.04 I have redis v.3.0.1, rq 0.13.0
When I run redis-server on my terminal I see that Redis 4.0.9 is used (that's also confusing to me).
I tried to google for two days looking for a solution with no avail.
Has anyone an idea of what might have happened and how to solve this error?
Here is the full relevant traceback:
File "/home/tom/dashenv/pb101_models/pages/cumulative_culture.py", line 1026, in stop_or_start_update
job = q.fetch_job(job_id)
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/rq/queue.py", line 142, in fetch_job
self.remove(job_id)
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/rq/queue.py", line 186, in remove
return self.connection.lrem(self.key, 1, job_id)
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/redis/client.py", line 1580, in lrem
return self.execute_command('LREM', name, count, value)
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/redis/client.py", line 754, in execute_command
connection.send_command(*args)
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/redis/connection.py", line 619, in send_command
self.send_packed_command(self.pack_command(*args))
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/redis/connection.py", line 659, in pack_command
for arg in imap(self.encoder.encode, args):
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/redis/connection.py", line 124, in encode
"byte, string or number first." % typename)
redis.exceptions.DataError: Invalid input of type: 'NoneType'. Convert to a byte, string or number first.
Thanks in advance for any suggestion/hint.
All best,
Tom
Check this link: redis 3.0
It says that the redis-py no longer accepts NoneType
Try json.dumps(None), that worked for me

Issues with bitbake for building Angstrom

The issue I'm having is that I'm trying to build an Angstrom image from scratch using bitbake (since Angstrom is now Yocto Compatible) but I've run into an error the moment I run the bitbake systemd-image
Traceback (most recent call last):
File "/usr/bin/bitbake", line 234, in <module>
ret = main()
File "/usr/bin/bitbake", line 197, in main
server = ProcessServer(server_channel, event_queue, configuration)
File "/usr/lib/pymodules/python2.7/bb/server/process.py", line 78, in __init__
self.cooker = BBCooker(configuration, self.register_idle_function)
File "/usr/lib/pymodules/python2.7/bb/cooker.py", line 76, in __init__
self.parseConfigurationFiles(self.configuration.file)
File "/usr/lib/pymodules/python2.7/bb/cooker.py", line 510, in parseConfigurationFiles
data = _parse(os.path.join("conf", "bitbake.conf"), data)
TypeError: getVar() takes exactly 3 arguments (2 given)
ERROR: Error evaluating '${TARGET_OS}:${TRANSLATED_TARGET_ARCH}:build-${BUILD_OS}:pn-${PN}:${MACHINEOVERRIDES}:${DISTROOVERRIDES}:${CLASSOVERRIDE}:forcevariable${#bb.utils.contains("TUNE_FEATURES", "thumb", ":thumb", "", d)}${#bb.utils.contains("TUNE_FEATURES", "no-thumb-interwork", ":thumb-interwork", "", d)}'
Traceback (most recent call last):
File "/usr/lib/pymodules/python2.7/bb/data_smart.py", line 116, in expandWithRefs
s = __expand_var_regexp__.sub(varparse.var_sub, s)
File "/usr/lib/pymodules/python2.7/bb/data_smart.py", line 60, in var_sub
var = self.d.getVar(key, 1)
File "/usr/lib/pymodules/python2.7/bb/data_smart.py", line 260, in getVar
return self.expand(value, var)
File "/usr/lib/pymodules/python2.7/bb/data_smart.py", line 132, in expand
return self.expandWithRefs(s, varname).value
File "/usr/lib/pymodules/python2.7/bb/data_smart.py", line 117, in expandWithRefs
s = __expand_python_regexp__.sub(varparse.python_sub, s)
TypeError: getVar() takes exactly 3 arguments (2 given)
ERROR: Error evaluating '${#bb.parse.BBHandler.vars_from_file(d.getVar('FILE'),d)[0] or 'defaultpkgname'}'
Traceback (most recent call last):
File "/usr/lib/pymodules/python2.7/bb/data_smart.py", line 117, in expandWithRefs
s = __expand_python_regexp__.sub(varparse.python_sub, s)
File "/usr/lib/pymodules/python2.7/bb/data_smart.py", line 76, in python_sub
value = utils.better_eval(codeobj, DataContext(self.d))
File "/usr/lib/pymodules/python2.7/bb/utils.py", line 387, in better_eval
return eval(source, _context, locals)
File "PN", line 1, in <module>
TypeError: getVar() takes exactly 3 arguments (2 given)
I've been at this for a while now, searching on different sites. Originally I tried following the guide at the developer section on the Angstrom site, but once I got some errors (prior to this one I'm putting here), I found Derek Molloy's site http://derekmolloy.ie/building-angstrom-for-beaglebone-from-source/ which solved those errors and gave a little more detail into the process.
Eventually I stumbled onto another forum post which decribed my problem, but unfortunately the answers weren't really clear (for me anyway) http://comments.gmane.org/gmane.linux.distributions.angstrom.devel/7431. I'm at a loss on what could be wrong, and I'm pretty much new to Yocto project so I'm unsure if there's any steps missing or something that's implicit that I have overlooked, so I would deeply appreciate anyone who could point me on the right direction on this.
As side note, I've been thinking that it could be something having to do with the environment-angstrom-... file that I have, since mine is environment-angstrom-v2013.12 and all the other examples use previous versions, I'm wondering if there's a new step involved when working with this.
Is there a reason why you are using a system-wide bitbake instead of the one that is compatible with that release of Angstrom?
Don't use a system-wide bitbake, as the bitbake API can and does change over time. Use the corresponding bitbake for that release of angstrom.
(This is breaking because your bitbake requires getVar to take three arguments but your angstrom layers are only passing two)

POSKeyError during migration

Migration from plone 3.3.2 to plone 4.2.1 fails with PosKeyError. I've tried recipes from this article http://plonechix.blogspot.com/2009/12/definitive-guide-to-poskeyerror.html.
I've run error_finder snippet, but it didn't give me any execeptions. I've also tried to take object in debugger using app.mysite._p_jar[p64(oid)] - also no success, it fails with the same error.
How can I delete the broken object or at least get more info about object (e.g. its class name or location)?
Full traceback:
POSKeyError('\x00\x00\x00\x00\x00\x0ey=',)
(Also, the following error occurred while attempting to render the standard error message, please see the event log for full details:
An operation previously failed, with traceback:
File "/Users/makmak/Plone/buildout-cache/eggs/Zope2-2.13.16-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File "/Users/makmak/Plone/buildout-cache/eggs/Zope2-2.13.16-py2.7.egg/ZPublisher/Publish.py", line 443, in publish_module
environ, debug, request, response)
File "/Users/makmak/Plone/buildout-cache/eggs/Zope2-2.13.16-py2.7.egg/ZPublisher/Publish.py", line 237, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File "/Users/makmak/Plone/buildout-cache/eggs/Zope2-2.13.16-py2.7.egg/ZPublisher/Publish.py", line 134, in publish
transactions_manager.commit()
File "/Users/makmak/Plone/buildout-cache/eggs/Zope2-2.13.16-py2.7.egg/Zope2/App/startup.py", line 301, in commit
transaction.commit()
File "/Users/makmak/Plone/buildout-cache/eggs/transaction-1.1.1-py2.7.egg/transaction/_manager.py", line 89, in commit
return self.get().commit()
File "/Users/makmak/Plone/buildout-cache/eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 336, in commit
t, v, tb = self._saveAndGetCommitishError()
File "/Users/makmak/Plone/buildout-cache/eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 329, in commit
self._commitResources()
File "/Users/makmak/Plone/buildout-cache/eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 443, in _commitResources
rm.commit(self)
File "/Users/makmak/Plone/buildout-cache/eggs/ZODB3-3.10.5-py2.7-macosx-10.4-x86_64.egg/ZODB/Connection.py", line 572, in commit
oid, serial, transaction)
File "/Users/makmak/Plone/buildout-cache/eggs/ZODB3-3.10.5-py2.7-macosx-10.4-x86_64.egg/ZODB/BaseStorage.py", line 416, in checkCurrentSerialInTransaction
committed_tid = self.getTid(oid)
File "/Users/makmak/Plone/buildout-cache/eggs/ZODB3-3.10.5-py2.7-macosx-10.4-x86_64.egg/ZODB/FileStorage/FileStorage.py", line 770, in getTid
with self._lock:
File "/Users/makmak/Plone/buildout-cache/eggs/ZODB3-3.10.5-py2.7-macosx-10.4-x86_64.egg/ZODB/FileStorage/FileStorage.py", line 403, in _lookup_pos
raise POSKeyError(oid)
POSKeyError: 0x0e793d
You can use the fsrefs.py to find the bad object.
A very short article on using it is: http://nathanvangheem.com/news/fixing-broken-zodb-object-references
I believe this is the same issue I just ran into which happens if a savepoint is rolled back that included adding an object to the catalog. I think this is a bug in the ZODB but you can workaround it by addressing whatever is rolling back a savepoint, in this case that's the migration of files and images to blobs. So if you fix what's keeping those files or images from successfully migrating to BLOBs (or just delete them) then it should succeed.

VARCHAR requires a length when rendered on MySQL

I have a buildout instance with pas.plugins.sqlalchemy. It appears in installation list but when installing it results in an error.
Here's the ZCML definition:
<configure xmlns="http://namespaces.zope.org/zope" xmlns:db="http://namespaces.zope.org/db">
<include package="z3c.saconfig" file="meta.zcml" />
<db:engine xmlns="http://namespaces.zope.org/db" name="pas" url="mysql://webadmin:password#rcs-mysql-dev/estep" />
<db:session xmlns="http://namespaces.zope.org/db" name="pas.plugins.sqlalchemy" engine="pas" />
</configure>
Traceback is:
Traceback (innermost last):
Module ZPublisher.Publish, line 127, in publish
Module ZPublisher.mapply, line 77, in mapply
Module ZPublisher.Publish, line 47, in call_object
Module Products.CMFQuickInstallerTool.QuickInstallerTool, line 575, in installProducts
Module Products.CMFQuickInstallerTool.QuickInstallerTool, line 512, in installProduct
- __traceback_info__: ('pas.plugins.sqlalchemy',)
Module Products.GenericSetup.tool, line 330, in runAllImportStepsFromProfile
- __traceback_info__: profile-pas.plugins.sqlalchemy:install
Module Products.GenericSetup.tool, line 1085, in _runImportStepsFromContext
Module Products.GenericSetup.tool, line 999, in _doRunImportStep
- __traceback_info__: pas.plugins.sqlalchemy.install
Module pas.plugins.sqlalchemy.setuphandlers, line 46, in install
Module sqlalchemy.schema, line 2148, in create_all
Module sqlalchemy.engine.base, line 1698, in create
Module sqlalchemy.engine.base, line 1740, in _run_visitor
Module sqlalchemy.sql.visitors, line 83, in traverse_single
Module sqlalchemy.engine.ddl, line 42, in visit_metadata
Module sqlalchemy.sql.visitors, line 83, in traverse_single
Module sqlalchemy.engine.ddl, line 58, in visit_table
Module sqlalchemy.engine.base, line 1191, in execute
Module sqlalchemy.engine.base, line 1241, in _execute_ddl
Module sqlalchemy.sql.expression, line 1413, in compile
Module sqlalchemy.engine.base, line 702, in compile
Module sqlalchemy.engine.base, line 715, in process
Module sqlalchemy.sql.visitors, line 54, in _compiler_dispatch
Module sqlalchemy.sql.compiler, line 1152, in visit_create_table
Module sqlalchemy.dialects.mysql.base, line 1282, in get_column_specification
Module sqlalchemy.engine.base, line 761, in process
Module sqlalchemy.sql.visitors, line 54, in _compiler_dispatch
Module sqlalchemy.sql.compiler, line 1450, in visit_string
Module sqlalchemy.dialects.mysql.base, line 1520, in visit_VARCHAR
InvalidRequestError: VARCHAR requires a length when rendered on MySQL
you need to specify lengths for all Strings. That means: String(n) instead of simply String. So, in pas.plugins.sqlalchemy.model.User
login = Column(String, unique=True)
becomes
login = Column(String(100), unique=True)
etc...
I don't have a MySQL installation, and pas.plugins.sqlalchemy works fine on postgresql for me, but it would seem that the authors have made an assumption about varchars. Assuming it's not something that SQLAlchemy should be handling itself (it would be really nice if the MySQL dialect for SQLalchemy would select an appropriate maximum size for unbounded varchars), I'll see if I can commit a fix this evening.
A quick glance at the code shows that all "String" (treated as varchar by the database) fields have maximum lengths except Login, name and password in the User table and name in the Group table, and there's no good reason why these should be different.
Update: Check out https://svn.plone.org/svn/collective/PASPlugins/pas.plugins.sqlalchemy/branches/auspex from subversion. It's my version of pas.plugins.sqlalchemy with support for the IGroupCapability interface (lets users be added to and removed from groups that are also stored in the rdb), and I've also added lengths to all unbounded String fields.
If you don't know how to use subversion checkouts in buildout, see: http://pypi.python.org/pypi/mr.developer/