How to query documents in MarkLogic and process results - orm

I've been working off of the tutorial pages but seem to have a fundamental disconnect in my thinking transitioning off of RDBMS systems. I'm using MarkLogic and handling this database interaction through the Java API focusing on the search access via POJO method outlines in the tutorial documentation.
My reference up to this point has come from here principally: http://developer.marklogic.com/learn/java/processing-search-results
My scenario is this:
I have a series of documents. We'll call them 'books' for simplicity. I'm writing these books into my DB like this:
jsonDocMgr.write("/" + book.getID() + "/",
new StringHandle(
"{name: \""+book.getID()+"\","+
"chaps: "+ book.getNumChaps()+","+
"pages: "+ book.getNumPages()+","+
"}"));
What I want is to execute the following type of operation:
-Query all documents with the name "book*" (as ID is represented by book0, book1, book2, etc)
where chaps > 3. For these documents only, I want to modify the number of pages by reducing by half.
In an RDBMS, I'd use something like jdbcTemplate and get a result set for me to iterate through. For each iteration I'd know I was working with a single record (aka a book), parse the field values from the result set, make a note of the ID, then update the DB accordingly.
With MarkLogic, I'm awash in a sea of different handlers and managers...none of which seems to follow the pattern of the ResultSet with a cursor abstraction. Ultimately I want to do a two-step operation of check the chapter count then update the page field for that specific URI.
What's the most common approach to this? It seems like the most basic of operations...

Try the high-level Java API and see if it works for you. Create a multi-statement transaction with a query by example, then use document operations.
At a lower level, the closest match to a ResultSet is the ResultSequence class. The examples at http://docs.marklogic.com/javadoc/xcc/overview-summary.html are pretty good. For updates the interaction model between Java and MarkLogic is a bit different from JDBC and SQL. There is no SELECT... FOR UPDATE syntax.
The most efficient low-level technique is to select and update in one XQuery transaction, something like a stored procedure. However this requires good knowledge of XQuery. The other low-level approach is to use an XCC multi-statement transaction, which requires a little less knowledge of XQuery.

A minor issue in your code ... you definately do NOT want to end your JSON docuement URIs with "/" as you do in your sample code. You should end them with the ".json" or some other extension or no extension but definately not "/" as that is treated specially in the server.

Related

How to skip the SQL(part of the SQL) parsing in antlr4?

Sorry for this question was closed and can not be reopened, and my poor english, it was translated by website indeed. :)
https://stackoverflow.com/questions/70035964/how-to-skip-sql-parsing-in-antlr4
#BartKiers Thanks for being interested in this question, let me give it a detailed example.
There are lots of SQL queries, such as select * from user or update user set field1 = 'value1' where condition = 'value' etc, let's called it original SQL queries.
There is a java program which intercepts and parses all the original SQL queries into Parse Tree nodes by ANTLR4, and then rewrites the query (which depended on the parse phase) by the java program, so the original SQL queries may be parsed and rewritten as
select field1, field1_encrypted, field1_digest, field2 from user
or
update user
set field1 = value1,
field1_encrypted = encrypt_algorithm(value1),
field1_digest = digest_algorithm(value1)
where condition_digest = digest_algorithm(values)
etc.
While they finished the rewritten phase, they should be executed as SQLStatement, the SELECT was executed as SelectSQLStatement while UPDATE executed as UpdateSQLStatement.
Now I thought some of the original SQL queries should skip the parse phase, and the rewrite phase which should be skipped as the same, but the originalSQL queries should be executed as it was.
I thought to mark those with comment as
/* PARSE_PHASE_SKIPPED=TRUE */ originalSQL
or prefix SKIP as
SKIP originalSQL
, I wish to parse the whole marked but original SQL query part into Parse Tree nodes by ANTLR4, and executed it as ParsePhaseSkippedSQLStatement.
Can ANTLR4 support on this situation, and how should the grammar be written? Thanks in advance.
====================
Thank you for your reply #Mike Cargal, Yes, almost.
Let me say it again and give a more detailed example.
There is a java system that we call it X, X has lots of SQL queries that the developers write and guarantee that those SQLs can be executed correctly by Ibatis / JPA etc, let's named those SQL queries as original SQL queries.
Using below original SQL queries as examples:
insert into user (username, id_no) values ('xyz', '123456')
select username, id_no from user u where u.id_no = '123456'
We say the column id_no on table user is sensitive data, we should save ciphertext instead of plaintext, so the originalSQLs would be parsed by ANTLR and rewritten by java code as below, let's named those SQLs as rewritten SQL queries, also rewritten SQL queries should be executed correctly by Ibatis / JPA etc.
insert
into user (username, id_no, id_no_cipher, id_no_digest)
values ('xyz', '', 'encrypted_123456', 'digest_123456')
select username, id_no_cipher as id_no
from user u
where u.id_no_digest = 'digest_123456'
In this case:
1、we see that the rewrite phase depends on the parse phase, original SQL queries need to be correctly parsed then to be rewritten by java code.
2、all original SQL queries are parsed but only a few matching the sensitive rules are rewritten to rewritten SQL queries.
But there are lots original SQL queries we clearly know that do not need to be rewritten, and also no need to be parsed, and may report exceptions in various complex situations while parsing it, but it should be executed correctly by Ibatis / JPA etc.
So I planed to use sql comment / customized keyword annotation to "turn off" parse phase of it.
If I understand your question correctly, you wish to use some sort of comment/annotation to "turn off" execution of the following SQL statement.
(NOTE: You can't really skip "parsing" part of the input. This will address ways in which you could skip processing part of the parsed input, which I believe is what you're ultimately wanting to accomplish.)
This would not be an ANTLR concern. ANTLR's responsibility is to parse you input stream and produce a parse tree (not technically an AST) that correctly represents the structure of your input.
Executing the SQL is not what ANTLR does. It does, however, generate utility Listener and Visitor classes that can be used to cleanly and efficiently navigate the resulting parse tree. There can be a lot of code involved in actually executing the SQL from the parse tree. Often, the first step is to produce an AST from the parse tree to make it easier to deal with.
You have a couple of alternatives (as you hint at).
1 - Using the current grammar an putting these annotations inside of comments (/* PARSE_PHASE_SKIPPED=TRUE */)
This can be done, but it's a bit "messy". It's most likely that COMMENT tokens are -> skiped (or perhaps sent to -> channel(HIDDEN)). This makes it MUCH easier to write the parser rules as you don't have to include optional COMMENTs everywhere a comment could appear. That said, if you send COMMENT tokens to the HIDDEN channel, they are still in the token stream even though they are ignored by the parser. The COMMENT tokens won't be in the rule Context objects that the listeners/visitors deal with, but you could look backwards/forwards in the token stream for COMMENT nodes.
2 - you could introduce some new syntax for annotations (similar to your SKIP idea). To do this you'll have to extend the syntax in the grammar to recognize these annotations. They'd have to be distinguishable from valid SQL, so a simple SKIP is probably not going to work.
The benefit of this approach is that, when you extend the grammar to recognize annotations, you can be very specific about where annotations are allowed. You'd be able to include them in your parse tree, making them easier to locate.
With either of these approaches, you would use a visitor or listener to go through your parse tree looking for the comment/annotation and then mark the subsequent statement with an indicator that you don't want to execute it. (You might use the information to simply skip the parse tree to AST transformation of the "skipped" nodes).
Let me see if I understand your question correctly. In your environment you run SQL queries (not "SQLs", btw.), which may contain data that must not reach the server as is. It doesn't matter if that is sensitive data or what else. All what matters is that you want to replace the text in the queries.
For that you parse the queries and rewrite them, before sending them to the server. However, you don't want to do that for all queries, but only for specific ones. And you came up with the idea to mark queries (or query parts) that must not be transformed, with a special comment. Does this match your intention so far?
Now I wonder why you want to accomplish that on query (parsing) level. It's not the parsing you want to modify but the semantic handling of the parse result (here the parse tree, as Mike Cargal already mentioned). So, in my opinion you don't need to introduce special markup for your queries, but instead define criterions that indicate which data must be replaced.
When you think about that you will probably realise that data for specific fields (columns) in specific tables need to be replaced. You can actually keep a list of schema/table/columns tuples, which tell your rewriter if a value must be rewritten. Everything else stays as it is.
ANTLR4 has nothing to do in this process. It's all to be done in the semantic phase (the processing of the parse tree using a parse tree listener). In this phase you have to collect all column references that are used in a query. Then you compare that list with the list for the rewriter. If a column reference matches, the rewriter has to rewrite the text for it in the query.
That task is however non-trivial, because of nested queries (subqueries, where inner queries can reference tables from an outer query). This is btw. pretty similar to the way code completion works, where you have to provide a possible column list for all mentioned tables in a query. That's why I have written (C++) code to collect such references in MySQL Workbench's SQL code completion.

RESTful API Design OR Predicates

I'm designing a RESTful API and I'm trying to work out how I could represent a predicate with OR an operator when querying for a resource.
For example if I had a resource Foo with a property Name, how would you search for all Foo resources with a name matching "Name1" OR "Name2"?
This is straight forward when it's an AND operator as I could do the following:
http://www.website.com/Foo?Name=Name1&Age=19
The other approach I've seen is to post the search in the body.
You will need to pick your own approach, but I can name few that seem to be pretty logical (although not without disadvantages):
Option 1.: Using | operator:
http://www.website.com/Foo?Name=Name1|Name2
Option 2.: Using modified query param to allow selection by one of the values from the set (list of possible comma-separated values):
http://www.website.com/Foo?Name_in=Name1,Name2
Option 3.: Using PHP-like notation to provide list instead of single string:
http://www.website.com/Foo?Name[]=Name1&Name[]=Name2
All of the above mentioned options have one huge advantage: they do not interfere with other query params.
But as I mentioned, pick your own approach and be consistent about it across your API.
Well one quick way to fixing that is to add an additional parameter that is identifying the relationship between your parameters wether they're an and or an or for example:
http://www.website.com/Foo?Name=Name1&Age=19&or=true
Or for much more complex queries just keep a single parameter and in it include your whole query by making up your own little query language and on the server side you would parse the whole string and extract the information and the statement.

What must be escaped in SQL?

When using SQL in conjunction with another language what data must be escaped? I was just reading the question here and it was my understanding that only data from the user must be escaped.
Also must all SQL statements be escaped? e.g. INSERT, UPDATE and SELECT
EVERY type of query in SQL must be properly escaped. And not only "user" data. It's entirely possible to inject YOURSELF if you're not careful.
e.g. in pseudo-code:
$name = sql_get_query("SELECT lastname FROM sometable");
sql_query("INSERT INTO othertable (badguy) VALUES ('$name')");
That data never touched the 'user', it was never submitted by the user, but it's still a vulnerability - consider what happens if the user's last name is O'Brien.
Most programming languages provide code for connecting to databases in a uniform way (for example JDBC in Java and DBI in Perl). These provide automatic techniques for doing any necessary escaping using Prepared Statements.
All SQL queries should be properly sanitized and there are various ways of doing it.
You need to prevent the user from trying to exploit your code using SQL Injection.
Injections can be made in various ways, for example through user input, server variables and cookie modifications.
Given a query like:
"SELECT * FROM tablename WHERE username= <user input> "
If the user input is not escaped, the user could do something like
' or '1'='1
Executing the query with this input will actually make it always true, possibly exposing sensitive data to the attacker. But there are many other, much worse scenarios injection can be used for.
You should take a look at the OWASP SQL Injection Guide. They have a nice overview of how to prevent those situations and various ways of dealing with it.
I also think it largely depends on what you consider 'user data' to be or indeed orignate from. I personally consider user data as data available (even if this is only through exploitations) in the public domain, i.e. can be changed by 'a' user even if it's not 'the' user.
Marc B makes a good point however that in certain circumstances you may dirty your own data, so I guess it's always better to be safer than sorry in regards to sql data.
I would note that in regards to direct user input (i.e. from web forms, etc) you should always have an additional layer server side validation before the data even gets near a sql query.

Case-insensitive search using Hibernate

I'm using Hibernate for ORM of my Java app to an Oracle database (not that the database vendor matters, we may switch to another database one day), and I want to retrieve objects from the database according to user-provided strings. For example, when searching for people, if the user is looking for people who live in 'fran', I want to be able to give her people in San Francisco.
SQL is not my strong suit, and I prefer Hibernate's Criteria building code to hard-coded strings as it is. Can anyone point me in the right direction about how to do this in code, and if impossible, how the hard-coded SQL should look like?
Thanks,
Yuval =8-)
For the simple case you describe, look at Restrictions.ilike(), which does a case-insensitive search.
Criteria crit = session.createCriteria(Person.class);
crit.add(Restrictions.ilike('town', '%fran%');
List results = crit.list();
Criteria crit = session.createCriteria(Person.class);
crit.add(Restrictions.ilike('town', 'fran', MatchMode.ANYWHERE);
List results = crit.list();
If you use Spring's HibernateTemplate to interact with Hibernate, here is how you would do a case insensitive search on a user's email address:
getHibernateTemplate().find("from User where upper(email)=?", emailAddr.toUpperCase());
You also do not have to put in the '%' wildcards. You can pass MatchMode (docs for previous releases here) in to tell the search how to behave. START, ANYWHERE, EXACT, and END matches are the options.
The usual approach to ignoring case is to convert both the database values and the input value to upper or lower case - the resultant sql would have something like
select f.name from f where TO_UPPER(f.name) like '%FRAN%'
In hibernate criteria restrictions.like(...).ignoreCase()
I'm more familiar with Nhibernate so the syntax might not be 100% accurate
for some more info see pro hibernate 3 extract and hibernate docs 15.2. Narrowing the result set
This can also be done using the criterion Example, in the org.hibernate.criterion package.
public List findLike(Object entity, MatchMode matchMode) {
Example example = Example.create(entity);
example.enableLike(matchMode);
example.ignoreCase();
return getSession().createCriteria(entity.getClass()).add(
example).list();
}
Just another way that I find useful to accomplish the above.
Since Hibernate 5.2 session.createCriteria is deprecated. Below is solution using JPA 2 CriteriaBuilder. It uses like and upper:
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<Person> criteria = builder.createQuery(Person.class);
Root<Person> root = criteria.from(Person.class);
Expression<String> upper = builder.upper(root.get("town"));
criteria.where(builder.like(upper, "%FRAN%"));
session.createQuery(criteria.select(root)).getResultList();
Most default database collations are not case-sensitive, but in the SQL Server world it can be set at the instance, the database, and the column level.
You could look at using Compass a wrapper above lucene.
http://www.compass-project.org/
By adding a few annotations to your domain objects you get achieve this kind of thing.
Compass provides a simple API for working with Lucene. If you know how to use an ORM, then you will feel right at home with Compass with simple operations for save, and delete & query.
From the site itself.
"Building on top of Lucene, Compass simplifies common usage patterns of Lucene such as google-style search, index updates as well as more advanced concepts such as caching and index sharding (sub indexes). Compass also uses built in optimizations for concurrent commits and merges."
I have used this in the past and I find it great.

Converting SQL Result Sets to XML

I am looking for a tool that can serialize and/or transform SQL Result Sets into XML. Getting dumbed down XML generation from SQL result sets is simple and trivial, but that's not what I need.
The solution has to be database neutral, and accepts only regular SQL query results (no db xml support used). A particular challenge of this tool is to provide nested XML matching any schema from row based results. Intermediate steps are too slow and wasteful - this needs to happen in one single step; no RS->object->XML, preferably no RS->XML->XSLT->XML. It must support streaming due to large result sets, big XML.
Anything out there for this?
With SQL Server you really should consider using the FOR XML construct in the query.
If you're using .Net, just use a DataAdapter to fill a dataset. Once it's in a dataset, just use its .WriteXML() method. That breaks your DB->object->XML rule, but it's really how things are done. You might be able to work something out with a datareader, but I doubt it.
Not that I know of. I would just roll my own. It's not that hard to do, maybe something like this:
#!/usr/bin/env jruby
import java.sql.DriverManager
# TODO some magic to load the driver
conn = DriverManager.getConnection(ARGV[0], ARGV[1], ARGV[2])
res = conn.executeQuery ARGV[3]
puts "<result>"
meta = res.meta_data
while res.next
puts "<row>"
for n in 1..meta.column_count
column = meta.getColumnName n
puts "<#{column}>#{res.getString(n)}</#{column}"
end
puts "</row>"
end
puts "</result>"
Disclaimer: I just made all of that up, I'm not even bothering to pretend that it works. :-)
In .NET you can fill a dataset from any source and then it can write that out to disk for you as XML with or without the schema. I can't say what performance for large sets would be like. Simple :)
Another option, depending on how many schemas you need to output, and/or how dynamic this solution is supposed to be, would be to actually write the XML directly from the SQL statement, as in the following simple example...
SELECT
'<Record>' ||
'<name>' || name || '</name>' ||
'<address>' || address || '</address>' ||
'</Record>'
FROM
contacts
You would have to prepend and append the document element, but I think this example is easy enough to understand.
dbunit (www.dbunit.org) does go from sql to xml and vice versa; you might be able to modify it more for your needs.
Technically, converting a result set to an XML file is straight forward and doesn't need any tool unless you have a requirement to convert the data structure to fit specific export schema. In general the result set gets the top-level element of an XML file, then you produce a number of record elements containing attributes, which effectively are the fields of a record.
When it comes to Java, for example, you just need appropriate JDBC driver for interfacing with DBMS of your choice addressing the database independency requirement (usually provided by a DBMS vendor), and a few lines of code to read a result set and print out an XML string per record, per field. Not a difficult task for an average Java developer in my opinion.
Anyway, the more concrete purpose you state the more concrete answer you get.
In Java, you may just fill an object with the xml data (like an entity bean) and then use XMLEncoder to get it to xml. From there you may use XSLT for further conversion or XMLDecoder to bring it back to an object.
Greetz, GHad
PS: See http://ghads.wordpress.com/2008/09/16/java-to-xml-to-java/ for an example for the Object to XML part... From DB to Object multiple more way are possible: JDBC, Groovy DataSets or GORM. Apache Common Beans may help to fill up JavaBeans via Reflection-like methods.
I created a solution to this problem by using the equivalent of a mail merge using the resultset as the source, and a template through which it was merged to produce the desired XML.
The template was standard XML, with a Header element, a Footer element and a Body element. Using a CDATA block in the Body element allowed me to include a complete XML structure that acted as the template for each row. In order to include a fields from the resultset in the template, I used markers that looked like this <[FieldName]>. The template was then pre-parsed to isolate the markers such that in operation, the template requests each of the fields from the resultset as the Body is being produced.
The Header and Footer elements are output only once at the beginning and end of the output set. The body could be any XML or text structure desired. In your case, it sounds like you might have several templates, one for each of your desired schemas.
All of the above was encapsulated in a Template class, such that after loading the Template, I merely called merge() on the template passing the resultset in as a parameter.