Solr Query, Syntax trouble with drupal apache solr module

Solr Query, Syntax trouble with drupal apache solr module - apache

I'm working with apache solr, and the module for drupal 7 apachesolr.
Some of our queries are very custom.
I have been looking in the solr documentation and at explanations on stackoverflow.
I have come up with the query:
/select?q=&start=0&rows=20&fq=bundle:(message)&fq=sm_hashtags:(hashtags)&fq=(is_uid:(1 OR 2 OR 37 OR 38 OR 50 OR 166 OR 174 OR 198 OR 431 OR 499 OR 640 OR 642) AND is_privacy:(0)) AND -is_uid:(177 OR 189) AND is_status:(1)&fq=entity_id:{* TO 2666}&fl=tus_message_object,sm_hashtags,content,ts_search,is_privacy,is_status,is_uid&sort=entity_id+desc&wt=json&wt=json
but this is returning NULL, i have tried a few different things like:
/select?q=&start=0&rows=20&fq=bundle:(message)&fq=sm_hashtags:(hashtags)&fq=((is_uid:(1+OR+2+OR+37+OR+38+OR+50+OR+166+OR+174+OR+198+OR+431+OR+499+OR+640+OR+642)+is_privacy:(0))-is_uid:(177+OR+189)+is_status:(1))&fq=entity_id:{*+TO+2666}&fl=tus_message_object,sm_hashtags,content,ts_search,is_privacy,is_status,is_uid&sort=entity_id+desc&wt=json&wt=json
But I am not sure this is correct.
I need a filter that allows the users with id (is_uid) and all the ones with privacy is 0 but not the users in blocked id list -is_uid and where the status is 1.

You're using a lot of clauses so it is very hard to determine here what could be the cause. I can give you the following hints:
a) Debug mode
Copy the whole query string, go to the Solr Admin console and execute that query with and additional debug=true or debugQuery=true. In the outcoming response, Solr will append an additional section where it explains how it "sees" the query you entered
b) Investigate each fq one by one
If there's a problem, it is for sure in the filter queries, so I suggest you to gradually try them one by one. But before of that, see the c) point.
c) fq design
Filter queries are great for their caching capabilities. Here, although there are some fq that I think won't be reused so much (the is_uid filter) I suggest you to split those queries in order to have bettere reusable (i.e. cached) results; something like:
fq=bundle:message // You already have this
fq=sm_hashtags:hashtags // You already have this
fq=is_privacy:0 // this will cache (for further reuse) all documents with no privacy
fq=is_status:1 // this will cache (for further reuse) all documents with status 1
fq=is_uid(1 OR 921 OR 9...) // Most probably these query results won't benefit so much of caching. If so, I think this query could be also part of the main query
fq=-is_uid(8 OR 99) // Most probably these query results won't benefit so much of caching. If so, I think this query could be also part of the main query

You are querying nothing on no fields q=. Try q=*:*, or anything on any field. Best way to debug solr is by constructing your query manually in the solr query builder, usually http://localhost:8983/solr and seeing what it comes back with.

Related

Way to optimise a mapping on informatica

I would like to optimise a mapping developped by one of my colleague and where the "loading part" (in a flat file) is really really slow - 12 row per sec
Currently, to get to the point where I start writting in my file, I take about 2 hours, so I would like to know where I should start looking first otherwise, I will need at least 2 hours between each improvment - which is not really efficient.
Ok, so to describe simply what is done :
Oracle table (with big query inside - takes about 2 hours to get a result)
SQ
2 LKup on ref table (should not be heavy)
update strategy
1 transformer
2 Lk up (on big table - that should be one optimum point I guess : change them to joiner)
6 stored procedure (these also seem a bit heavy, what do you think ?)
another tranformer
load in the flat file
Can you confirm that either the LK up or the stored procedur part could be the reason why it is so slow ?
Do you think that I should look somewhere else to optimize ? I was thinking may be only 1 transformer.

First check the logs carefuly. Look at the timestamps. It should give you initial idea what part causes delay.
Lookups to big tables are not recommended. Joiners are a better way, but they still need to cache data. Can you limit the data for cache, perhaps? It'll be very hard to advise without seeing it.
Which leads us to the Stored Procedures: it's simply impossible to tell anything about them just like that.
So: first collect the stats and do log analysis. Next, read some tuning guides on the Net - there's plenty. Here's a more comprehensive one, but well... large - so you might like to try and look for some other ones.
Powercenter Performance Tuning Guide

Alphabetical index with millions of rows in redis

For my application, I need an alphabetical index on a set with millions of rows.
When I use a sorted set, and give all members the same score, the result looks perfect.
Performance is also great, with a test set of 2 million rows, the last third does not perform noticably less than the first third of the set.
However, I need to query those results. For example, get the first (max) 100 items that start with "goo". I played around with zscan and sort, but it does not give me a working and performant result.
Since redis is very fast when inserting a new member to the sorted set, it must be technically possible to immediately (well, very quickly) go to the right memory location. I suppose redis uses some kind of quicksort mechanism to accomplish this.
But.. I don't seem to get the result when I just want to query the data, and not write to it.
We use replicated slaves for read actions, and we prefer the (default) read-only config switch. So creating a dummy key and deleting it afterward (however unelegant) is not really an option.
I'm stuck a bit, and I'm thinking about writing a ZLEX command in redis-server itself. Which I could use like this:
HELP "ZLEX" -> (ZLEX set score startswith)
-- Query the lexicographical index of a sorted set, supplying a 'startswith' string.
127.0.0.1:12345> ZLEX myset 0 goo LIMIT 0 100
1) goo
2) goof
3) goons
4) goozer
What are your thoughts? Am I missing something in the standard redis commands?
We're using Redis 2.8.4 x64 on Debian.
Kind regards, TW
Edits:
Note:
Related issue: indexing-using-redis-sorted-sets -> At least the name I gave to ZLEX seems to conform with Antirez' (Salvatore's) standards. As of 24-1-2014, I'm working on implementing ZLEX. It seems to be the easiest and most straight-forward solution for this use case, and Antirez could merge it into the main branch for everyone's benefit.

I've implemented ZLEX.
Here are the full specs.
You can grab the new functionality from here: github tw-bert
I also posted a pull request to Antirez here.
Kind regards, TW

Have you had a look at this ?
It can be useful depending on the length of the field by which you sort, this method requires b*(a^2) keys, where a is the length of the field , and b is amount of rows for this field.

Large results set from Oracle SELECT

I have a simple, pasted below, statement called against an Oracle database. This result set contains names of businesses but it has 24,000 results and these are displayed in a drop down list.
I am looking for ideas on ways to reduce the result set to speed up the data returned to the user interface, maybe something like Google's search or a completely different idea. I am open to whatever thoughts and any direction is welcome.
SELECT BusinessName FROM MyTable ORDER BY BusinessName;
Idea:
SELECT BusinessName FROM MyTable WHERE BusinessName LIKE "A%;
I'm know all about how LIKE clauses are not wise to use but like I said this is a LARGE result set. Maybe something along the lines of a BINARY search?

The last query can perform horribly. String comparisons inside the database can be very slow, and depending on the number of "hits" it can be a huge drag on performance. If that doesn't concern you that's fine. This is especially true if the Company data isn't normalized into it's own db table.
As long as the user knows the company he's looking up, then I would identify an existing JavaScript component in some popular JavaScript library that provides a search text field with a dynamic dropdown that shows matching results would be an effective mechanism. But you might want to use '%A%', if they might look for part of a name. For example, If I'm looking for IBM Rational, LLC. do I want it to show up in results when I search for "Rational"?
Either way, watch your performance and if it makes sense cache that data in the company look up service that sits on the server in front of the DB. Also, make sure you don't respond to every keystroke, but have a timeout 500ms or so, to allow the user to type in multiple chars before going to the server and searching. Also, I would NOT recommend bringing all of the company names to the client. We're always looking to reduce the size and frequency of traversals to the server from the browser page. Waiting for 24k company names to come down to the client when the form loads (or even behind the scenes) when shorter quicker very specific queries will perform sufficiently well seems more efficient to me. Again, test it and identify the performance characteristics that fit your use case best.
These are techniques I've used on projects with large data, like searching for a user from a base of 100,000+ users. Our code was a custom Dojo widget (dijit), I 'm not seeing how to do it directly with the dijit code, but jQuery UI provides the autocomplete widget.
Also use limit on this query with a text field so that the drop down only provides a subset of all the matches, forcing the user to further refine the query.

SELECT BusinessName FROM MyTable ORDER BY BusinessName LIMIT 10

neo4j count nodes performance on 200K nodes and 450K relations

We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.

Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.

Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?

You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}

If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)

What index would speed up my XQuery in X-Hive / Documentum xDB?

I have approx 2500 documents in my test database and searching the xpath /path/to/#attribute takes approximately 2.4 seconds. Doing distinct-values(/path/to/#attribute) takes 3.0 seconds.
I've been able to speed up queries on /path/to[#attribute='value'] to hundreds or tens of milliseconds by adding a Path value index on /path/to[#attribute<STRING>] but no index I can think of gets picked up for the more general query.
Anybody know what indexes I should be using?

The index you propose is the correct one (/path/to[#attribute]), but unfortunately the xDB optimizer currently doesn't recognize this specific case since the 'target node' stored in the index is always an element and not an attribute. If /path/to/#attribute has few results then you can optimize this by slightly modifying your query to this: distinct-values(/path/to[#attribute]/#attribute). With this query the optimizer recognizes that there is an index it can use to get to the 'to' element, but then it still has the access the target document to retrieve the attribute for the #attribute step. This is precisely why it will only benefit cases where there are few hits: each hit will likely access a different data page.
What you also can do is access the keys in the index directly through the API: XhiveIndexIf.getKeys(). This will be very fast, but clearly this is not very user friendly (and should be done by the optimizer instead).
Clearly the optimizer could handle this. I will add it to the bug tracker.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas