Can regular expressions be used in Splunk generating commands? - splunk

I created a regular expression that filters results based on a field "id" value:
index=some_index | regex id="222[1-3]{2}00"
Unfortunally, this search is executed very long because it first generates huge data volume and then filters it.
Can you tell me whether it is possible to use a regular expression inside generating command to decrease execution time?

None of the generating commands support regular expressions. Go to ideas.splunk.com to make a case for it.
In the meantime, use a pattern to filter in the search command then refine the results with a regex.
index=some_index id="222*" | regex id="222[1-3]{2}00"

Related

How to get all the records matching a regex in Aerospike?

I have millions of records in a set. I would like to retrieve all the records that match the same pattern.
For example I may have :
id=4444?mode=mode1?fieldA=abc
id=4444?mode=mode1?fieldA=azerty
id=4444?mode=mode1?fieldA=qwerty
id=4444?mode=mode1?fieldA=foo
id=4444?mode=mode1?fieldA=bar
Is it possible to make a query to get all the above records without knowing in advance the value of the fieldA ? Something like this in regex :
id=4444?mode=mode1?fieldA=[\w]*
Thanks for you time.
Yes, it can be done. You would need to query by a secondary index first to narrow the result set to a manageable size first, then write a filter using Lua which filters out the ones you don't want. This filter could take the regex you want to match against (passed in dynamically) and return only those records that match.
Whilst this would work, it would not be as performant as the key-value operations in Aerospike. You would definitely want to benchmark such a solution before putting it into production.
Predicate filtering was added in release 3.12 on March 15. You can use the stringRegex method of the PredExp class of the Java client to build complex filters such as the one you mentioned. It also currently exists for the C, C# and Go clients.
There's a similar example in the Aerospike Java client:
Statement stmt = new Statement();
stmt.setNamespace(params.namespace);
stmt.setSetName(params.set);
stmt.setFilter(Filter.range(binName, begin, end));
stmt.setPredExp(
PredExp.stringBin("bin3"),
PredExp.stringValue("prefix.*suffix"),
PredExp.stringRegex(RegexFlag.ICASE | RegexFlag.NEWLINE)
);
The RegexFlag class in com.aerospike.client.query defines which regular expressions you can use, and how they'd behave.

Examine lucene.net custom query after analyzer tokenizes

I'm using Examine in Umbraco to query Lucene index of content nodes. I have a field "completeNodeText" that is the concatenation of all the node properties (to keep things simple and not search across multiple fields).
I'm accepting user-submitted search terms. When the search term is multiple words (ie, "firstterm secondterm"), I want the resulting query to be an OR query: Bring me back results where fullNodeText is firstterm OR secondterm.
I want:
{+completeNodeText:"firstterm ? secondterm"}
but instead, I'm getting:
{+completeNodeText:"firstterm secondterm"}
If I search for "firstterm OR secondterm" instead of "firstterm secondterm", then the generated query is correctly: {+completeNodeText:"firstterm ? secondterm"}
I'm using the following API calls:
var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = searcher.CreateSearchCriteria();
var query = searchCriteria.Field("completeNodeText", term).Compile();
Is there an easy way to force Examine to generate this "OR" query? Or do I have to manually construct the raw query by calling the StandardAnalyzer to tokenize the user input and concatenating together a query by iterating through the tokens? And bypassing the entire Examine fluent query API?
I don't think that question mark means what you think it means.
It looks like you are generating a PhraseQuery, but you want two disjoint TermQueries. In Lucene query syntax, a phrase query is enclosed in quotes.
"firstterm secondterm"
A phrase query is looking for precisely that phrase, with the two terms appearing consecutively, and in order. Placing an OR within a phrase query does not perform any sort of boolean logic, but rather treats it as the word "OR". The question mark is a placeholder using in PhraseQuery.toString() to represent a removed stop word (See #Lucene-1396). You are still performing a phrasequery, but now it is expecting a three word phrase firstterm, followed by a removed stop word, followed by secondterm
To simply search for two separate terms, get rid of the quotes.
firstterm secondterm
Will search for any document with either of those terms (with higher score given to documents with both).

Using regular expressions for syntax highlight

I have a piece of code that assigns attributes to an NSAttributedString depending on whether certain keywords are present in the string or not. In other words, syntax highlight.
To find if a certain string has those keywords I am currently using regular expressions to find the location of those words with "\\bKEYWORD\\b". The problem is, obviously, performance.
I first tried with NSRegularExpression but performance was so slow that scrolling my textview was nearly impossible. I then tried Oniguruma and things improved but it's still noticeably slow. I may try PCRE but I don't think I'll be adding much.
So, my question is: how can I speed up regular expression searches? Maybe caching the compiled expression?
It sounds like you're searching for each word individually. I would create an array of search words, then join them together with a regex alternation | symbol
Given search words like: alpha, bravo, charlie, delta, echo
Resulting complied regex: \b(?:alpha|bravo|charlie|delta|echo)\b
The non capture group construct (?:...) is a bit faster then the capture syntax (...)

Efficient method for storing simple regular expressions

I have a list of simple regular expressions:
ABC.+DE.+FHIJ.+
.+XY.+Z.+AB
.+KLM.+NO.+J.+
QRST.+UV
they all have alternating patterns of .+ and some text (I will call "words") repeated some number of times. A pattern may or may not begin or end in .+. These regular expression are all mutually exclusive. When another regex is added I want to remove any other matching regular expressions, and add one regular expression that combines the added one with all of its matches. For example, adding:
.+J.+
would match,
ABC.+DE.+FHIJ.+
.+KLM.+NO.+J.+
and thus, these would be remove and replaced with the added regular expression resulting in:
.+J.+
.+XY.+Z.+AB
QRST.+UV
I need to store these patterns either in some data structure or (preferably) in a database in an efficient manner. I first tried a tree of dictionaries, only to realize that in the case that a regex starts with a .* it has to search the entire tree for the next word, which is order O(2^n). Unfortunately, (unless I am mistaken) it appears that neither SQLite (which I am using) nor any other relational database that I have used, supports "regular expression" as a data type. My question is, is there an efficient method for storing and retrieving such simple regular expressions? If there is no canned method, is there some data structure that would be relatively efficient (say, at worst amortized polynomial time)?
Could you please explain what you are using these regular expressions for as that would make it easier to provide a better answer? In particular when I see the way you are splitting your regular expressions I'm wondering if a Trie or a Directed acyclic word graph would be a better fit.
From their you may find your answer is as simple as providing better normalization or finding an alternative no SQL db made specifically for your problem area.

Search literal within a word

Is there a way to perform a FULLTEXT search which returns literals found within words?
I have been using MATCH(col) AGAINST('+literal*' IN BOOLEAN MODE) but it fails if the text is like:
blah,blah,literal,blah
blahliteralblah
blah,blah,literal
Please Note that there is no space after commas.
I want all three cases above to be returned.
I think that should be better fetching the array of entries and then perform a text manipulation over the fetched data (in this case a search)!
Because any text manipulation or complex query take more resources and if your database contains a lot of data, the query become too slow! Moreover, if you are running your
query on a shared server, that increases the performance issues!
You can easily accomplish what you are trying to do with regex, once you have fetched the data from the database!
UPDATE: My suggestion is the same even if you are running your script on a dedicated server! However, if you want to perform a full-text search of the word "literal" in BOOLEAN MODE like you have described, you can remove the + operator (because you are searching only one word) and construct the query as follow:
SELECT listOfColumsNames WHERE
MATCH (colName)
AGAINST ('literal*' IN BOOLEAN MODE);
However, even if you add the AND operator, your query works fine: tested on Apache Server with MySQL 5.1!
I suggest you to read the documentation about the full-text search in boolean mode.
The only one problem of this query is that doesn't matches the word "literal" if it is a sub-string inside an other word, for example: "textliteraltext".
As you noticed, you can't use the * operator at the beginning of the word!
So, to accomplish what you are trying to do, the fastest and easiest way is to follow the suggestion of Paul, using the % placeholder:
SELECT listOfColumsNames
WHERE colName LIKE '%literal%';