Hive : accented characters to their non accented counterparts - hive

How can I replace non-ascii characters with their ascii counterparts in a SELECT request sent to hive ? That is have accents removed (é, ê, è => e) and have other non alphanumeric characters (``) removed.
I know I can use regexp_replace() but I'd have to deal with every accented/non-accented pair there is. Surely, there is something more practical ?

It seems that you want to use
String subjectString = "öäü";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
As described in
Replace non ASCII character from string
I have tried using reflect but couldn't make it work due to the Normalizer.Form enum parameter.
So, it seems that you have to define a one-line UDF:
public class NormalizerUDF extends UDF {
public String evaluate(String in) {
return Normalizer.normalize(in, Normalizer.Form.NFD);
}
}

Related

Parsing Request Payload string

can anyone help me in parsing data from Request Payload string like following one:
7|0|5|https://www.bosscapital.com/app/Basic/|B8CC86B6E3BFEAF758DE5845F8EBEA08|com.optionfair.client.common.services.TradingService|getAssetDailyTicks|J|1|2|3|4|2|5|5|CB|U9mc4GQ|
Thanks & Regards
Ajay
You can call the Split() method on strings to split them at a certain character. Alternatively, you can use Regex.Split(value, "<pattern>"); for splitting, e.g. if you have multiple characters you want to split at. <pattern> is a string here, so you can provide more than one character (e.g. "\r\n" to find line breaks).
using System;
class Program {
static void Main() {
string s = "7|0|5|https://www.bosscapital.com/app/Basic/|B8CC86B6E3BFEAF758DE5845F8EBEA08|com.optionfair.client.common.services.TradingService|getAssetDailyTicks|J|1|2|3|4|2|5|5|CB|U9mc4GQ|";
// Split string at pipe character
string[] parts = s.Split('|');
// Process segments
foreach (string segment in parts) {
Console.WriteLine(segment);
// Use the segmented data...
}
}
}

How to implement just some basic keywords highlighting in text editor?

I'm a novice programmer trying to learn plug-in development. I'd like to upgrade the sample XML editor so that some words like "cat", "dog", "hamster", "rabbit" and "bird" would be highlighted when it appears in an XML file (it's just for learning purpose). Can anyone give me some implementation tips or suggestions? I am clueless.. (But I am carrying out my research on this as well, I'm not being lazy. You have my word.) Thanks in advance.
You can detect words in the plain text part of the XML by modifying the sample XML editor as follows.
We can use the provided WordRule class to detect the words. The XMLScanner class which scans the plain text needs to be updated to include the word rule:
public XMLScanner(final ColorManager manager)
{
IToken procInstr = new Token(new TextAttribute(manager.getColor(IXMLColorConstants.PROC_INSTR)));
WordRule words = new WordRule(new WordDetector());
words.addWord("cat", procInstr);
words.addWord("dog", procInstr);
// TODO add more words here
IRule [] rules = new IRule [] {
// Add rule for processing instructions
new SingleLineRule("<?", "?>", procInstr),
// Add generic whitespace rule.
new WhitespaceRule(new XMLWhitespaceDetector()),
// Words rules
words
};
setRules(rules);
}
I have used the existing processing instruction token here to reduce the amount of new code, but you should define a new color and use a new token.
The WordRule constructor requires an IWordDetector class, we can use a very simple detector here:
class WordDetector implements IWordDetector
{
#Override
public boolean isWordStart(final char c)
{
return Character.isLetter(c);
}
#Override
public boolean isWordPart(final char c)
{
return Character.isLetter(c);
}
}
This is just accepting letters in words.

Handling uuid pk column in yii

I'm using UUID's as PK in my tables. They're stored in a BINARY(16) MySQL column. I find that they're being mapped to string type in YII. The CRUD code I generate breaks down though, because these binary column types are being HTML encoded in the views. Example:
<?php echo
CHtml::link(CHtml::encode($data->usr_uuid), /* This is my binary uuid field */
array('view', 'id'=>$data->usr_uuid)); ?>
To work around this problem, I overrode afterFind() and beforeSave() in my model where I convert the values to/from hex using bin2hex and hex2bin respectively. See this for more details.
This takes care of the view problems.
However, now the search on PK when accessing a url of the form:
http://myhost.com/mysite/user/ec12ef8ebf90460487abd77b3f534404
results in User::loadModel($id) being called which in turn calls:
User::model()->findByPk($id);
This doesn't work since the SQL is being generated (on account of it being mapped to php string type) is
select ... where usr_uuid='EC12EF8EBF90460487ABD77B3F534404'
What would have worked is if I could, for these uuid fields change the condition to:
select ... where usr_uuid=unhex('EC12EF8EBF90460487ABD77B3F534404')
I was wondering how I take care of this problem cleanly. I see one possiblity - extend CMysqlColumnSchema and override the necessary methods to special case and handle binary(16) columns as uuid type.
This doesn't seem neat as there's no support for uuid natively either in php (where it is treated as string) or in mysql (where I have it as binary(16) column).
Does anyone have any recommendation?
If you plan using it within your own code then I'd create my own base AR class:
class ActiveRecord extends CActiveRecord
{
// ...
public function findByUUID($uuid)
{
return $this->find('usr_uuid=unhex(:uuid)', array('uuid' => $uuid));
}
}
If it's about using generated code etc. then customizing schema a bit may be a good idea.
I used the following method to make working with uuid (binary(16)) columns using Yii/MySQL possible and efficient. I mention efficient, because I could have just made the column a CHAR(32) or (36) with dashes, but that would really chuck efficient out of the window.
I extended CActiveRecord and added a virtual attribute uuid to it. Also overloaded two of the base class methods getPrimaryKey and setPrimaryKey. With these changes most of Yii is happy.
class UUIDActiveRecord extends CActiveRecord
{
public function getUuid()
{
$pkColumn = $this->primaryKeyColumn;
return UUIDUtils::bin2hex($this->$pkColumn);
}
public function setUuid($value)
{
$pkColumn = $this->primaryKeyColumn;
$this->$pkColumn = UUIDUtils::hex2bin($value);
}
public function getPrimaryKey()
{
return $this->uuid;
}
public function setPrimaryKey($value)
{
$this->uuid = $value;
}
abstract public function getPrimaryKeyColumn();
}
Now I get/set UUID using this virtual attribute:
$model->uuid = generateUUID(); // return UUID as 32 char string without the dashes (-)
The last bit, is about how I search. That is accomplished using:
$criteria = new CDbCriteria();
$criteria->addCondition('bkt_user = unhex(:value)');
$criteria->params = array(':value'=>Yii::app()->user->getId()); //Yii::app()->user->getId() returns id as hex string
$buckets = Bucket::model()->findAll($criteria);
A final note though, parameter logging i.e. the following line in main.php:
'db'=>array(
...
'enableParamLogging' => true,
);
Still doesn't work, as once again, Yii will try to html encode binary data (not a good idea). I haven't found a workaround for it so I have disabled it in my config file.

NSPredicateEditorRowTemplate, specifying of Key Path with spaces?

As per a previous question, I have reluctantly given up on using IB/Xcode4 to edit an NSPredicateEditor and done it purely in code.
In the GUI way of editing the fields, key paths can be specified with spaces, like 'field name', and it makes them work as 'fieldName'-style key paths, while still displaying them in the UI with spaces. How do I do this in code? When I specify them with spaces, they don't work. When I specify them in camelCase, they work but display in camelCase. I'm just adding a bunch of NSExpressions like this:
[NSExpression expressionForKeyPath:#"original filename"]
The proper way to get human readable strings in the predicate editor's row views is to use the localization capabilities of NSRuleEditor and NSPredicateEditor.
If you follow the instructions in this blog post, you'll have everything you need to localize the editor.
As an example, let's say your key path is fileName, you support 2 operators (is and contains), and you want the user to enter a string. You'll end up with a strings file that looks like this:
"%[fileName]# %[is]# %#" = "%1$[fileName]# %2$[is]# %3$#";
"%[fileName]# %[contains]# %#" = "%1$[fileName]# %2$[contains]# %3$#";
You can use this file to put in human-readable stuff, and even reorder things:
"%[fileName]# %[is]# %#" = "%1$[original filename]# %2$[is]# %3$#";
"%[fileName]# %[contains]# %#" = "%3$# %2$[is contained in]# %1$[original filename]#";
Once you've localized the strings file, you hand that file back to the predicate editor, and it'll pull out the translated values, do its magic, and everything will show up correctly.
If you don't want to localize everything, just map the key paths consider overriding value(forKey:) in your evaluated object like this:
class Match: NSObject {
var date: Date?
var fileName: String?
override func value(forKey key: String) -> Any? {
// Alternatively use static dictionary for mapping key paths
super.value(forKey: camelCasedKeyPath(forKey: key))
}
private func camelCasedKeyPath(forKey key: String) -> String {
key.components(separatedBy: .whitespaces)
.enumerated()
.map { $0.offset > 0 ? $0.element.capitalized : $0.element.lowercased() }
.joined()
}
}

What analyzer should I use for a URL in lucene.net?

I'm having problems getting a simple URL to tokenize properly so that you can search it as expected.
I'm indexing "http://news.bbc.co.uk/sport1/hi/football/internationals/8196322.stm" with the StandardAnalyzer and it is tokenizing the string as the following (debug output):
(http,0,4,type=<ALPHANUM>)
(news.bbc.co.uk,7,21,type=<HOST>)
(sport1/hi,22,31,type=<NUM>)
(football,32,40,type=<ALPHANUM>)
(internationals/8196322.stm,41,67,type=<NUM>)
In general it looks good, http itself, then the hostname but the issue seems to come with the forward slashes. Surely it should consider them as seperate words?
What do I need to do to correct this?
Thanks
P.S. I'm using Lucene.NET but I really don't think it makes much of a difference with regards to the answers.
The StandardAnalyzer, which uses the StandardTokenizer, doesn't tokenize urls (although it recognised emails and treats them as one token). What you are seeing is it's default behaviour - splitting on various punctuation characters. The simplest solution might be to use a write a custom Analyzer and supply a UrlTokenizer, that extends/modifies the code in StandardTokenizer, to tokenize URLs. Something like:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer() {
super();
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new MyUrlTokenizer(reader);
result = new LowerCaseFilter(result);
result = new StopFilter(result);
result = new SynonymFilter(result);
return result;
}
}
Where the URLTokenizer splits on /, - _ and whatever else you want. Nutch may also have some relevant code, but I don't know if there's a .NET version.
Note that if you have a distinct fieldName for urls then you can modify the above code the use the StandardTokenizer by default, else use the UrlTokenizer.
e.g.
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = null;
if (fieldName.equals("url")) {
result = new MyUrlTokenizer(reader);
} else {
result = new StandardTokenizer(reader);
}
You should parse the URL yourself (I imagine there's at least one .Net class that can parse a URL string and tease out the different elements), then add those elements (such as the host, or whatever else you're interested in filtering on) as Keywords; don't Analyze them at all.