Store and retrieve string arrays in HBase - serialization

I've read this answer (How to store complex objects into hadoop Hbase?) regarding the storing of string arrays with HBase.
There it is said to use the ArrayWritable Class to serialize the array. With WritableUtils.toByteArray(Writable ... writable) I'll get a byte[] which I can store in HBase.
When I now try to retrieve the rows again, I get a byte[] which I have somehow to transform back again into an ArrayWritable.
But I don't find a way to do this. Maybe you know an answer or am I doing fundamentally wrong serializing my String[]?

You may apply the following method to get back the ArrayWritable (taken from my earlier answer, see here) .
public static <T extends Writable> T asWritable(byte[] bytes, Class<T> clazz)
throws IOException {
T result = null;
DataInputStream dataIn = null;
try {
result = clazz.newInstance();
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
dataIn = new DataInputStream(in);
result.readFields(dataIn);
}
catch (InstantiationException e) {
// should not happen
assert false;
}
catch (IllegalAccessException e) {
// should not happen
assert false;
}
finally {
IOUtils.closeQuietly(dataIn);
}
return result;
}
This method just deserializes the byte array to the correct object type, based on the provided class type token.
E.g:
Let's assume you have a custom ArrayWritable:
public class TextArrayWritable extends ArrayWritable {
public TextArrayWritable() {
super(Text.class);
}
}
Now you issue a single HBase get:
...
Get get = new Get(row);
Result result = htable.get(get);
byte[] value = result.getValue(family, qualifier);
TextArrayWritable tawReturned = asWritable(value, TextArrayWritable.class);
Text[] texts = (Text[]) tawReturned.toArray();
for (Text t : texts) {
System.out.print(t + " ");
}
...
Note:
You may have already found the readCompressedStringArray() and writeCompressedStringArray() methods in WritableUtils
which seem to be suitable if you have your own String array-backed Writable class.
Before using them, I'd warn you that these can cause serious performance hit due to
the overhead caused by the gzip compression/decompression.

Related

ServiceStack Redis client Get<T>(key) removes quotes from string data

I am using ServiceStack.Redis library to work with Redis. To start with, I have implemented this solution. The get/set methods work fine for plain text/string.
Now when I save a string with quotes (with escape char), it saves properly (I verify the same in redis-cli). But the Get method returns string having all the double quotes removed.
For example saving this string - "TestSample" is saved and get as expected. Also,
saving "TestSample \"with\" \"quotes\"" is fine and shows same in redis-cli. But the output of Get method becomes "TestSample with quotes"
public bool SetDataInCache<T>(string cacheKey, T cacheData)
{
try
{
using (_redisClient = new RedisClient(_cacheConfigs.RedisHost))
{
_redisClient.As<T>().SetValue(cacheKey, cacheData, new TimeSpan(0,0,300));
}
return true;
}
catch (Exception ex)
{
return false;
}
}
public T GetDataFromCacheByType<T>(string cacheKey)
{
T retVal = default(T);
try
{
using (_redisClient = new RedisClient(_cacheConfigs.RedisHost))
{
if (_redisClient.ContainsKey(cacheKey))
{
var wrapper = _redisClient.As<T>();
retVal = wrapper.GetValue(cacheKey);
}
return retVal;
}
}
catch (Exception ex)
{
return retVal;
}
}
Usage:
cacheObj.SetDataInCache("MyKey1","TestSample");
cacheObj.SetDataInCache("MyKey2","TestSample \"with\" \"quotes\"");
string result1 = Convert.ToString(cacheObj.GetDataFromCacheByType<string>("MyKey1"));
string result2 = Convert.ToString(cacheObj.GetDataFromCacheByType<string>("MyKey2"));
Actual : "TestSample with quotes"
Expected : "TestSample \"with\" \"quotes\""
The Typed Generic API is only meant for creating a generic Redis Client for serializing complex types. If you're implementing a generic cache you should use the IRedisClient APIs instead, e.g:
_redisClient.Set(cacheKey, cacheData, new TimeSpan(0,0,300));
Then retrieve back with:
var retVal = _redisClient.Get<T>(cacheKey);
Alternatively for saving strings or if you want to serialize the POCOs yourself you can use the IRedisClient SetValue/GetValue string APIs, e.g:
_redisClient.SetValue(cacheKey, cacheData.ToJson());
var retVal = _redisClient.GetValue(cacheKey).FromJson<T>();
Note: calling IRedisClient.ContainsKey() performs an additional unnecessary Redis I/O operation, since you're returning default(T) anyway you should just call _redisClient.Get<T>() which returns the default value for non-existing keys.

Pig - passing Databag to UDF constructor

I have a script which is loading some data about venues:
venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);
Then I want to create UDF which has a constructor that is accepting venues type.
So I tried to define this UDF like that:
DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);
And here is the actual UDF:
public class GenerateVenues extends EvalFunc<Tuple> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
private static final String ALLCHARS = "(.*)";
private ArrayList<String> venues;
private String regex;
public GenerateVenues(DataBag venuesBag) {
Iterator<Tuple> it = venuesBag.iterator();
venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
String current = "";
regex = "";
while (it.hasNext()){
Tuple t = it.next();
try {
current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
venues.add((String) t.get(0));
} catch (ExecException e) {
throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
}
regex += current + (it.hasNext() ? "|" : "");
}
}
#Override
public Tuple exec(Tuple tuple) throws IOException {
// expect one string
if (tuple == null || tuple.size() != 2) {
throw new IllegalArgumentException(
"BagTupleExampleUDF: requires two input parameters.");
}
try {
String tweet = (String) tuple.get(0);
for (String venue: venues)
{
if (tweet.matches(ALLCHARS + venue + ALLCHARS))
{
Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
return output;
}
}
return null;
} catch (Exception e) {
throw new IOException(
"BagTupleExampleUDF: caught exception processing input.", e);
}
}
}
When executed the script is firing error at the DEFINE part just before (venues);:
2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60> mismatched input 'venues' expecting RIGHT_PAREN
Obviously I'm doing something wrong, can you help me out figuring out what's wrong.
Is it the UDF that cannot accept the venues relation as a parameter. Or the relation is not represented by DataBag like this public GenerateVenues(DataBag venuesBag)?
Thanks!
PS I'm using Pig version 0.11.1.1.3.0.0-107.
As #WinnieNicklaus already said, you can only pass strings to UDF constructors.
Having said that, the solution to your problem is using distributed cache, you need to override public List<String> getCacheFiles() to return a list of filenames that will be made available via distributed cache. With that, you can read the file as a local file and build your table.
The downside is that Pig has no initialization function, so you have to implement something like
private void init() {
if (!this.initialized) {
// read table
}
}
and then call that as the first thing from exec.
You can't use a relation as a parameter in a UDF constructor. Only strings can be passed as arguments, and if they are really of another type, you will have to parse them out in the constructor.

How to customize data serialization based on content in WCF?

Trying to serialize a union-like data-type. There is an enum field indicating the type of data stored in the union, and a variety of possible field types.
The desired result is DataContractSerializer produced XML which contains just the enum, and the relevant field.
Possible solutions, none of which have been attempted yet, are:
Use a custom serializer and mark the union properties with a custom attribute, similar to this question. The custom serializer would strip out the members not required.
Use ISerializationSurrogate and serialize a different object which just contains the relevant data.
Don't use separate fields in the union, use one object field (this could be used as part of the implementation of the ISerializationSurrogate approach).
Other... ?
For example:
[DataContract]
public class WCFTestUnion
{
public enum EUnionType
{
[EnumMember]
Bool,
[EnumMember]
String,
[EnumMember]
Dictionary,
[EnumMember]
Invalid
};
EUnionType unionType = EUnionType.Invalid;
bool boolValue = true;
string stringValue = "Hello";
IDictionary<object, object> dictionaryValue = null;
// Could use custom attribute here ?
[DataMember]
public bool BoolValue
{
get { return this.boolValue; }
set { this.boolValue = value; }
}
// Could use custom attribute here ?
[DataMember]
public string StringValue
{
get { return this.stringValue; }
set { this.stringValue = value; }
}
// Could use custom attribute here ?
[DataMember]
public IDictionary<object, object> DictionaryValue
{
get { return this.dictionaryValue; }
set { this.dictionaryValue = value; }
}
[DataMember]
public EUnionType UnionType
{
get { return this.unionType; }
set { this.unionType = value; }
}
} // Ends class WCFTestUnion
Test
class TestSerializeUnion
{
internal static void Test()
{
Console.WriteLine("===TestSerializeUnion.Test()===");
WCFTestUnion u = new WCFTestUnion();
u.UnionType = WCFTestUnion.EUnionType.Dictionary;
u.DictionaryValue = new Dictionary<object, object>();
u.DictionaryValue[1] = "one";
u.DictionaryValue["two"] = 2;
System.Runtime.Serialization.DataContractSerializer serialize = new System.Runtime.Serialization.DataContractSerializer(typeof(WCFTestUnion));
System.IO.Stream stream = new System.IO.MemoryStream();
serialize.WriteObject(stream, u);
stream.Seek(0, System.IO.SeekOrigin.Begin);
byte[] buffer = new byte[stream.Length];
int length = checked((int)stream.Length);
int read = stream.Read(buffer, 0, length);
while (read < stream.Length)
{
read += stream.Read(buffer, 0, length - read);
}
string xml = Encoding.Default.GetString(buffer);
System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
doc.LoadXml(xml);
System.Xml.XmlTextWriter xmlwriter = new System.Xml.XmlTextWriter(Console.Out);
xmlwriter.Formatting = System.Xml.Formatting.Indented;
doc.WriteContentTo(xmlwriter);
xmlwriter.Flush();
Console.WriteLine();
}
} // Ends class TestSerializeUnion
Output:
<WCFTestUnion xmlns="http://schemas.datacontract.org/2004/07/WCFTestServiceContracts" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<BoolValue>true</BoolValue>
<DictionaryValue xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<a:KeyValueOfanyTypeanyType>
<a:Key i:type="b:int" xmlns:b="http://www.w3.org/2001/XMLSchema">1</a:Key>
<a:Value i:type="b:string" xmlns:b="http://www.w3.org/2001/XMLSchema">one</a:Value>
</a:KeyValueOfanyTypeanyType>
<a:KeyValueOfanyTypeanyType>
<a:Key i:type="b:string" xmlns:b="http://www.w3.org/2001/XMLSchema">two</a:Key>
<a:Value i:type="b:int" xmlns:b="http://www.w3.org/2001/XMLSchema">2</a:Value>
</a:KeyValueOfanyTypeanyType>
</DictionaryValue>
<StringValue>Hello </StringValue>
<UnionType>Dictionary</UnionType>
</WCFTestUnion>
Desired Output (only field being used is serialized, along with enum):
<WCFTestUnion xmlns="http://schemas.datacontract.org/2004/07/WCFTestServiceContracts" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<DictionaryValue xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<a:KeyValueOfanyTypeanyType>
<a:Key i:type="b:int" xmlns:b="http://www.w3.org/2001/XMLSchema">1</a:Key>
<a:Value i:type="b:string" xmlns:b="http://www.w3.org/2001/XMLSchema">one</a:Value>
</a:KeyValueOfanyTypeanyType>
<a:KeyValueOfanyTypeanyType>
<a:Key i:type="b:string" xmlns:b="http://www.w3.org/2001/XMLSchema">two</a:Key>
<a:Value i:type="b:int" xmlns:b="http://www.w3.org/2001/XMLSchema">2</a:Value>
</a:KeyValueOfanyTypeanyType>
</DictionaryValue>
<UnionType>Dictionary</UnionType>
</WCFTestUnion>
You do have several options here. What you use depends on the complexity of this scenario (where else you have to do something like this, how often and in what ways you have to serialize this data, performance, etc.) Take a look at these options, ask away if you have more questions, but mostly, I recommend you just play and experiment with multiple strategies from the list below before picking one or a hybrid solution.
Use a data contract resolver. Provides a mechanism for dynamically mapping types to and from wire representations during serialization and deserialization, giving you flexibility to support far more types than you can out-of-the-box.
Use IObjectReference. You can have a class which implements and returns a reference to a different object after it has been deserialized.
Use a data contract surrogate. This is different from the serialization surrogates you're referring to, but also similar. I think these might work out nicely for you

Jackson vector serialization exception

I have the following code with a simple class and a method for writing and then reading:
ObjectMapper mapper = new ObjectMapper();
try{
DataStore testOut = new DataStore();
DataStore.Checklist ch1 = testOut.addChecklist();
ch1.SetTitle("Checklist1");
String output = mapper.writeValueAsString(testOut);
JsonNode rootNode = mapper.readValue(output, JsonNode.class);
Map<String,Object> userData = mapper.readValue(output, Map.class);
}
public class DataStore {
public static class Checklist
{
public Checklist()
{
}
private String _title;
public String GetTitle()
{
return _title;
}
public void SetTitle(String title)
{
_title = title;
}
}
//Checklists
private Vector<Checklist> _checklists = new Vector<Checklist>();
public Checklist addChecklist()
{
Checklist ch = new Checklist();
ch.SetTitle("New Checklist");
_checklists.add(ch);
return ch;
}
public Vector<Checklist> getChecklists()
{
return _checklists;
}
public void setChecklists(Vector<Checklist> checklists)
{
_checklists = checklists;
}
}
The line:
String output = mapper.writeValueAsString(testOut);
causes an exception that has had me baffled for hours and about to abandon using this at all.
Any hints are appreciated.
Here is the exception:
No serializer found for class DataStore$Checklist and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS) ) (through reference chain: DataStore["checklists"]->java.util.Vector[0])
There are multiple ways to do it, but I will start with what you are doing wrong: your naming of getter and setter method is wrong -- in Java one uses "camel-case", so you should be using "getTitle". Because of this, properties are not found.
Besides renaming methods to use Java-style names, there are alternatives:
You can use annotation JsonProperty("title") for GetTitle(), so that property is recognized
If you don't want the wrapper object, you could alternatively just add #JsonValue for GetTitle(), in which case value used for the whole object would be return value of that method.
The answer seems to be: You can't do that with Json. I've seen comments in the Gson tutorial as well, that state that some serialization just doesn't work. I downloaded XStream and spat it out with XML in a few minutes of work and a lot less construction around what I really wanted to persist. In the process, I was able to delete a lot of code.

how to parse non-string values in Opencsv HeaderColumnNameMappingStrategy

I'm using a HeaderColumnNameMappingStrategy to map a csv file with a header into a JavaBean. String values parse fine but any "true" or "false" value in csv doesn't map to JavaBean and I get the following exception from the PropertyDescriptor:
java.lang.IllegalArgumentException: argument type mismatch
The code where it occurs is in CsvToBean, line 64:
protected T processLine(MappingStrategy<T> mapper, String[] line) throws
IllegalAccessException, InvocationTargetException, InstantiationException, IntrospectionException {
T bean = mapper.createBean();
for(int col = 0; col < line.length; col++) {
String value = line[col];
PropertyDescriptor prop = mapper.findDescriptor(col);
if (null != prop) {
Object obj = convertValue(value, prop);
// this is where exception is thrown for a "true" value in csv
prop.getWriteMethod().invoke(bean, new Object[] {obj});
}
}
return bean;
}
protected PropertyEditor getPropertyEditor(PropertyDescriptor desc) throws
InstantiationException, IllegalAccessException {
Class<?> cls = desc.getPropertyEditorClass();
if (null != cls) return (PropertyEditor) cls.newInstance();
return getPropertyEditorValue(desc.getPropertyType());
}
I can confirm (via debugger) that the setter method id correctly retrieved at this point.
The problem occurs in desc.getPropertyEditorClass() since it returns null. I assumed primitive types and its wrappers are supported. Are they not?
I've run into this same issue. The cleanest way is probably to override getPropertyEditor like pritam did above and return a custom PropertyEditor for your particular object. The quick and dirty way would be to override convertValue in anonymous class form, like this:
CsvToBean<MyClass> csvToBean = new CsvToBean<MyClass>(){
#Override
protected Object convertValue(String value, PropertyDescriptor prop) throws InstantiationException,IllegalAccessException {
if (prop.getName().equals("myWhatever")) {
// return an custom object based on the incoming value
return new MyWhatever((String)value);
}
return super.convertValue(value, prop);
}
};
This is working fine for me with OpenCSV 2.3. Good luck!
I resolved this by extending CsvToBean and adding my own PropertyEditors. Turns out opencsv just supports primitive types and no wrappers.
Pritam's answer is great and this is a sample for dealing with datetime format.
PropertyEditorManager.registerEditor(java.util.Date.class, DateEditor.class);
You should write your own editor class like this:
public class DateEditor extends PropertyEditorSupport{
public static final SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
#Override
public void setAsText(String text){
setValue(sdf.parse(text));}
}