Does the "Cartridge" pattern exist? - naming-conventions

AndroMDA uses the term "cartridge" (e.g. for out-of-the-box NHibernate support).
As I understood it, it takes an API/component, wrapps it, never adds new features, simplifies it, often taking away "the full power", but works well for most cases.
My questions:
Is the term widely used?
Can one properly define it?
Should the suffix "Cartridge" be used in class/method names?
An example: is the following Base64 helper a cartridge for Base64 conversion?
You give away all the power for performance-tuning, but if you simply want to decode a simple (and small) string it works fine:
Usage:
Base64StringCartridge.Decode(input);
Implementation
public static string Decode(string data)
{
try
{
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
System.Text.Decoder utf8Decode = encoder.GetDecoder();
byte[] todecode_byte = System.Convert.FromBase64String(data);
int charCount = utf8Decode.GetCharCount(todecode_byte, 0, todecode_byte.Length);
char[] decoded_char = new char[charCount];
utf8Decode.GetChars(todecode_byte, 0, todecode_byte.Length, decoded_char, 0);
string result = new String(decoded_char);
return result;
}
catch
{
return "";
}
}

It's called the Facade Pattern. Presumably the AndroMDA folks are big fans of old video game machines...

Related

How to enable parallelism for a custom U-SQL Extractor

I’m implementing a custom U-SQL Extractor for our internal file format (binary serialization). It works well in the "Atomic" mode:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class BinaryExtractor : IExtractor
If I switch off the “Atomic“ mode, It looks like U-SQL is splitting the file in a random place (I guess just by 250MB chunks). This is not acceptable for me. The file format has a special row delimiter. Can I define a custom row delimiter in my Extractor and enable parallelism for it. Technically I can change our row delimiter to a new one if it can help.
Could anyone help me with this question?
The file is indeed split into chunks (I think it is 1 GB at the moment, but the exact value is implementation defined and may change for performance reasons).
If the file is indeed row delimited, and assuming your raw input data for the row is less than 4MB, you can use the input.Split() function inside your UDO to do the splitting into rows. The call will automatically handle the case if the raw input data spans the chunk boundary (assuming it is less than 4MB).
Here is an example:
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow outputrow)
{
// this._row_delim = this._encoding.GetBytes(row_delim); in class ctor
foreach (Stream current in input.Split(this._row_delim))
{
using (StreamReader streamReader = new StreamReader(current, this._encoding))
{
int num = 0;
string[] array = streamReader.ReadToEnd().Split(new string[]{this._col_delim}, StringSplitOptions.None);
for (int i = 0; i < array.Length; i++)
{
// DO YOUR PROCESSING
}
}
yield return outputrow.AsReadOnly();
}
}
Please note that you cannot read across chunk boundaries yourself and you should make sure your data is indeed splittable into rows.

Google diff-match-patch : How to unpatch to get Original String?

I am using Google diff-match-patch JAVA plugin to create patch between two JSON strings and storing the patch to database.
diff_match_patch dmp = new diff_match_patch();
LinkedList<Patch> diffs = dmp.patch_make(latestString, originalString);
String patch = dmp.patch_toText(diffs); // Store patch to DB
Now is there any way to use this patch to re-create the originalString by passing the latestString?
I google about this and found this very old comment # Google diff-match-patch Wiki saying,
Unpatching can be done by just looping through the diff, swapping
DIFF_INSERT with DIFF_DELETE, then applying the patch.
But i did not find any useful code that demonstrates this. How could i achieve this with my existing code ? Any pointers or code reference would be appreciated.
Edit:
The problem i am facing is, in the front-end i am showing a revisions module that shows all the transactions of a particular fragment (take for example an employee details), like which user has updated what details etc. Now i am recreating the fragment JSON by reverse applying each patch to get the current transaction data and show it as a table (using http://marianoguerra.github.io/json.human.js/). But some JSON data are not valid JSON and I am getting JSON.parse error.
I was looking to do something similar (in C#) and what is working for me with a relatively simple object is the patch_apply method. This use case seems somewhat missing from the documentation, so I'm answering here. Code is C# but the API is cross language:
static void Main(string[] args)
{
var dmp = new diff_match_patch();
string v1 = "My Json Object;
string v2 = "My Mutated Json Object"
var v2ToV1Patch = dmp.patch_make(v2, v1);
var v2ToV1PatchText = dmp.patch_toText(v2ToV1Patch); // Persist text to db
string v3 = "Latest version of JSON object;
var v3ToV2Patch = dmp.patch_make(v3, v2);
var v3ToV2PatchTxt = dmp.patch_toText(v3ToV2Patch); // Persist text to db
// Time to re-hydrate the objects
var altV3ToV2Patch = dmp.patch_fromText(v3ToV2PatchTxt);
var altV2 = dmp.patch_apply(altV3ToV2Patch, v3)[0].ToString(); // .get(0) in Java I think
var altV2ToV1Patch = dmp.patch_fromText(v2ToV1PatchText);
var altV1 = dmp.patch_apply(altV2ToV1Patch, altV2)[0].ToString();
}
I am attempting to retrofit this as an audit log, where previously the entire JSON object was saved. As the audited objects have become more complex the storage requirements have increased dramatically. I haven't yet applied this to the complex large objects, but it is possible to check if the patch was successful by checking the second object in the array returned by the patch_apply method. This is an array of boolean values, all of which should be true if the patch worked correctly. You could write some code to check this, which would help check if the object can be successfully re-hydrated from the JSON rather than just getting a parsing error. My prototype C# method looks like this:
private static bool ValidatePatch(object[] patchResult, out string patchedString)
{
patchedString = patchResult[0] as string;
var successArray = patchResult[1] as bool[];
foreach (var b in successArray)
{
if (!b)
return false;
}
return true;
}

replace string in PDF document (ITextSharp or PdfSharp)

We use non-manage DLL that has a funciton to replace text in PDF document (http://www.debenu.com/docs/pdf_library_reference/ReplaceTag.php).
We are trying to move to managed solution (ITextSharp or PdfSharp).
I know that this question has been asked before and that the answers are "you should not do it" or "it is not easily supported by PDF".
However there exists a solution that works for us and we just need to convert it to C#.
Any ideas how I should approach it?
According to your library reference link, you use the Debenu PDFLibrary function ReplaceTag. According to this Debenu knowledge base article
the ReplaceTag function simply replaces text in the page’s content stream, so for most documents it wouldn’t have any effect. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed. Essentially it’s the same as doing:
DPL.CombineContentStreams();
string content = DPL.GetContentStreamToString();
DPL.SetPageContentFromString(content.Replace("Moby", "Mary"));
That should be possible with any general purpose PDF library, it definitely is with iText(Sharp):
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
byte[] contentBytes = reader.GetPageContent(1);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(1, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
WARNING: Just like in case of the Debenu function, for most documents this code wouldn’t have any effect or would even be destructive. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed.
By the way, the Debenu knowledge base article continues:
If you created a PDF using Debenu Quick PDF Library and a standard font then the ReplaceTag function should work – however, for PDFs created with tools that do subsetted fonts or even kerning (where words will be split up) then the search text probably won’t be in the content in a simple format.
So in short, the ReplaceTag function will only work in some limited scenarios and isn’t a function that you can rely on for searching and replacing text.
Thus, if during your move to managed solution you also change the way the source documents are created, chances are that neither the Debenu PDFLibrary function ReplaceTag nor the code above will be able to change the content as desired.
for pdfsharp users heres a somewhat usable function, i copied from my project and it uses an utility method which is consumed by othere methods hence the unused result.
it ignores whitespaces created by Kerning, and therefore may mess up the result (all characters in the same space) depending on the source material
public static void ReplaceTextInPdfPage(PdfPage contentPage, string source, string target)
{
ModifyPdfContentStreams(contentPage, stream =>
{
if (!stream.TryUnfilter())
return false;
var search = string.Join("\\s*", source.Select(c => c.ToString()));
var stringStream = Encoding.Default.GetString(stream.Value, 0, stream.Length);
if (!Regex.IsMatch(stringStream, search))
return false;
stringStream = Regex.Replace(stringStream, search, target);
stream.Value = Encoding.Default.GetBytes(stringStream);
stream.Zip();
return false;
});
}
public static void ModifyPdfContentStreams(PdfPage contentPage,Func<PdfDictionary.PdfStream, bool> Modification)
{
for (var i = 0; i < contentPage.Contents.Elements.Count; i++)
if (Modification(contentPage.Contents.Elements.GetDictionary(i).Stream))
return;
var resources = contentPage.Elements?.GetDictionary("/Resources");
var xObjects = resources?.Elements.GetDictionary("/XObject");
if (xObjects == null)
return;
foreach (var item in xObjects.Elements.Values.OfType<PdfReference>())
{
var stream = (item.Value as PdfDictionary)?.Stream;
if (stream != null)
if (Modification(stream))
return;
}
}

Performant Entity Serialization: BSON vs MessagePack (vs JSON)

Recently I've found MessagePack, an alternative binary serialization format to Google's Protocol Buffers and JSON which also outperforms both.
Also there's the BSON serialization format that is used by MongoDB for storing data.
Can somebody elaborate the differences and the dis-/advantages of BSON vs MessagePack?
Just to complete the list of performant binary serialization formats: There are also Gobs which are going to be the successor of Google's Protocol Buffers. However in contrast to all the other mentioned formats those are not language-agnostic and rely on Go's built-in reflection there are also Gobs libraries for at least on other language than Go.
// Please note that I'm author of MessagePack. This answer may be biased.
Format design
Compatibility with JSON
In spite of its name, BSON's compatibility with JSON is not so good compared with MessagePack.
BSON has special types like "ObjectId", "Min key", "UUID" or "MD5" (I think these types are required by MongoDB). These types are not compatible with JSON. That means some type information can be lost when you convert objects from BSON to JSON, but of course only when these special types are in the BSON source. It can be a disadvantage to use both JSON and BSON in single service.
MessagePack is designed to be transparently converted from/to JSON.
MessagePack is smaller than BSON
MessagePack's format is less verbose than BSON. As the result, MessagePack can serialize objects smaller than BSON.
For example, a simple map {"a":1, "b":2} is serialized in 7 bytes with MessagePack, while BSON uses 19 bytes.
BSON supports in-place updating
With BSON, you can modify part of stored object without re-serializing the whole of the object. Let's suppose a map {"a":1, "b":2} is stored in a file and you want to update the value of "a" from 1 to 2000.
With MessagePack, 1 uses only 1 byte but 2000 uses 3 bytes. So "b" must be moved backward by 2 bytes, while "b" is not modified.
With BSON, both 1 and 2000 use 5 bytes. Because of this verbosity, you don't have to move "b".
MessagePack has RPC
MessagePack, Protocol Buffers, Thrift and Avro support RPC. But BSON doesn't.
These differences imply that MessagePack is originally designed for network communication while BSON is designed for storages.
Implementation and API design
MessagePack has type-checking APIs (Java, C++ and D)
MessagePack supports static-typing.
Dynamic-typing used with JSON or BSON are useful for dynamic languages like Ruby, Python or JavaScript. But troublesome for static languages. You must write boring type-checking codes.
MessagePack provides type-checking API. It converts dynamically-typed objects into statically-typed objects. Here is a simple example (C++):
#include <msgpack.hpp>
class myclass {
private:
std::string str;
std::vector<int> vec;
public:
// This macro enables this class to be serialized/deserialized
MSGPACK_DEFINE(str, vec);
};
int main(void) {
// serialize
myclass m1 = ...;
msgpack::sbuffer buffer;
msgpack::pack(&buffer, m1);
// deserialize
msgpack::unpacked result;
msgpack::unpack(&result, buffer.data(), buffer.size());
// you get dynamically-typed object
msgpack::object obj = result.get();
// convert it to statically-typed object
myclass m2 = obj.as<myclass>();
}
MessagePack has IDL
It's related to the type-checking API, MessagePack supports IDL. (specification is available from: http://wiki.msgpack.org/display/MSGPACK/Design+of+IDL)
Protocol Buffers and Thrift require IDL (don't support dynamic-typing) and provide more mature IDL implementation.
MessagePack has streaming API (Ruby, Python, Java, C++, ...)
MessagePack supports streaming deserializers. This feature is useful for network communication. Here is an example (Ruby):
require 'msgpack'
# write objects to stdout
$stdout.write [1,2,3].to_msgpack
$stdout.write [1,2,3].to_msgpack
# read objects from stdin using streaming deserializer
unpacker = MessagePack::Unpacker.new($stdin)
# use iterator
unpacker.each {|obj|
p obj
}
I think it's very important to mention that it depends on what your client/server environment look like.
If you are passing bytes multiple times without inspection, such as with a message queue system or streaming log entries to disk, then you may well prefer a binary encoding to emphasize the compact size. Otherwise it's a case by case issue with different environments.
Some environments can have very fast serialization and deserialization to/from msgpack/protobuf's, others not so much. In general, the more low-level the language/environment the better binary serialization will work. In higher level languages (node.js, .Net, JVM) you will often see that JSON serialization is actually faster. The question then becomes is your network overhead more or less constrained than your memory/cpu?
With regards to msgpack vs bson vs protocol buffers... msgpack is the least bytes of the group, protocol buffers being about the same. BSON defines more broad native types than the other two, and may be a better match to your object model, but this makes it more verbose. Protocol buffers have the advantage of being designed to stream... which makes it a more natural format for a binary transfer/storage format.
Personally, I would lean towards the transparency that JSON offers directly, unless there is a clear need for lighter traffic. Over HTTP with gzipped data, the difference in network overhead are even less of an issue between the formats.
A key difference not yet mentioned is that BSON contains size information in bytes for the entire document and further nested sub-documents.
document ::= int32 e_list
This has two major benefits for restricted environments (e.g. embedded) where size and performance is important.
You can immediately check if the data you're going to parse represents a complete document or if you're going to need to request more at some point (be it from some connection or storage). Since this is most likely an asynchronous operation you might already send a new request before parsing.
Your data might contain entire sub-documents with irrelevant information for you. BSON allows you to easily traverse to the next object past the sub-document by using the size information of the sub-document to skip it. msgpack on the other hands contains the number of elements inside whats called a map (similar to BSON's sub-documents). While this is undoubtedly useful information it doesn't help the parser. You'd still have to parse every single object inside the map and can't just skip it. Depending on the structure of your data this might have a huge impact on performance.
Well,as the author said,MessagePack is originally designed for network communication while BSON is designed for storages.
MessagePack is compact while BSON is verbose.
MessagePack is meant to be space-efficient while BSON is designed for CURD (time-efficient).
Most importantly, MessagePack's type system (prefix) follow Huffman encoding, here I drawed a Huffman tree of MessagePack(click link to see image):
Quick test shows minified JSON is deserialized faster than binary MessagePack. In the tests Article.json is 550kb minified JSON, Article.mpack is 420kb MP-version of it. May be an implementation issue of course.
MessagePack:
//test_mp.js
var msg = require('msgpack');
var fs = require('fs');
var article = fs.readFileSync('Article.mpack');
for (var i = 0; i < 10000; i++) {
msg.unpack(article);
}
JSON:
// test_json.js
var msg = require('msgpack');
var fs = require('fs');
var article = fs.readFileSync('Article.json', 'utf-8');
for (var i = 0; i < 10000; i++) {
JSON.parse(article);
}
So times are:
Anarki:Downloads oleksii$ time node test_mp.js
real 2m45.042s
user 2m44.662s
sys 0m2.034s
Anarki:Downloads oleksii$ time node test_json.js
real 2m15.497s
user 2m15.458s
sys 0m0.824s
So space is saved, but faster? No.
Tested versions:
Anarki:Downloads oleksii$ node --version
v0.8.12
Anarki:Downloads oleksii$ npm list msgpack
/Users/oleksii
└── msgpack#0.1.7
I made quick benchmark to compare encoding and decoding speed of MessagePack vs BSON. BSON is faster at least if you have large binary arrays:
BSON writer: 2296 ms (243487 bytes)
BSON reader: 435 ms
MESSAGEPACK writer: 5472 ms (243510 bytes)
MESSAGEPACK reader: 1364 ms
Using C# Newtonsoft.Json and MessagePack by neuecc:
public class TestData
{
public byte[] buffer;
public bool foobar;
public int x, y, w, h;
}
static void Main(string[] args)
{
try
{
int loop = 10000;
var buffer = new TestData();
TestData data2;
byte[] data = null;
int val = 0, val2 = 0, val3 = 0;
buffer.buffer = new byte[243432];
var sw = new Stopwatch();
sw.Start();
for (int i = 0; i < loop; i++)
{
data = SerializeBson(buffer);
val2 = data.Length;
}
var rc1 = sw.ElapsedMilliseconds;
sw.Restart();
for (int i = 0; i < loop; i++)
{
data2 = DeserializeBson(data);
val += data2.buffer[0];
}
var rc2 = sw.ElapsedMilliseconds;
sw.Restart();
for (int i = 0; i < loop; i++)
{
data = SerializeMP(buffer);
val3 = data.Length;
val += data[0];
}
var rc3 = sw.ElapsedMilliseconds;
sw.Restart();
for (int i = 0; i < loop; i++)
{
data2 = DeserializeMP(data);
val += data2.buffer[0];
}
var rc4 = sw.ElapsedMilliseconds;
Console.WriteLine("Results:", val);
Console.WriteLine("BSON writer: {0} ms ({1} bytes)", rc1, val2);
Console.WriteLine("BSON reader: {0} ms", rc2);
Console.WriteLine("MESSAGEPACK writer: {0} ms ({1} bytes)", rc3, val3);
Console.WriteLine("MESSAGEPACK reader: {0} ms", rc4);
}
catch (Exception e)
{
Console.WriteLine(e);
}
Console.ReadLine();
}
static private byte[] SerializeBson(TestData data)
{
var ms = new MemoryStream();
using (var writer = new Newtonsoft.Json.Bson.BsonWriter(ms))
{
var s = new Newtonsoft.Json.JsonSerializer();
s.Serialize(writer, data);
return ms.ToArray();
}
}
static private TestData DeserializeBson(byte[] data)
{
var ms = new MemoryStream(data);
using (var reader = new Newtonsoft.Json.Bson.BsonReader(ms))
{
var s = new Newtonsoft.Json.JsonSerializer();
return s.Deserialize<TestData>(reader);
}
}
static private byte[] SerializeMP(TestData data)
{
return MessagePackSerializer.Typeless.Serialize(data);
}
static private TestData DeserializeMP(byte[] data)
{
return (TestData)MessagePackSerializer.Typeless.Deserialize(data);
}

What analyzer should I use for a URL in lucene.net?

I'm having problems getting a simple URL to tokenize properly so that you can search it as expected.
I'm indexing "http://news.bbc.co.uk/sport1/hi/football/internationals/8196322.stm" with the StandardAnalyzer and it is tokenizing the string as the following (debug output):
(http,0,4,type=<ALPHANUM>)
(news.bbc.co.uk,7,21,type=<HOST>)
(sport1/hi,22,31,type=<NUM>)
(football,32,40,type=<ALPHANUM>)
(internationals/8196322.stm,41,67,type=<NUM>)
In general it looks good, http itself, then the hostname but the issue seems to come with the forward slashes. Surely it should consider them as seperate words?
What do I need to do to correct this?
Thanks
P.S. I'm using Lucene.NET but I really don't think it makes much of a difference with regards to the answers.
The StandardAnalyzer, which uses the StandardTokenizer, doesn't tokenize urls (although it recognised emails and treats them as one token). What you are seeing is it's default behaviour - splitting on various punctuation characters. The simplest solution might be to use a write a custom Analyzer and supply a UrlTokenizer, that extends/modifies the code in StandardTokenizer, to tokenize URLs. Something like:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer() {
super();
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new MyUrlTokenizer(reader);
result = new LowerCaseFilter(result);
result = new StopFilter(result);
result = new SynonymFilter(result);
return result;
}
}
Where the URLTokenizer splits on /, - _ and whatever else you want. Nutch may also have some relevant code, but I don't know if there's a .NET version.
Note that if you have a distinct fieldName for urls then you can modify the above code the use the StandardTokenizer by default, else use the UrlTokenizer.
e.g.
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = null;
if (fieldName.equals("url")) {
result = new MyUrlTokenizer(reader);
} else {
result = new StandardTokenizer(reader);
}
You should parse the URL yourself (I imagine there's at least one .Net class that can parse a URL string and tease out the different elements), then add those elements (such as the host, or whatever else you're interested in filtering on) as Keywords; don't Analyze them at all.