how common is it to use both a guid and an int as unique identifiers for a table? - sql

I think a Guid is generally the preferred unique table row identifier from a dba perspective. But I'm working on a project where the developers and managers appear to want a way to reference things by an int value. I can understand their perspective b/c they want a simple and easy way to reference different entities.
I was thinking about using a pattern for my tables where each table would have an int Id column representing the PK column but then it would also include a Guid column as a globally unique identifier. How common is it to use this type of pattern?

In the vast majority of cases you'll want to either use an INT or BIGINT for you primekey/foreign key. For the most part you are looking to make sure that table can be joined to and have a way to easily select a single unique row. In theory using GUIDs all over the place gets you there too, if you were a robot and could quickly ask a colleague, "Hey can you check out ROW_ID FD229C39-2074-4B04-8A50-456402705C02" vs "Hey can you check out ROW_ID 523". But we are human. I don't think there is a really good reason to include another column that is simply a GUID in addition to your PK (which should be an INT or BIGINT)
It can also be nice to have your PK in an order, that seems to come in handy. GUIDs won't be in a order. However, a case for using a GUID would be if you have to expose this value to a customer. You may not want them to know they are customer #6. But being customer #B8D44820-DF75-44C9-8527-F6AC7D1D259B isn't too great if they have to call in and identify themselves, but might be fine for writing code against (say a webservice or some kind of API). SQL is a lot of art with the science!
In addition do you really need a global unique id for a row? Probably not. If you are designing a system that could use up more than what INT can handle (say total number of tweets in all time) then use BIGINT. If you can use up all the BIGINTs, wow. I'd be interested in hearing how and would like to subscribe to your newsletter.
A question I ask myself when writing stuff, "If I'm wrong how hard will it be to do the other way?". If you really need a GUID later, add it. If you put it in now and just 1 person uses it you can never take it out and it will have to be maintained... job security? nah, don't think that way :) Don't over engineer it.

I would not say GUID is generally preferred from a DBA perspective. It is larger (16 bytes rather than 4 for int or 8 for bigint) and the random variety introduces fragmentation and causes much more IO with large tables due to lower page life expectancy. This is especially a problem with spinning media and limited RAM.
When a GUID is actually needed, some of these issues can be avoided using a sequential version for the GUID value rather than introducing another surrogate key. The value can be assigned in by SQL Server with a NEWSEQUENTIALID() default constraint on a column or generated in application code with the bytes ordered properly for SQL Server. Below is a Windows C# example of the latter technique.
using System;
using System.Runtime.InteropServices;
public class Example
{
[DllImport("rpcrt4.dll", CharSet = CharSet.Auto)]
public static extern int UuidCreateSequential(ref Guid guid);
/// sequential guid for SQL Server
public static Guid NewSequentialGuid()
{
const int S_OK = 0;
const int RPC_S_UUID_LOCAL_ONLY = 1824;
Guid oldGuid = Guid.Empty;
int result = UuidCreateSequential(ref oldGuid);
if (result != S_OK && result != RPC_S_UUID_LOCAL_ONLY)
{
throw new ExternalException("UuidCreateSequential call failed", result);
}
byte[] oldGuidBytes = oldGuid.ToByteArray();
byte[] newGuidBytes = new byte[16];
oldGuidBytes.CopyTo(newGuidBytes, 0);
// swap low timestamp bytes (0-3)
newGuidBytes[0] = oldGuidBytes[3];
newGuidBytes[1] = oldGuidBytes[2];
newGuidBytes[2] = oldGuidBytes[1];
newGuidBytes[3] = oldGuidBytes[0];
// swap middle timestamp bytes (4-5)
newGuidBytes[4] = oldGuidBytes[5];
newGuidBytes[5] = oldGuidBytes[4];
// swap high timestamp bytes (6-7)
newGuidBytes[6] = oldGuidBytes[7];
newGuidBytes[7] = oldGuidBytes[6];
//remaining 8 bytes are unchanged (8-15)
return new Guid(newGuidBytes);
}
}

Related

EMV TLV length restriction limitation to overcome

We have code to interrogate the values from various EMV TLVs.
However, in the case of PED serial number, the spec for tag "9F1E" at
http://www.emvlab.org/emvtags/
has:-
Name Description Source Format Template Tag Length P/C Interface
Device (IFD) Serial Number Unique and permanent serial number assigned
to the IFD by the manufacturer Terminal an 8 9F1E 8 primitive
But the above gives a limit of 8, while we have VeriFone PEDs with 9-long SNs.
So sample code relying on tag "9F1E" cannot retrieve the full length.
int GetPPSerialNumber()
{
int rc = -1;
rc = GetTLV("9F1E", &resultCharArray);
return rc;
}
In the above, GetTLV() is written to take a tag arg and populate the value to a char array.
Have any developers found a nice way to retrieve the full 9?
You're correct -- there is a mis-match here. The good thing about TLV is that you don't really need a specification to tell you how long the value is going to be. Your GetTLV() is imposing this restriction itself; the obvious solution is to relax this.
We actually don't even look at the documented lengths on the TLV-parsing level. Each tag is mapped to an associated entity in the BL (sometimes more than one thanks to the schemes going their own routes for contactless), and we get to choose which entities we want to impose a length restriction on there.

Is it possible to add SignalR messages directly to the SQL Backplane?

I'd like to know if I can add SignalR messages directly to the SignalR SQL Backplane (from SQL) so I don't have to use a SignalR client to do so.
My situation is that I have an activated stored procedure for a SQL Service Broker queue, and when it fires, I'd like to post a message to SignalR clients. Currently, I have to receive the message from SQL Service Broker in a seperate process and then immediately re-send the message with a SignalR hub.
I'd like my activated stored procedure to basically move the message directly onto the SignalR SQL Backplane.
Yes and no. I set up a small experiment on my localhost to determine if possible - and it is, if formatted properly.
Now, a word on the [SignalR] schema. It generates three tables:
[SignalR].[Messages_0]
--this holds a list of all messages with the columns of
--[PayloadId], [Payload], and [InsertedOn]
[SignalR].[Messages_0_Id]
--this holds one record of one field - the last Id value in the [Messages_0] table
[SignalR].[Scehma]
--No idea what this is for; it's a 1 column (SchemaVersion) 1 record (value of 1) table
Right, so, I duplicated the last column except I incremented the PayloadId (for the new record and in [Messages_0_Id] and put in GETDATE() as the value for InsertedOn. Immediately after adding the record, a new message came into the connected client. Note that PayloadId is not an identity column, so you must manually increment it, and you must copy that incremented value into the only record in [Messages_0_Id], otherwise your signalr clients will be unable to connect due to Signalr SQL errors.
Now, the trick is populating the [Payload] column properly. A quick look at the table shows that it's probably binary serialized. I'm no expert at SQL, but I'm pretty sure doing a binary serialization is up there in difficulty. If I'm right, this is the source code for the binary serialization, located inside Microsoft.AspNet.SignalR.Messaging.ScaleoutMessage:
public byte[] ToBytes()
{
using (MemoryStream memoryStream = new MemoryStream())
{
BinaryWriter binaryWriter = new BinaryWriter((Stream) memoryStream);
binaryWriter.Write(this.Messages.Count);
for (int index = 0; index < this.Messages.Count; ++index)
this.Messages[index].WriteTo((Stream) memoryStream);
binaryWriter.Write(this.ServerCreationTime.Ticks);
return memoryStream.ToArray();
}
}
With WriteTo:
public void WriteTo(Stream stream)
{
BinaryWriter binaryWriter = new BinaryWriter(stream);
string source = this.Source;
binaryWriter.Write(source);
string key = this.Key;
binaryWriter.Write(key);
int count1 = this.Value.Count;
binaryWriter.Write(count1);
ArraySegment<byte> arraySegment = this.Value;
byte[] array = arraySegment.Array;
arraySegment = this.Value;
int offset = arraySegment.Offset;
arraySegment = this.Value;
int count2 = arraySegment.Count;
binaryWriter.Write(array, offset, count2);
string str1 = this.CommandId ?? string.Empty;
binaryWriter.Write(str1);
int num1 = this.WaitForAck ? 1 : 0;
binaryWriter.Write(num1 != 0);
int num2 = this.IsAck ? 1 : 0;
binaryWriter.Write(num2 != 0);
string str2 = this.Filter ?? string.Empty;
binaryWriter.Write(str2);
}
So, re-implementing that in a stored procedure with pure SQL will be near impossible. If you need to do it on the SQL Server, I suggest using SQL CLR functions. One thing to mention, though - It's easy enough to use a class library, but if you want to reduce hassle over the long term, I'd suggest creating a SQL Server project in Visual Studio. This will allow you to automagically deploy CLR functions with much more ease than manually re-copying the latest class library to the SQL Server. This page talks more about how to do that.
I was inspired by this post and wrote up a version in an SQL stored proc. Works really slick, and wasn't hard.
I hadn't done much work with varbinary before - but SQL server makes it pretty easy to work with, and you can just add the sections together. The format given by James Haug above is accurate. Most of the strings are just "length as byte then the string content" (with the string content just being convert(varbinary,string)). The exception string is the payload, which instead is "length as int32 then string content". Numbers are written out "least significant byte first". I'm not sure whether you can do a conversion like this natively - I found it easy enough to write this myself as a recursive function (something like numToBinary(val,bytesRemaining)... returns varbinary).
If you take this route, I'd still write a parser first (in .NET or another non-SQL language) and test it on some packets generated by SignalR itself. That gives you a better place to work out the kinks in your SQL - and learn the right formatting of the payload package and what not.

How can I ask Lucene to do simple, flat scoring?

Let me preface by saying that I'm not using Lucene in a very common way and explain how my question makes sense. I'm using Lucene to do searches in structured records. That is, each document, that is indexed, is a set of fields with short values from a given set. Each field is analysed and stored, the analysis producing usually no more than 3 and in most cases just 1 normalised token. As an example, imagine files for each of which we store two fields: the path to the file and a user rating in 1-5. The path is tokenized with a PathHierarchyTokenizer and the rating is just stored as-is. So, if we have a document like
path: "/a/b/file.txt"
rating: 3
This document will have for its path field the tokens "/a", "/a/b" and "/a/b/file.ext", and for rating the token "3".
I wish to score this document against a query like "path:/a path:/a/b path:/a/b/different.txt rating:1" and get a value of 2 - the number of terms that match.
My understanding and observation is that the score of the document depends on various term metrics and with many documents with many fields each, I most definitely am not getting simple integer scores.
Is there some way to make Lucene score documents in the outlined fashion? The queries that are run against the index are not generated by the users, but are built by the system and have an optional filter attached, meaning they all have a fixed form of several TermQuerys joined in a BooleanQuery with nothing like any fuzzy textual searches. Currently I don't have the option of replacing Lucene with something else, but suggestions are welcome for a future development.
I doubt there's something ready to use, so most probably you will need to implement your own scorer and use it when searching. For complicated cases you may want to play around with queries, but for simple case like yours it should be enough to overwrite DefaultSimilarity setting tf factor to raw frequency (number of specified terms in document in question) and all other components to 1. Something like this:
public class MySimilarity extends DefaultSimilarity {
#Override
public float computeNorm(String field, FieldInvertState state) {
return 1;
}
#Override
public float queryNorm(float sumOfSquaredWeights) {
return 1;
}
#Override
public float tf(float freq) {
return freq;
}
#Override
public float idf(int docFreq, int numDocs) {
return 1;
}
#Override
public float coord(int overlap, int maxOverlap) {
return 1;
}
}
(Note, that tf() is the only method that returns something different than 1)
And the just set similarity on IndexSearcher.

Why are these hashcodes equals?

This test is failing :
var hashCode = new
{
CustomerId = 3354,
ServiceId = 3,
CmsThematicId = (int?)605,
StartDate = (DateTime?)new DateTime(2013, 1, 5),
EndDate = (DateTime?)new DateTime(2013, 1, 6)
}.GetHashCode();
var hashCode2 = new
{
CustomerId = 1210,
ServiceId = 3,
CmsThematicId = (int?)591,
StartDate = (DateTime?)new DateTime(2013, 3, 31),
EndDate = (DateTime?)new DateTime(2013, 4, 1)
}.GetHashCode();
Assert.AreNotEqual(hashCode, hashCode2);
Can you tell me why ?
It's kinda amazing you found this coincidence.
Anonymous classes have a generated GetHashCode() method that generates a hash code by combining the hash codes of all properties.
The calculation is basically this:
public override int GetHashCode()
{
return -1521134295 *
( -1521134295 *
( -1521134295 *
( -1521134295 *
( -1521134295 *
1170354300 +
CustomerId.GetHashCode()) +
ServiceId.GetHashCode()) +
CmsThematicId.GetHashCode()) +
StartDate.GetHashCode()) +
EndDate.GetHashCode();
}
If you change any of the values of any of the fields, the hash code does change. The fact that you found two different sets of values that happen to get the same hash codes is a coincidence.
Note that hash codes are not necessarily unique. It's impossible to say hash codes would always be unique since there can be more objects than hash codes (although that is a lot of objects). Good hash codes provide a random distribution of values.
NOTE: The above is from .NET 4. Different versions of .NET may different and Mono differs.
If you want to actually compare two objects for equality then use .Equals(). For anonymous objects it compares each field. An even better option is to use an NUnit constraint that compares each field and reports back which field differs. I posted a constraint here:
https://stackoverflow.com/a/2046566/118703
Your test is not valid.
Because hash codes are not guaranteed to be unique (see other answers for a good explanation), you should not test for uniqueness of hash codes.
When writing your own GetHashCode() method, it is a good idea to test for even distribution of random input, just not for uniqueness. Just make sure that you use enough random input to get a good test.
The MSDN spec on GetHashCode specifically states:
For the best performance, a hash function must generate a random
distribution for all input.
This is all relative, of course. A GetHashCode() method that is being used to put 100 objects in a dictionary doesn't need to be nearly as random as a GetHashCode() that puts 10,000,000 objects in a dictionary.
Did you run into this when processing a fairly large amount of data?
Welcome to the wonderful world of hash codes. A hash code is not a "unique identifier." It can't be. There is an essentially infinite number of possible different instances of that anonymous type, but only 2^32 possible hash codes. So it's guaranteed that if you create enough of those objects, you're going to see some duplicates. In fact, if you generate 70,000 of those objects randomly, the odds are better than 50% that two of them will have the same hash code.
See Birthdays, Random Numbers, and Hash Codes, and the linked Wikipedia article for more info.
As for why some people didn't see a duplicate and others did, it's likely that they ran the program on different versions of .NET. The algorithm for generating hash codes is not guaranteed to remain the same across versions or platforms:
The GetHashCode method for an object must consistently return the same
hash code as long as there is no modification to the object state that
determines the return value of the object's Equals method. Note that
this is true only for the current execution of an application, and
that a different hash code can be returned if the application is run
again.
Jim suggested me (in the chat room) to store my parameters so when i display my parameters, select the not used ones, then when someone registers I flag it as used. But it's a big PITA to generate all the parameters.
So my solution is to build a int64 hashcode like this
const long i = -1521134295;
return -i * (-i * (-i * (-i * -117147284 + customerId.GetHashCode()) + serviceId.GetHashCode()) + cmsThematicId.GetHashCode()) + startDate.GetHashCode();
I removed the end date because Its value was depending on serviceId and startDate so I shouldn't have add this to the hashcode in the firstplace.
I copy/pasted it from a decompilation of the generated class. I got not colision if I do a test with 300 000 differents combinations.

Hacky Sql Compact Workaround

So, I'm trying to use ADO.NET to stream a file data stored in an image column in a SQL Compact database.
To do this, I wrote a DataReaderStream class that takes a data reader, opened for sequential access, and represents it as a stream, redirecting calls to Read(...) on the stream to IDataReader.GetBytes(...).
One "weird" aspect of IDataReader.GetBytes(...), when compared to the Stream class, is that GetBytes requires the client to increment an offset and pass that in each time it's called. It does this even though access is sequential, and it's not possible to read "backwards" in the data reader stream.
The SqlCeDataReader implementation of IDataReader enforces this by incrementing an internal counter that identifies the total number of bytes it has returned. If you pass in a number either less than or greater than that number, the method will throw an InvalidOperationException.
The problem with this, however, is that there is a bug in the SqlCeDataReader implementation that causes it to set the internal counter to the wrong value. This results in subsequent calls to Read on my stream throwing exceptions when they shouldn't be.
I found some infomation about the bug on this MSDN thread.
I was able to come up with a disgusting, horribly hacky workaround, that basically uses reflection to update the field in the class to the correct value.
The code looks like this:
public override int Read(byte[] buffer, int offset, int count)
{
m_length = m_length ?? m_dr.GetBytes(0, 0, null, offset, count);
if (m_fieldOffSet < m_length)
{
var bytesRead = m_dr.GetBytes(0, m_fieldOffSet, buffer, offset, count);
m_fieldOffSet += bytesRead;
if (m_dr is SqlCeDataReader)
{
//BEGIN HACK
//This is a horrible HACK.
m_field = m_field ?? typeof (SqlCeDataReader).GetField("sequentialUnitsRead", BindingFlags.NonPublic | BindingFlags.Instance);
var length = (long)(m_field.GetValue(m_dr));
if (length != m_fieldOffSet)
{
m_field.SetValue(m_dr, m_fieldOffSet);
}
//END HACK
}
return (int) bytesRead;
}
else
{
return 0;
}
}
For obvious reasons, I would prefer to not use this.
However, I do not want to buffer the entire contents of the blob in memory either.
Does any one know of a way I can get streaming data out of a SQL Compact database without having to resort to such horrible code?
I contacted Microsoft (through the SQL Compact Blog) and they confirmed the bug, and suggested I use OLEDB as a workaround. So, I'll try that and see if that works for me.
Actually, I decided to fix the problem by just not storing blobs in the database to begin with.
This eliminates the problem (I can stream data from a file), and also fixes some issues I might have run into with Sql Compact's 4 GB size limit.