Using MD5 in SQL Server 2005 to do a checksum file on a varbinary - sql

Im trying to do a MD5 check for a file uploaded to a varbinary field in MSSQL 2005.
I uploaded the file and using
SELECT DATALENGTH(thefile) FROM table
I get the same number of bytes that the file has.
But using MD5 calculator (from bullzip) i get this MD5:
20cb960d7b191d0c8bc390d135f63624
and using SQL I get this MD5:
44c29edb103a2872f519ad0c9a0fdaaa
Why they are different if the field has the same lenght and so i assume the same bytes?
My SQL Code to do that was:
DECLARE #HashThis varbinary;
DECLARE #md5text varchar(250);
SELECT #HashThis = thefile FROM CFile WHERE id=1;
SET #md5text = SUBSTRING(sys.fn_sqlvarbasetostr(HASHBYTES('MD5',#HashThis)),3,32)
PRINT #md5text;
Maybe the data type conversion?
Any tip will be helpful.
Thanks a lot :)

Two options
VARBINARY type without size modifier utilizes VARBINARY(1), so you are hashing the very 1st byte of file, SELECT DATALENGTH(#HashThis) after assignment will bring to you 1
If you use varbinary(MAX) instead - then keep in mind, that HASHBYTES hashes only first 8000 bytes of input
If you want to perform hashing more than 8000 bytes - write your own CLR hash function, for example the file is from my sql server project, it brings the same results as other hash functions outside of sql server:
using System;
using System.Data.SqlTypes;
using System.IO;
namespace ClrHelpers
{
public partial class UserDefinedFunctions {
[Microsoft.SqlServer.Server.SqlFunction]
public static Guid HashMD5(SqlBytes data) {
System.Security.Cryptography.MD5CryptoServiceProvider md5 = new System.Security.Cryptography.MD5CryptoServiceProvider();
md5.Initialize();
int len = 0;
byte[] b = new byte[8192];
Stream s = data.Stream;
do {
len = s.Read(b, 0, 8192);
md5.TransformBlock(b, 0, len, b, 0);
} while(len > 0);
md5.TransformFinalBlock(b, 0, 0);
Guid g = new Guid(md5.Hash);
return g;
}
};
}

It can be that MD5 Calculator is making the MD5 Hash of file content + other properties (ex: author, last process date, etc.). You may try to do alter these properties and make another hash to see if the result is equal (between before and after using only MD5 Calculator).
Another possibility is about what are you really saving in SQL Server..
So, it's quite clear, MD5 Calculator and SQL Server are hashing different things. What? I give a congratz to who answers it :)

Related

Reading a CSV file with 50M lines, how to improve performance

I have a data file in CSV (Comma-Separated-Value) format that has about 50 million lines in it.
Each line is read into a string, parsed, and then used to fill in the fields of an object of type FOO. The object then gets added to a List(of FOO) that ultimately has 50 million items.
That all works, and fits in memory (at least on an x64 machine), but its SLOW. It takes like 5 minutes every time load and parse the file into the list. I would like to make it faster. How can I make it faster?
The important parts of the code are shown below.
Public Sub LoadCsvFile(ByVal FilePath As String)
Dim s As IO.StreamReader = My.Computer.FileSystem.OpenTextFileReader(FilePath)
'Find header line
Dim L As String
While Not s.EndOfStream
L = s.ReadLine()
If L = "" Then Continue While 'discard blank line
Exit While
End While
'Parse data lines
While Not s.EndOfStream
L = s.ReadLine()
If L = "" Then Continue While 'discard blank line
Dim T As FOO = FOO.FromCSV(L)
Add(T)
End While
s.Close()
End Sub
Public Class FOO
Public time As Date
Public ID As UInt64
Public A As Double
Public B As Double
Public C As Double
Public Shared Function FromCSV(ByVal X As String) As FOO
Dim T As New FOO
Dim tokens As String() = X.Split(",")
If Not DateTime.TryParse(tokens(0), T.time) Then
Throw New Exception("Could not convert CSV to FOO: Invalid ISO 8601 timestamp")
End If
If Not UInt64.TryParse(tokens(1), T.ID) Then
Throw New Exception("Could not convert CSV to FOO: Invalid ID")
End If
If Not Double.TryParse(tokens(2), T.A) Then
Throw New Exception("Could not convert CSV to FOO: Invalid Format for A")
End If
If Not Double.TryParse(tokens(3), T.B) Then
Throw New Exception("Could not convert CSV to FOO: Invalid Format for B")
End If
If Not Double.TryParse(tokens(4), T.C) Then
Throw New Exception("Could not convert CSV to FOO: Invalid Format for C")
End If
Return T
End Function
End Class
I did some benchmarking and here are the results.
The complete algorithm above took 314 seconds to load the whole file and put the objects into the list.
With the body of FromCSV() reduced to just returning a new object of type FOO with default field values, the whole process took 84 seconds. Therefore it appears that processing the line of text into the object fields is taking 230 seconds (73% of the total time).
Doing everything but parsing the ISO 8601 date string takes 175 seconds. Therefore it appears that processing the date string takes 139 seconds, which is 60% of the text processing time, just for that one field.
Just reading the lines in the file without any processing or object creating takes 41 seconds.
Using StreamReader.ReadBlock to read the whole file in chunks of about 1KB takes 24s, but its a minor improvement in the grand scheme of things and probably not worth the added complexity. In order to use TryParse I would now need to manually create the temporary strings rather than using String.Split().
At this point the only path I see is to just display status to the user every few seconds so they don't wonder if the program is frozen or something.
UPDATE
I created two new functions. One can save the dataset from memory into a binary file using System.IO.BinaryWriter. The other function can load that binary file back into memory using System.IO.BinaryReader. The binary versions were considerably faster than CSV versions, and the binary files take up much less space.
Here are the benchmark results (same dataset for all tests):
LOAD CSV: 340s
SAVE CSV: 312s
SAVE BIN: 29s
LOAD BIN: 41s
CSV FILE SIZE: 3.86GB
BIN FILE SIZE: 1.63GB
I have a lot of experience with CSV, and the bad news is that you aren't going to be able to make this a whole lot faster. CSV libraries aren't going to be of much assistance here. The difficult problem with CSV, that libraries attempt to handle, is dealing with fields that have embedded commas, or newlines, which require quoting and escaping. Your dataset doesn't have this issue, since none of the columns are strings.
As you have discovered, the bulk of the time is spent in the parse methods. Andrew Morton had a good suggestion, using TryParseExact for DateTime values can be a quite a bit faster than TryParse. My own CSV library, Sylvan.Data.Csv (which is the fastest available for .NET), uses an optimization where it parses primitive values directly out of the stream read buffer without converting to string first (only when running on .NET core), that can also speed things up a bit. However, I wouldn't expect it to be possible to cut the processing time in half while sticking with CSV.
Here is an example of using my library, Sylvan.Data.Csv to process the CSV in C#.
static List<Foo> Read(string file)
{
// estimate of the average row length based on Andrew Morton's 4GB/50m
const int AverageRowLength = 80;
var textReader = File.OpenText(file);
// specifying the DateFormat will cause TryParseExact to be used.
var csvOpts = new CsvDataReaderOptions { DateFormat = "yyyy-MM-ddTHH:mm:ss" };
var csvReader = CsvDataReader.Create(textReader, csvOpts);
// estimate number of rows to avoid growing the list.
var estimatedRows = (int)(textReader.BaseStream.Length / AverageRowLength);
var data = new List<Foo>(estimatedRows);
while (csvReader.Read())
{
if (csvReader.RowFieldCount < 5) continue;
var item = new Foo()
{
time = csvReader.GetDateTime(0),
ID = csvReader.GetInt64(1),
A = csvReader.GetDouble(2),
B = csvReader.GetDouble(3),
C = csvReader.GetDouble(4)
};
data.Add(item);
}
return data;
}
I'd expect this to be somewhat faster than your current implementation, so long as you are running on .NET core. Running on .NET framework the difference, if any, wouldn't be a significant. However, I don't expect this to be acceptably fast for your users, it will still likely take tens of seconds, or minutes to read the whole file.
Given that, my advice would be to abandon CSV altogether, which means you can abandon parsing which is what is slowing things down. Instead, read and write the data in binary form. Your data records have a nice property, in that they are fixed width: each record contains 5 fields that are 8 bytes (64bit) wide, so each record requires exactly 40 bytes in binary form. 50m x 40 = 2GB. So, assuming Andrew Morton's estimate of 4GB for the CSV is correct, moving to binary will halve the storage needs. Immediately, that means there is half as much disk IO needed to read the same data. But beyond that, you won't need to parse anything, the binary representation of the value will essentially be copied directly to memory.
Here are some examples of how to do this in C# (don't know VB very well, sorry).
static List<Foo> Read(string file)
{
var stream = File.OpenRead(file);
// the exact number of records can be determined by looking at the length of the file.
var recordCount = stream.Length / 40;
var data = new List<Foo>(recordCount);
var br = new BinaryReader(stream);
for (int i = 0; i < recordCount; i++)
{
var ticks = br.ReadInt64();
var id = br.ReadInt64();
var a = br.ReadDouble();
var b = br.ReadDouble();
var c = br.ReadDouble();
var f = new Foo()
{
time = new DateTime(ticks),
ID = id,
A = a,
B = b,
C = c,
};
data.Add(f);
}
return data;
}
static void Write(List<Foo> data, string file)
{
var stream = File.Create(file);
var bw = new BinaryWriter(stream);
foreach(var item in data)
{
bw.Write(item.time.Ticks);
bw.Write(item.ID);
bw.Write(item.A);
bw.Write(item.B);
bw.Write(item.C);
}
}
This should almost certainly be an order of magnitude faster than a CSV-based solution. The question then becomes: is there some reason that you must use CSV? If the source of the data is out of your control, and you must use CSV, I would then ask: will the data file change every time, or will it only be appended to with new data? If it is appended to, I would investigate a solution where each time the app starts convert only the new section of appended CSV data and add it to a binary file that you will then load everything from. Then you only have to pay the cost of processing the new CSV data each time, and will load everything quickly from the binary form.
This could be made even faster by creating fixed layout struct (Foo), allocating an array of them, and using span-based trickery to read the array data directly from the FileStream. This can be done because all of your data elements are "blittable". This would be the absolute fastest way to load this data into your program. Start with the BinaryReader/Writer and if you find that still isn't fast enough, then investigate this.
If you find this solution to work, I'd love to hear the results.

how common is it to use both a guid and an int as unique identifiers for a table?

I think a Guid is generally the preferred unique table row identifier from a dba perspective. But I'm working on a project where the developers and managers appear to want a way to reference things by an int value. I can understand their perspective b/c they want a simple and easy way to reference different entities.
I was thinking about using a pattern for my tables where each table would have an int Id column representing the PK column but then it would also include a Guid column as a globally unique identifier. How common is it to use this type of pattern?
In the vast majority of cases you'll want to either use an INT or BIGINT for you primekey/foreign key. For the most part you are looking to make sure that table can be joined to and have a way to easily select a single unique row. In theory using GUIDs all over the place gets you there too, if you were a robot and could quickly ask a colleague, "Hey can you check out ROW_ID FD229C39-2074-4B04-8A50-456402705C02" vs "Hey can you check out ROW_ID 523". But we are human. I don't think there is a really good reason to include another column that is simply a GUID in addition to your PK (which should be an INT or BIGINT)
It can also be nice to have your PK in an order, that seems to come in handy. GUIDs won't be in a order. However, a case for using a GUID would be if you have to expose this value to a customer. You may not want them to know they are customer #6. But being customer #B8D44820-DF75-44C9-8527-F6AC7D1D259B isn't too great if they have to call in and identify themselves, but might be fine for writing code against (say a webservice or some kind of API). SQL is a lot of art with the science!
In addition do you really need a global unique id for a row? Probably not. If you are designing a system that could use up more than what INT can handle (say total number of tweets in all time) then use BIGINT. If you can use up all the BIGINTs, wow. I'd be interested in hearing how and would like to subscribe to your newsletter.
A question I ask myself when writing stuff, "If I'm wrong how hard will it be to do the other way?". If you really need a GUID later, add it. If you put it in now and just 1 person uses it you can never take it out and it will have to be maintained... job security? nah, don't think that way :) Don't over engineer it.
I would not say GUID is generally preferred from a DBA perspective. It is larger (16 bytes rather than 4 for int or 8 for bigint) and the random variety introduces fragmentation and causes much more IO with large tables due to lower page life expectancy. This is especially a problem with spinning media and limited RAM.
When a GUID is actually needed, some of these issues can be avoided using a sequential version for the GUID value rather than introducing another surrogate key. The value can be assigned in by SQL Server with a NEWSEQUENTIALID() default constraint on a column or generated in application code with the bytes ordered properly for SQL Server. Below is a Windows C# example of the latter technique.
using System;
using System.Runtime.InteropServices;
public class Example
{
[DllImport("rpcrt4.dll", CharSet = CharSet.Auto)]
public static extern int UuidCreateSequential(ref Guid guid);
/// sequential guid for SQL Server
public static Guid NewSequentialGuid()
{
const int S_OK = 0;
const int RPC_S_UUID_LOCAL_ONLY = 1824;
Guid oldGuid = Guid.Empty;
int result = UuidCreateSequential(ref oldGuid);
if (result != S_OK && result != RPC_S_UUID_LOCAL_ONLY)
{
throw new ExternalException("UuidCreateSequential call failed", result);
}
byte[] oldGuidBytes = oldGuid.ToByteArray();
byte[] newGuidBytes = new byte[16];
oldGuidBytes.CopyTo(newGuidBytes, 0);
// swap low timestamp bytes (0-3)
newGuidBytes[0] = oldGuidBytes[3];
newGuidBytes[1] = oldGuidBytes[2];
newGuidBytes[2] = oldGuidBytes[1];
newGuidBytes[3] = oldGuidBytes[0];
// swap middle timestamp bytes (4-5)
newGuidBytes[4] = oldGuidBytes[5];
newGuidBytes[5] = oldGuidBytes[4];
// swap high timestamp bytes (6-7)
newGuidBytes[6] = oldGuidBytes[7];
newGuidBytes[7] = oldGuidBytes[6];
//remaining 8 bytes are unchanged (8-15)
return new Guid(newGuidBytes);
}
}

How to read 8-byte integers in GMS 2.x?

I need to read 8-byte integers from a stream. I could not find any documentation how to read 8-byte integers in DM. It would be something similar to a long long integer.
Is there a trick how to stream 8-byte integers from file in GMS 2.x ?
We can use the "Stream" object to read/import data of various kinds. Please refer to the DM Help > Scripting > File Input and Output:
Other examples can also be found at DM-Script-Database :
Read-Ser (http://donation.tugraz.at/dm/source_codes/127)
JEMS_.ems file reader (http://donation.tugraz.at/dm/source_codes/108)
Hope this helps.
I used the following (stupid) method to do so:
number readint32(object s){
number stream_byte_order=2
number result=0
TagGroup tg = NewTagGroup();
tg.TagGroupSetTagAsLong( "SInt32_0", 0 )
TagGroupReadTagDataFromStream( tg, "SInt32_0", s, stream_byte_order );
tg.TagGroupGetTagAsLong( "SInt32_0", result)
return result
}
number readint64(object s){
//new for reading 8-byte integer in TIA ver >3.7
//DM automatic convert result to float when the second 4-byte >1
number result = readint32(s)+ (readint32(s)*4294967296)
// 4294967296 equals to 0xFFFFFFFF in hex form
return result
}
It works with reading ser <2GB, but does not for larger file. I still did not figure it out...
#09-04-2016
Now i got a solution to the data offset problem in ser:
Here is the solution:
Void b_readint64(object s, number &lo, number &hi){
//new for reading 8-byte (64bit) integer in TIA ver >3.7
//read the low and high section individually and later work
//together with StreamSetPos32singed, StreamSetPos64 funcsions
lo = b_readint32(s)
hi = b_readint32(s)
}
Void StreamSetPos32Signed(object s, number base, number lo){
if (lo>0) StreamSetPos(s, base, lo)
else StreamSetPos(s, base, 4294967296+lo)
}
Void StreamSetPos64(object s, number base, number lo, number hi){
if (hi!=0){
StreamSetPos(s, base, 0)
for (number i=0; i<hi; i++) StreamSetPos(s, 1, 4294967296)
StreamSetPos32Signed(s, 1, lo)
} else StreamSetPos32signed(s, base, lo)
}
BTW, I just uploaded this upgraded script to
http://portal.tugraz.at/portal/page/portal/felmi/DM-Script/DM-Script-Database
There is nothing like an 8-byte integer in DigitalMicrograph. You can use the streaming to read in two successive 4-byte sections as integers (See answer above) and then display them as binary using binary() or hexadecimal using hex(), but you will have to do the maths yourself for the "meaning" of the 8-byte integer (storing it as real-number). You can use the binary operators & | ^ for bitwise numeric, when needed.

Is it possible to add SignalR messages directly to the SQL Backplane?

I'd like to know if I can add SignalR messages directly to the SignalR SQL Backplane (from SQL) so I don't have to use a SignalR client to do so.
My situation is that I have an activated stored procedure for a SQL Service Broker queue, and when it fires, I'd like to post a message to SignalR clients. Currently, I have to receive the message from SQL Service Broker in a seperate process and then immediately re-send the message with a SignalR hub.
I'd like my activated stored procedure to basically move the message directly onto the SignalR SQL Backplane.
Yes and no. I set up a small experiment on my localhost to determine if possible - and it is, if formatted properly.
Now, a word on the [SignalR] schema. It generates three tables:
[SignalR].[Messages_0]
--this holds a list of all messages with the columns of
--[PayloadId], [Payload], and [InsertedOn]
[SignalR].[Messages_0_Id]
--this holds one record of one field - the last Id value in the [Messages_0] table
[SignalR].[Scehma]
--No idea what this is for; it's a 1 column (SchemaVersion) 1 record (value of 1) table
Right, so, I duplicated the last column except I incremented the PayloadId (for the new record and in [Messages_0_Id] and put in GETDATE() as the value for InsertedOn. Immediately after adding the record, a new message came into the connected client. Note that PayloadId is not an identity column, so you must manually increment it, and you must copy that incremented value into the only record in [Messages_0_Id], otherwise your signalr clients will be unable to connect due to Signalr SQL errors.
Now, the trick is populating the [Payload] column properly. A quick look at the table shows that it's probably binary serialized. I'm no expert at SQL, but I'm pretty sure doing a binary serialization is up there in difficulty. If I'm right, this is the source code for the binary serialization, located inside Microsoft.AspNet.SignalR.Messaging.ScaleoutMessage:
public byte[] ToBytes()
{
using (MemoryStream memoryStream = new MemoryStream())
{
BinaryWriter binaryWriter = new BinaryWriter((Stream) memoryStream);
binaryWriter.Write(this.Messages.Count);
for (int index = 0; index < this.Messages.Count; ++index)
this.Messages[index].WriteTo((Stream) memoryStream);
binaryWriter.Write(this.ServerCreationTime.Ticks);
return memoryStream.ToArray();
}
}
With WriteTo:
public void WriteTo(Stream stream)
{
BinaryWriter binaryWriter = new BinaryWriter(stream);
string source = this.Source;
binaryWriter.Write(source);
string key = this.Key;
binaryWriter.Write(key);
int count1 = this.Value.Count;
binaryWriter.Write(count1);
ArraySegment<byte> arraySegment = this.Value;
byte[] array = arraySegment.Array;
arraySegment = this.Value;
int offset = arraySegment.Offset;
arraySegment = this.Value;
int count2 = arraySegment.Count;
binaryWriter.Write(array, offset, count2);
string str1 = this.CommandId ?? string.Empty;
binaryWriter.Write(str1);
int num1 = this.WaitForAck ? 1 : 0;
binaryWriter.Write(num1 != 0);
int num2 = this.IsAck ? 1 : 0;
binaryWriter.Write(num2 != 0);
string str2 = this.Filter ?? string.Empty;
binaryWriter.Write(str2);
}
So, re-implementing that in a stored procedure with pure SQL will be near impossible. If you need to do it on the SQL Server, I suggest using SQL CLR functions. One thing to mention, though - It's easy enough to use a class library, but if you want to reduce hassle over the long term, I'd suggest creating a SQL Server project in Visual Studio. This will allow you to automagically deploy CLR functions with much more ease than manually re-copying the latest class library to the SQL Server. This page talks more about how to do that.
I was inspired by this post and wrote up a version in an SQL stored proc. Works really slick, and wasn't hard.
I hadn't done much work with varbinary before - but SQL server makes it pretty easy to work with, and you can just add the sections together. The format given by James Haug above is accurate. Most of the strings are just "length as byte then the string content" (with the string content just being convert(varbinary,string)). The exception string is the payload, which instead is "length as int32 then string content". Numbers are written out "least significant byte first". I'm not sure whether you can do a conversion like this natively - I found it easy enough to write this myself as a recursive function (something like numToBinary(val,bytesRemaining)... returns varbinary).
If you take this route, I'd still write a parser first (in .NET or another non-SQL language) and test it on some packets generated by SignalR itself. That gives you a better place to work out the kinks in your SQL - and learn the right formatting of the payload package and what not.

Hacky Sql Compact Workaround

So, I'm trying to use ADO.NET to stream a file data stored in an image column in a SQL Compact database.
To do this, I wrote a DataReaderStream class that takes a data reader, opened for sequential access, and represents it as a stream, redirecting calls to Read(...) on the stream to IDataReader.GetBytes(...).
One "weird" aspect of IDataReader.GetBytes(...), when compared to the Stream class, is that GetBytes requires the client to increment an offset and pass that in each time it's called. It does this even though access is sequential, and it's not possible to read "backwards" in the data reader stream.
The SqlCeDataReader implementation of IDataReader enforces this by incrementing an internal counter that identifies the total number of bytes it has returned. If you pass in a number either less than or greater than that number, the method will throw an InvalidOperationException.
The problem with this, however, is that there is a bug in the SqlCeDataReader implementation that causes it to set the internal counter to the wrong value. This results in subsequent calls to Read on my stream throwing exceptions when they shouldn't be.
I found some infomation about the bug on this MSDN thread.
I was able to come up with a disgusting, horribly hacky workaround, that basically uses reflection to update the field in the class to the correct value.
The code looks like this:
public override int Read(byte[] buffer, int offset, int count)
{
m_length = m_length ?? m_dr.GetBytes(0, 0, null, offset, count);
if (m_fieldOffSet < m_length)
{
var bytesRead = m_dr.GetBytes(0, m_fieldOffSet, buffer, offset, count);
m_fieldOffSet += bytesRead;
if (m_dr is SqlCeDataReader)
{
//BEGIN HACK
//This is a horrible HACK.
m_field = m_field ?? typeof (SqlCeDataReader).GetField("sequentialUnitsRead", BindingFlags.NonPublic | BindingFlags.Instance);
var length = (long)(m_field.GetValue(m_dr));
if (length != m_fieldOffSet)
{
m_field.SetValue(m_dr, m_fieldOffSet);
}
//END HACK
}
return (int) bytesRead;
}
else
{
return 0;
}
}
For obvious reasons, I would prefer to not use this.
However, I do not want to buffer the entire contents of the blob in memory either.
Does any one know of a way I can get streaming data out of a SQL Compact database without having to resort to such horrible code?
I contacted Microsoft (through the SQL Compact Blog) and they confirmed the bug, and suggested I use OLEDB as a workaround. So, I'll try that and see if that works for me.
Actually, I decided to fix the problem by just not storing blobs in the database to begin with.
This eliminates the problem (I can stream data from a file), and also fixes some issues I might have run into with Sql Compact's 4 GB size limit.