How to show the public key for a website certificate - ssl

I've created a Go program to connect to a website and get the certificates it uses. I'm not sure how to get the correct representation of the public key.
I can fetch the certificate and I can type check on Certificate.PublicKey. Once I understand it's rsa.PublicKey or ecdsa.PublicKey I'd need to print the hex representation of it.
switch cert.PublicKey.(type) {
case *rsa.PublicKey:
logrus.Error("this is RSA")
// TODO: print hex representation of key
case *ecdsa.PublicKey:
logrus.Error("this is ECDSA")
// TODO: print hex representation of key
default:
fmt.Println("it's something else")
}
I'd expect it to print something like:
04 4B F9 47 1B A8 A8 CB A4 C6 C0 2D 45 DE 43 F3 BC F5 D2 98 F4 25 90 6F 13 0D 78 1A AC 05 B4 DF 7B F6 06 5C 80 97 9A 53 06 D0 DB 0E 15 AD 03 DE 14 09 D3 77 54 B1 4E 15 A8 AF E3 FD DC 9D AD E0 C5

it seems you are asking for the sha1 sum of the certificates involved.
here is a working example that asks for a host:port and prints the sums of the certificates involved
package main
import (
"crypto/sha1"
"crypto/tls"
"fmt"
"log"
"os"
)
func main() {
if len(os.Args) != 2 {
log.Panic("call with argument of host:port")
}
log.SetFlags(log.Lshortfile)
conf := &tls.Config{
//InsecureSkipVerify: true,
}
fmt.Printf("dialing:%s\n", os.Args[1])
conn, err := tls.Dial("tcp", os.Args[1], conf)
if err != nil {
log.Println(err)
return
}
defer conn.Close()
for i, v := range conn.ConnectionState().PeerCertificates {
//edit: use %X for uppercase hex printing
fmt.Printf("cert %d sha1 fingerprint:%x \n", i, sha1.Sum(v.Raw))
}
}
run as:
./golang-tls www.google.com:443
dialing:www.google.com:443
cert 0 sha1 fingerprint:34781c3be98cf958f514aecb1ae2e4e866effe34
cert 1 sha1 fingerprint:eeacbd0cb452819577911e1e6203db262f84a318
for general notions on SSL i have found this stackexchange answer to be extremely valuable.

Related

Convert hex to latitude longitude from dashcam .MOV file

I have an OEM dashcam that saves videos in .MOV format. I want to extract the embedded gps data from the video files programmatically. Upon opening the .mov file in a hex editor I found data packets with freeGPS headers and I can confirm that for my 7 seconds sample video there are 7 packets so I know this where the gps data comes from.
I already found the date and time but I got stuck with converting the hex values to latitude longitude. Below are the hex values and their equivalent coordinates when extracted using Registrator Viewer.
273108AC1C7996404E,0D022B873EA3C74045 - 14.637967,121.041475
516B9A771C7996404E,0D022B873EA3C74045 - 14.637963,121.041475
B9FC87F41B7996404E,52499D803EA3C74045 - 14.637955,121.041472
B9FC87F41B7996404E,52499D803EA3C74045 - 14.637955,121.041472
B459F5B91A7996404E,C442AD693EA3C74045 - 14.637935,121.041460
1DEBE2361A7996404E,ACADD85F3EA3C74045 - 14.637927,121.041455
08CE19511A7996404E,4FD1915C3EA3C74045 - 14.637928,121.041453
The bolded bytes directly translates to #N and #E so I think they are not part of the conversion. I already tried the below answers but I did not succeed in getting the correct coordinates.
How to convert GPS Longitude and latitude from hex
How to convert my binary (hex) data to latitude and longitude?
I already sent an email to the dashcam provider asking for their protocol documentation but it does not look like they have one since they sent Registrator Viewer when I asked for their own video player.
I will also include the first freeGPS packet in case I am looking at the wrong place.
00 00 80 00 66 72 65 65 47 50 53 20 98 00 00 00 78 2E 78 78 00 00 00 00 00 00 00 00 00 00 00 00 30 30 30 30 30 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 03 00 00 00 27 00 00 00 41 00 00 00 27 31 08 AC 1C 79 96 40 4E 00 00 00 00 00 00 00 0D 02 2B 87 3E A3 C7 40 45 00 00 00 00 00 00 00 8F C2 F5 28 5C 8F E2 3F 48 E1 7A 14 AE 07 68 40 11 00 00 00 06 00 00 00 14 00 00 00 76 00 00 00 88 00 00 00 DE 00 00 00 4D 00 00 00 49 00 00 00 4F 00 00 00 2D 00 00 00 2B 00 00 00 00 00 00 00
Data in bold extracted in order : freeGPS, time, latitude#N?, longitude#E?, date
I can confirm that time and date are correct. The speed is supposed to be 1km/h but I can't also find that.
Thanks in advance for those who can help.
EDIT:
Here is the link for the test video. Test Video
Not sure how helpful this is but if you convert the hex string (including the 0x40 at the end) into a little endian IEEE-754 double precision real, then you get approximately 100X the lat/lon.
In C, I did something like
printf( "%f\n", *(double*)"\x27\x31\x08\xAC\x1C\x79\x96\x40" );
printf( "%f\n", *(double*)"\x0D\x02\x2B\x87\x3E\xA3\xC7\x40" );
and got out
1438.278000
12102.488500
Update thanks to #TDG
If the 1438.278 is interpreted as 14 degrees 38.278 minutes then you get a decimal value of 14.6379666666666667. If 12102.4885 is interpreted as 121 degrees and 2.4885 minutes, the decimal equivalent is 121.041475.
Some example C code to do this
#include<stdio.h>
double convert( double input ) {
int i = input/100;
return ( input - i*100 ) / 60 + i;
}
int main(){
printf( "%f\n", convert( *(double*)"\x27\x31\x08\xAC\x1C\x79\x96\x40" ) );
printf( "%f\n", convert( *(double*)"\x0D\x02\x2B\x87\x3E\xA3\xC7\x40" ) );
}
Found this explanation of GPS data format, worked for me.
public static ViofoGpsPoint Parse(uint offset, uint size, byte[] file)
{
byte[] data = new byte[size];
Array.Copy(file, offset, data, 0, size);
uint pos = 0;
uint size1 = Box.ReadUintBE(data, pos); pos += 4;
string type = Encoding.ASCII.GetString(data, (int)pos, 4); pos += 4;
string magic = Encoding.ASCII.GetString(data, (int)pos, 4); pos += 4;
if (size != size1 || type != "free" || magic != "GPS ")
return null;
ViofoGpsPoint gps = new ViofoGpsPoint();
//# checking for weird Azdome 0xAA XOR "encrypted" GPS data.
//This portion is a quick fix.
uint payload_size = 254;
if (data[pos] == 0x05)
{
if (size < 254)
payload_size = size;
byte[] payload = new byte[payload_size];
pos += 6; //???
for (int i = 0; i < payload_size; i++)
{
payload[i] = (byte)(file[pos + i] ^ 0xAA);
}
}
else if ((char)data[pos] == 'L')
{
const uint OFFSET_V2 = 48, OFFSET_V1 = 16;
pos = OFFSET_V2;
//# Datetime data
int hour = (int)Box.ReadUintLE(data, pos); pos += 4;
int minute = (int)Box.ReadUintLE(data, pos); pos += 4;
int second = (int)Box.ReadUintLE(data, pos); pos += 4;
int year = (int)Box.ReadUintLE(data, pos); pos += 4;
int month = (int)Box.ReadUintLE(data, pos); pos += 4;
int day = (int)Box.ReadUintLE(data, pos); pos += 4;
try { gps.Date = new DateTime(2000 + year, month, day, hour, minute, second); }
catch (Exception err) { Debug.WriteLine(err.ToString()); return null; }
//# Coordinate data
char active = (char)data[pos]; pos++;
gps.IsActive = (active == 'A');
gps.Latitude_hemisphere = (char)data[pos]; pos++;
gps.Longtitude_hemisphere = (char)data[pos]; pos++;
gps.Unknown = data[pos]; pos++;
float lat = Box.ReadFloatLE(data, pos); pos += 4;
gps.Latitude = FixCoordinate(lat, gps.Latitude_hemisphere);
float lon = Box.ReadFloatLE(data, pos); pos += 4;
gps.Longtitude = FixCoordinate(lon, gps.Longtitude_hemisphere);
gps.Speed = Box.ReadFloatLE(data, pos); pos += 4;
gps.Bearing = Box.ReadFloatLE(data, pos); pos += 4;
return gps;
}
return null;
}
/// <summary>
/// # Novatek stores coordinates in odd DDDmm.mmmm format
/// </summary>
/// <param name="coord"></param>
/// <param name="hemisphere"></param>
/// <returns></returns>
private static double FixCoordinate(double coord, char hemisphere)
{
double minutes = coord % 100.0;
double degrees = coord - minutes;
double coordinate = degrees / 100.0 + (minutes / 60.0);
if (hemisphere == 'S' || hemisphere == 'W')
return -1 * (coordinate);
else
return (coordinate);
}

Converting Hex String to UInt8 Array in Objective-C

The String to bytes[]
NSString *strData = #"ca 20 fe c1 04 03 03 07 00 ac";
The Result is
UInt8 bytes[]= {0xca,0x20,0xfe,0xc1,0x04,0x03,0x03,0x07,0x01,0xac};
for swift code, how convert to Objective-C
let data = NSString(string : "ca 20 fe c1 04 03 03 07 01 ac");
let dataArr = data.componentsSeparatedByString(" ")
var bytes : [UInt8] = [];
for item in dataArr {
let byte = UInt8(item, radix: 16)
bytes.append(byte!);
}
let hexData = NSData(bytes: bytes, length: 10)

Ti C6x DSP intrinsics for optimising C code

I want to use C66x intrinsics to optimise my code .
Below is some C code what I want to optimise by using DSP intrinsics .
I am new to DSP intrinsic ,so not having full knowledge of which intrinsic use for below logic .
uint8 const src[40] = = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40};
uint32_t width = 8;
uint32_t axay1_6 = 112345;
uint32_t axay2_6 = 123456;
uint32_t axay3_6 = 134567;
uint32_t axay4_6 = 145678;
C code:
uint8_t const *cLine = src;
uint8_t const *nLine = cLine + width;
uint32_t res = 0;
const uint32_t a1 = (*cLine++) * axay1_6;
const uint32_t a3 = (*nLine++) * axay3_6;
res = a1 + a3;
const uint32_t a2 = (*cLine) * axay2_6;
const uint32_t a4 = (*nLine) * axay4_6;
res += a2 + a4;
C66x Intrinscics :
const uint8_t *Ix00, *Ix01, *Iy00,*Iy01;
uint32_t in1,in2;
uint64_t l1, l2;
__x128_t axay1_6 = _dup32_128(axay1_6); //112345 112345 112345 112345
__x128_t axay2_6 = _dup32_128(axay2_6); //123456 123456 123456 123456
__x128_t axay3_6 = _dup32_128(axay3_6); //134567 134567 134567 134567
__x128_t axay4_6 = _dup32_128(axay4_6); //145678 145678 145678 145678
Ix00 = src ;
Ix01 = Ix00 + 1 ;
Iy00 = src + width;
Iy01 = Iy00 + 1;
int64_t I_00 = _mem8_const(Ix00); //00 01 02 03 04 05 06 07
int64_t I_01 = _mem8_const(Ix01); //01 02 03 04 05 06 07 08
int64_t I_10 = _mem8_const(Iy00); //10 11 12 13 14 15 16 17
int64_t I_11 = _mem8_const(Iy01); //11 12 13 14 15 16 17 18
in1 = _loll(I_00); //00 01 02 03
l1 = _unpkbu4(in1); //00 01 02 03 (16x4)
in2 = _hill(I_00); //04 05 06 07
l2 = _unpkbu4(in2); //04 05 06 07 (16x4)
Here I want one something __x128 register with 32*4 value containg " 00 01 02 03 " data .
So I can multiply __x128 into __x128 bit register and get __x128 bit value .Presently i am planning to use _qmpy32
I am new to this C66x DSP intrinscic .
Can you tell me which intrinsic is suitable to get __x128 type of register with 32x4 values with 00 01 02 03 values.
(means how to convert 16 bit to 32 bit by using dsp intrinsic)
Use the _unpkhu2 instruction to expand the 16x4 to 32x4.
__x128_t src1_128, src2_128;
src1_128 = _llto128(_unpkhu2(_hill(l1)), _unpkhu2(_loll(l1)));
src2_128 = _llto128(_unpkhu2(_hill(l2)), _unpkhu2(_loll(l2)));
Be careful: Little-endian/Big-endian settings can make these sorts of things come out in a way you didn't expect.
Also, I wouldn't recommend naming a variable l1. In some fonts, lower-case L and the number 1 are indistinguishable.

Using pyOpenSSL to generate p12 / pfx containers

I have just started using pyOpenSSL library to generate certificates and to read existing certs. However, I want to generate a p12/pfx bundle in my program instead of the standard pem files. I wasnt able to find the appropriate API for this. Only for dumping pkcs12 objects. Can anyone let me know how to do this ?
Thanks
Using the example PEM private key data in privkeydata and certificate data in certdata (which I moved to the bottom of the answer for better readability), I think the following is what you are looking for:
>>> cert = crypto.load_certificate(crypto.FILETYPE_PEM, certdata)
>>> privkey = crypto.load_privatekey(crypto.FILETYPE_PEM, privkeydata)
>>> pfx = crypto.PKCS12Type()
>>> pfx.set_privatekey(privkey)
>>> pfx.set_certificate(cert)
>>> pfxdata = pfx.export('passphrase')
>>> with open('test.pfx', 'wb') as pfxfile:
... pfxfile.write(pfxdata)
...
>>>
Checking the result by invoking openssl in the shell:
$ openssl pkcs12 -info -in test.pfx -passin pass:passphrase -passout pass:otherpassphrase
MAC Iteration 1
MAC verified OK
PKCS7 Encrypted data: pbeWithSHA1And3-KeyTripleDES-CBC, Iteration 2048
Certificate bag
Bag Attributes
localKeyID: 97 AD B9 5B EC 5B BA 6D BC F7 D3 06 EA CC 12 A1 52 AE 90 7B
subject=/C=nl/ST=Noord-Holland/O=Mobilefish.com/L=Zaandam/OU=Marketing/CN=www.mobilefish.com/emailAddress=contact#mobilefish.com
issuer=/C=nl/ST=Noord-Holland/O=Mobilefish.com/L=Zaandam/OU=Marketing/CN=www.mobilefish.com/emailAddress=contact#mobilefish.com
-----BEGIN CERTIFICATE-----
MIID0zCCAzygAwIBAgIBADANBgkqhkiG9w0BAQQFADCBqDELMAkGA1UEBhMCbmwx
FjAUBgNVBAgTDU5vb3JkLUhvbGxhbmQxFzAVBgNVBAoTDk1vYmlsZWZpc2guY29t
MRAwDgYDVQQHEwdaYWFuZGFtMRIwEAYDVQQLEwlNYXJrZXRpbmcxGzAZBgNVBAMT
End3dy5tb2JpbGVmaXNoLmNvbTElMCMGCSqGSIb3DQEJARYWY29udGFjdEBtb2Jp
bGVmaXNoLmNvbTAeFw0xNTExMTQwMjAyNDlaFw0xNjExMTMwMjAyNDlaMIGoMQsw
CQYDVQQGEwJubDEWMBQGA1UECBMNTm9vcmQtSG9sbGFuZDEXMBUGA1UEChMOTW9i
aWxlZmlzaC5jb20xEDAOBgNVBAcTB1phYW5kYW0xEjAQBgNVBAsTCU1hcmtldGlu
ZzEbMBkGA1UEAxMSd3d3Lm1vYmlsZWZpc2guY29tMSUwIwYJKoZIhvcNAQkBFhZj
b250YWN0QG1vYmlsZWZpc2guY29tMIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKB
gQC2Yw+5xKhhelVmH7Weu9eMhreuRvQXuNsyi5SA0sBXboOybox5oJZAWbL84KN5
gX1qN7U62szotl3K49bRlzbKu/TmcVdJYlRlnwusL5XQJDKv+uERlUU0QDXeswEu
M93UxkeN/j0vKfjp8k/Ny4qc5pNOT/dqNRyx01pVFV8NFwIDAQABo4IBCTCCAQUw
HQYDVR0OBBYEFKEXjyTmz/vOVxHbtJCJUraUZhxsMIHVBgNVHSMEgc0wgcqAFKEX
jyTmz/vOVxHbtJCJUraUZhxsoYGupIGrMIGoMQswCQYDVQQGEwJubDEWMBQGA1UE
CBMNTm9vcmQtSG9sbGFuZDEXMBUGA1UEChMOTW9iaWxlZmlzaC5jb20xEDAOBgNV
BAcTB1phYW5kYW0xEjAQBgNVBAsTCU1hcmtldGluZzEbMBkGA1UEAxMSd3d3Lm1v
YmlsZWZpc2guY29tMSUwIwYJKoZIhvcNAQkBFhZjb250YWN0QG1vYmlsZWZpc2gu
Y29tggEAMAwGA1UdEwQFMAMBAf8wDQYJKoZIhvcNAQEEBQADgYEAanK63a/8Emwl
v4i8XI57hkt3Iq0NbMveGT01DrBiRUJ/Uf7jpS+j4blcaUUJ6JuOk+wrwYZIZqZE
9mHfiPKMNps22OYXoHkaZPcxtofpyTGE2tnW2ReauTKCVPSczQPqn7mhBG2t6TJs
YBpp0s2I/q7a4bVbowibPbO3RK1kBcA=
-----END CERTIFICATE-----
PKCS7 Data
Shrouded Keybag: pbeWithSHA1And3-KeyTripleDES-CBC, Iteration 2048
Bag Attributes
localKeyID: 97 AD B9 5B EC 5B BA 6D BC F7 D3 06 EA CC 12 A1 52 AE 90 7B
Key Attributes: <No Attributes>
-----BEGIN ENCRYPTED PRIVATE KEY-----
MIICxjBABgkqhkiG9w0BBQ0wMzAbBgkqhkiG9w0BBQwwDgQIQ4sDzexzf6gCAggA
MBQGCCqGSIb3DQMHBAjmWBnhSdfEJgSCAoCQMrLa0Y+V3zrgRtjesa6Er/dJFz40
rpN2unNBpdrFMkuEIcCAnlNoLKJpe3x20ly4QrYaDG7sxMbdxnr3jqf4Jy0TxgnC
nC5x8hDhIV+M7gnXQiiGTK2VPDeJ2n3/hmmIEgleBOSdbz39O1Ik52+E47Fee+pB
W9b2au/p8NUE66v7JgN+VQVG6EcXCsyFkFivl1O+eokcTwa9q3sqPW+xTiPJ43LH
yKAvjT7vWOYark6QK8Gcth4Y8FdKMA6kHNim/LAtl4Vc1Af5qHMubBO1C+Avw0HE
Qt3DP/mkdwLYjisBbqjpAFkTsdEuMIwyhuExCSu0w+QfxjVAezyC6y+7IWfBfRpG
j9+MNy9qe0DqKIQ/P09GeoXJH8Yy0RQiA1XpQBcGSuRHj6B3lWUlxtTlGlTmxlzO
yPDJXxaUmMNTCNQlYu7CBj2FOXXewAuGi0nv8/bbZpWxSgyZcVcJlCtYZq+9NmYv
RhGwfhWuNsQZQmtFDgtpg/GYD8TFV6oc6mmTurBkLEL2KGCnPWVRH8xyJeb87/EF
/H/2gA5P9aS/K3cN3OsgC5uUi38jgFZ2p69TPNLjxBHK5HakaCgh1Txdx9dcAoMt
lA/GRBu/CoqA48O4vV3RyrB0ZNSYyAYTuVRjJ+50d427InaUwrwaYCakpbxXKrlH
jvb2gKtXnvIpNnE32N1whORBGU+srEO8tz/Il5AYrZ21ESIixX9pftAgIiEMc7Xw
WmV3NexkHZGvyCG1vq62LzNxgEBN3Ng013gYdLXbO1y/pXcSRHGRdidvIwYefBbs
Yo6yvsUgdtfeAwlCC+ojgB6rTKhlbk2Yex6y9sxRCSMHibiwnveuNez+
-----END ENCRYPTED PRIVATE KEY-----
The example PEMs are created on and copy/pasted from mobilefish:
>>> certdata = """-----BEGIN CERTIFICATE-----
... MIID0zCCAzygAwIBAgIBADANBgkqhkiG9w0BAQQFADCBqDELMAkGA1UEBhMCbmwx
... FjAUBgNVBAgTDU5vb3JkLUhvbGxhbmQxFzAVBgNVBAoTDk1vYmlsZWZpc2guY29t
... MRAwDgYDVQQHEwdaYWFuZGFtMRIwEAYDVQQLEwlNYXJrZXRpbmcxGzAZBgNVBAMT
... End3dy5tb2JpbGVmaXNoLmNvbTElMCMGCSqGSIb3DQEJARYWY29udGFjdEBtb2Jp
... bGVmaXNoLmNvbTAeFw0xNTExMTQwMjAyNDlaFw0xNjExMTMwMjAyNDlaMIGoMQsw
... CQYDVQQGEwJubDEWMBQGA1UECBMNTm9vcmQtSG9sbGFuZDEXMBUGA1UEChMOTW9i
... aWxlZmlzaC5jb20xEDAOBgNVBAcTB1phYW5kYW0xEjAQBgNVBAsTCU1hcmtldGlu
... ZzEbMBkGA1UEAxMSd3d3Lm1vYmlsZWZpc2guY29tMSUwIwYJKoZIhvcNAQkBFhZj
... b250YWN0QG1vYmlsZWZpc2guY29tMIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKB
... gQC2Yw+5xKhhelVmH7Weu9eMhreuRvQXuNsyi5SA0sBXboOybox5oJZAWbL84KN5
... gX1qN7U62szotl3K49bRlzbKu/TmcVdJYlRlnwusL5XQJDKv+uERlUU0QDXeswEu
... M93UxkeN/j0vKfjp8k/Ny4qc5pNOT/dqNRyx01pVFV8NFwIDAQABo4IBCTCCAQUw
... HQYDVR0OBBYEFKEXjyTmz/vOVxHbtJCJUraUZhxsMIHVBgNVHSMEgc0wgcqAFKEX
... jyTmz/vOVxHbtJCJUraUZhxsoYGupIGrMIGoMQswCQYDVQQGEwJubDEWMBQGA1UE
... CBMNTm9vcmQtSG9sbGFuZDEXMBUGA1UEChMOTW9iaWxlZmlzaC5jb20xEDAOBgNV
... BAcTB1phYW5kYW0xEjAQBgNVBAsTCU1hcmtldGluZzEbMBkGA1UEAxMSd3d3Lm1v
... YmlsZWZpc2guY29tMSUwIwYJKoZIhvcNAQkBFhZjb250YWN0QG1vYmlsZWZpc2gu
... Y29tggEAMAwGA1UdEwQFMAMBAf8wDQYJKoZIhvcNAQEEBQADgYEAanK63a/8Emwl
... v4i8XI57hkt3Iq0NbMveGT01DrBiRUJ/Uf7jpS+j4blcaUUJ6JuOk+wrwYZIZqZE
... 9mHfiPKMNps22OYXoHkaZPcxtofpyTGE2tnW2ReauTKCVPSczQPqn7mhBG2t6TJs
... YBpp0s2I/q7a4bVbowibPbO3RK1kBcA=
... -----END CERTIFICATE-----"""
>>> privkeydata = """-----BEGIN RSA PRIVATE KEY-----
... MIICXAIBAAKBgQC2Yw+5xKhhelVmH7Weu9eMhreuRvQXuNsyi5SA0sBXboOybox5
... oJZAWbL84KN5gX1qN7U62szotl3K49bRlzbKu/TmcVdJYlRlnwusL5XQJDKv+uER
... lUU0QDXeswEuM93UxkeN/j0vKfjp8k/Ny4qc5pNOT/dqNRyx01pVFV8NFwIDAQAB
... AoGBAIzWW/tYV6nGHJHapJWpeZ4DHW2PTsfOsD0MuaTsmSgqp7muUf1Nuxh/644I
... LVQTYPQXhnOnJ5n/0NduLqD0ApMk2IAdP0w224Yk3HJaMTu/KgOMj7gyDJvUOncY
... GNoxRZ9Fz/ByNUdL+OmZdECaSbcVR/PftYlduEFdy5PEcGBBAkEA8ab14UgMz7Tw
... 5zy32QWljTlmLBAuFZ73tbxNpDlX4WtP3ye1eAGm2usNVjf9vtfpfXspicgPI9z8
... Va2en2q1twJBAME3SZw/pmhijjn8+0FLO7ieooHfnEJ7XZWeEVnPU9cW66fe6EqN
... foToJadmU6avWFiIRYPazRECCgzOxkDrY6ECQCXzBmIeooRr8fkee/DFBj6raPQ6
... hkI2+Me9jqPfrYFlDOIKpmD2QXHXv/xuRpcV6UEfemJ83IPRTH9YCLUYWPkCQEu8
... eT0m8fquzyNJ188DR3iZrgeMeDrTEp7oI9L5YtrH4D2gMZuvlO1R9hiFErsetlmV
... qPIDXSiSjQ/yKWIfIqECQH8Q7WuTIpNbJjoMOoLZ18NqTDPFOG/L0BFeb/ovMZ06
... LNLN9K1eJ0ZQUHy447A3auCeMhJLG8JfBG7Kjk4wul4=
... -----END RSA PRIVATE KEY-----"""

Optimization using NEON assembly

I am trying to optimize some parts of OpenCV code using NEON. Here is the original code block I work on. (Note: If it is of any importance, you can find the full source at "opencvfolder/modules/video/src/lkpyramid.cpp". It is an implementation of an object tracking algorithm.)
for( ; x < colsn; x++ )
{
deriv_type t0 = (deriv_type)(trow0[x+cn] - trow0[x-cn]);
deriv_type t1 = (deriv_type)((trow1[x+cn] + trow1[x-cn])*3 + trow1[x]*10);
drow[x*2] = t0; drow[x*2+1] = t1;
}
In this code, size of deriv_type is a 2 byte.
And here is the NEON assembly I have written. With original code I measure 10-11 fps. With NEON it is worse, I can only get 5-6 fps. I don't really know much about NEON, probably there are lots of mistakes in this code. Where am I doing wrong? Thanks
for( ; x < colsn; x+=4 )
{
__asm__ __volatile__(
"vld1.16 d2, [%2] \n\t" // d2 = trow0[x+cn]
"vld1.16 d3, [%3] \n\t" // d3 = trow0[x-cn]
"vsub.i16 d9, d2, d3 \n\t" // d9 = d2 - d3
"vld1.16 d4, [%4] \n\t" // d4 = trow1[x+cn]
"vld1.16 d5, [%5] \n\t" // d5 = trow1[x-cn]
"vld1.16 d6, [%6] \n\t" // d6 = trow1[x]
"vmov.i16 d7, #3 \n\t" // d7 = 3
"vmov.i16 d8, #10 \n\t" // d8 = 10
"vadd.i16 d4, d4, d5 \n\t" // d4 = d4 + d5
"vmul.i16 d10, d4, d7 \n\t" // d10 = d4 * d7
"vmla.i16 d10, d6, d8 \n\t" // d10 = d10 + d6 * d8
"vst2.16 {d9,d10}, [%0] \n\t" // drow[x*2] = d9; drow[x*2+1] = d10;
//"vst1.16 d4, [%1] \n\t"
: //output
:"r"(drow+x*2), "r"(drow+x*2+1), "r"(trow0+x+cn), "r"(trow0+x-cn), "r"(trow1+x+cn), "r"(trow1+x-cn), "r"(trow1) //input
:"d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10" //registers
);
}
EDIT
This is the verison with intrinsics. It is almost the same with before. It still works slow.
const int16x8_t vk3 = { 3, 3, 3, 3, 3, 3, 3, 3 };
const int16x8_t vk10 = { 10, 10, 10, 10, 10, 10, 10, 10 };
for( ; x < colsn; x+=8 )
{
int16x8x2_t loaded;
int16x8_t t0a = vld1q_s16(&trow0[x + cn]);
int16x8_t t0b = vld1q_s16(&trow0[x - cn]);
loaded.val[0] = vsubq_s16(t0a, t0b); // t0 = (trow0[x + cn] - trow0[x - cn])
loaded.val[1] = vld1q_s16(&trow1[x + cn]);
int16x8_t t1b = vld1q_s16(&trow1[x - cn]);
int16x8_t t1c = vld1q_s16(&trow1[x]);
loaded.val[1] = vaddq_s16(loaded.val[1], t1b);
loaded.val[1] = vmulq_s16(loaded.val[1], vk3);
loaded.val[1] = vmlaq_s16(loaded.val[1], t1c, vk10);
}
You're creating a lot of pipeline stalls due to data hazards. For example these three instructions:
"vadd.i16 d4, d4, d5 \n\t" // d4 = d4 + d5
"vmul.i16 d10, d4, d7 \n\t" // d10 = d4 * d7
"vmla.i16 d10, d6, d8 \n\t" // d10 = d10 + d6 * d8
They each only take 1 instruction to issue, but there are several-cycle stalls between them because the results are not ready (NEON instruction scheduling).
Try unrolling the loop a few times and interleaving their instructions. The compiler might do this for you if you use intrinsics. It's not impossible to beat the compiler at instructions scheduling etc, but it is quite hard and not often worth it (this might fall under not optimizing prematurely).
EDIT
Your intrinsic code is reasonable, I suspect the compiler is just not doing a very good job. Take a look at the assembly code it's producing (objdump -d) and you will probably see that it's also creating a lot of pipeline hazards. A later version of the compiler may help, but if it doesn't you might have to modify the loop yourself to hide the latency of the results (you will need the instruction timings). Keep the current code around, as it is correct and should be optimisable by a clever compiler.
You might end up with something like:
// do step 1 of first iteration
// ...
for (int i = 0; i < n - 1; i++) {
// do step 1 of (i+1)th
// do step 2 of (i)th
// with their instructions interleaved
// ...
}
// do step 2 of (n-1)th
// ...
You can also split the loop into more than 2 steps, or unroll the loop a few times (e.g. change i++ to i+=2, double the body of the loop, changing i to i+1 in the second half). I hope this answer helps, let me know if anything is unclear!
There is some loop invariant stuff there that needs to be moved outside the for loop - this may help a little.
You could also consider using full width SIMD operations, so that you can process 8 ppints per loop iteration rather than 4.
Most importantly though, you should probably be using intrinsics rather than raw asm, so that the compiler can take care of peephole optimisation, register allocation, instruction scheduling, loop unrolling, etc.
E.g.
// constants - init outside loop
const int16x8_t vk3 = { 3, 3, 3, 3, 3, 3, 3, 3 };
const int16x8_t vk10 = { 10, 10, 10, 10, 10, 10, 10, 10 };
for( ; x < colsn; x += 8)
{
int16x8_t t0a = vld1q_s16(&trow0[x + cn]);
int16x8_t t0b = vld1q_s16(&trow0[x - cn]);
int16x8_t t0 = vsubq_s16(t0a, t0b); // t0 = (trow0[x + cn] - trow0[x - cn])
// ...
}