I heard that PSM is a library supporting tag-matching.
What is Tag-matching interface? Why is tag-matching important for performance in the context of MPI?
Short intro in tag matching for MPI: https://www.hpcwire.com/2006/08/18/a_critique_of_rdma-1/ section "Matching"
MPI is a two-sided interface with a large matching space: a MPI Recv is associated with a MPI Send according to several criteria such as Sender, Tag, and Context, with the first two possibly ignored (wildcard). Matching is not necessarily in order and, worse, a MPI Send can be posted before the matching MPI Recv ... MPI requires 64 bits of matching information, and MX, Portals, and QsNet provide such a matching capability.
InfiniBand Verbs and other RDMA-based APIs do not support matching at all
So, it sounds like PSM is way to include fast matching to Infiniband-style network adapters (first versions with software matching, but with possibility of moving part of matching to the hardware).
I can't find public documentation of PSM (there are no details in User Guide http://storusint.com/pdf/qlogic/InfiniPath%20User%20Guide%202_0.pdf).
But there are sources of the library: https://github.com/01org/psm
Some details are listed in PSM2 presentation https://www.openfabrics.org/images/eventpresos/2016presentations/304PSM2Features.pdf
What is PSM?
Matched Queue (MQ) component
• Semantically matched to the needs of MPI using tag matching
• Provides calls for communication progress guarantees
• MQ completion semantics (standard vs. synchronized)
PSM API
• Global tag matching API with 64-bit tags
• Scale up to 64K processes per job
• MQ APIs provide point-to-point message passing between endpoints
• e.g. psm_mq_send, psm_mq_irecv
• No “recvfrom” functionality – needed by some applications
So, there are 64-bit tags. Every message has a tag, and Matched Queue has tag (in some tag matching implementations there is also tag mask). According to the source psm_mq_internal.h: mq_req_match() https://github.com/01org/psm/blob/67c0807c74e9d445900d5541358f0f575f22a630/psm_mq_internal.h#L381, there is mask in PSM:
typedef struct psm_mq_req {
...
/* Tag matching vars */
uint64_t tag;
uint64_t tagsel; /* used for receives */
...
} psm_mq_req_t;
mq_req_match(struct mqsq *q, uint64_t tag, int remove)
)
{
psm_mq_req_t *curp;
psm_mq_req_t cur;
for (curp = &q->first; (cur = *curp) != NULL; curp = &cur->next) {
if (!((tag ^ cur->tag) & cur->tagsel)) { /* match! */
if (remove) {
if ((*curp = cur->next) == NULL) /* fix tail */
q->lastp = curp;
cur->next = NULL;
}
return cur;
}
}
So, match is when the incoming tag is xored with tag of receives, posted to the MQ, result anded with tagsel of receive. If after these operations there are only zero bits, the match is found, else next receive is processed.
Comment from psm_mq.h, psm_mq_irecv() function, https://github.com/01org/psm/blob/4abbc60ab02c51efee91575605b3430059f71ab8/psm_mq.h#L206
/* Post a receive to a Matched Queue with tag selection criteria
*
* Function to receive a non-blocking MQ message by providing a preposted
* buffer. For every MQ message received on a particular MQ, the tag and #c
* tagsel parameters are used against the incoming message's send tag as
* described in tagmatch.
*
* [in] mq Matched Queue Handle
* [in] rtag Receive tag
* [in] rtagsel Receive tag selector
* [in] flags Receive flags (None currently supported)
* [in] buf Receive buffer
* [in] len Receive buffer length
* [in] context User context pointer, available in psm_mq_status_t
* upon completion
* [out] req PSM MQ Request handle created by the preposted receive, to
* be used for explicitly controlling message receive
* completion.
*
* [post] The supplied receive buffer is given to MQ to match against incoming
* messages unless it is cancelled via psm_mq_cancel #e before any
* match occurs.
*
* The following error code is returned. Other errors are handled by the PSM
* error handler (psm_error_register_handler).
*
* [retval] PSM_OK The receive buffer has successfully been posted to the MQ.
*/
psm_error_t
psm_mq_irecv(psm_mq_t mq, uint64_t rtag, uint64_t rtagsel, uint32_t flags,
void *buf, uint32_t len, void *context, psm_mq_req_t *req);
Example of encoding data into tag:
* uint64_t tag = ( ((context_id & 0xffff) << 48) |
* ((my_rank & 0xffff) << 32) |
* ((send_tag & 0xffffffff)) );
With tagsel mask we can encode both "match everything", "match tags with some bytes or bits equal to value, and anything in other", "match exactly".
There is newer PSM2 API, open source too - https://github.com/01org/opa-psm2, programmer's guide published at http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf.
In PSM2 tags are longer, and the matching rule is defined (stag is "Message Send Tag" - the tag value sent in message, and rtag is tag of receive request):
https://www.openfabrics.org/images/eventpresos/2016presentations/304PSM2Features.pdf#page=7
Tag matching improvement
• Increased tag size to 96 bits
• Fundamentally ((stag ^ rtag) & rtagsel) == 0
• Supports wildcards such as MPI_ANY_SOURCE or MPI_ANY_TAG using zero bits in rtagsel
• Allows for practically unlimited scalability
• Up to 64M processes per job
PSM2 TAG MATCHING
#define PSM_MQ_TAG_ELEMENTS 3
typedef
struct
psm2_mq_tag {
union {
uint32_t tag[PSM_MQ_TAG_ELEMENTS] __attribute__((aligned(16)));
struct {
uint32_t tag0;
uint32_t tag1;
uint32_t tag2;
};
};
} psm2_mq_tag_t;
• Application fills ‘tag’ array or ‘tag0/tag1/tag2’ and passes to PSM
• Both tag and tag mask use the same 96 bit tag type
And actually there is source peer address near matching variables in psm2_mq_req struct: https://github.com/01org/opa-psm2/blob/master/psm_mq_internal.h#L180
/* Tag matching vars */
psm2_epaddr_t peer;
psm2_mq_tag_t tag;
psm2_mq_tag_t tagsel; /* used for receives */
And software list scanning for match, mq_list_scan() called from mq_req_match() https://github.com/01org/opa-psm2/blob/85c07c656198204c4056e1984779fde98b00ba39/psm_mq_recv.c#L188:
psm2_mq_req_t
mq_list_scan(struct mqq *q, psm2_epaddr_t src, psm2_mq_tag_t *tag, int which, uint64_t *time_threshold)
{
psm2_mq_req_t *curp, cur;
for (curp = &q->first;
((cur = *curp) != NULL) && (cur->timestamp < *time_threshold);
curp = &cur->next[which]) {
if ((cur->peer == PSM2_MQ_ANY_ADDR || src == cur->peer) &&
!((tag->tag[0] ^ cur->tag.tag[0]) & cur->tagsel.tag[0]) &&
!((tag->tag[1] ^ cur->tag.tag[1]) & cur->tagsel.tag[1]) &&
!((tag->tag[2] ^ cur->tag.tag[2]) & cur->tagsel.tag[2])) {
*time_threshold = cur->timestamp;
return cur;
}
}
return NULL;
}
Related
There are two types of messages on the CAN bus. Those are broadcast message and default message. Currently, I'm using fifo0 for both the message(which works perfectly fine). But I would like to use fifo1 specially for broadcast message. Below is my initializing code
uint8 BspCan_RxFilterConfig(uint32 filterId, uint32 filterMask, uint8 filterBankId, uint8 enableFlag, uint8 fifoAssignment)
{
///\todo Add method for calculating filter on the fly
CAN_FilterTypeDef canBusFilterConfig;
FunctionalState filterEnableFlag = ENABLE;
if(enableFlag == 0)
{
filterEnableFlag = DISABLE;
}
else
{
filterEnableFlag = ENABLE;
}
/*Define filter used to determine if application needs to handle message on the CAN bus or if it should
ignore it. If the selected rx FIFO is changed, the rx functions in this module must also be updated.
Using mask mode with all bits set to "don't care"*/
canBusFilterConfig.FilterBank = filterBankId; //Identification of which of the filter banks to define.
canBusFilterConfig.FilterMode = CAN_FILTERMODE_IDMASK; //Sets whether to filter out messages based on a specific id or a list
canBusFilterConfig.FilterScale = CAN_FILTERSCALE_32BIT; //Sets the width of the filter, 32-bit width means filter applies to full range of std id, extended id, IDE, and RTR bits
canBusFilterConfig.FilterIdHigh = (0xFFFF0000 & filterId)>>16; //For upper 16 bits, dominant bit is expected (logic 0)
canBusFilterConfig.FilterIdLow = 0x0000FFFF & filterId; //For Lower 16 bits, dominant bit is expected (logic 0)
canBusFilterConfig.FilterMaskIdHigh = (0xFFFF0000 & filterMask)>>16; //Upper 16 bits are don't care
canBusFilterConfig.FilterMaskIdLow = 0x0000FFFF & filterMask; //Lower 16 bits are don't care
//canBusFilterConfig.FilterFIFOAssignment = CAN_FILTER_FIFO0; //Sets which rx FIFO to which to apply the filter settings
canBusFilterConfig.FilterActivation = filterEnableFlag;
canBusFilterConfig.SlaveStartFilterBank = 1; //Bank for the defined filter. Arbitrary value.
if (fifoAssignment == 0)
{
canBusFilterConfig.FilterFIFOAssignment = CAN_FILTER_FIFO0;
}
else
{
canBusFilterConfig.FilterFIFOAssignment = CAN_FILTER_FIFO1;
}
//Only fails if CAN peripheral is not in ready or listening state
if (HAL_CAN_ConfigFilter(&gCanBusH, &canBusFilterConfig) != HAL_OK)
{
return(ERR_CAN_INIT_FAILED);
}
else
{
return(SZW_NO_ERROR);
}
}//end BspCan_RxFilterConfig
When initializing, fifo0 works perfectly but fifo1 doesn't. If I just initialize fifo1 for both types of messages, it doesn't generate the interrupt. What am I doing wrong over here ? How to i initialize fifo1 to make it work and generate interrupt? I also tried without using digital filters still no luck.
Thanks in advance,
CONTEXT
I'm using a code written to work with a GPS module that connects to the Arduino through serial communication. The module starts each packet with a header (0xb5, 0x62), continues with the information you requested and ends with to bytes of checksum, CK_A, and CK_B. I don't understand the code that calculates that checksum. More info about the algorithm of checksum (8-Bit Fletcher Algorithm) in the module protocol (https://www.u-blox.com/sites/default/files/products/documents/u-blox7-V14_ReceiverDescriptionProtocolSpec_%28GPS.G7-SW-12001%29_Public.pdf), page 74 (87 with index).
MORE INFO
Just wanted to understand the code, it works fine. In the UBX protocol, I mentioned there is also a piece of code that explains how it works (isn't write in c++)
struct NAV_POSLLH {
//Here goes the struct
};
NAV_POSLLH posllh;
void calcChecksum(unsigned char* CK) {
memset(CK, 0, 2);
for (int i = 0; i < (int)sizeof(NAV_POSLLH); i++) {
CK[0] += ((unsigned char*)(&posllh))[i];
CK[1] += CK[0];
}
}
In the link you provide, you can find a link to RFC 1145, containing that Fletcher 8 bit algorithm as well and explaining
It can be shown that at the end of the loop A will contain the 8-bit
1's complement sum of all octets in the datagram, and that B will
contain (n)*D[0] + (n-1)*D[1] + ... + D[n-1].
n = sizeof byte D[];
Quote adjusted to C syntax
Try it with a couple of bytes, pen and paper, and you'll see :)
I'm having trouble calculating the MAC of the finished message.The RFC gives the formula
HMAC_hash(MAC_write_secret, seq_num + TLSCompressed.type +
TLSCompressed.version + TLSCompressed.length +
TLSCompressed.fragment));
But the tlsCompressed(tlsplaintext in this case because no compression is used) does not contain version information:(hex dump)
14 00 00 0c 2c 93 e6 c5 d1 cb 44 12 bd a0 f9 2d
the first byte is the tlsplaintext.type, followed by uint24 length.
The full message, with the MAC and padding appended and before encryption is
1400000c2c93e6c5d1cb4412bda0f92dbc175a02daab04c6096da8d4736e7c3d251381b10b
I have tried to calculate the hmac with the following parameters(complying to the rfc) but it does not work:
uint64 seq_num
uint8 tlsplaintext.type
uint8 tlsplaintext.version_major
uint8 tlscompressed.version_minor
uint16 tlsplaintext.length
opaque tlsplaintext.fragment
I have also tried omitting the version and using uint24 length instead.no luck.
My hmac_hash() function cannot be the problem because it has worked thus far. I am also able to compute the verify_data and verify it.
Because this is the first message sent under the new connection state, the sequence number is 0.
So, what exactly are the parameters for the calculation of the MAC for the finished message?
Here's the relevant source from Forge (JS implementation of TLS 1.0):
The HMAC function:
var hmac_sha1 = function(key, seqNum, record) {
/* MAC is computed like so:
HMAC_hash(
key, seqNum +
TLSCompressed.type +
TLSCompressed.version +
TLSCompressed.length +
TLSCompressed.fragment)
*/
var hmac = forge.hmac.create();
hmac.start('SHA1', key);
var b = forge.util.createBuffer();
b.putInt32(seqNum[0]);
b.putInt32(seqNum[1]);
b.putByte(record.type);
b.putByte(record.version.major);
b.putByte(record.version.minor);
b.putInt16(record.length);
b.putBytes(record.fragment.bytes());
hmac.update(b.getBytes());
return hmac.digest().getBytes();
};
The function that creates the Finished record:
tls.createFinished = function(c) {
// generate verify_data
var b = forge.util.createBuffer();
b.putBuffer(c.session.md5.digest());
b.putBuffer(c.session.sha1.digest());
// TODO: determine prf function and verify length for TLS 1.2
var client = (c.entity === tls.ConnectionEnd.client);
var sp = c.session.sp;
var vdl = 12;
var prf = prf_TLS1;
var label = client ? 'client finished' : 'server finished';
b = prf(sp.master_secret, label, b.getBytes(), vdl);
// build record fragment
var rval = forge.util.createBuffer();
rval.putByte(tls.HandshakeType.finished);
rval.putInt24(b.length());
rval.putBuffer(b);
return rval;
};
The code to handle a Finished message is a bit lengthier and can be found here. I see that I have a comment in that code that sounds like it might be relevant to your problem:
// rewind to get full bytes for message so it can be manually
// digested below (special case for Finished messages because they
// must be digested *after* handling as opposed to all others)
Does this help you spot anything in your implementation?
Update 1
Per your comments, I wanted to clarify how TLSPlainText works. TLSPlainText is the main "record" for the TLS protocol. It is the "wrapper" or "envelope" for content-specific types of messages. It always looks like this:
struct {
ContentType type;
ProtocolVersion version;
uint16 length;
opaque fragment[TLSPlaintext.length];
} TLSPlaintext;
So it always has a version. A Finished message is a type of handshake message. All handshake messages have a content type of 22. A handshake message looks like this:
struct {
HandshakeType msg_type;
uint24 length;
body
} Handshake;
A Handshake message is yet another envelope/wrapper for other messages, like the Finished message. In this case, the body will be a Finished message (HandshakeType 20), which looks like this:
struct {
opaque verify_data[12];
} Finished;
To actually send a Finished message, you have to wrap it up in a Handshake message envelope, and then like any other message, you have to wrap it up in a TLS record (TLSPlainText). The ultimate result looks/represents something like this:
struct {
ContentType type=22;
ProtocolVersion version=<major, minor>;
uint16 length=<length of fragment>;
opaque fragment=<struct {
HandshakeType msg_type=20;
uint24 length=<length of finished message>;
body=<struct {
opaque verify_data[12]>;
} Finished>
} Handshake>
} TLSPlainText;
Then, before transport, the record may be altered. You can think of these alterations as operations that take a record and transform its fragment (and fragment length). The first operation compresses the fragment. After compression you compute the MAC, as described above and then append that to the fragment. Then you encrypt the fragment (adding the appropriate padding if using a block cipher) and replace it with the ciphered result. So, when you're finished, you've still got a record with a type, version, length, and fragment, but the fragment is encrypted.
So, just so we're clear, when you're computing the MAC for the Finished message, imagine passing in the above TLSPlainText (assuming there's no compression as you indicated) to a function. This function takes this TLSPlainText record, which has properties for type, version, length, and fragment. The HMAC function above is run on the record. The HMAC key and sequence number (which is 0 here) are provided via the session state. Therefore, you can see that everything the HMAC function needs is available.
In any case, hopefully this better explains how the protocol works and that will maybe reveal what's going wrong with your implementation.
I have used the ICMP example provided in the ASIO documentation to create a simple ping utility. However, the example covers IPv4 only and I have a hard time to make it work with IPv6.
Upgrading the ICMP header class to support IPv6 requires a minor change - the only difference between ICMP and ICMPv6 header is the different enumeration of ICMP types. However, I have a problem computing the checksum that needs to be incorporated in the ICMPv6 header.
For IPv4 the checksum is based on the ICMP header and payload. However, for IPv6 the checksum should include the IPv6 pseudo-header before the ICMPv6 header and payload. The ICMPv6 checksum function needs to know the source and destination address that will be in the IPv6 header. However, we have no control over what goes into the IPv6 header. How can this be done in Asio-Boost?
For reference please find below the function for IPv4 checksum calculation.
void compute_checksum(icmp_header& header, Iterator body_begin, Iterator body_end)
{
unsigned int sum = (header.type() << 8) + header.code()
+ header.identifier() + header.sequence_number();
Iterator body_iter = body_begin;
while (body_iter != body_end)
{
sum += (static_cast<unsigned char>(*body_iter++) << 8);
if (body_iter != body_end)
sum += static_cast<unsigned char>(*body_iter++);
}
sum = (sum >> 16) + (sum & 0xFFFF);
sum += (sum >> 16);
header.checksum(static_cast<unsigned short>(~sum));
}
[EDIT]
What are the consequences if the checksum is not calculated correctly? Will the target host send echo reply if the echo request has invalid checksum?
If the checksum is incorrect, a typical IPv6 implementation will drop the packet. So, it is a serious issue.
If you insist on crafting the packet yourself, you'll have to do it
completely. This incldues finding the source IP address, to put it in
the pseudo-header before computing the checksum. Here is how I do it
in C, by calling connect() for my intended destination address
(even when I use UDP, so it should work for ICMP):
/* Get the source IP addresse chosen by the system (for verbose display, and
* for checksumming) */
if (connect(sd, destination->ai_addr, destination->ai_addrlen) < 0) {
fprintf(stderr, "Cannot connect the socket: %s\n", strerror(errno));
abort();
}
source = malloc(sizeof(struct addrinfo));
source->ai_addr = malloc(sizeof(struct sockaddr_storage));
source_len = sizeof(struct sockaddr_storage);
if (getsockname(sd, source->ai_addr, &source_len) < 0) {
fprintf(stderr, "Cannot getsockname: %s\n", strerror(errno));
abort();
}
then, later:
sockaddr6 = (struct sockaddr_in6 *) source->ai_addr;
op6.ip.ip6_src = sockaddr6->sin6_addr;
and:
op6.udp.check =
checksum6(op6.ip, op6.udp, (u_int8_t *) & message, messagesize);
Consider this code running on my microcontroller unit(MCU):
while(1){
do_stuff;
if(packet_from_PC)
send_data_via_gpio(new_packet); //send via general purpose i/o pins
else
send_data_via_gpio(default_packet);
do_other_stuff;
}
The MCU is also interfaced to a PC via a UART.Whenever the PC sends data to the MCU, the new_packet is sent,
otherwise the default_packet is sent.Each packet can be 5 or more bytes with a pre defined packet structure.
My question is:
1.Should i receive the entire packet from PC using inside the UART interrut service routine (ISR)? In this case, i have to implement
a state machine inside the ISR to assemble the packet (which can be lengthy with if-else or switch-case blocks).
OR
2.Have the PC send some sort of a REQUEST command (one byte),detect it in my ISR set a flag, disable UART interrupt alone and form the packet in my while(1) loop by checking for the flag and polling the UART?In this case the UART interrupt would be re-enabled in the while(1) loop after the entire packet is formed.
Those are not the only two choices, and the second one seems suboptimal.
My first approach would be to a simple circular queue, and push bytes into it from the ISR and read bytes from in your main loop. That way you have a small and simple ISR and you and do the processing in your main loop without disabling interrupts.
The first choice is possible assuming you can code the ISR sensibly. You probably want to have timeouts when dealing with constructing packets; you need to be able to handle that correctly in your ISR. It depends on the line speed, the speed of your MCU and what else you need to do.
Update:
Doing it in the ISR is certainly reasonable. However, using a circular queue is pretty straightforward with a standard implementation in your bag of tricks. Here is a circular queue implementation; readers and writers can operate independently.
#ifndef ARRAY_ELEMENTS
#define ARRAY_ELEMENTS(x) (sizeof(x) / sizeof(x[0]))
#endif
#define QUEUE_DEFINE(name, queue_depth, type) \
struct queue_type__##name { \
volatile size_t m_in; \
volatile size_t m_out; \
type m_queue[queue_depth]; \
}
#define QUEUE_DECLARE(name) struct queue_type__##name name
#define QUEUE_SIZE(name) ARRAY_ELEMENTS((name).m_queue)
#define QUEUE_CALC_NEXT(name, i) \
(((name).i == (QUEUE_SIZE(name) - 1)) ? 0 : ((name).i + 1))
#define QUEUE_INIT(name) (name).m_in = (name).m_out = 0
#define QUEUE_EMPTY(name) ((name).m_in == (name).m_out)
#define QUEUE_FULL(name) (QUEUE_CALC_NEXT(name, m_in) == (name).m_out)
#define QUEUE_NEXT_OUT(name) ((name).m_queue + (name).m_out)
#define QUEUE_NEXT_IN(name) ((name).m_queue + (name).m_in)
#define QUEUE_PUSH(name) ((name).m_in = QUEUE_CALC_NEXT((name), m_in))
#define QUEUE_POP(name) ((name).m_out = QUEUE_CALC_NEXT((name), m_out))
Use it like this:
QUEUE_DEFINE(bytes_received, 64, unsigned char);
QUEUE_DECLARE(bytes_received);
void isr(void)
{
/* Move the received byte into 'c' */
/* This code enqueues the byte, or drops it if the queue is full */
if (!QUEUE_FULL(bytes_received)) {
*QUEUE_NEXT_IN(bytes_received) = c;
QUEUE_PUSH(bytes_received);
}
}
void main(void)
{
QUEUE_INIT(bytes_received);
for (;;) {
other_processing();
if (!QUEUE_EMPTY(bytes_received)) {
unsigned char c = *QUEUE_NEXT_OUT(bytes_received);
QUEUE_POP(bytes_received);
/* Use c as you see fit ... */
}
}
}