Replace dropped frame - frame

I'm doing a Cloud Gaming solution that works kind of "good".
At the moment our servers runs a game, encode the video using VCE (AMD hardware encoding), chunk the video frames and send it in UDP to the player. The player receives the packets, rebuid the data and decode it. So we have no problems if there is no packet loss.
In the case of a wired connexion everything is smooth, but people like to use Wifi (5ghz, we can't handle 2Ghz). Even if you have a good Wifi, you may experience packet loss. We have a redundancy plan that works "okish" but it will take too much network.
Here is a small explanation:
Original encoding (only P frames):
F1 - F2 - F3 - F4 - F5
What we do at the moment if we lose F2:
F1 - empty - F3(ugly) - F4(ugly) - F5(ugly)
What we want to do, replace F2:
F1 - F1' - F3 - F4 - F5
Would it work if the third frame refers to F1' (thinking it is F2)? At least I think it better than doing nothing. Is there a way to change the reference of F3 (so it refers to F1 and not F2), or creating F1' with the "header" of F2?

Your solution would be largely ineffective. You should adopt the same solution as the others in your space. That being periodic intra refresh, reference frame invalidation and FEC.

Related

measuring time between two rising edges in beaglebone

I am reading sensor output as square wave(0-5 volt) via oscilloscope. Now I want to measure frequency of one period with Beaglebone. So I should measure the time between two rising edges. However, I don't have any experience with working Beaglebone. Can you give some advices or sample codes about measuring time between rising edges?
How deterministic do you need this to be? If you can tolerate some inaccuracy, you can probably do it on the main Linux OS; if you want to be fancy pants, this seems like a potential use case for the BBB's PRU's (which I unfortunately haven't used so take this with substantial amounts of salt). I would expect you'd be able to write PRU code that just sits with an infinite outerloop and then inside that loop, start looping until it sees the pin shows 0, then starts looping until the pin shows 1 (this is the first rising edge), then starts counting until either the pin shows 0 again (this would then be the falling edge) or another loop to the next rising edge... either way, you could take the counter value and you should be able to directly convert that into time (the PRU is states as having fixed frequency for each instruction, and is a 200Mhz (50ns/instruction). Assuming your loop is something like
#starting with pin low
inner loop 1:
registerX = loadPin
increment counter
jump if zero registerX to inner loop 1
# pin is now high
inner loop 2:
registerX = loadPin
increment counter
jump if one registerX to inner loop 2
# pin is now low again
That should take 3 instructions per counter increment, so you can get the time as 3 * counter * 50 ns.
As suggested by Foon in his answer, the PRUs are a good fit for this task (although depending on your requirements it may be fine to use the ARM processor and standard GPIO). Please note that (as far as I know) both the regular GPIOs and the PRU inputs are based on 3.3V logic, and connecting a 5V signal might fry your board! You will need an additional component or circuit to convert from 5V to 3.3V.
I've written a basic example that measures timing between rising edges on the header pin P8.15 for my own purpose of measuring an engine's rpm. If you decide to use it, you should check the timing results against a known reference. It's about right but I haven't checked it carefully at all. It is implemented using PRU assembly and uses the pypruss python module to simplify interfacing.

What does TDO on 4th bit in ICSP SendCommand header mean? (PIC32MX, ICSP 2-wire 4-phase)

Right now I'm trying to implement the flash programming specification for PIC32MX. I'm working with a PIC32MX512L and a PIC32MX512H. The PIC32MX512L must eventually transfer a program to the two wires PGEC2 and PGED2 of the PIC32MX512H.
Right now I'm trying to execute the check device operation. As specified, I'm entering the programming mode by MCLR-juggling and executing SetMode (6b011111) on the TMS clock while the TDI clock stays low. The TAP controller replies with zeroes (every TDO is low).
After that I must execute SendCommand( MTAP_SW_MTAP ) to select the MTAP controller. The sequence to be shifted is
(header) 01 01 00 00_ | (data) 00 00 10 00 00 | (most sign. bit) 01 | (footer) 01 00
The first bit of each pairs is the TDI and the second -- TMS. I write TDI on the first clock, TMS on the second clock and read TDO during the third and the fourth clock. This sequence is feeded from the left to the right. Shifted bits hold their value during each clock fall.
The issue
After shifting the first 4 pairs, the TDO line goes on the fourth pair high (on the third clock) and low at the end of that 4-phase part (on the fourth clock). I've marked this spot with an underscore in the sequence above. After that the controller ignores any further commands. On the next SendCommand( MTAP_COMMAND ), the TDO stays low and later on for XferData( MCHP_STATUS ) TDO still stays low, no matter how often I send the command.
I've done a small screenshot from my oscilloscope. The blue line is the clock, the green one is the data. The hop on the right is what I mean.
The question
Does anyone know what the TAP controller is trying to tell me with that TDO high on the fourth phase?
Thank you in advance!
Well, I've fixed it. Generally the last TDO of the prologue is the first least significant bit of the output. For SendCommand it has no meaning, but for XferData and XferFastData it is important.
For XferFastData it is the PrAacc bit according to the spec. If the bit is zero, you should repeat the whole operation. But beware: the MCU implementation doesn't follow the spec. If you really restart the whole operation for FastData if PrAcc is zero, it won't work. Instead just ignore the bit and proceed writing. I've found it out eventually by trial and error and by comparing my XferFastData implementation against pic32prog.

State based testing(state charts) & transition sequences

I am really stuck with some state based testing concepts...
I am trying to calculate some checking sequences that will cover all transitions from each state and i have the answers but i dont understand them:
alt text http://www.gam3r.co.uk/1m.jpg
Now the answers i have are :
alt text http://www.gam3r.co.uk/2m.jpg
I dont understand it at all. For example say we want to check transition a/x from s1, wouldnt we do ab only? As we are already in s1, we do a/x to test the transition to s2, then b to check we are in the previous right state(s1)? I cant understand why it is aba or even bb for s1...
Can anyone talk me through it?
Thanks
There are 2 events available in each of 4 states, giving 8 transitions, which the author has decided to test in 8 separate test sequences. Each sequence (except the S1 sequences - apparently the machine's initial state is S1) needs to drive the machine to the target state and then perform either event a or event b.
The sequences he has chosen are sufficient, in that each transition is covered. However, they are not unique, and - as you have observed - not minimal.
A more obvious choice would be:
a b
ab aa
aaa aab
ba bb
I don't understand the author's purpose in adding superfluous transitions at the end of each sequence. The system is a Mealy machine - the behaviour of the machine is uniquely determined by the current state and event. There is no memory of the path leading to the current state; therefore the author's extra transitions give no additional coverage and serve only to confuse.
You are also correct that you could cover the all transitions with a shorter set of paths through the graph. However, I would be disinclined to do that. Clarity is more important than optimization for test code.

What are the most hardcore optimisations you've seen?

I'm not talking about algorithmic stuff (eg use quicksort instead of bubblesort), and I'm not talking about simple things like loop unrolling.
I'm talking about the hardcore stuff. Like Tiny Teensy ELF, The Story of Mel; practically everything in the demoscene, and so on.
I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.
In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.
(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)
In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:
Validate the bootsector of the host harddrive and make sure it was not infected.
Validate the floppy bootsector and
make sure that it was not infected.
Code to remove the virus from the
harddrive if it was infected.
Code to duplicate the antivirus
bootsector to another floppy if a
special key was pressed.
Code to boot the harddrive if all was
well, and no infections was found.
This was done in the program space of a bootsector, about 440 bytes :)
The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"
I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.
My favorite is the floating point inverse square root via integer operations. This is a cool little hack on how floating point values are stored and can execute faster (even doing a 1/result is faster than the stock-standard square root function) or produce more accurate results than the standard methods.
In c/c++ the code is: (sourced from Wikipedia)
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1); // Now this is what you call a real magic number
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
A Very Biological Optimisation
Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.
Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)
For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)
That's hardcore space optimisation!
UPDATE: Links to further info:
Reading Frames on Wikipedia suggests Hepatitis B and "Barley Yellow Dwarf" virus (a plant virus) both overlap reading frames.
Hepatitis B genome info on Wikipedia. Seems that different reading-frame subunits produce different variations of a surface protein.
Or you could google for "overlapping reading frames"
Seems this can even happen in mammals! Extensively overlapping reading frames in a second mammalian gene is a 2001 scientific paper by Marilyn Kozak that talks about a "second" gene in rat with "extensive overlapping reading frames". (This is quite surprising as mammals have a genome structure that provides ample room for separate genes for separate proteins.) Haven't read beyond the abstract myself.
I wrote a tile-based game engine for the Apple IIgs in 65816 assembly language a few years ago. This was a fairly slow machine and programming "on the metal" is a virtual requirement for coaxing out acceptable performance.
In order to quickly update the graphics screen one has to map the stack to the screen in order to use some special instructions that allow one to update 4 screen pixels in only 5 machine cycles. This is nothing particularly fantastic and is described in detail in IIgs Tech Note #70. The hard-core bit was how I had to organize the code to make it flexible enough to be a general-purpose library while still maintaining maximum speed.
I decomposed the graphics screen into scan lines and created a 246 byte code buffer to insert the specialized 65816 opcodes. The 246 bytes are needed because each scan line of the graphics screen is 80 words wide and 1 additional word is required on each end for smooth scrolling. The Push Effective Address (PEA) instruction takes up 3 bytes, so 3 * (80 + 1 + 1) = 246 bytes.
The graphics screen is rendered by jumping to an address within the 246 byte code buffer that corresponds to the right edge of the screen and patching in a BRanch Always (BRA) instruction into the code at the word immediately following the left-most word. The BRA instruction takes a signed 8-bit offset as its argument, so it just barely has the range to jump out of the code buffer.
Even this isn't too terribly difficult, but the real hard-core optimization comes in here. My graphics engine actually supported two independent background layers and animated tiles by using different 3-byte code sequences depending on the mode:
Background 1 uses a Push Effective Address (PEA) instruction
Background 2 uses a Load Indirect Indexed (LDA ($00),y) instruction followed by a push (PHA)
Animated tiles use a Load Direct Page Indexed (LDA $00,x) instruction followed by a push (PHA)
The critical restriction is that both of the 65816 registers (X and Y) are used to reference data and cannot be modified. Further the direct page register (D) is set based on the origin of the second background and cannot be changed; the data bank register is set to the data bank that holds pixel data for the second background and cannot be changed; the stack pointer (S) is mapped to graphics screen, so there is no possibility of jumping to a subroutine and returning.
Given these restrictions, I had the need to quickly handle cases where a word that is about to be pushed onto the stack is mixed, i.e. half comes from Background 1 and half from Background 2. My solution was to trade memory for speed. Because all of the normal registers were in use, I only had the Program Counter (PC) register to work with. My solution was the following:
Define a code fragment to do the blend in the same 64K program bank as the code buffer
Create a copy of this code for each of the 82 words
There is a 1-1 correspondence, so the return from the code fragment can be a hard-coded address
Done! We have a hard-coded subroutine that does not affect the CPU registers.
Here is the actual code fragments
code_buff: PEA $0000 ; rightmost word (16-bits = 4 pixels)
PEA $0000 ; background 1
PEA $0000 ; background 1
PEA $0000 ; background 1
LDA (72),y ; background 2
PHA
LDA (70),y ; background 2
PHA
JMP word_68 ; mix the data
word_68_rtn: PEA $0000 ; more background 1
...
PEA $0000
BRA *+40 ; patched exit code
...
word_68: LDA (68),y ; load data for background 2
AND #$00FF ; mask
ORA #$AB00 ; blend with data from background 1
PHA
JMP word_68_rtn ; jump back
word_66: LDA (66),y
...
The end result was a near-optimal blitter that has minimal overhead and cranks out more than 15 frames per second at 320x200 on a 2.5 MHz CPU with a 1 MB/s memory bus.
Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.
Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.
The Stalin Scheme compiler is pretty crazy in that aspect.
I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:
Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases
I forget what N was. This was in the source code for Windows that was leaked in 2004.
I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.
Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.
The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.
The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)

AVSampleBufferDisplayLayer renders half of each frame when using high preset

I am using AVSampleBufferDisplayLayer to display video that is being streamed over the network. On the sending side an AVCaptureSession is used to capture CMSampleBuffers which are serialized into NAL units and streamed to the receiver, which then turns them back into CMSampelBuffers and feeds them to AVSampleBufferDisplayLayer (as is described for instance here). It works quite well - I can see the video and it streams more or less smoothly.
If I set the capture session's sessionPreset to AVCaptureSessionPresetHigh the video shown on the receiving side is cut in half - the top half displays the video from the sender while the bottom half is a solid dark green. If I use any other preset (e.g. AVCaptureSessionPresetMedium or AVCaptureSessionPreset1280x720) the video displays in its entirety.
Has anyone encountered such an issue, or has any idea what might cause it?
I tried examining the data at the source as well as the data at the destination, to see if I can determine where the image is being chopped off, but I have not been successful. It occurred to me that perhaps the high quality frame is being split into more than one NALUs and I am not putting it together correctly - is that possible? How does such splitting look like on the elementary-stream level (if possible at all)?
Thank you
Amos
The problem turned out to be that at the preset AVCaptureSessionPresetHigh one frame would get split into more than one type 5 (or type 1) NALU. On the receiving side I was combining the SPS, PPS and type 1 (or type 5) NAL units into a CMSampleBuffer, but ignoring the second part of the frame if they were split, which caused the problem.
In order to recognize if two successive NAL units belong to the same frame it is necessary to parse the slice header of the picture NAL units. This requires delving into the specification, but goes more or less like this: the first field of the slice header is first_mb_in_slice which is encoded in Golomb encoding. Next come slice_type and the pic_aprameter_set_id, also in Golomb encoding, and finally the frame_number, as an unsigned integer of length (log2_max_frame_num_minus_4 + 4) bits (to get the value of log2_max_frame_num_minus_4 it is necessary to parse the PPS corresponding to this frame). If two consecutive NAL units have the same frame_num they are part of the same frame and should be put into the same CMSampleBuffer.