How fast could you read/write to floppy disks, both 3 1/4 and 5 1/2? - hardware

Does anyone know/remember the actual read/write speed of floppy disks? I want to use this as a tidbit for arguing how painfully slow our Sharepoint server is, but all the websites with information about the disks don't seem to have the actual speeds they worked at.

IIRC,
The controllers at the end where rated 500kbps to 1Mbps for most modern floppy controllers older ones had about 250 kbps.
The actual disks always maxed out around 100-250kbps and I've never seen above 250kbps on a floppy.
Read speeds could reach higher but I've never seen close to a controllers max. Not the best answer perhaps but some insight for you.

Depends, in the late 1970s, an 8bit Intel 8271 Floppy disk controller, with an Intel 8257 DMA controller, operating in a machine clocked at say 1 MHz, and reading a 40 track, single sided, single density, 5.25", 3 KByte / track, disk, would incur a ~ 400ms spin up cost, as the drive accelerated from 0 to 300 rpm, before waiting the > 50ms necessary for the head to seek and settle on the desired track, to be able to have it initiate a sector operation, which would incur a ~ 60ms penalty, as the device waited for the desired sector to come under the head. Tracks were typically partitioned into 128, 256 or 512 byte sectors (blocks). So reading / writing a > 3 Kbyte file would incur over a second of spin up and seek overhead, before adding the cost of the bytes transferred, say 4-8ms per byte written. Two tracks having to be sought, and numerous sectors read, to assemble the file. So nowhere near achieving the 1 MByte / sec, the controllers were theoretically able to crunch.
Add the like of Commodore opting, in the early 1980s, in say their Commodore 1540/1 drives, to stick a 300 bits (37 bytes) per second, serial bus between the external disk drives and the computer, making 0.4 KBytes a second the theoretical maximum transfer rate. Atari opting for a slightly faster 2400 bits (300 bytes) / second serial controller, in the same period.
In the late 1980s High Density, double sided, 80 track, 360 rpm rated, 3.5" disks, notionally able to hold ~ 2MB, and typically partitioned into 512 byte sectors, would more than quadruple transfer rates, as seek times improved. The disks being physically smaller, requiring the heads to travel a smaller distance, and spun faster, to reduce sector seek operations, while still requiring less energy and time to be spin up. The larger sector sizes also reduced the number of individual block transactions, with the associated set-up costs. Though still nowhere near achieving the 1 MByte, even an 8 bit, 1 MHZ controller could in theory shift a second, let alone what could be pushed over the > 50 MByte / sec data buses, of the almost 32bit systems, the drives were sold with.

Related

How do GPU/PPU timings work on the Gameboy?

I’m writing a gameboy emulator and I’ve come to implementing the graphics. However I can’t quite figure out how it works with the cpu as far as timing/clock cycles go. Does the CPU execute a certain amount of cycles (if so how many) and then hand it of to the GPU? Or is the gameboy always in a hblank/vblank state and the GPU uses the CPU in between them? I can’t find any information that helps me with this, only how to use the control registers.
This has been answered at https://forums.nesdev.com/viewtopic.php?f=20&t=17754&p=225009#p225009
It turns out I had it completely wrong and they are completely different.
Here is the post:
The Game Boy CPU and PPU run in parallel. The 4.2 MHz master clock is
also the dot clock. It's divided by 2 to form the PPU's 2.1 MHz memory
access clock, and divided by 4 to form a multi-phase 1.05 MHz clock
used by the CPU.
Each scanline is 456 dots (114 CPU cycles) long and consists of mode 2
(OAM search), mode 3 (active picture), and mode 0 (horizontal
blanking). Mode 2 is 80 dots long (2 for each OAM entry), mode 3 is
about 168 plus about 10 more for each sprite on a given line, and mode
0 is the rest. After 144 scanlines are drawn are 10 lines of mode 1
(vertical blanking), for a total of 154 lines or 70224 dots per
screen. The CPU can't see VRAM (writes are ignored and reads are $FF)
during mode 3, but it can during other modes. The CPU can't see OAM
during modes 2 and 3, but it can during blanking modes (0 and 1).
The link gives more of a general answer instead of implementation specifics, so I want to give my 2 cents.
CPU usually is the main part of your emulator and what actually counts cycles. Each time your CPU does something for any amount of cycles you pass that amount of cycles to other components of your emulator so that they can synchronize themselves.
For example, some CPU instructions read and write memory as part of a single instruction. That means is would take Gameboy CPU 4 (read) + 4 (write) cycles to complete the instruction. So in emulator you do the read, pass 4 cycles to GPU, do the write, pass 4 cycles to GPU. You do the same for other components that run parallel to the CPU like timers and sound.
It's actually important to do it that way instead of emulating whole instruction and then synchronizing everything else. Don't know about real ROMs but there're test ROMs that verify this exact behavior. 8 cycles is a long time and in the middle of multiple memory accesses some other Gameboy component might make a change.

In general, how expensive is calling an external program?

I know external programs can be called, but I don't know how expensive it is compared to, say, calling a subroutine. By the cost of calling, I mean the overhead of starting the program, rather than the cost of executing the program's code itself. I know the cost probably varies greatly depending on the language and operating system used and other factors, but I would appreciate some ballpark estimates.
I am asking to see the plausibility of emulating code self-modification on languages that don't allow code self-modification by making processes modify other processes
Like I said in my comment above, perhaps it would be best if you simply tried it and did some benchmarking. I'd expect this to depend primarily on the OS you're using.
That being said, starting a new process generally is many orders of magnitude slower than calling a subroutine (I'm tempted to say something like "at least a million times slower", but I couldn't back up such a claim with any measurements).
Possible reasons why starting a process is much slower:
Disk I/O (the OS has to load the process image file into memory) — this is going to be a big factor because I/O is many orders of magnitude slower than a simple CPU jump/call instruction.
To give you a rough idea of the orders of magnitude involved, let me quote this 2011 blog article (which is about memory access vs HDD access, not CPU jump instruction vs HDD access):
"Disk latency is around 13ms, but it depends on the quality and rotational speed of the hard drive. RAM latency is around 83 nanoseconds. How big is the difference? If RAM was an F-18 Hornet with a max speed of 1,190 mph (more than 1.5x the speed of sound), disk access speed is a banana slug with a top speed of 0.007 mph."
You do the math.
allocations of memory & other kernel data structures
laying out the process image in memory & performing relocations
creation of a new OS thread
context switches
etc.
Apparently, all of the above points mean that your OS is likely to perform lots of internal subroutine calls to start a new process, so doing just one subroutine call yourself instead of having the OS do hundreds of these is bound to be comparatively super-cheap.

Logging 16-bit data to an SD card at the rate of 44 kHz

I am using the STM32F4 microcontroller with a microSD card. I am capturing analogue data via DMA.
I am using a double buffer, taking 1280 (10*128 - 10 FFTs) samples at a time.
When one buffer is full I am setting a flag and I then look at 128 samples at a time and run an FFT calculation on it. All of this is running well.
The data is being sampled at the rate I want and FFT calculation is as I would expect. If I just let the program run for one second, I see that it runs the FFT approximately 343 times (44000/128).
But the problem is I would like to save 64 values from this FFT to the SD card.
I am using the HCC fat file system library.
Each loop of the FFT calculation I am copy the 64 values into an array.
After every 10 calculations I write the contents of this array to file and start again.
The array stores 640 float_32 values (10*64).
This works perfectly for a one-second test run. I get 22,000 values stored to the SD card.
But as I increase the time I start losing samples as it take the SD card longer to write. I need the SD card to store over 87 kbit/s (4 bytes * 64 * 343 = 87808) consistently. I have tried increasing the DMA buffer sample size and then the number of times it writes, but didn't find it helped.
I am using an 8G microSD card, class 4. I formatted the SD card to the default FAT32 allocation unit size 2048.
How should I organize the buffering of data to allow for this? I thought using fewer writes might help. Would a queue help? How would I implement this and would anyone have an example?
I saw that clifford had a similar problem and he was using a queue, How can I use an SD card for logging 16-bit data at 48 ksamples/s?.
In my case I got it to work by trying a large number of different cards - they vary a great deal. If I had enough RAM available for a longer buffer that would have worked too.
If you are not using an RTOS, the queue buffering option may not be available to you, or at least would be non-trivial to implement.
Using an RTOS queue, I suggest that you create a queue of messages each of length 64*sizeof(float_32), the number of messages in the queue will be determined by the ammount of card latency you need to deal with; a length of 343 for example, will sustain a card stall of 1 second, and will require 87Kb of RAM. The application will then have a high priority thread performing the FFT and placing data in the queue, while a low priority thread takes data from the queue and writes to the file.
You might improve performance further by accumulating multiple message blocks in your DMA buffer before initiating a write, and there may be some benefit in carefully selecting an optimum DMA buffer length.
Flash is very, very sensitive to overwrites. Writing 3kB and then a further 3kB may count as an overwrite of the first 4 kB. In your case, there's no good reason why you'd want such small writes anyway. I'd advise 16 kB writes (32 frames/write * 64 samples/frame * 4 bytes/sample). You'd need 5 or 6 writes per second, which should be well in spec of any old SD card.
Now it's quite likely that you'd get another 1280 samples it while writing; you'll have to deal with that on another thread. Should be no problem as the writing should block without using CPU (it's a low-level Flash delay)
The most probable cause of the problem might be the way you are interfacing the card through the library.
SD cards over the SPI protocol (which I assume being used here) can be read or written in 512 byte sector units, some SD commands making it possible to stream (to perform sequential sector access faster). An important element of the SD card SPI protocol are various delays, where you have to poll the card whether you could start an operation (such as writing data to a sector).
You should read the library's API to discover how its writing process might work. You will need to perform some regular action which in the end would poll the card to know whether the writing process could continue. Some cards might require a set number of accesses before becoming ready for an operation, some others might use timeouts for state transitions. It might not work well to have the function called relatively rarely (such as once in 2-3 milliseconds) anticipating the card getting ready meanwhile. You have to keep on nagging it whether it completed already.
Just from own experiences with SD interfacing.

Does quad-core perform substantially better than a dual-core for web development?

First, I could not ask this on most hardware forums, because they are mostly populated by
gamers. Additionally, it is difficult to get an opinion from sysadmins, because they have a fairly different perspective as well.
So perhaps, amongst developers, I might be able to deduce a realistic trend.
What I want to know is, if I regularly fire up netbeans/eclipse, mysql workbench, 3 to 5 browsers with multi-tabs, along with apache-php / mysql running in the background, perhaps gimp/adobe photoshop from time to time, does the quad core perform considerably faster than a dual core? provided the assumption is that the quad has a slower i.e. clockspeed ~2.8 vs a 3.2 dual-core ?
My only relevant experience is with the old core 2 duo 2.8 Ghz running on 4 Gig ram performed considerably slower than my new Core i5 quad core on 2.8 Ghz (desktops). It is only one sample data, so I can't see if it hold true for everyone.
The end purpose of all this is to help me decide on buying a new laptop ( 4 cores vs 2 cores have quite a difference, currently ).
http://www.intel.com/content/www/us/en/processor-comparison/comparison-chart.html
I did a comparison for you as a fact.
Here Quad core is 2.20 GHz where dual core is 2.3 GHz.
Now check out this comparison and see the "Max Turbo Frequency". You will notice that even though quad core has less GHz but when it hit turbo it passes the dual core.
Second thing to consider is Cache size. Which does make a huge difference. Quad core will always have more Cache. In this example it has 6MB but some has up to 8MB.
Third is, Max memory bandwidth, Quad core has 25.6 vs dual core 21.3 means more faster speed in quad core.
Fourth important factor is graphics. Graphics Base Frequency is 650MHz in quad and 500MHz in dual.
Fifth, Graphics Max Dynamic Frequency is 1.30 for quad and 1.10 for dual.
Bottom line is if you can afford it quad not only gives you more power punch but also allow you to add more memory later. As max memory size with Quad is 16GB and dual restricts you to 8GB. Just to be future proof I will go with Quad.
One more thing to add is simultaneous thread processing is 4 in dual core and 8 in quad, which does make a difference.
The problem with multi-processors/multi-core processors has been and still is memory bandwidth. Most applications in daily use have not been written to economize on memory bandwidth. This means that for typical, everyday use you'll run out of bandwidth when your apps are doing something (i e not waiting for user input).
Some applications - such as games and parts of operating systems - attempt to address this. Their parallellism loads a chunk of data into a core, spends some time processing it - without accessing memory further - and finally writes the modified data back to memory. During the processing itself the memory bus is free and other cores can load and store data.
In a well-designed, parallel code essentially any number of cores can be working on different parts of the same task so long as the total amount of processing - number of cores * processing time - is less than or equal to the total time doing memory work - number of cores * (read time + write time).
A code designed and balanced for a specific number of cores will be efficient for fewer but not for more cores.
Some processors have multiple data buses to increase the overall memory bandwidth. This works up to a certain point after which the next-higher memory - the L3 cache- becomes the bottleneck.
Even if they were equivalent speeds, the quad core is executing twice as many instructions per cycle as the duo core. 0.4 Mhz isn't going to make a huge difference.

What makes a modern commodity cluster?

Would would be the most cost effective way of implementing a terabyte distributed memory cache using commodity hardware these days? What would class as a piece of commodity hardware?
Commodity hardware is considered hardware that
Is off the shelf (nothing custom)
Is available in substantially similar version from many manufacturers.
There are many motherboards that can hold 8 or 16 GB of RAM. Fewer server motherboards can hold 32 and even 64GB.
But they fit the definition of commodity, therefore can be made into very large clusters for a very large sum of money.
Note, however, that in many access patterns a striped RAID HD array doesn't go much slower than a gigabit ethernet link - so a RAM cluster might not have significant improvement (except in latency) depending on how you're actually using it.
-Adam