How to connect 16 bits SDRAM and 32 bits processor on DE10 Standard FBGA kit

How to connect 16 bits SDRAM and 32 bits processor on DE10 Standard FBGA kit - hardware

I am having a project that design a RISC V processor on DE10 kit and I've already created Verilog files for processor.
Because the processor has 32 bits data bus but the available external SDRAM has only 16 bits data, so how to connect them together?

DE10 Standard is a Cyclone V SoC, and SDRAM is controlled by the ARM HPS, you don't need to talk to it directly. The easiest way is to talk via Avalon bus (can be up to 128bit wide). You'll need to enable a port with a U-Boot script, see an example here
HPS clock frequency is higher than what you can get in your logic, so a 128bit bus is still an efficient way to talk to a DDR.
Now, you'll need to connect your 32 bit data bus to a wider data bus, not a more narrow one. For reading, you should be using a cache anyway, with a line width a multiple of 128bits.

Related

What is the maximum speed of the STM32 USB CDC?

I'm using an STM32L151 to communicate with a PC using USB CDC. I used the STM32 HAL libraries to create my project.
I found that the USB sends data in 1 ms intervals, and each time 64 bytes are being sent. So, is the maximum speed of the USB CDC 64 kbyte/s? That is much lower than the USB Full Speed data rate of 12 Mbit/s. How can I reach this speed, or at least a fraction of this speed?

Nope. If your code is "fast enough", the maximum CDC speed is about 1MByte/sec. This may require a big (>1KB) FIFO on the device side. Oh, and the PC side must be able to read the data fast enough, e.g. with big buffers.
The 64KByte/s limit applies for USB HID which uses interrupt endpoints. USB CDC interface uses faster bulk endpoints.

USB FS frame is 1ms so if you put 64 bytes to the buffer (using the HAL function) - it will send those 64 bytes in the next frame. And it will not send any more data until another 1ms frame
How to increase this speed -> aggregate your data in larger chunks and send more data in the one transaction (up to 8kB using HAL libraries).

Atmel AVR SRAM vs register

In Atmel AVR architecture, register and SRAM are in the same data memory space (for example 0x0000 till 0x001F would be registers, and 0x300 would be internal SRAM). How is that implemented? Is it the same principle as virtual memory?

It could be or they could be separate rams or they could be several separate ram blocks. It starts in the processor core and then, esp with that core being harvard architecture, instruction fetches vs data, split into at least two busses, then you get on the data bus, and then you have some sort of address decoder to isolate peripherals from ram and perhaps registers from sram.
It may very well be that they are simply part of the generic sram. Or it could be that they are their own bank of ram closer to the processor, but that happen to be addressable. And that address decoding may happen in the core and not make it to the edge of the processor where the sram and peripheral decoding would happen.
If split then yes it may feel a little like virtual memory in that there is a space in one address space to maps to some other thing. But unlike virtual memory that you dont have an mmu doing it, esp one you can reprogram or that can check permissions, etc.
This addressing registers thing is a feature in some other 8 bit processor, I am wanting to say the 8051, so the AVR may have been designed with a feature like that as well. But like BCD math instructions, is a feature that has gone by the wayside. Much more likely not to see it than see it.

Can a 32 bit MCU have data lines more than 32?

I was wondering what is the reason behind branding a MCU as 32 bit or 64 bit. In the simplistic architecture like Harvard or Neumann architecture it used to be width of data bus. But in the market I have seen MCUs which have 64 bit data lines and yet marketed as 32 bit MCUs. Can somebody explain?

It is not true that the bit width of a processor was defined by the data bus width. Intel 8088 (used in the original IBM PC) was a 16bit device with an 8 bit data bus, and Motorola 68008 (Sinclair QL) was a 32bit device with an 8 bit bus.
It is primarily defined by the nature of the instruction set (width of operands) and the register width (necessarily the same).
When most devices had matching bus and instruction/register widths (i.e. prior to about 1980), there was no need for a distinction and that it was unclear whether it refered to bus or register/insttruction width was of little consequence, when narrow bus width bus versions of wide instruction/register devices were introduced it represented a marketing dilemma. The QL was widely advertised as having a 32 bit processor despite its 8 bit bus, while the 8088 was sometimes referred to as an 8/16 bit part. The 68008 could trivially perform 32bit operations in a single instruction - the fact that it took 4 bus cycles to get the operand was transparent to software, and the total number of instruction and data fetch cycles was still far fewer than it would take an 8 bit processor to perform the same 32 bit operation.
Another interesting architecture in this context is ARM architecture v4 that supports a 16 bit mode known as "Thumb" in addition to the 32bit ARM mode, In Thumb mode both the instruction and register set is 16 bit. This has higher code density than ARM mode. Where an external memory interface is used, most ARM v4 parts support both a 16 or 32 bit external bus - either ARM or Thumb may be used with either, but when a 16 bit bus is implemented, Thumb mode generally runs more efficiently than the 32 bit instruction set due to the single bus cycle per instruction or operand fetch.
Given the increasing variety of architectures instruction/register sets and bus widths, it makes sense now to characterise an architecture by its instruction/register set.

Configuration registers for LPC bus in Poulsbo System Controller Hub (US15W)

We have a system based around an Atom Z510/Intel SCH US15W Q7 card (running Debian Linux.) We need to transfer blocks of data from a device on the Low Pin Count Bus. As far as I know this chipset does not provide DMA facilities, meaning the processor has to read the data out a byte at a time in a software loop. (The device driver actually implements this using the "rep insb" x86 instructions so the loop is actually implemented by the CPU if I understand correctly.)
This is far from optimal, but it should be possible to hit a transfer rate of 14Mb/s. Instead we can barely manage 4Mb/s with transactions on the bus no closer than 2us apart even though each read to the slave device is is done in 560ns. I don't believe other traffic on the bus is to blame, but am still investigating.
My question is:
Does any one know if there are any configuration registers on the SCH that could affect the LPC bus timing?
I cannot find any useful information on the device on the Intel website, nor have I spotted anything in the Linux Kernel code that appears to be fiddling with any such registers (but I'm a noob when it come to Linux Kernel stuff.)
I'm not an x86 expert so any other factors that might come into play or any other 'war stories' relating to this device would be good to know about too.
Edit: I have found the datasheet. I've not seen anything in it that explains this behaviour, but I am investigating the possibility of mapping our device as a firmware device as the firmware bus cycles don't seem to suffer the same delays.

For the record, the solution was to modify the FPGA firmware such that the chip's data in/out register was mapped to four adjacent addresses and the driver modified to do 32 bit inb/outb instructions. Although the SCH does not implement 32 bit LPC read/write operations, the result is 4 back-to-back 8 bit operations followed by the same dead time as I was getting previously with a single byte, meaning it averages about 1us per byte. Not ideal, but still a doubling in throughput.
It transpires the firmware cycles were quicker because the SCH transfers 64 bytes at a time from the firmware flash - after 64 bytes there is the same 1.4us gap, indicating this is the per-transaction latency of the device. Exploiting this may have been slightly quicker than the above solution however the trade-off is that it is limited to 64 bytes chunks and each byte takes longer (680ns IIRC) due to the additional cycles required to do a firmware read.

Control stepper motors via USB

I'm doing a USB device is to control stepper motors. I've done this before using a parallel port. because these ports do not exist in current motherboards, I decided to implement a USB communication between my device and the PC (host).
To achieve My objective, I endowed the freescale microcontroller the device with that has a USB module 12Mbps.
My USB device must receive 4 bytes (one for each motor driver) at a given time, because every byte is a step that should move the engine.
In the PC (Host) an application of user processes a text file with information and make the trajectory coordinates sending bytes at a certain rate for each motor (time is trivial to achieve the acceleration and speed of the motors) .
Using the parallel port was an easy the task because each byte is sent sequentially to a time determined by the user app.
doing a little research about full speed USB protocol understood that the frame is sent every 1ms.
then you can send 4 byte or many more every 1ms but I can not manage time like I did with the parallel port.
My microcontroller can send up to 64 bytes per frame (Based on transfer papers type Control, Bulk, Int, Iso ..).
question 1:
I want to know in what way I can send 4-byte packets faster than every 1 ms?
question 2:
What type of transfer can advise me for these type of devices?
Thanks.

Like Ricardo said, USB-serial will suffice.
As for the type of transfer, try implementing a CDC stack and use your SCI receiver to listen for PC commands. That will give you a receive buffer which will meet your needs.
Initialize your SCI (baud, etc)
Enable receiver and interrupt
On data receive, move it to your 4-byte command buffer
Clear receive buffer, wait for more
When you have all 4 bytes, fire off the steppers! Four bytes should take µs.
Check with Freescale to see if your processor is supported.
http://cache.freescale.com/files/microcontrollers/doc/support_info/USB_STACK_RELEASE_NOTES_V4.1.1.pdf?fpsp=1
There might even be some sample code to get you started.
-Cheers

I am achieving the same goal (driving/control CNC machines) like this:
the USB device is just synchronous I/O parallel port. Using continuous bulk transfer one pipe as input and one as output. This way I was able to achieve synchronous 64bit parallel communication with ~70KHz sample rate. It uses traffic around (i)4.27+(o)4.27 MBit/s that is limit for mine MCU and code. Bigger speeds cause jitter on the output due to USB events interrupts.
How to do it (on MCU side)
I have 2 FIFO's one for ingoing and one for outgoing data. I have timer interrupt occurring with sample rate frequency. In it I read the inputs and feed it to the first FIFO and read data from the other FIFO and send it to the outputs.
On top of that the USB task is called (inside the same interrupt) checking FIFO for sending to and incoming data from USB handling the transfer itself
I choose ATMEL AT32UC3A chips for this task. After a long and pain full research I decided these MCU's because they have enough memory for both FIFO's and program so no need for additional IC. It has FPGA package which can be used (BGA is not an option). It has HS USB (most USB MCU's have only FS like yours). It runs at 66MHz. It supports many interesting features (did interesting projects with it in the past) and of coarse I have experience with ATMEL MCU's from past
So if you want to achieve something similar then
start with bulk transfer (PC -> USB -> MCU -> output)
add FIFO if needed
do not know the sample rate you need. The old LPT's could handle from 80-196KHz depend on the manufactor. The modern ones are much much slower (which is silly and sad).
measure the critical sample rate
you need oscilloscope or very good hearing for this. The output data must be synchronous so no holes in it, no jitter, etc...
if any of these are present you have to lower the sample rate. Mine setup could handle even 1MHz sample rate but the USB jitter was present (sometimes USB event froze the sending for longer that one sample...) so I achieve only 70KHz of stable output.
if needed also inputs then add them
but only if the output is working as it should. Do not forget to lower the sample rate after this too ... Use separate bulk pipes and FIFOs for input and output.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas