What would cause a LBC-managed region access to hang? - embedded

I'm trying to write a Compact Flash driver for the RouterBoard 800, for FreeBSD, and running into problems. The CF slot is managed by the Local Bus Controller (LBC) of the CPU (MPC8544E), using the User Programmable Machine (UPM) module and any access to the memory region the CF is located at hangs the thread (the CPU can still be interrupted, but the thread never continues). Even dummy accesses, when programming or reading the UPM, hang. Now, the question is, what would cause the thread to hang when accessing the UPM-managed region, even if it's a dummy access, which should not actually assert the bus?
I know the CF card and slot themselves work, because the kernel itself boots from the card, loaded by the RouterBoard boot loader.

For posterity, the mpc8544E (probably most of the mpc85xx series) has the concept of Local Access Windows (LAWs). If an address does not exist in any of the 10 windows created, it's dropped on the floor, with no exception thrown nor garbage data returned. This is true for all address regions, including external RAM.

Related

How to bypass Microsoft .NET Framework Unhandled Exception?

My company recently started installing IO modules to production machines to capture data to our database. To do so we outsourced a 3rd party programmer to create an IO pulling software to draw data from IO module to a database in a collector PC.
The IO pulling software is running on the collector PC and will continue to draw data from machines into its local database for as long as the software is left open. No data will come in if the collector PC is shut down or the software is turned off.
The program works well but there was a bit of an oversight. If a machine gets unplugged or disconnected from the network (connection made via Ethernet cables), an unhandled exception error will appear for the pulling software for that particular machine.
Clearly its due to there being no connection and clicking Continue will solve this problem. But I'm afraid it will affect the quality of the data captured. Also our contract with the programmer who made this expired, so I'm the one who has to fix it.
The program was done in VB and I was thinking of just adding a procedure into the code to automatically bypass this error or at least tell the program to automatically select Continue and move on without human intervention. As I'm new to programming I'm not sure if my approach is even possible. Any ideas?

Can VMs on Google Compute detect when they've been migrated?

Is it possible to notify an application running on a Google Compute VM when the VM migrates to different hardware?
I'm a developer for an application (HMMER) that makes heavy use of vector instructions (SSE/AVX/AVX-512). The version I'm working on probes its hardware at startup to determine which vector instructions are available and picks the best set.
We've been looking at running our program on Google Compute and other cloud engines, and one concern is that, if a VM migrates from one physical machine to another while running our program, the new machine might support different instructions, causing our program to either crash or execute more slowly than it could.
Is there a way to notify applications running on a Google Compute VM when the VM migrates? The only relevant information I've found is that you can set a VM to perform a shutdown/reboot sequence when it migrates, which would kill any currently-executing programs but would at least let the user know that they needed to restart the program.
We ensure that your VM instances never live migrate between physical machines in a way that would cause your programs to crash the way you describe.
However, for your use case you probably want to specify a minimum CPU platform version. You can use this to ensure that e.g. your instance has the new Skylake AVX instructions available. See the documentation on Specifying the Minimum CPU Platform for further details.
As per the Live Migration docs:
Live migration does not change any attributes or properties of the VM
itself. The live migration process just transfers a running VM from
one host machine to another. All VM properties and attributes remain
unchanged, including things like internal and external IP addresses,
instance metadata, block storage data and volumes, OS and application
state, network settings, network connections, and so on.
Google does provide few controls to set the instance availability policies which also lets you control aspects of live migration. Here they also mention what you can look for to determine when live migration has taken place.
Live migrate
By default, standard instances are set to live migrate, where Google
Compute Engine automatically migrates your instance away from an
infrastructure maintenance event, and your instance remains running
during the migration. Your instance might experience a short period of
decreased performance, although generally most instances should not
notice any difference. This is ideal for instances that require
constant uptime, and can tolerate a short period of decreased
performance.
When Google Compute Engine migrates your instance, it reports a system
event that is published to the list of zone operations. You can review
this event by performing a gcloud compute operations list --zones ZONE
request or by viewing the list of operations in the Google Cloud
Platform Console, or through an API request. The event will appear
with the following text:
compute.instances.migrateOnHostMaintenance
In addition, you can detect directly on the VM when a maintenance event is about to happen.
Getting Live Migration Notices
The metadata server provides information about an instance's
scheduling options and settings, through the scheduling/
directory and the maintenance-event attribute. You can use these
attributes to learn about a virtual machine instance's scheduling
options, and use this metadata to notify you when a maintenance event
is about to happen through the maintenance-event attribute. By
default, all virtual machine instances are set to live migrate so the
metadata server will receive maintenance event notices before a VM
instance is live migrated. If you opted to have your VM instance
terminated during maintenance, then Compute Engine will automatically
terminate and optionally restart your VM instance if the
automaticRestart attribute is set. To learn more about maintenance
events and instance behavior during the events, read about scheduling
options and settings.
You can learn when a maintenance event will happen by querying the
maintenance-event attribute periodically. The value of this
attribute will change 60 seconds before a maintenance event starts,
giving your application code a way to trigger any tasks you want to
perform prior to a maintenance event, such as backing up data or
updating logs. Compute Engine also offers a sample Python script
to demonstrate how to check for maintenance event notices.
You can use the maintenance-event attribute with the waiting for
updates feature to notify your scripts and applications when a
maintenance event is about to start and end. This lets you automate
any actions that you might want to run before or after the event. The
following Python sample provides an example of how you might implement
these two features together.
You can also choose to terminate and optionally restart your instance.
Terminate and (optionally) restart
If you do not want your instance to live migrate, you can choose to
terminate and optionally restart your instance. With this option,
Google Compute Engine will signal your instance to shut down, wait for
a short period of time for your instance to shut down cleanly,
terminate the instance, and restart it away from the maintenance
event. This option is ideal for instances that demand constant,
maximum performance, and your overall application is built to handle
instance failures or reboots.
Look at the Setting availability policies section for more details on how to configure this.
If you use an instance with a GPU or a preemptible instance be aware that live migration is not supported:
Live migration and GPUs
Instances with GPUs attached cannot be live migrated. They must be set
to terminate and optionally restart. Compute Engine offers a 60 minute
notice before a VM instance with a GPU attached is terminated. To
learn more about these maintenance event notices, read Getting live
migration notices.
To learn more about handling host maintenance with GPUs, read
Handling host maintenance on the GPUs documentation.
Live migration for preemptible instances
You cannot configure a preemptible instances to live migrate. The
maintenance behavior for preemptible instances is always set to
TERMINATE by default, and you cannot change this option. It is also
not possible to set the automatic restart option for preemptible
instances.
As Ramesh mentioned, you can specify the minimum CPU platform to ensure you are only migrated to an instance which has at least the minimum CPU platform you specified. At a high level it looks like:
In summary, when you specify a minimum CPU platform:
Compute Engine always uses the minimum CPU platform where available.
If the minimum CPU platform is not available or the minimum CPU platform is older than the zone default, and a newer CPU platform is
available for the same price, Compute Engine uses the newer platform.
If the minimum CPU platform is not available in the specified zone and there are no newer platforms available without extra cost, the
server returns a 400 error indicating that the CPU is unavailable.

Google Compute Engine VM constantly crashes

On the Compute Engine VM in us-west-1b, I run 16 vCPUs near 99% usage. After a few hours, the VM automatically crashes. This is not a one-time incident, and I have to manually restart the VM.
There are a few instances of CPU usage suddenly dropping to around 30%, then bouncing back to 99%.
There are no logs for the VM at the time of the crash. Is there any other way to get the error logs?
How do I prevent VMs from crashing?
CPU usage graph
This could be your process manager saying that your processes are out of resources. You might wanna look into Kernel tuning where you can increase the limits on the number of active processes on your VM/OS and their resources. Or you can try using a bigger machine with more physical resources. In short, your machine is falling short on resources and hence in order to keep the OS up, process manager shuts down the processes. SSH is one of those processes. Once you reset the machine, all comes back to normal.
How process manager/kernel decides to quit a process varies in many ways. It could simply be that a process has consistently stayed up for way long time to consume too many resources. Also, one thing to note is that OS images that you use to create a VM on GCP is custom hardened by Google to make sure that they can limit malicious capabilities of processes running on such machines.
One of the best ways to tackle this is:
increase the resources of your VM
then go back to code and find out if there's something that is leaking in the process or memory
if all fails, then you might wanna do some kernel tuning to make sure your processes have higer priority than other system process. Though this is a bad idea since you could end up creating a zombie VM.

Is it possible to "wake up" linux kernel process from user space without system call?

I'm trying to modify a kernel module that manages a special hardware.
The user space process, performs 2 ioctl() system calls per milliseconds to talk with the module. This doesn't meet my real.time requirements because the 2 syscalls sometimes take to long to execute and go out my time slot.
I know that with mmap I could share a memory area, and this is great, but how can I synchronize the data exchange with the module without ioctl() ?

USB EHCI: Help with transaction errors

One of our current milestones on our (open source) project at the moment is to complete USB support, and as such we're working hard on drivers at the moment. Our current development focuses on EHCI on both x86 and ARM (OMAP35xx SoC specifically, EHCI-only in the silicon of the board). We have mostly everything running smoothly in a variety of emulators - VMware (free and non-free versions), QEMU, and VirtualBox.
When we do testing on real hardware however, we get absolutely nowhere. The basic routine for device enumeration in our system goes something like this:
Turn on port power (if the option is available) and wait for power to stabilise to the device
Perform a port reset (held for 50 ms) and then wait as long as needed for the reset to complete (while loop)
If the port has a device present, and is enabled, notify the system that a new USB device is available for initialisation.
Send the SET ADDRESS command to assign an address to the device. This is where we run into problems everywhere:
The SETUP transaction for this command completes without error
The zero-length IN transaction (status phase) throws a transaction error, halts the qTD, and disables the port.
Our timing delays are basically the same as Linux's driver (if anything, longer).
According to the USB 2.0 specification, this behaviour is a "Port Error" (section 11.8) but to be completely honest I don't see how to translate its description of a port error into a working solution for our driver. As we are an open source project we also don't have the money to go out and purchase a proper hardware USB protocol analyser to investigate exactly what's going on on the line either.
Has anyone faced a similar problem and knows a solution?
We have identified the cause of this problem has been a timing issue, but in our case the issue was too much of a delay.
By modifying our qTD/QH creation code to create a single QH with multiple linked qTDs associated with it, we've been able to get successful runs on physical hardware.
We also had to use te EHCI 64-bit data structures, which had not been implemented previously.