Fuel7 News Blog

Wednesday, May 26, 2010

Atomic collections are a powerful toolbox for real-time embedded systems

If you're unfamiliar with atomic collections when compared to traditional collections and data-structures, read-on...

A typical wait-free/lock-free doubly-linked list (or dequeue) uses a DCAS or CAS2 (double-compare-and-swap).

In general, just working with atomic collections opens up some major challenges... Take a look at existing implementations, the ABA problem, hazard pointers, memory barriers, and threads discussing CAS, CMPXCHG, DCAS, CAS2, etc. Because there is not always support for DCAS/CAS2 on a particular embedded architecture, it would be nice to have an "atomic API" that falls back to CAS for the collection implementation.

Is this possible?

Libraries that currently employ atomics layer the assembly ISA so that all the assembly implemented behind a few C routines that can effectively be swapped out for different architectures. We've been working towards a non-traditional atomic API that enables intrinsics like DCAS to fallback to CAS without the need to change the data structure implementation. For now though, I'll just focus on why atomic collections are powerful.

A general API for a doubly linked-list in our framework would accept an arbitrary "element type". The C API might look something like this:

void List_Create(list_t* pList);
void List_AddHead(list_t* pList, elem_t* pElem);
void List_AddTail(list_t* pList, elem_t* pElem);
void List_AddAfter(list_t* pList, elem_t* pAfterElem, elem_t* pElem);
elem_t* List_RemoveHead(list_t* pList);
elem_t* List_RemoveTail(list_t* pList);
void List_Remove(list_t* pList, elem_t* pElem);
void List_Destroy(list_t* pList);

NOTE: All operations are, of course, O(1)

So, in a typical implementation, if the main execution context started List_RemoveHead, but could be interrupted by an interrupt that uses List_AddTail, then we would want to protect the shared list by either:
1. disabling interrupts around the critical sections of the Add/Remove code
2. use some sort of primitive lock or mutex to similarly protect these sections

...otherwise, the list pointers would get corrupted by arbitrary pre-emption of the main execution context.

Both of these above solutions are really undesirable from a performance perspective. As old as the "disable-all-interrupts" technique is, we *still* see it used quite frequently inside of client libraries, kernels, and sample software from various embedded vendors. This certainly threatens the interrupt's ability to service in a timely manner, but of course it can be tuned for the specific real-time application and not matter. It's certainly not a safe generalizable solution.

The use of mutex locks typically infers the side effect of causing a priority inversion for the interrupt if it attempts to List_AddTail when a lower-priority main context has already acquired the lock. These may be "tunable" and not critical on a single-microprocessor system, but as the number of cores increases, these design choices impact performance significantly.

One last performance point: The trivial case of locking the entire list_t* structure for access to any element is common for STL. This means that we cannot leverage concurrency when that two different elements have no pointers in common. We would cumbersomely lock the entire structure in every case.

When we introduce atomic algorithms, we resolve many of these issues:

We NEVER disable interrupts
Locks are NOT held on the structure, so..
We NEVER priority invert or force a context-switch to modify the data structure
Although not perfect, we can calculate a worst-case (wait-free) time it will take an interrupt to execute it's List_AddTail call, thereby enabling hard real-time constraints on the software, thus..
The highest priority interrupt accessing the common structure always "wins" (gains access to the data structure without a traditional "lock"), and..
Multiple cores (or threads) can modify different parts of the data structure concurrently when there are no common pointers between the elements involved

Fun stuff...

Atomics and collections in C for embedded systems

I have a lot catching up on many technical topics here at Fuel7...

We have a set of software which provides collections in C. I found this FOSS software which is kind of similar to what we did: http://code.google.com/p/c-generic-library/

There are a number of notable differences in our implementations:

We have implemented some of the more complex data structures like left-leaning RB trees, etc.
We have a more generalized concept of "elements" with a collection. Element comparison, hashing, keying, destruction, etc. is handled through collection callback functions. In the case of search on a tuple, an element structure contains both the key and the value.
Our model does not rely on heap allocation/de-allocation. Most implementations use a "static memory model." This is user-specified so the elements can be dynamically or statically allocated.
"Elements" can efficiently be moved from one collection to another.
We have generalized iterators.
The collections support thread concurrency.

I think we've solved the iteration w/ concurrency in a unique way as well. When using an iterator on a collection, the iterator is thread-safe on the element it's cursor points to. This means there is a callback for "destruction" or destroying an element. All other software follows a reference count/release policy.

The goals for this software have already been met. These collections provide the foundation for a resource constrained, high performance ANSI C framework that we use to bootstrap work for our clients. I have not yet decided to open source the entire library, but it's always on the back of my mind.

The topic I'm most interested in is thread concurrency. One of the most critical limiting factors in performance scalability w/ respect to multi-core is the use of thread locking. Concurrency w/ synchronization kills performance. (interesting & relevant article from Nvidia earlier this month!) The main idea behind using atomics is that they enable non-blocking synchronization. We avoid all kinds of locking mechanisms at a per-element level meaning:

there is improved performance because priority inversion does not occur
concurrent (other-core) access to a different element in the same collection is uninhibited

http://www.audiomulch.com/~rossb/code/lockfree/ has a pretty good set of open and commercial solutions on the market. Some of the challenges here are problems that have been solved well by other people,we need to take a breadth-first approach to understand what techniques are currently employed. There are some higher-level libraries like boost & tbb that put these techniques to good use, but they are tied to certain data structures (Ex: heap) the C++ language semantics with is not as portable as C, or are tied to a specific processor architecture. We've been working on atomic implementations on the higher level data structures for some time and are closing in on a generalizable solution that can be integrated into our software framework and fall back to a "thread model" when the intrinsics are not available.

This has enabled lock-free scheduling, algorithm partitioning, task dispatching, high-performance network servers, concurrent heap allocation, among other use cases at Fuel7.

FFplay or vlc for playback of RTP/SDP?

Just completing a demo platform for the DM6467T with video capture and h.264 RTP streaming. These employ freely licensable codecs distributed by TI rather than commercial codecs. We're looking forward to testing latencies over dedicated ethernet as they will likely be very aggressive.

We should have a baseline SDK w/ Linux 2.6.32+, DVSDK 3.10, 720P @ 60fps, or 1080P/1080i @ 30 fps ready shortly.

We're currently working through the syntactic compatibility of the TI h.264 with other codecs. I've used ffplay to play back RTP streams in the past, but AFAIK, FFplay doesn't have extensive command line options to specify the stream type. After putting together an SDP file to specify the metadata for the video coming out of the EVM, it seems rather odd to me that no simple RTSP server exists.

I looked around at a couple of options for RTSP:

RTSP::Lite (http://search.cpan.org/~nabeshima/RTSP-Lite-0.1/Lite.pm)
Ruby-based RAOP (http://raop.rubyforge.org/raop-client/)

...but these are clients.. I just need my player to open up the RTP payload and an SDP file seems like the easiest way to provide that info.. A quick look at the RTSP protocol (http://forum.videolan.org/viewtopic.php?f=3&t=38586) shows that there are only a few commands the server needs to respond to:

OPTIONS, DESCRIBE, SETUP

...because FFplay won't require the entire command set. (http://www.ietf.org/rfc/rfc2326.txt)

Sunday, November 30, 2008

Beagle Board Up With 2.6.28

Posted with LifeCast

Tuesday, July 1, 2008

ATC 2008

Just got back from a week at the Hilton for TI's annual ATC 2008. This was a very informative conference. We highly recommend these conferences for anyone seriously working with TI products. They are heavily seeded w/ professionals in the industry which provided a great platform for networking and some creative discussions on ultra low power wireless applications. The value of the gifted EVM modules and development software alone make up for the cost of the conference.

More on the conference and progress w/ a viable development toolchain under Linux since we ran into Steve Underwood (author of msp430-gdbproxy) on Thursday.

Using this blog to populate our updated design template

So hopefully sometime this month we'll have a renovated site up with some more active and dynamic content. News and press postings come through here...