mtapi.tex 23.4 KB
Newer Older
Michael Schmid committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327
\chapter{MTAPI}
\label{cha:mtapi}

Leveraging the power of multicore processors requires to split computations into fine-grained tasks that can be executed in parallel. Threads are usually too heavy-weight for that purpose, since context switches consume a significant amount of time. Moreover, programming with threads is complex and error-prone due to typical pitfalls such as race conditions and deadlocks. To solve these problems, efficient task scheduling techniques have been developed which dynamically distribute the available tasks among a fixed number of worker threads. To reduce overhead, there is usually exactly one worker thread for each processor core.

While task schedulers are nowadays widely employed, especially in desktop and server applications, they are typically limited to a single operating system running on a homogeneous multicore processor. System-wide task management in heterogeneous embedded systems must be realized explicitly with low-level communication mechanisms. MTAPI~\cite{MTAPI} addresses those issues by providing an API which allows parallel embedded software to be designed in a straightforward way, covering homogeneous and heterogeneous multicore architectures, as well as acceleration units. It abstracts from the hardware details and lets software developers focus on the application. Moreover, MTAPI takes into account typical requirements of embedded systems such as real-time constraints and predictable memory consumption.

The remainder of this chapter is structured as follows: The next section explains the basic terms and concepts of MTAPI as given in the specification~\cite{MTAPI}. Section~\ref{sec:mtapi_c_interface} describes the C API using a simple example taken from~\cite{MTAPI}. Finally, Section~\ref{sec:mtapi_cpp_interface} outlines the use of MTAPI in C++ applications. Note that the C++ interface is provided by \embb for convenience but it is not part of the standard.

\section{Foundations}

\subsection{Domains}

An MTAPI system is composed of one or more MTAPI domains. An MTAPI domain is a unique system global entity. Each MTAPI domain comprises a set of MTAPI nodes. An MTAPI node may only belong to one MTAPI domain, while an MTAPI domain may contain one or more MTAPI nodes. This allows the programmer to use MTAPI domains as namespaces for all kinds of IDs (e.g., nodes, actions, queues, etc.).

\subsection{Nodes}

An MTAPI node is an independent unit of execution, such as a process, thread, thread pool, processor, hardware accelerator, or instance of an operating system. A given MTAPI implementation specifies what constitutes a node for that implementation.

The intent is to avoid a mixture of node definitions in the same implementation (or in different domains within an implementation). If a node is defined as a unit of execution with its private address space (like a process), then a core with a single unprotected address space OS is equivalent to a node, whereas a core with a virtual memory OS can host multiple nodes.

On a shared memory SMP processor, a node can be defined as a subset of cores. A quad-core processor, for example, could be divided into two nodes, one node representing three cores and one node representing the fourth core reserved exclusively for certain tasks. The definition of a node is flexible because this allows applications to be written in the most portable fashion supported by the underlying hardware, while at the same time supporting more general-purpose multicore and many-core devices.

The definition allows portability of software at the interface level (e.g., the functional interface between nodes). However, the software implementation of a particular node cannot (and often should not) necessarily be preserved across a multicore SoC product line (or across product lines from different silicon providers) because a given node's functionality may be provided in different ways, depending on the chosen multicore SoC.

\subsection{Tasks}

A task represents the computation associated with the data to be processed. A task is executed concurrently to the code starting the task. The main API functions are \lstinline|mtapi_task_start()| and \lstinline|mtapi_task_wait()|. The semantics are similar to the corresponding thread functions (e.g. \lstinline|pthread_create|$/$\lstinline|pthread_join| in Pthreads). The lifetime of a task is limited; it can be started only once.

\subsection{Actions}

In order to cope with heterogeneous systems and computations implemented in hardware, a task is not directly associated with an entry function as it is done in other task-parallel APIs. Instead, it is associated with at least one action object representing the calculation. The association is indirect: one or more actions implement a job, one job is associated with a task. If the action is implemented in software, this is either a function on the same node (which can represent the same processor or core) or a function implemented on a different node that does not share memory with the core starting the task.

Starting a task consists of three steps:
\begin{enumerate}
  \item Create the action object with a job ID (software-implemented actions only).
  \item Obtain a job reference.
  \item Start the task using the job reference.
\end{enumerate}

\subsection{Synchronization}

The basic synchronization mechanism provided with in MTAPI is waiting for task completion. Calling \lstinline|mtapi_task_wait()| with a task handle blocks the current thread or task until the task referenced by the handle has completed. Depending on the implementation, the calling thread can be used for executing other tasks while waiting for the task to be completed. In order to synchronize with a set of tasks, every task can be associated with a task group. The methods \lstinline|mtapi_group_wait_all()| and \lstinline|mtapi_group_wait_any()| wait for a group of tasks or completion of any task in the group, respectively.
%MTAPI only provides synchronization on task granularity. Synchronization inside a task implementation can be done by MCAPI messages, MRAPI synchronization primitives, and the MRAPI memory primitives. If MCAPI or MRAPI implementations are not available, synchronization mechanisms provided by the operating system or a threading library must be used. In this case, the MTAPI implementation must define the consequences of using those mechanisms in the task context.

\subsection{Queues}

Queues are used for guaranteeing sequential order of execution of tasks. A common use case is packet processing in the communication domain: for every connection all packets must be processed sequentially, while the packets of different connections can be processed in parallel to each other.

Sequential execution is accomplished by using a queue for every connection and queuing all packets of one connection into the same queue. In some systems, queues are implemented in hardware, otherwise MTAPI implements software queues. MTAPI is designed for handling thousands of queues that are processed in parallel.

The procedure for setting up and using a queue is as follows:
\begin{enumerate}
  \item Create the action object (software-implemented actions only).
  \item Obtain a job reference.
  \item Create a queue object and attach the job to the queue (software-implemented queues only).
  \item Obtain a queue handle if the queue was created on a different node, or if the queue is hardware-implemented.
  \item Use the queue: enqueue the work using the queue.
\end{enumerate}
Another important purpose of queues is that different queues can express different scheduling attributes for the same job. For example, in contrast to order-preserving queues, non-order-preserving queues can be used for load-balancing purposes between different computation nodes. In this case, the queue must be associated with more than one action implementing the same task on different nodes (i.e., different processors or cores implementing different instruction set architectures). If a queue is configured this way, the order will not be preserved.

\subsection{Attributes}

Attributes are provided as a means to extend the API. Different implementations may define and support additional attributes beyond those predefined by the API. To promote portability and implementation flexibility, attributes are maintained in an opaque data object that may not be directly examined by the user. Each object (e.g., task, action, queue) has an attributes data object associated with it, and many attributes have a small set of predefined values that must be supported by MTAPI implementations. The user may initialize, get, and set these attributes. For default behavior, it is not necessary to call the initialize, get, and set attribute functions. However, to get non-default behavior, the typical four-step process is:
\begin{enumerate}
  \item Declare an attributes object of the \lstinline|mtapi_<object>_attributes_t| data type.
  \item \lstinline|mtapi_<object>attr_init()|: Returns an attributes object with all attributes set to their default values.
  \item \lstinline|mtapi_<object>attr_set()|: (Repeat for all attributes to be set). Assigns a value to the specified attribute of the specified attributes object.
  \item \lstinline|mtapi_<object>_create()|: Passes the attributes object modified in the previous step as a parameter when creating the object.
\end{enumerate}
At any time, the user can call \lstinline|mtapi_<object>_get_attribute()| to query the value of an attribute. After an object has been created, some objects allow to change attributes by calling \lstinline|mtapi_<object>_set_attribute()|.

\section{C Interface}
\label{sec:mtapi_c_interface}

The calculation of Fibonacci numbers is a simple example for a recursive algorithm that can easily be parallelized. Listing~\ref{lst:mtapi_fibonacci_sequential} shows a sequential version.

\begin{lstlisting}[frame=none,caption={Sequential program for computing Fibonacci numbers},label={lst:mtapi_fibonacci_sequential}]
int fib(int n) {
  int x,y;
  if (n < 2) {
    return n;
  } else {
    x = fib(n - 1);
    y = fib(n - 2);
    return x + y;
  }
}

int fibonacci(int n) {
  return fib(n);
}

void main(void) {
  int n = 6;
  int result = fibonacci(n);
  printf("fib(%i) = %i\n", n, result);
}
\end{lstlisting}

This algorithm can be parallelized by spawning a task for one of the recursive calls (\lstinline|fib(n - 1)|, for example). When doing this with MTAPI, an action function that represents \lstinline|fib(int n)| is needed. It has the following signature: 
%
\\\inputlisting{../examples/mtapi/mtapi_c_action_signature-snippet.h}
%
Within the action function, the arguments should be checked, since the user might supply a buffer that is too small:
%
\\\inputlisting{../examples/mtapi/mtapi_c_validate_arguments-snippet.h}
%
Here, \lstinline|mtapi_context_status_set()| is used to report errors. The error code will be returned by \lstinline|mtapi_task_wait()|. Also, care has to be taken when using the result buffer. The user might not want to use the result and supply a NULL pointer or accidentally a buffer that is too small:
%
\\\inputlisting{../examples/mtapi/mtapi_c_validate_result_buffer-snippet.h}
%
At this point, calculation of the result can commence. First, the terminating condition of the recursion is checked:
%
\\\inputlisting{../examples/mtapi/mtapi_terminating_condition-snippet.h}
%
After that, the first part of the computation is launched as a task using \lstinline|mtapi_task_start()| (the action function is registered with the job \lstinline|FIBONACCI_JOB| in the \lstinline|fibonacci()| function and the resulting handle is stored in the global variable \lstinline|mtapi_job_hndl_t fibonacciJob|):
%
\\\inputlisting{../examples/mtapi/mtapi_c_calc_task-snippet.h}
%
The second part can be executed directly:
%
\\\inputlisting{../examples/mtapi/mtapi_c_calc_direct-snippet.h}
%
Then, completion of the MTAPI task has to be waited for by calling \lstinline|mtapi_task_wait()|:
%
\\\inputlisting{../examples/mtapi/mtapi_c_wait_task-snippet.h}
%
Finally, the results can be added and written into the result buffer:
%
\\\inputlisting{../examples/mtapi/mtapi_write_back-snippet.h}
% 

The \lstinline|fibonacci()| function gets a bit more complicated now. The MTAPI runtime has to be initialized first by (optionally) initializing node attributes and then calling \lstinline|mtapi_initialize()|:
%
\\\inputlisting{../examples/mtapi/mtapi_c_initialize-snippet.h}
%
Then, the action function needs to be associated to a job. By calling \lstinline|mtapi_action_create()|, the action function is registered with the job \lstinline|FIBONACCI_JOB|. The job handle of this job is stored in the global variable \lstinline|mtapi_job_hndl_t fibonacciJob| so that it can be accessed by the action function later on:
%
\\\inputlisting{../examples/mtapi/mtapi_c_register_action-snippet.h}
%
Now that the action is registered with a job, the root task can be started with \lstinline|mtapi_task_start()|:
%
\\\inputlisting{../examples/mtapi/mtapi_c_start_task-snippet.h}
%
%The started task has to be waited for before the result can be returned.
After everything is done, the action is deleted (\lstinline|mtapi_action_delete()|) and the runtime is shut down (\lstinline|mtapi_finalize()|):
%
\\\inputlisting{../examples/mtapi/mtapi_c_finalize-snippet.h}
%

\section{C++ Interface}
\label{sec:mtapi_cpp_interface}

\embb provides C++ wrappers for the MTAPI C interface. The signature of the action function for the C++ interface is the same as in the C interface:
%
\\\inputlisting{../examples/mtapi/mtapi_c_action_signature-snippet.h}
%
Checking argument and result buffer sizes is the same as in the C example. Also, the terminating condition of the recursion still needs to be checked:
%
\\\inputlisting{../examples/mtapi/mtapi_terminating_condition-snippet.h}
%
After that, the first part of the computation is launched as an MTAPI task using \lstinline|embb::mtapi::Node::Start()| (the action function is registered with the job \lstinline|FIBONACCI_JOB| in the \lstinline|fibonacci()| function and the resulting handle is stored in the global variable \lstinline|embb::mtapi::Job fibonacciJob|):
%
\\\inputlisting{../examples/mtapi/mtapi_cpp_calc_task-snippet.h}
%
The second part can be executed directly:
%
\\\inputlisting{../examples/mtapi/mtapi_c_calc_direct-snippet.h}
%
Then, completion of the MTAPI task has to be waited for using \lstinline|embb::mtapi::Task::Wait()|:
%
\\\inputlisting{../examples/mtapi/mtapi_cpp_wait_task-snippet.h}
%
Finally, the two parts can be added and written into the result buffer:
%
\\\inputlisting{../examples/mtapi/mtapi_write_back-snippet.h}
% 
Note that there is no need to do error checking everywhere, since errors are reported as exceptions. In this example there is only a single try/catch block in the main function:
%
\\\inputlisting{../examples/mtapi/mtapi_cpp_main-snippet.h}
%

The \lstinline|fibonacci()| function is about the same as in the C version. The MTAPI runtime needs to be initialized first:
%
\\\inputlisting{../examples/mtapi/mtapi_cpp_initialize-snippet.h}
%
Then the node instance can to be fetched:
%
\\\inputlisting{../examples/mtapi/mtapi_cpp_get_node-snippet.h}
%
After that, the action function needs to be associated to a job. By instancing an \lstinline|embb::mtap::Action| object, the action function is registered with the job \lstinline|FIBONACCI_JOB|. The job is stored in the global variable \lstinline|embb::mtapi::Job fibonacciJob| so that it can be accessed by the action function later on:
%
\\\inputlisting{../examples/mtapi/mtapi_cpp_register_action-snippet.h}
%
Not that the action is registered and the job is initialized, the root task can be started:
%
\\\inputlisting{../examples/mtapi/mtapi_cpp_start_task-snippet.h}
%
Again, the started task has to be waited for (using \lstinline|embb::mtapi::Task::Wait()|) before the result can be returned.

The registered action will be unregistered when it goes out of scope.
The runtime needs to be shut down by calling:
\\\inputlisting{../examples/mtapi/mtapi_cpp_finalize-snippet.h}

\section{Plugins}

The \embb implementation of MTAPI provides an extension to allow for custom actions that are not executed by the scheduler for software actions as detailed in the previous sections.
Two plugins are delivered with \embb, one for supporting distributed systems through TCP/IP networking and the other to allow for transparently using OpenCL accelerators.

\subsection{Plugin API}

The plugin API consists of a single function named \lstinline|mtapi_ext_plugin_action_create()| contained in the mtapi\_ext.h header file. It is used to associate the plugin action with a specific job ID:
\begin{lstlisting}
mtapi_action_hndl_t mtapi_ext_plugin_action_create(
  MTAPI_IN mtapi_job_id_t job_id,
  MTAPI_IN mtapi_ext_plugin_task_start_function_t task_start_function,
  MTAPI_IN mtapi_ext_plugin_task_cancel_function_t task_cancel_function,
  MTAPI_IN mtapi_ext_plugin_action_finalize_function_t action_finalize_function,
  MTAPI_IN void* plugin_data,
  MTAPI_IN void* node_local_data,
  MTAPI_IN mtapi_size_t node_local_data_size,
  MTAPI_IN mtapi_action_attributes_t* attributes,
  MTAPI_OUT mtapi_status_t* status
);
\end{lstlisting}
The plugin action is implemented through 3 callbacks, task start, task cancel and action finalize.

\lstinline|task_start_function| is called when the user requests execution of the plugin action by calling \lstinline|mtapi_task_start()| or \lstinline|mtapi_task_enqueue()|. To those functions the fact that they operate on a plugin action is transparent, they only require the job handle of the job the action was registered with.

\lstinline|task_cancel_function| is called when the user requests cancelation of a tasks by calling \lstinline|mtapi_task_cancel()| or by calling \lstinline|mtapi_queue_disable()| on a non-retaining queue.

\lstinline|action_finalize_function| is called when the node is finalized and the action is deleted, or when the user explicitly deletes the action by calling \lstinline|mtapi_action_delete()|.

For illustration our example plugin will provide a no-op action. The task start callback in that case looks like this:
%
\\\inputlisting{../examples/mtapi/mtapi_c_plugin_task_start_cb-snippet.h}
%
The scheduling operation is responsible for bringing the task to execution, this might involve instructing some hardware to execute the task or pushing the task into a queue for execution by a separate worker thread. Here however, the task is executed directly:
%
\\\inputlisting{../examples/mtapi/mtapi_c_plugin_task_schedule-snippet.h}
%
Since the task gets executed right away, it cannot be canceled and the task cancel callback implementation is empty:
%
\\\inputlisting{../examples/mtapi/mtapi_c_plugin_task_cancel_cb-snippet.h}
%
The plugin action did not acquire any resources so the action finalize callback is empty as well:
%
\\\inputlisting{../examples/mtapi/mtapi_c_plugin_action_finalize_cb-snippet.h}
%

Now that the callbacks are in place, the action can be registered with a job after the node was initialized using \lstinline|mtapi_initialize()|:
%
\\\inputlisting{../examples/mtapi/mtapi_c_plugin_action_create-snippet.h}
%
The job handle can now be obtained the normal MTAPI way. The fact that there is a plugin working behind the scenes is transparent by now:
%
\\\inputlisting{../examples/mtapi/mtapi_c_plugin_get_job-snippet.h}
%
Using the job handle tasks can be started like normal MTAPI tasks:
%
\\\inputlisting{../examples/mtapi/mtapi_c_plugin_task_start-snippet.h}
%
This call will lead to the invocation of then \lstinline|plugin_task_start| callback function, where the plugin implementor is responsible for bringing the task to execution.

\subsection{Network}

The MTAPI network plugin provides a means to distribute tasks over a TCP/IP network. As an example the following vector addition action is used:
%
\\\inputlisting{../examples/mtapi/mtapi_network_c_action_function-snippet.h}
%
It adds two float vectors and a float from node local data and writes the result into the result float vector. In the example code the vectors will hold \lstinline|kElements| floats each.

To use the network plugin, its header file needs to be included first:
%
\\\inputlisting{../examples/mtapi/mtapi_network_c_header-snippet.h}
%
After initializing the node using \lstinline|mtapi_initialize()|, the plugin itself needs to be initialized:
%
\\\inputlisting{../examples/mtapi/mtapi_network_c_plugin_initialize-snippet.h}
%
This will set up a listening socket on the localhost interface (127.0.0.1) at port 12345. The socket will allow a maximum of 5 connections and have a maximum transfer buffer size of \lstinline|kElements * 4 * 3 + 32|. This buffer size needs to be big enough to fit at least the argument and result buffer sizes at once. The example uses 3 vectors of \lstinline|kElements| floats using \lstinline|kElements * sizeof(float) * 3| bytes.

Since the example connects to itself on localhost, the "remote" action needs to be registered with the \lstinline|NETWORK_REMOTE_JOB|:
%
\\\inputlisting{../examples/mtapi/mtapi_network_c_remote_action_create-snippet.h}
%
After that, the local network action is created, that maps \lstinline|NETWORK_LOCAL_JOB| to \lstinline|NETWORK_REMOTE_JOB| through the network:
%
\\\inputlisting{../examples/mtapi/mtapi_network_c_local_action_create-snippet.h}
%
Now, \lstinline|NETWORK_LOCAL_JOB| can be used to execute tasks by simply calling \lstinline|mtapi_task_start()|. Their parameters will be transmitted through a socket connection and are consumed by the network plugin worker thread. The thread will start a task using the \lstinline|NETWORK_REMOTE_JOB|. When this task is finished, the results will be collected and sent back through the network. Again the network plugin thread will receive the results, provide them to the \lstinline|NETWORK_LOCAL_JOB| task and mark that task as finished.

When all work is done, the plugin needs to be finalized. This will stop the plugin worker thread and close the sockets:
%
\\\inputlisting{../examples/mtapi/mtapi_network_c_plugin_finalize-snippet.h}
%
Then the node may be finalized by calling \lstinline|mtapi_finalize()|.

\subsection{OpenCL}

The MTAPI OpenCL plugin allows the user to incorporate the computational power of an OpenCL accelerator, if one is available in the system.

The vector addition example from the network plugin is used again. However, the action function is an OpenCL kernel now:
%
\\\inputlisting{../examples/mtapi/mtapi_opencl_c_kernel-snippet.h}
%
The OpenCL plugin header file needs to be included first:
%
\\\inputlisting{../examples/mtapi/mtapi_opencl_c_header-snippet.h}
%
As with the network plugin, the OpenCL plugin needs to be initialized after the node has been initialized:
%
\\\inputlisting{../examples/mtapi/mtapi_opencl_c_plugin_initialize-snippet.h}
%
Then the plugin action can be registered with the \lstinline|OPENCL_JOB|:
%
\\\inputlisting{../examples/mtapi/mtapi_opencl_c_action_create-snippet.h}
%
The kernel source and the name of the kernel to use (AddVector) need to be specified while creating the action. The kernel will be compiled using the OpenCL runtime and the provided node local data transferred to accelerator memory. The local work size is the number of threads that will share OpenCL local memory, in this case 32. The element size instructs the OpenCL plugin how many bytes a single element in the result buffer consumes, in this case 4, as a single result is a single float. The OpenCL plugin will launch \lstinline|result_buffer_size/element_size| OpenCL threads to calculate the result.

Now the \lstinline|OPENCL_JOB| can be used like a normal MTAPI job to start tasks.

After all work is done, the plugin needs to be finalized. This will free all memory on the accelerator and delete the corresponding OpenCL context:
%
\\\inputlisting{../examples/mtapi/mtapi_opencl_c_plugin_finalize-snippet.h}
%