System containing a Graphical Processing Unit

We are now going to illustrate the modeling of a complex application using the MARTE Modelio module. The first example is related to a complete system containing an application and an execution platform, where a vector multiplication application is executed partly on a host machine and partly on a graphical processing unit (GPU). This example has been inspired from the next generation NVIDIA’s Fermi architecture. Details related to this system can be found at:

http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

MARTE concepts and packages

This example uses the concepts of the following MARTE profiles: GCM, GRM, SRM, HRM, Time, Alloc, RSM, VSL, NFP, and Clock Handling packages.

UML diagrams

The Class, Sequence, State machine, and Deployment UML diagrams have been used in this example.

Detailed Description

The system consists globally of an application and an execution platform as indicated in the GPU Allocation figure. This example shows the distributed embedded system which is modeled via the MARTE profile.

Figure 8 GPU Allocation

The application VectorMultiplicationExample consists of four main tasks, which are modeled as classes. Hence the main application contains the four instances of these tasks. The first two tasks VectorGeneratorTask1 and VectorGeneratorTask2 are responsible for generating vectors that are then subsequently taken as input by the Kernel which then sends the result of the vector multiplication to the Result task.

All the instances of the tasks are stereotyped with GRM stereotype: resourceUsage to determine the resources consumed by a task. For example, we can specify that for a task, how much memory is allocated, what is its execution time. Here the instances of the tasks task1, task2 and result execute in 50 ms while the kernel instance executes in 25 ms. Hence a user can easily specify these concepts. Additionally as these tasks are allocated on to the execution platform, they are stereotyped as allocated, with the kind attribute set to application. Thus we also make use of the MARTE Alloc package.

It should be mentioned that all the classes which represent the application tasks are themselves modeled, as illustrate in GPU Elementary Components. All the ports of the application tasks are stereotyped as FlowPort as they indicate flow of data, as specified in the MARTE GCM package. Depending upon a task, its port direction is set to either out or in. Additionally, all the ports of the tasks are stereotyped as shaped (based on the MARTE RSM package) to indicate that the task produces a vector with a dimension of 5000. The first two instances task1 and task2 each produces vectors of dimension 5000 which are taken as input by the kernel and it also produces a vector of 5000 which is then taken by the result instance. These properties can be verified in the GPU Elementary Components figure illustrated later on in the document.

We now turn towards the modeling of the hardware architecture using the MARTE HRM package. Regarding the execution platform called SystemArchitecture, which is stereotyped as HwResource, it consists of three main modules: a Terminal which is connected to a GraphicalProcessingUnit via a PCI-Bus. The three respective instances of these modules, i.e. terminal, pci-bus and gpu are stereotyped using the HRM and Alloc concepts. The terminal and gpu are stereotyped as HwComputingResource, HwComponent (for both logical and physical characteristics of hardware) and Allocated (with its kind sets to executionPlatform), while the pci-bus is stereotyped as HwBus and HwComponent.

Additionally, similarly to the application, the ports of all the components are set as FlowPort and with direction set to inout. Finally, a UserInput port with stereotype HwEndPoint allows a user to send input to the terminal. The components themselves are also modeled and present in the GPU Elementary Components figure.

Once the application and the architecture are modeled, we can move on to the allocation phase, using the MARTE allocation mechanisms. Here, only the high level allocation has been specified, but several views are possible at different granularity levels. In this case, we allocate the task1, task2 and the result to the terminal with allocate set to timeScheduling via its nature attribute. While the kernel task is allocated to the gpu with allocate set to spatialDistribution.

Figure 9 GPU Architecture

We now take a look at the structure of the GraphicalProcessingUnit as shown in GPU Architecture figure. As indicated in the NVIDIA fermi documents, a GPU executes a kernel in parallel across a set of parallel threads, which are organized in thread blocks. These thread blocks are also organized in grids. So a grid consists of several thread blocks; while one thread block itself consisting of several threads.

The architecture of the GPU is thus created accordingly. At the highest level a Grid Wrapper is connected to a Global Memory by means of a bus: Bus-Level 0. This GridWrapper itself contains several grids, as indicated by the grid instance with a shaped stereotype of value 16.

This instance has one output port with a shaped value of 1 or {}; which connects to the Port_1 of GridWrapper having a shaped value of 16 by means of a connector with stereotype tiler, found in the MARTE RSM package. This connector helps to connect the ports of the various repetitions of the grid to the port of the GridWrapper.

A Grid itself contains a SMWrapper, a Bus-Level 1 and a SharedMemory module. The SM Wrapper is a wrapper that contains 32 repetitions of the StreamingMultiprocessor.

When we zoom into the StreamingMultiprocessor, we see that at the lowest hierarchical level, it contains a Core, a Bus-Level 2 and a PrivateMemory.

All the hardware components are stereotypes accordingly via the HRM package in MARTE. For example, the memories are stereotyped as HwRAM, while the Core module is stereotyped as HwProcessor.

It should be evident that initially an elementary components of the application/architecture need to be modeled before continuing with the modeling of the aforementioned aspects. While a separate class diagram can be created just for modeling of the elementary components, in this example, they are modeled within the GPU Architecture figure, and are illustrated separately in GPU Elementary Components figure here in the document only for the sake of clarity.

Figure 10 GPU Elementary Components

We now look at the internal structure of the Terminal module in the Terminal internal structure figure. Using the UML object diagram, we are able to model the modules of the Terminal which are as follows:

· HwProcessor Controller (With a NFPConstraint of the MARTE NFP package: saying that the if the processor is being utilized 100 percent, set its internal clock frequency to 100 MHz otherwise 50 MHz),

· HwRAM Memory,

· HwSupport, HwPowerSupply Battery (the two stereotypes are used to show a mixed logical, physical view for the battery as these two stereotypes have different attributes. While we can do the same for all the other modules, it is just a modelling choice that we have not implemented as yet),

· HwBus Bus,

· HwClock Clock (A global system clock with ClockConstraints),

· HwI_O UARTController,

· HwI_O Display.

Figure 11 Terminal internal structure

While it is also possible to add GRM stereotypes here, the HRM stereotypes are sufficient as they contain all the attributes of the GRM package. Additionally, the application showed earlier can be illustrated here as well to show that the three tasks of the application are actually allocated to the Controller in the Terminal. However, this step has not been carried out.

This diagram also contains some platform based tasks (or system drivers) which are allocated to the hardware resources. While several tasks are possible, we have only modeled two for the sake of visibility. The TerminalApplication package here contains two tasks, a UART Task with the SwSchedulableResource from the MARTE SRM package; and a Scheduler stereotyped Scheduler from the MARTE GRM package. These tasks are allocated to respective hardware resources, using the allocation concepts.

We now take a look at the Controller State Machine to determine the behavior of the Controller module of the Terminal.

Figure 12 Controller State Machine

As shown in the figure Controller State Machine, the state machine SuperVisingBehaviour-Controller determines the behavior of the Controller which can be either in Executing, Idle or Sleep states. The TimedProcessing stereotype should be applied to the state machine so that it can refer the ideal clock. This will allow it to express the associated timing concepts. The initial state is the Executing one which contains a DO activity execute, itself stereotype as TimedProcessing to determine the time taken for this action/behavior. This is not shown in the figure but we can specify its value to 15 ms. When the controller does not received data for computation for 20 ms, it goes into the Idle state where it remains for about 5 ms (not carried out due to Modelio limitations). When the computation is needed to be done, it returns to the Executing state. If however, there is no activity for about 35 ms, then the system goes into Sleep mode where it remains for about 20 ms and if there is no activity, the controller is deactivated. Similarly in Idle state, if there is no activity or computation event, then the system goes to Sleep mode.

Finally the overall scenario is described using the UML sequence diagram as shown in Computation Sequence.

Here, initially a User sends the data to the Terminal so start the execution, once the data is received by the Terminal, it carries out two activities in parallel, it sends the data to its display to show the Starting Execution message and at the same time, executes task1 and task2. The result of the first two tasks is sent to the GraphicalProcessingUnit via the PCI-Bus. When the GraphicalProcessingUnit receives the data, it then carries out a Kernel Computation which is referenced by another sequence diagram. The result of the vector multiplication is then sent back to the Terminal via the PCI-bus. Once the terminal receives the result, it executes the result task and then displays the result on the screen.

We should be able to apply the TimedConstraints stereotypes on the sequence diagram to determine the timing constraints. For example, we can specify a constraint that the whole execution from the user sent input to the final prints on screen should not exceed more than 100 ms. Similarly another constraint can be related to the GraphicalProcessingUnit. The constraint can specify that the whole execution on the GPU, from receiving of input data to sending of the calculated result should not take more than 15 ms.

Figure 13 Computation Sequence

image011.jpg (31.5 KB) Etienne Brosse, 29 September 2011 13:58

image012.jpg (17 KB) Etienne Brosse, 29 September 2011 13:58

image013.gif (16.9 KB) Etienne Brosse, 29 September 2011 13:58

image014.gif (3.47 KB) Etienne Brosse, 29 September 2011 13:58

image015.gif (10.1 KB) Etienne Brosse, 29 September 2011 13:58

image010.jpg (20.9 KB) Etienne Brosse, 29 September 2011 13:59