6.3 MPU
The Cortex®-A15 microprocessor unit (MPU) subsystem serves the applications processing role by running the high-level operating system (HLOS) and application code.
The MPU subsystem incorporates one Cortex-A15 MPU core (MPU_C0), individual level 1 (L1) caches, level 2 (L2) cache (MPU_L2CACHE) shared between them, and various other shared peripherals. To aid software development, the processor core can be kept cache-coherent with the L2 cache.
The MPU subsystem provides a high-performance computing platform with high peak-computing performance and low memory latency.
The Arm subsystem supports the following key features:
- Arm® Cortex-A15 MP Core (MPU_CLUSTER)
- One Cortex-A15 MPU core (revision r2p2) which has the following features:
- Superscalar, dynamic multi-issue technology
- Out-of-order (OoO) instruction dispatch and completion
- Dynamic branch prediction with branch target buffer (BTB), global history buffer (GHB), and 48-entry return stack
- Continuous fetch and decoding of three instructions per clock cycle
- Dispatch of up to four instructions and completion of eight instructions per clock cycle
- Provides optimal performance from binaries compiled for previous Arm processors
- Five execution units handle simple instructions, branch instructions, Neon and floating point instructions, multiply instructions, and load and store instructions.
- Simple instructions take two cycles from dispatch, while complex instructions take up to 11 cycles.
- Can issue two simple instructions in a cycle
- Can issue a load and a store instruction in the same cycle
- Integrated Neon processing engine to include the Arm Neon Advanced SIMD (single instruction, multiple data) support for accelerated media and signal processing computation
- Includes VFPv4-compatible hardware to support single- and double-precision add, subtract, divide, multiply and accumulate, and square root operations
- Extensive support to accelerate virtualization using a hypervisor
- 32-KiB L1 instruction (L1I) and 32-KiB L1 data (L1D) cache:
- 64-byte line size
- 2-way set associative
- Memory management unit (MMU):
- Two-level translation lookaside buffer (TLB) organization
- First level is an 32-entry, fully associative micro-TLB implemented for each of instruction fetch, load, and store.
- Second level is a unified, 4-way associative, 512-entry main TLB
- Supports hardware TLB table-walk for backward-compatible and new 64-bit entry page table formats
- New page table format can produce 40-bit physical addresses
- Two-stage translation where first stage is HLOS-controlled and the second level may be controlled by a hypervisor. Second stage always uses the new page table format
- Integrated L2 cache (MPU_L2CACHE) and snoop control unit (SCU):
- 1-MiB of unified (instructions and data) cache organized as 16 ways of 1024 sets of 64-byte lines
- Redundant L1 data (cache) tags to perform snoop filtering (L1 instruction cache tags are not duplicated)
- Operates at Cortex-A15 MPU core clock rate
- Integrated L2 cache controller (MPU_L2CACHE_CTRL):
- Sixteen 64-byte line buffers that handle evictions, line fills and snoop transfers
- One 128-bit AMBA4 Coherent Bus (AXI4-ACE) port
- Auto-prefetch buffer for up to 16 streams and detecting forward and backward strides
- Generalized interrupt controller (GIC, also referred to as MPU_INTC): An interrupt controller supplied by Arm. The single GIC in the MPU_CLUSTER routes interrupts to the MPU core. The GIC supports:
- Number of shared peripheral interrupts (SPI): 160
- Number of software generated interrupts (SGI): 16
- Number of CPU interfaces: 1
- Virtual CPU interface for virtualization support. This allows the majority of guest operating system (OS) interactions with the GIC to be handled in hardware, but with physical interrupts still requiring hypervisor intervention to assign them to the appropriate virtual machine.
- Integrated timer counter and one timer block
- Arm CoreSight debug and trace modules. For more information, see chapter On-Chip Debug Support of the Device TRM..
- MPU_AXI2OCP bridge (local interconnect):
- Connected to Memory Adapter (MPU_MA), which routes the non-EMIF address space transactions to MPU_AXI2OCP
- Single request multiple data (SRMD) protocol on L3_MAIN port
- Multiple targets:
- 64-bit port to the L3_MAIN interconnect. Interface frequency is 1/4 or 1/8 of core frequency
- MPU_ROM
- Internal MPU subsystem peripheral targets, including Memory Adapter LISA Section Manager (MA_LSM), wake-up generator (MPU_WUGEN), watchdog timer (MPU_WD_TIMER), and local PRCM module (MPU_PRCM) configuration
- Internal AXI target, CoreSight System Trace Module (CS_STM)
- Memory adapter (MPU_MA): Helps decrease the latency of accesses between the MPU_L2CACHE and the external memory interface (EMIF1) by providing a direct path between the MPU subsystem and EMIF1:
- Connected to 128-bit AMBA4 interface of MPU_CLUSTER
- Direct 128-bit interface to EMIF1
- Interface speed between MPU_CLUSTER and MPU_MA is at half-speed of the MPU core frequency
- Quarter-speed interface to EMIF
- Uses firewall logic to check access rights of incoming addresses
- Local PRCM (MPU_PRCM):
- Handles MPU_C0 power domain
- Supports SR3-APG (SmartReflex3 Automatic Power Gating) power management technology inside the MPU_CLUSTER
- MPU subsystem has five power domains
- Wake-up generator (MPU_WUGEN)
- Responsible for waking up the MPU core
- Standby controller: Handles the power transitions inside the MPU subsystem
- Realtime (master) counter (COUNTER_REALTIME): Produces the count used by the private timer peripheral in the MPU_CLUSTER
- Watchdog timer (MPU_WD_TIMER): Used to generate a chip-level watchdog reset request to global PRCM
- On-chip boot ROM (MPU_ROM): The MPU_ROM size is 48-KiB, and the address range is from 0x4003 8000 to 0x4004 3FFF. For more information about booting from this memory, see chapter Initialization of the Device TRM..
- Interfaces:
- 128-bit interface to EMIF1
- 64-bit master port to the L3_MAIN interconnect
- 32-bit slave port from the L4_CFG_EMU interconnect (debug subsystem) for configuration of the MPU subsystem debug modules
- 32-bit slave port from the L4_CFG interconnect for memory adapter firewall (MPU_MA_NTTP_FW) configuration
- 32-bit ATB output for transmitting debug and trace data
- 160 peripheral interrupt inputs
For more information, see section Arm Cortex-A15 Subsystem in chapter Processors and Accelerators of the device TRM.