system_health_monitor.txt 8.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213
  1. Introduction
  2. ============
  3. System Health Monitor (SHM) passively monitors the health of the
  4. peripherals connected to the application processor. Software components
  5. in the application processor that experience communication failure can
  6. request the SHM to perform a system-wide health check. If any failures
  7. are detected during the health-check, then a subsystem restart will be
  8. triggered for the failed subsystem.
  9. Hardware description
  10. ====================
  11. SHM is solely a software component and it interfaces with peripherals
  12. through QMI communication. SHM does not control any hardware blocks and
  13. it uses subsystem_restart to restart any peripheral.
  14. Software description
  15. ====================
  16. SHM hosts a QMI service in the kernel that is connected to the Health
  17. Monitor Agents (HMA) hosted in the peripherals. HMAs in the peripherals
  18. are initialized along with other critical services in the peripherals and
  19. hence the connection between SHM and HMAs are established during the early
  20. stages of the peripheral boot-up procedure. Software components within the
  21. application processor, either user-space or kernel-space, identify any
  22. communication failure with the peripheral by a lack of response and report
  23. that failure to SHM. SHM checks the health of the entire system through
  24. HMAs that are connected to it. If all the HMAs respond in time, then the
  25. failure report by the software component is ignored. If any HMAs do not
  26. respond in time, then SHM will restart the concerned peripheral. Figure 1
  27. shows a high level design diagram and Figure 2 shows a flow diagram of the
  28. design.
  29. Figure 1 - System Health Monitor Overview:
  30. +------------------------------------+ +----------------------+
  31. | Application Processor | | Peripheral 1 |
  32. | +--------------+ | | +----------------+ |
  33. | | Applications | | | | Health Monitor | |
  34. | +------+-------+ | +------->| Agent 1 | |
  35. | User-space | | | | +----------------+ |
  36. +-------------------------|----------+ | +----------------------+
  37. | Kernel-space v | QMI .
  38. | +---------+ +---------------+ | | .
  39. | | Kernel |----->| System Health |<----+ .
  40. | | Drivers | | Monitor | | |
  41. | +---------+ +---------------+ | QMI +----------------------+
  42. | | | | Peripheral N |
  43. | | | | +----------------+ |
  44. | | | | | Health Monitor | |
  45. | | +------->| Agent N | |
  46. | | | +----------------+ |
  47. +------------------------------------+ +----------------------+
  48. Figure 2 - System Health Monitor Message Flow with 2 peripherals:
  49. +-----------+ +-------+ +-------+ +-------+
  50. |Application| | SHM | | HMA 1 | | HMA 2 |
  51. +-----+-----+ +-------+ +---+---+ +---+---+
  52. | | | |
  53. | | | |
  54. | check_system | | |
  55. |------------------->| | |
  56. | _health() | Report_ | |
  57. | |---------------->| |
  58. | | health_req(1) | |
  59. | | | |
  60. | | Report_ | |
  61. | |---------------------------------->|
  62. | +-+ health_req(2) | |
  63. | |T| | |
  64. | |i| | |
  65. | |m| | |
  66. | |e| Report_ | |
  67. | |o|<---------------| |
  68. | |u| health_resp(1) | |
  69. | |t| | |
  70. | +-+ | |
  71. | | subsystem_ | |
  72. | |---------------------------------->|
  73. | | restart(2) | |
  74. + + + +
  75. HMAs can be extended to monitor the health of individual software services
  76. executing in their concerned peripherals. HMAs can restore the services
  77. that are not responding to a responsive state.
  78. Design
  79. ======
  80. The design goal of SHM is to:
  81. * Restore the unresponsive peripheral to a responsive state.
  82. * Restore the unresponsive software services in a peripheral to a
  83. responsive state.
  84. * Perform power-efficient monitoring of the system health.
  85. The alternate design discussion includes sending keepalive messages in
  86. IPC protocols at Transport Layer. This approach requires rolling out the
  87. protocol update in all the peripherals together and hence has considerable
  88. coupling unless a suitable feature negotiation algorithm is implemented.
  89. This approach also requires all the IPC protocols at transport layer to be
  90. updated and hence replication of effort. There are multiple link-layer
  91. protocols and adding keep-alive at the link-layer protocols does not solve
  92. issues at the client layer which is solved by SHM. Restoring a peripheral
  93. or a remote software service by an IPC protocol has not been an industry
  94. standard practice. Industry standard IPC protocols only terminate the
  95. connection if there is any communication failure and rely upon other
  96. mechanisms to restore the system to full operation.
  97. Power Management
  98. ================
  99. This driver ensures that the health monitor messages are sent only upon
  100. request and hence does not wake up application processor or any peripheral
  101. unnecessarily.
  102. SMP/multi-core
  103. ==============
  104. This driver uses standard kernel mutexes and wait queues to achieve any
  105. required synchronization.
  106. Security
  107. ========
  108. Denial of Service (DoS) attack by an application that keeps requesting
  109. health checks at a high rate can be throttled by the SHM to minimize the
  110. impact of the misbehaving application.
  111. Interface
  112. =========
  113. Kernel-space APIs:
  114. ------------------
  115. /**
  116. * kern_check_system_health() - Check the system health
  117. *
  118. * @return: 0 on success, standard Linux error codes on failure.
  119. *
  120. * This function is used by the kernel drivers to initiate the
  121. * system health check. This function in turn trigger SHM to send
  122. * QMI message to all the HMAs connected to it.
  123. */
  124. int kern_check_system_health(void);
  125. User-space Interface:
  126. ---------------------
  127. This driver provides a devfs interface(/dev/system_health_monitor) to the
  128. user-space. A wrapper API library will be provided to the user-space
  129. applications in order to initiate the system health check. The API in turn
  130. will interface with the driver through the sysfs interface provided by the
  131. driver.
  132. /**
  133. * check_system_health() - Check the system health
  134. *
  135. * @return: 0 on success, -1 on failure.
  136. *
  137. * This function is used by the user-space applications to initiate the
  138. * system health check. This function in turn trigger SHM to send QMI
  139. * message to all the HMAs connected to it.
  140. */
  141. int check_system_health(void);
  142. The above mentioned interface function works by opening the sysfs
  143. interface provided by SHM, perform an ioctl operation and then close the
  144. sysfs interface. The concerned ioctl command(CHECK_SYS_HEALTH_IOCTL) does
  145. not take any argument. This function performs the health check, handles the
  146. response and timeout in an asynchronous manner.
  147. Driver parameters
  148. =================
  149. The time duration for which the SHM has to wait before a response
  150. arrives from HMAs can be configured using a module parameter. This
  151. parameter will be used only for debugging purposes. The default SHM health
  152. check timeout is 2s, which can be overwritten by the timeout provided by
  153. HMA during the connection establishment.
  154. Config options
  155. ==============
  156. This driver is enabled through kernel config option
  157. CONFIG_SYSTEM_HEALTH_MONITOR.
  158. Dependencies
  159. ============
  160. This driver depends on the following kernel modules for its complete
  161. functionality:
  162. * Kernel QMI interface
  163. * Subsystem Restart support
  164. User space utilities
  165. ====================
  166. Any user-space or kernel-space modules that experience communication
  167. failure with peripherals will interface with this driver. Some of the
  168. modules include:
  169. * RIL
  170. * Location Manager
  171. * Data Services
  172. Other
  173. =====
  174. SHM provides a debug interface to enumerate some information regarding the
  175. recent health checks. The debug information includes, but not limited to:
  176. * application name that triggered the health check.
  177. * time of the health check.
  178. * status of the health check.