Tuesday, December 1, 2015

#AirAsia #QZ8501: The cost of an Organizational Failure



Almost exactly 11 months ago, on 28 Dec 2014, #AirAsia #QZ8501 crashed killing all 162 on board. Today, 01 Dec 2015, the Indonesian National Committee on Transportation Safety, called Komite Nasional Keselamatan Transportasi (KNKT) published its final report on investigation into the accident. The original report can be downloaded here in English:

The report presents some very shocking information. The press has gone viral with even major news channels like CNN publishing headlines like Pilot response led to AirAsia crash into Java Sea

So, was this a Human Error accident? Did The Erring Human Strike once again to kill 162? The answer is, purely and simply, yes. The human did fail and the aircraft was lost due to that failure. However, as I have stated several times before in this blog, Human Error is the start point, and not the end of an investigation. All human performance happens inside the boundaries of an organizations policies and procedures. A flawed organization will, sooner or later, cause even the best-intentioned human to fail. Quoting Justice Moshansky from the Dryden accident report, “Mistakes made by the designer, manufacturer, regulator and the executive gradually reduce the safety margins available to the aircrew and leave them in the end with no opportunity to correct their mistakes, especially in critical phases of the flight … While the aircrew must accept responsibility for their actions and inactions, it is amply clear that the Civil Aviation System failed them, by allowing them to be placed in a situation where they did not have all the support that they needed to complete the flight safely.”

This is yet another case that proves this statement to be as true in 2015 as it was then in 1992! Without going into the details of the 206 page report, that can be downloaded and read by the those interested in such details, lets examine how the mistakes made by the designers, the manufacturers, the regulators and the executive eroded the safety margins and caused the human to fail.

http://www.theerringhuman.blogspot.it/2014/12/ending-year-with-bang.html
1.  The accident aircraft, registered as PK-AXC presented with 23 occurrences related to Rudder Travel Limiter Unit (RTLU), in the one year preceding this accident (Jan to Dec 2014) as follows:
    • AUTO FLT RUD TRV LIM 1 - 11 occurrences (Failure of RTLU unit 1)
    • AUTO FLT RUD TRV LIM 2 - 3 occurrences (Failure of RTLU unit 2)
    • AUTO FLT RUD TRV LIM SYS - 9 occurrences (Failure of complete RTLU system)

    http://www.amazon.com/dp/B00HGRNN1I

    2.  The failure of RTLU does not render the aircraft inoperable and within the protections provided by the Flight Control Unit (FCU), the aircraft can still be operated safety and landed at its destination, even with both RTLUs inoperative. However, the alarms associated with this failure are both audible and visual and present a considerable distraction to the Pilots.

    3.  The A320 cockpit does have an EMER CANC (emergency cancel) button and a CLR (clear) button available. The EMER CANC button is to cancel (stop) an aural warning for as long as the failure condition continues and extinguish the master warning lights. Activation of this button will not affect the message display of a malfunction other than the system that has been cancelled. The message however will remain active until acted upon by the pilot. The CLEAR button, activation of this button will clear the message without performing any other action. However, the Flight Crew Operation Manual (FCOM) stated that EMER CANC should only be used to suppress spurious master cautions, and it certainly is not a prudent action to CLEAR a failure alarm without addressing the failure condition. There are no other approved procedures for cancelling multiple and repetitive cautions in the A320 FCOM that was being used by Air Asia, and certified by the Indonesian regulator. Therefore the crew had no option but to follow the full RTLU reset procedure each time, just to avoid being irritated and distracted by the constant sound of the alarm. 

    4.  The accident aircraft was presented with multiple and repetitive failures to the RTLU, each within less than 3 seconds of resetting the previous. The same crew had also experienced this failure on multiple occasions earlier on this same aircraft, the most recent being mere 3 days before, on the 25 Dec 2014, when the No. 2 RTLU was replaced.

    5.  The aircraft is designed with a degree of resilience build into the design. This resilience comes by duplicating critical systems, and is one of the reasons why two RTLUs exist. The logic of design is that upon failure of unit no. 1, the aircraft should still be operable with unit no. 2. Therefore, failure of any one unit should generate a failure alarm for only that specific unit, and not of the entire RTLU system. However, in this case, while there were 11 occurrences of No. 1 unit failure and 3 occurrences of No. 2 unit failure, there were also 9 occurrences of entire system failure. But there was never an occasion when both the RTLUs had failed at the same time! Further the failures of the RTLU 2 did not recur after that unit was replaced on 25 Dec 2014. The table below presents the sequence recorded of the number of times crew were presented with this failure on the same aircraft just during the last 10 days. Note that the failure of 25 Dec is not recorded here, for reasons that will become clear as you read on.


    No

    Date

    Flight Number

    Message

    Remarks

    1.

    19 Dec 2014
    7684
    RTLU-1 and RTLU-2 off
    9 RTLU fault cycles
    7689
    RTLU-1 and RTLU-2 off
    13 RTLU fault cycles

    2.

    20 Dec 2014
    7693
    RTLU-1 and RTLU-2 off
    RTLU fault during descent

    3.

    21 Dec 2014
    8501
    RTLU-1 and RTLU-2 off
    1 RTLU fault cycle, 1 partial RTLU fault cycle (YD1 reset)

    4.

    22 Dec 2014
    7685
    RTLU-1 and RTLU-2 off
    1 RTLU fault cycle partial reset (YD1 reset)
    7684
    RTLU-1 and RTLU-2 off
    Partial RTLU fault (RTLU1 failed for entire flight)
    7689
    RTLU-1 Off
    RTLU1 fault during taxi at the end of the flight
    7681
    RTLU-1 Off
    RTLU1 fault during approach, not reset until end of next flight

    5.

    23 Dec 2014
    7680
    RTLU-1 Off
    RTLU1 fault present for entire flight
    387
    RTLU-1 and RTLU-2 off
    1 RTLU fault cycle during climb and 1 RTLU1 fault and reset during cruise
    7620
    RTLU-1 Off
    RTLU1 fault and reset during descent

    6.

    24 Dec 2014
    323
    RTLU-1 Off
    RTLU1 fault during climb not reset for entire flight

    7.

    27 Dec 2014
    7683
    RTLU-1 Off
    RTLU1 fault in descent RTLU2 fault and master caution during taxi in
    8.
    28 Dec 2014

    Accident flight


    6.  On each of the above occasions, the system was fixed by resetting the circuit breakers and despite the fact that 23 occurrences were recorded, 11 of them being of unit no. 1 failure, no further investigation or maintenance action was considered necessary by the airline! The actual culprit, all along, was the unit no. 1, that after the crash, we now know had a failure of solder in the electrical contacts, as can be seen in the second picture below!



    The RTLU

    The cracked solder in electrical contact, which ultimately caused 162 people to die!

    What is surprizing here is that while the defect was only in Unit no. 1, its presentation was as entire systems failure, thereby giving the pilots a false sense of urgency and alarm to address the matter assuming that both units have failed and the system redundancy has been lost. Clearly a design defect in the A320 system.

    7.  The Airbus A320 is equipped with a Centralized Fault Display System (CFDS) that provides information of current or historical problems arising during operation of the aircraft. The maintenance personnel can access the data through the display system or through a printed Post Flight Report (PFR). Airbus also provides the maintenance personnel with a Trouble Shooting Manual (TSM) which contains information to troubleshoot the effected system stated in the PFR and identifies the suspected defective part. The Airbus TSM states that PFR is the main source of information to use for initiating trouble-shooting and to decide on the required maintenance action.

    8.  However, Air Asia management had a company policy of referring to the pilot report or Maintenance Report 1 (MR1) as the main source of the defect handling and the maintenance action performed to be recorded in the Technical Log. Any maintenance that resulted from a PFR was not recorded in the Aircraft Technical Log (ATL)

    9.  ICAO Annex 6 states that one of the duties of pilot in command is to report all known or suspected defects in the aircraft after completion of the flight. This requirement has not been implemented in the Indonesian CASRs. In fact, not all pilots reported the defects occurred during flight. Consequently, while there were 23 occurrences of RTLU failure between Jan – Dec 2014, only 4 Pilot Reports were registered for the event! 

    10.  Since there was no requirement for the Line Maintenance Personnel to record on the technical log the rectifications based on PFR, therefore, the RTLU problems, including their rectification by replacement of unit No. 2 on 25 Dec 2014, were not recorded on the technical log. This resulted in the line maintenance personnel not being aware of recurring nature of this problem. This consequently resulted in their repeating similar maintenance actions that had failed to solve the problem earlier. In addition, the problem itself was not recorded as a repetitive problem. None of the issues reported was identified as meeting the repetitive defect definition which would have triggered further maintenance actions had the company followed the Airbus recommended maintenance procedure of using PFR, and not the technical log as the main source of defect handling.

    So, from the above summary, it is clear that,

    1.  The Designer, Airbus, has failed in meeting the essential redundancy criteria through a design defect that permitted failure of one unit to trigger a complete system failure alarm, thereby making the other (serviceable) unit also inoperative.

    2.  The manufacturer, also Airbus, has failed to ensure the reliability and redundancy of its systems by failing in its basic human engineering requirement of clearly establishing the procedures for handling, and cancellation of various alarms.

    3.  The regulator has failed in many aspects, from improper reporting procedures, certification of manuals and operator procedures not commensurate with level of safety expected in public operations, issues with pilot training requirements and inability to highlight maintenance flaws of Air Asia in its surveillance audits.

    4.  The executive, Air Asia itself, is unfortunately the biggest culprit! The FCOM did not include correct procedures and information. What was included had not been trained properly and consequently was not being applied across the fleet in a standardized manner. Maintenance procedures were not aligned to manufacturers recommendations. The Continuing Airworthiness management did not meet minimum expected standards in a public service.

    5.  Lastly, the crew. The mistakes of the designer, the manufacturer, the regulator and the executive had ensured that they had neither the complete information, nor the support that was necessary for safe completion of the flight. They had not been trained properly. The reference manuals that they were using had errors. The aircraft had not been maintained properly. The regulator had failed to regulate the system and the executive had failed to manage the system. Faced with these odds, they made a fatal mistake…and paid for it with their lives, including the lives of their 160 other passengers and crew.

    Once again I, The Erring Human, was able to strike. Once again it has been proved that human error is merely a symptom. The real disease is Poor Organizational Management.

    Stay Safe,

    The Erring Human.

    No comments:

    Post a Comment

    Kindly refrain from posting obscenity or advertisements. Users posting inappropriate or unrelated comments will be blacklisted from further postings. Thank you for your understanding and for maintaining the professionalism of this blog.