Almost exactly 11 months ago, on
28 Dec 2014, #AirAsia #QZ8501 crashed killing all 162 on board. Today, 01 Dec
2015, the Indonesian National Committee on Transportation Safety, called Komite
Nasional Keselamatan Transportasi (KNKT) published its final report on
investigation into the accident. The original report can be downloaded here in
English:
So, was this a Human Error
accident? Did The Erring Human Strike once again to kill 162? The answer is,
purely and simply, yes. The human did fail and the aircraft was lost due to
that failure. However, as I have stated several times before in this blog, Human Error is the start point, and not the end of an investigation.
All human performance happens inside the boundaries of an organizations
policies and procedures. A flawed organization will, sooner or later, cause
even the best-intentioned human to fail. Quoting Justice Moshansky from the
Dryden accident report, “Mistakes made by
the designer, manufacturer, regulator and the executive gradually reduce the
safety margins available to the aircrew and leave them in the end with no
opportunity to correct their mistakes, especially in critical phases of the
flight … While the aircrew must accept responsibility for their actions and
inactions, it is amply clear that the Civil Aviation System failed them, by
allowing them to be placed in a situation where they did not have all the
support that they needed to complete the flight safely.”
This is yet another case that
proves this statement to be as true in 2015 as it was then in 1992! Without
going into the details of the 206 page report, that can be downloaded and read
by the those interested in such details, lets examine how the mistakes made by
the designers, the manufacturers, the regulators and the executive eroded the
safety margins and caused the human to fail.
1. The
accident aircraft, registered as PK-AXC presented with 23 occurrences related
to Rudder Travel Limiter Unit (RTLU), in the one year preceding this accident (Jan to Dec 2014)
as follows:
- AUTO FLT RUD TRV LIM 1 - 11 occurrences (Failure
of RTLU unit 1)
- AUTO FLT RUD TRV LIM 2 - 3 occurrences (Failure
of RTLU unit 2)
- AUTO FLT RUD TRV LIM SYS - 9 occurrences
(Failure of complete RTLU system)
2. The
failure of RTLU does not render the aircraft inoperable and within the
protections provided by the Flight Control Unit (FCU), the aircraft can still
be operated safety and landed at its destination, even with both RTLUs
inoperative. However, the alarms associated with this failure are both audible
and visual and present a considerable distraction to the Pilots.
3. The
A320 cockpit does have an EMER CANC (emergency cancel) button and a CLR (clear)
button available. The EMER CANC button is to cancel (stop) an aural warning for
as long as the failure condition continues and extinguish the master warning lights.
Activation of this button will not affect the message display of a malfunction
other than the system that has been cancelled. The message however will remain
active until acted upon by the pilot. The CLEAR button, activation of this
button will clear the message without performing any other action. However, the Flight Crew Operation Manual (FCOM) stated that EMER CANC should
only be used to suppress spurious master cautions, and it certainly
is not a prudent action to CLEAR a failure alarm without addressing the failure
condition. There are no other approved procedures for cancelling
multiple and repetitive cautions in the A320 FCOM that
was being used by Air Asia, and certified by the Indonesian regulator.
Therefore the crew had no option but to follow the full RTLU reset procedure
each time, just to avoid being irritated and distracted by the constant sound
of the alarm.
4. The
accident aircraft was presented with multiple and repetitive failures to the
RTLU, each within less than 3 seconds of resetting the previous. The same crew
had also experienced this failure on multiple occasions earlier on this same
aircraft, the most recent being mere 3 days before, on the 25 Dec 2014, when
the No. 2 RTLU was replaced.
5. The
aircraft is designed with a degree of resilience build into the design. This
resilience comes by duplicating critical systems, and is one of the reasons why
two RTLUs exist. The logic of design is that upon failure of unit no. 1, the
aircraft should still be operable with unit no. 2. Therefore, failure of any
one unit should generate a failure alarm for only that specific unit, and not
of the entire RTLU system. However, in this case, while there were 11
occurrences of No. 1 unit failure and 3 occurrences of No. 2 unit failure,
there were also 9 occurrences of entire system failure. But there was
never an occasion when both the RTLUs had failed at the same time! Further the
failures of the RTLU 2 did not recur after that unit was replaced on 25 Dec
2014. The table below presents the sequence recorded of the number of times
crew were presented with this failure on the same aircraft just during the last 10 days. Note that the failure of 25 Dec is not recorded here, for reasons that will become clear as you read on.
No
|
Date
|
Flight Number
|
Message
|
Remarks
|
1.
|
19 Dec 2014
|
7684
|
RTLU-1 and RTLU-2 off
|
9 RTLU fault cycles
|
7689
|
RTLU-1 and RTLU-2 off
|
13 RTLU fault cycles
|
2.
|
20 Dec 2014
|
7693
|
RTLU-1 and RTLU-2 off
|
RTLU fault during descent
|
3.
|
21 Dec 2014
|
8501
|
RTLU-1 and RTLU-2 off
|
1 RTLU fault cycle,
1 partial RTLU fault cycle (YD1 reset)
|
4.
|
22 Dec 2014
|
7685
|
RTLU-1 and RTLU-2 off
|
1 RTLU fault cycle partial reset (YD1 reset)
|
7684
|
RTLU-1 and RTLU-2 off
|
Partial RTLU fault (RTLU1 failed for entire flight)
|
7689
|
RTLU-1 Off
|
RTLU1 fault during taxi at the end of the flight
|
7681
|
RTLU-1 Off
|
RTLU1 fault during approach, not reset until end of next flight
|
5.
|
23 Dec 2014
|
7680
|
RTLU-1 Off
|
RTLU1 fault present for entire flight
|
387
|
RTLU-1 and RTLU-2 off
|
1 RTLU fault cycle during climb and 1 RTLU1 fault and reset during
cruise
|
7620
|
RTLU-1 Off
|
RTLU1 fault and reset
during descent
|
6.
|
24 Dec 2014
|
323
|
RTLU-1 Off
|
RTLU1 fault during climb not reset for entire
flight
|
7.
|
27 Dec 2014
|
7683
|
RTLU-1 Off
|
RTLU1 fault in descent RTLU2 fault and master
caution during taxi in
|
8.
|
28 Dec 2014
|
|
Accident flight
|
6. On
each of the above occasions, the system was fixed by resetting the circuit
breakers and despite the fact that 23 occurrences were recorded, 11 of them
being of unit no. 1 failure, no further investigation or maintenance action was
considered necessary by the airline! The actual culprit, all along, was the
unit no. 1, that after the crash, we now know had a failure of solder in the electrical
contacts, as can be seen in the second picture below!
What is
surprizing here is that while the defect
was only in Unit no. 1, its presentation was as entire systems failure, thereby
giving the pilots a false sense of urgency and alarm to address the matter
assuming that both units have failed and the system redundancy has been lost.
Clearly a design defect in the A320
system.
7. The
Airbus A320 is equipped with a Centralized Fault Display System (CFDS) that
provides information of current or historical problems arising during operation of the aircraft. The maintenance personnel can access the data
through the display system or through a printed Post Flight Report (PFR). Airbus also
provides the maintenance personnel with a Trouble Shooting Manual (TSM) which
contains information to troubleshoot the effected system stated in the PFR and
identifies the suspected defective part. The
Airbus TSM states that PFR is the main source of information to use for
initiating trouble-shooting and to decide on the required maintenance action.
8. However,
Air Asia management had a company policy of referring to the pilot report or Maintenance Report 1
(MR1) as the main source of the defect handling and the maintenance action
performed to be recorded in the Technical Log. Any maintenance that resulted from a PFR was not recorded in the Aircraft Technical Log (ATL)
9. ICAO Annex 6 states that one of the duties of pilot in command is to
report all known or suspected defects in the aircraft after completion of the
flight. This requirement has not been implemented in the Indonesian CASRs. In fact, not all pilots reported the
defects occurred during flight. Consequently, while there were 23 occurrences of
RTLU failure between Jan – Dec 2014, only 4 Pilot Reports were registered for
the event!
10. Since there was no requirement for
the Line Maintenance Personnel to record on the technical log the rectifications
based on PFR, therefore, the RTLU problems, including their rectification by replacement of unit No. 2 on 25 Dec 2014, were not recorded on the technical
log. This resulted in
the line maintenance personnel not being aware of recurring nature of this problem.
This consequently resulted in their repeating similar maintenance actions that
had failed to solve the problem earlier. In addition, the problem itself was
not recorded as a repetitive problem. None
of the issues reported was identified as meeting the repetitive defect
definition which would have triggered further maintenance actions had the
company followed the Airbus recommended maintenance procedure of using PFR, and
not the technical log as the main source of defect handling.
So, from the above summary, it is
clear that,
1. The
Designer, Airbus, has failed in meeting the essential redundancy criteria through a
design defect that permitted failure of one unit to trigger a complete system
failure alarm, thereby making the other (serviceable) unit also inoperative.
2. The
manufacturer, also Airbus, has failed to ensure the reliability and redundancy of its systems
by failing in its basic human engineering requirement of clearly establishing
the procedures for handling, and cancellation of various alarms.
3. The
regulator has failed in many aspects, from improper reporting procedures,
certification of manuals and operator procedures not commensurate with level of
safety expected in public operations, issues with pilot training requirements
and inability to highlight maintenance flaws of Air Asia in its surveillance
audits.
4. The
executive, Air Asia itself, is unfortunately the biggest culprit! The FCOM did
not include correct procedures and information. What was included had not been
trained properly and consequently was not being applied across the fleet in a standardized manner. Maintenance
procedures were not aligned to manufacturers recommendations. The Continuing Airworthiness
management did not meet minimum expected standards in a public service.
5. Lastly,
the crew. The mistakes of the designer, the manufacturer, the regulator and the
executive had ensured that they had neither the complete information, nor the support
that was necessary for safe completion of the flight. They had not been trained
properly. The reference manuals that they were using had errors. The aircraft
had not been maintained properly. The regulator had failed to regulate the
system and the executive had failed to manage the system. Faced with these
odds, they made a fatal mistake…and paid for it with their lives, including the lives of their 160 other passengers and crew.
Once again I, The Erring Human,
was able to strike. Once again it has been proved that human error is merely a
symptom. The real disease is Poor Organizational Management.
Stay Safe,
The Erring Human.