PR Numbers: 1xxx=UCB, 2xxx=Caltech/JPL, 3xxx=UMd, 4xxx=GSFC/SEP, 5xxx=GSFC/Mag, 6xxx=CESR, 7xxx=Keil, 8xxx=ESTEC, 9xxx=MPAe SubAssembly: SEP LVPS Top Board Assembly : IMPACT SEP **Component/Part Number:** Serial Number: FM1 **Originator: David Curtis Organization: U.C. Berkeley** Phone: 510-642-5998 Email: dwc@ssl.berkeley.edu **Failure Occurred During (Check one**  $\sqrt{}$ )  $\sqrt{Functional test}$ □ Qualification test □ S/C Integration  $\Box$  Launch operations **Environment when failure occurred:** √ Ambient □ Vibration □ Acoustic □ Shock □ Thermal-Vacuum □ EMI/EMC □ Thermal □ Vacuum **Problem Description** On October 15, during IMPACT Suite I&T, the SEP system stopped generating data or responding to commands. This condition persisted for a few hours during which some diagnostics were performed (see attached). It recovered on its own after a ~20 minute power-off. A similar problem happened a few days later, and again at the EMC facility. It has not happened since in spite of ~2 weeks of near continuous operations. **Analyses Performed to Determine Cause** See attached. Diagnostics seem to indicate a problem with a common element since all systems crash. An intermittent in the power converter is a likely suspect (it would have to be an open rather than a short since power drops when it happens). It does not seem to be related to the flex strip problem we had with the power converter since systems powered by different flex strips are simultaneously effected. Reference IMPACT PFR 1032: A similar problem was seen also on the IDPU Flight units during Thermal Vac. A tantalum capacitor, (C88) CWR06NH335KC was found to be installed backwards on the top SEP LVPS board. Solid tantalum capacitors are sensitive to reverse bias voltage and will degrade over time. In addition, measurements were taken on a spare SEP LVPS top board and the voltage on the capacitor was 3.8V, when it is pulled up to 5V through 20k, which indicates increased leakage current (caused by the cap being reversed biased), almost enough to cause it to shut down. On the flight unit, the voltage was measured across the capacitor and a transient was captured confirming that the current decreased causing a supply shut down. Due to the fact that tantalum capacitors have the ability to heal themselves could be why the SEP failures were intermittent. The reason why the tantalum capacitors were installed incorrectly is because the polarity is not indicated on the schematic. The original design used a capacitor that was ceramic. When the value of the capacitor increased the part changed to a tantalum without the part symbol changing resulting in an ambiguous input to the layout program and incorrect polarity indications on the silk screen. The part is a CWR06NH335KC 3.3uF 50V part, LDC P017196332. **Corrective Action/ Resolution** √ Rework □ Repair Use As Is □ Scrap Replaced the (C88) CWR06NH335KC tantalum capacitors on the SEP LVPS top board. Retest (board level testing, CPT, vibration, T/V). Redlined drawing includes polarity identifiers. The circuit was not stressed due to this error. Retest Results: Success, passed tests noted above. Date Action Taken: 1/24/2005 Corrective Action Required/Performed on other Units: FM2, completed and tested same as FM1 **Closure Approvals** Subsystem Lead:

Subsystem Lead: IMPACT Project Manager: IMPACT QA: NASA IMPACT Instrument Manager:

| Date: |
|-------|
| Date  |
| Date: |
| Date: |

#### Dave Curtis' Notes:

The IMPACT FM1 Suite was integrated and operating for the first time at Caltech on October 14. The suite had previously operated independently without trouble, though the SEP LVPS had recently been fixed to solve an intermittent problem.

On October 15 2004 at about 10:40AM SEP started crashing periodically. Each time it seemed to reboot. To diagnose the problem SEP was disconnected from the IDPU and connected to Caltech GSE (the interface was disabled but power was not disconnected during this transfer). The problem persisted. Power was briefly cycled but this did not restore operation. See Rick Cooks notes below for more details on the diagnostics performed in this configuration. His tests seem to rule out a software problem. Power was removed for ~20 minutes to remove SIT so secondary voltages could be monitored on the SEP Central to SIT connector. On powering back up, the problem was no longer evident. Power was returned to its original configuration and left operating over night.

On October 15 at 23:30 PDT SEP stopped sending telemetry and reverted to a low power state (slightly above its normal power-on level). See attached timeline extracted from the data logger that tracks currents and temperatures. When we came in the next morning of October 16 we performed the following diagnostics:

- Sent hardware reset command to SEP no effect
- Cycled the interface SEP/IDPU interface enable off/on and resent the SEP reset no effect
- Power off SEP, disconnect SIT so we could monitor secondary voltages on its interface, powered back on. Still no SEP data. We had the wrong pinout for the SIT connector at this time, and so could not determine the state of the voltages.
- Powered off for a longer interval. On restoring power, the instrument started normally.
- Warmed various parts of the instrument with a heat gun to attempt to invoke the problem, but it continued to operate normally.
- Operated fine through the night. Broke for transfer to the EMC facility on the 17<sup>th</sup>.



2004-10-15 - 2004-10-16 Data Logger

On October 18 we set up in the EMC facility. At about 4PM there were a series of spontaneous SEP reboots, similar to the first event. We powered off to install a break-out box in the SEP Central to SIT harness so we could monitor SIT telemetry. On restoring power, SEP Central seemed to operate normally, but there was some trouble getting the instruments functioning, which may have been operator error.

Later on October 18 a new version of the software was loaded. There have been no crashes since in spite of  $\sim$ 2 weeks of nearly continuous operation. At one point we reloaded the old software, but it continued to operate normally.

#### **Rick Cook's Notes:**

On 10/15/04, following several SEP Central reboots, we attached my notebook pc to allow better debugging, and a log file of the notebook pc interaction with SEP Central was recorded. During the session two types of crashes occurred. One type was initiated by sending certain command strings through the IDPU to SEP Central. This type crash was completely reproducible and caused SEP Central to reboot, but LET, HET and SIT would continue running ok. This crash type was traced to a SEP Central software bug which was fixed later in the day on 10/15/04 and has never recurred. The second type of crash also resulted in a SEP Central re-boot, but seemed to occur at random times and at least on several occasions (every time we checked) LET, HET and SIT also "crashed" -- as evidenced by lack of response to commands and lack of routine packet transfers.

During the session with my notebook PC connected on 10/15/04 SEP Central crashed repeatedly about 10 times, staying up for only a few minutes at a time, in contrast to experience in preceding days of uninterrupted error-free operation extending over many hours at a time. Between reboots I was able to send commands and see normal response. At each reboot SEP central printed out the correct checksums for EEPROM, code and command tables and these checksum printouts appeared correctly in both the command response packets and on the notebook PC. After one of the reboots I was able to send the command string "FRESH ZFILL" and verify its proper execution. This command string disables all routine SEP Central processing and discards all but the basic forth operating system. The operating system is put into a non-interrupt driven mode (all interrupts are disabled) and execution degenerates to a single tight loop that polls the status of the UART that services the notebook PC. Execution leaves the tight loop only to pick up and buffer characters from the UART. That the SEP Central S/W was indeed properly executing the tight loop and servicing the UART correctly was verified by sending the command string "HERE." which properly printed the address of the end of the standard forth operating system at "2580". The forth system appeared entirely operational and I was able to define a short word that could print the interrupt status register contents. The contents, as expected, verified that indeed all interrupts were off. The next reboot occurred while the SEP Central code was executing the tight loop that polls the UART. That is, the reboot occurred at a time when no typing or command execution was occurring. The code for this tight loop includes only four instructions and has been stable for several years. Hence, I have not been able to construct any scenario in which this particular SEP central crash could be due to "software".

Several other anomalies occurred on a subsequent day. In one case SEP Central crashed without rebooting, and was unresponsive to a hardware reset command from the IDPU. Power was turned off for a few minutes and then back on and SEP Central still did not boot. Power was turned off for a longer time (20 minutes?) and then on again and SEP Central then booted (with valid checksums?). This sequence of anomalous behavior is also difficult to ascribe solely to a software problem. But it might be remotely possible that an initial software related crash somehow placed the system into a state from which it was unable to be recovered with either a hardware reset command or short power cycle.

SEP Central also crashed without rebooting during one of the first few days at EMC. Again, it would not respond to a hardware reset command. (Is my recollection correct here?) It did reboot upon power cycling (correct?). Later that same day I installed new S/W into LET and SEP (with some stack errors flagged by Stephan's program corrected). There have been no crashes since that time, suggesting the possibility that all previous crashes were somehow software related. (Although, as mentioned above I can't see how that is possible.)

Today I have performed some experiments with our EM setup which includes an EM SEP Central, analog bd, bias supply and LET. After booting SEP and LET I disconnected then reconnected after a few seconds some power supply connection and observed whether SEP would crash or reboot -- and if it rebooted whether LET had been affected. I found:

SEP 3.3V only => SEP crash, no reboot.
SEP 2.5V only => SEP reboot, LET ok
SEP 5.1V only => SEP reboot, LET ok
SEP+LET 3.3V => SEP reboot, LET crash
SEP+LET 2.5V => SEP reboot, LET crach
SEP+LET 5.1V => SEP reboot, LET ok
SEP+LET 2.5+3.3V => SEP reboot, LET crash

Only cases 4, 5, and 7 mimic the as yet unexplained SEP reboots. However, case 1 mimics the unexplained SEP crashes which were not associated with a reboot. Note that during these tests an IDPU simulator was connected, but the interface disabled. We should repeat the tests later when the IDPU simulator can be operated.

These tests suggest looking for a power supply problem that could bring down the SEP, LET, HET and SIT 3.3 and/or 2.5 volt supplies.

When it becomes convenient we should probably install break-out boxes that allow probing the various 2.5, 3.3 and 5.1 volt supply voltages, monitor them with digital scopes set to trigger on any dropouts and run for extended periods both with the new and old software versions.