Saturday, April 15, 2017

Painless Remote Access to monit...NOT!


Installing monit Was Easy.  Configuring & Setting Up Remote Access To It Wasn't...


Help!  Help!  My Server is Down...Or Is It?

For the past while, an "interesting" problem has challenged one of my servers.  Every once in a while, at random intervals, the machine appeared to crash.  There was no rhyme or reason to the crash - it just stopped being availableInvestigating, I discovered that it hadn't crashed, instead the machine had become totally bogged down and was operating at a near-halt.  Instead, I observed that it was experiencing sky-high CPU utilization and fully allocated memory.  You could log in, but the process was agonizingly slow (5+ minutes).

Seeing as this machine was using the LAMP stack on a cloud-based Virtual Private Server (VPS), there were many potential fault-points. I looked into the situation carefully, but was never able to really pin down the reason for the failure.  So I just scheduled a restart of the suspected systems with crontab -e and a few scripts and moved on.  No longer.
 

A Need For Better Configuration Monitoring & Management

The potential reasons for this problem were literally myriad.  Because of the way this server was provisioned, there were many candidate problem areas throughout the implication solution stack involved.  As a frame of reference, there may have been problems with:
 
1) The Hardware Layer
2) The Virtualization Layer
3) The Operating System Layer
4) The Application Layer

Seeking a richer troubleshooting model than the above, I eventually decided to use the OSI model to really go into troubleshooting the problem.

Troubleshooting Via The OSI Reference Model

The OSI model has been around for a long, long time.  It proposes that modern computer systems are composed of the following seven layers:

(https://en.wikipedia.org/wiki/OSI_model)

To be able to eliminate as many potential sources of the "freezing" as possible, I needed a tool to help me gather information on those layers; as much information as possible.  It would also be nice if that tool could tell me if my server was heading towards trouble, and maybe even help to manage the trouble when it arrived.  For those needs, monit seemed to fit the bill.

monit:  A Server Monitoring & Troubleshooting System


According to its own man page, monit is:

       monit is a utility for managing and monitoring processes, files, direc-
       tories and devices on a Unix system. Monit conducts automatic mainte-
       nance and repair and can execute meaningful causal actions in error
       situations. E.g. monit can start a process if it does not run, restart
       a process if it does not respond and stop a process if it uses too much
       resources. You may use monit to monitor files, directories and devices
       for changes, such as timestamps changes, checksum changes or size
       changes.


Installing monit

Installing monit was pretty easy on Centos 7.  I did it with the following command:

[root@vps2]# yum install monit

Configuring monit

The monit configuration file is located at /etc/monit.conf.  Working with the file was "OK" but I think the developers could explain the innards of the file a little better - a little bit of documentation (with examples) goes a long way in linux, as the LDP (of which I was a charter member) proved.  Many of the supplied values in monit.conf (checksum, file paths, logging strategy) were wrong for CentOS 7, so I had to change them - turning what could have been a 20 minute exercise into a half-day exercise.

Anyway, after some fiddling around, I discovered that the following configuration worked for my use case, which was to monitor the overall machine, as well as the web server status from a remote location.  My settings are highlighted, settings you will need to customize for your implementation are <italics bold>:

###############################################################################
## Monit control file
###############################################################################
##
## Comments begin with a '#' and extend through the end of the line. Keywords
## are case insensitive. All path's MUST BE FULLY QUALIFIED, starting with '/'.
##
## Below you will find examples of some frequently used statements. For
## information about the control file and a complete list of statements and
## options, please have a look in the Monit manual.
##
##
###############################################################################
## Global section
###############################################################################
##
## Start Monit in the background (run as a daemon):
#
set daemon  60              # check services at 1-minute intervals
#   with start delay 240    # optional: delay the first check by 4-minutes (by
#                           # default Monit check immediately after Monit start)
#
#
## Set syslog logging with the 'daemon' facility. If the FACILITY option is
## omitted, Monit will use 'user' facility by default. If you want to log to
## a standalone log file instead, specify the full path to the log file
#
# set logfile syslog facility log_daemon
set logfile /var/log/monit.log
#
#
## Set the location of the Monit id file which stores the unique id for the
## Monit instance. The id is generated and stored on first Monit start. By
## default the file is placed in $HOME/.monit.id.
#
set idfile /var/monit/id
#
## Set the location of the Monit state file which saves monitoring states
## on each cycle. By default the file is placed in $HOME/.monit.state. If
## the state file is stored on a persistent filesystem, Monit will recover
## the monitoring state across reboots. If it is on temporary filesystem, the
## state will be lost on reboot which may be convenient in some situations.
#
set statefile /var/monit/state
#
## Set the list of mail servers for alert delivery. Multiple servers may be
## specified using a comma separator. If the first mail server fails, Monit
# will use the second mail server in the list and so on. By default Monit uses
# port 25 - it is possible to override this with the PORT option.
#
set mailserver <you@mailserver.domain>       # primary mailserver
#                backup.bar.baz port 10025,  # backup mailserver on port 10025
                 localhost                   # fallback relay
#
#
## By default Monit will drop alert events if no mail servers are available.
## If you want to keep the alerts for later delivery retry, you can use the
## EVENTQUEUE statement. The base directory where undelivered alerts will be
## stored is specified by the BASEDIR option. You can limit the maximal queue
## size using the SLOTS option (if omitted, the queue is limited by space
## available in the back end filesystem).
#
# set eventqueue
#     basedir /var/monit  # set the base directory where events will be stored
#     slots 100           # optionally limit the queue size
#
#
## Send status and events to M/Monit (for more informations about M/Monit
## see http://mmonit.com/). By default Monit registers credentials with
## M/Monit so M/Monit can smoothly communicate back to Monit and you don't
## have to register Monit credentials manually in M/Monit. It is possible to
## disable credential registration using the commented out option below.
## Though, if safety is a concern we recommend instead using https when
## communicating with M/Monit and send credentials encrypted.
#
# set mmonit http://monit:monit@192.168.1.10:8080/collector
#     # and register without credentials     # Don't register credentials
#
#
## Monit by default uses the following format for alerts if the the mail-format
## statement is missing::
## --8<--
## set mail-format {
##      from: monit@$HOST
##   subject: monit alert --  $EVENT $SERVICE
##   message: $EVENT Service $SERVICE
##                 Date:        $DATE
##                 Action:      $ACTION
##                 Host:        $HOST
##                 Description: $DESCRIPTION
##
##            Your faithful employee,
##            Monit
## }
## --8<--
##
## You can override this message format or parts of it, such as subject
## or sender using the MAIL-FORMAT statement. Macros such as $DATE, etc.
## are expanded at runtime. For example, to override the sender, use:
#
# set mail-format { from: monit@foo.bar }
#
#
## You can set alert recipients whom will receive alerts if/when a
## service defined in this file has errors. Alerts may be restricted on
## events by using a filter as in the second example below.
#
set alert <you@mailserver.domain>                # receive all alerts
# set alert manager@foo.bar only on { timeout }  # receive just service-
#                                                # timeout alert
#
#
## Monit has an embedded web server which can be used to view status of
## services monitored and manage services from a web interface. See the
## Monit Wiki if you want to enable SSL for the web server.
#
set httpd port 2812                  # bind internal webserver to specified port
# use address <URL>                  # bind webserver to specific IP or URL
                                     # (commented binds to all interfaces)
 allow 0.0.0.0/0.0.0.0               # allow any machine to connect to the server
 allow <user>:<pass>                 # require specified user/pass
#    allow @monit                    # allow users of group 'monit' to connect (rw)
#    allow @users readonly           # allow users of group 'users' to connect readonly

###############################################################################
## Services
###############################################################################
##
## Check general system resources such as load average, cpu and memory
## usage. Each test specifies a resource, conditions and the action to be
## performed should a test fail.
#
check system <IP or URL>
     if loadavg (1min) > 4 then alert
     if loadavg (5min) > 2 then alert
     if memory usage > 75% then alert
     if swap usage > 25% then alert
     if cpu usage (user) > 70% then alert
     if cpu usage (system) > 30% then alert
     if cpu usage (wait) > 20% then alert#
#
## Check if a file exists, checksum, permissions, uid and gid. In addition
## to alert recipients in the global section, customized alert can be sent to
## additional recipients by specifying a local alert handler. The service may
## be grouped using the GROUP option. More than one group can be specified by
## repeating the 'group name' statement.
#
  check file apache_bin with path /usr/sbin/httpd
#    if failed checksum and expect the sum <checksum> then unmonitor
     if failed permission 755 then unmonitor
     if failed uid root then unmonitor
     if failed gid root then unmonitor
     alert graham.leach@yougrow.net on { checksum, permission, uid, gid } with the mail-format { subject: Alarm! }
     group server
#
#
## Check that a process is running, in this case Apache, and that it respond
## to HTTP and HTTPS requests. Check its resource usage such as cpu and memory,
## and number of children. If the process is not running, Monit will restart
## it by default. In case the service is restarted very often and the
## problem remains, it is possible to disable monitoring using the TIMEOUT
## statement. This service depends on another service (apache_bin) which
## is defined above.
#
  check process apache with pidfile /var/run/httpd/httpd.pid
    start program = "/etc/init.d/httpd start" with timeout 60 seconds
    stop program  = "/etc/init.d/httpd stop"
    if cpu > 60% for 2 cycles then alert
    if cpu > 80% for 5 cycles then restart
    if totalmem > 1800.0 MB for 5 cycles then restart
    if children > 250 then restart
    if loadavg(5min) greater than 10 for 8 cycles then restart#    if failed host www.tildeslash.com port 80 protocol http and request "/somefile.html" then restart
#    if failed port 443 type tcpssl protocol http with timeout 15 seconds then restart
#    if 3 restarts within 5 cycles then timeout
    depends on apache_bin
    group server#
#
## Check filesystem permissions, uid, gid, space and inode usage. Other services,
## such as databases, may depend on this resource and an automatically graceful
## stop may be cascaded to them before the filesystem will become full and data
## lost.
#
#  check filesystem datafs with path /dev/sdb1
#    start program  = "/bin/mount /data"
#    stop program  = "/bin/umount /data"
#    if failed permission 660 then unmonitor
#    if failed uid root then unmonitor
#    if failed gid disk then unmonitor
#    if space usage > 80% for 5 times within 15 cycles then alert
#    if space usage > 99% then stop
#    if inode usage > 30000 then alert
#    if inode usage > 99% then stop
#    group server
#
#
## Check a file's timestamp. In this example, we test if a file is older
## than 15 minutes and assume something is wrong if its not updated. Also,
## if the file size exceed a given limit, execute a script
#
#  check file database with path /data/mydatabase.db
#    if failed permission 700 then alert
#    if failed uid data then alert
#    if failed gid data then alert
#    if timestamp > 15 minutes then alert
#    if size > 100 MB then exec "/my/cleanup/script" as uid dba and gid dba
#
#
## Check directory permission, uid and gid.  An event is triggered if the
## directory does not belong to the user with uid 0 and gid 0.  In addition,
## the permissions have to match the octal description of 755 (see chmod(1)).
#
#  check directory bin with path /bin
#    if failed permission 755 then unmonitor
#    if failed uid 0 then unmonitor
#    if failed gid 0 then unmonitor
#
#
## Check a remote host availability by issuing a ping test and check the
## content of a response from a web server. Up to three pings are sent and
## connection to a port and an application level network check is performed.
#
#  check host myserver with address 192.168.1.1
#    if failed icmp type echo count 3 with timeout 3 seconds then alert
#    if failed port 3306 protocol mysql with timeout 15 seconds then alert
#    if failed url http://user:password@www.foo.bar:8080/?querystring
#       and content == 'action="j_security_check"'
#       then alert
#
#
###############################################################################
## Includes
###############################################################################
##
## It is possible to include additional configuration parts from other files or
## directories.
#
include /etc/monit.d/* 
# 
 

Testing monit Locally

Seeing as monit runs its own web server, testing from the local machine was pretty straightforward.  I simply fired up links and pointed it at the loopback URL:

[root@vps2]# links 127.0.0.1:2812

And here's what I saw:


So far so good!


Testing monit Remotely

But the problems started when I tried to connect to monit remotely. Here's what I saw when I tried to access it from my PC:




Help! Help!  I Am Unable To Access monit Remotely

Perplexed, I decided to see if monit was listening on the right interface for public access, which would be its public ip address, not the loopback address. To check if it was indeed listening to the right interface (ethN), I used the netstat command with a few parameters thrown in.  Here's what I saw:


[root@vps2]# netstat -tldpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name
tcp        0      0 0.0.0.0:22                  0.0.0.0:*                   LISTEN      1351/sshd
tcp        0      0 0.0.0.0:25                  0.0.0.0:*                   LISTEN      1586/master
tcp        0      0 0.0.0.0:2812                0.0.0.0:*                   LISTEN      24250/monit
tcp        0      0 :::80                       :::*                        LISTEN      16000/httpd
tcp        0      0 :::443                      :::*                        LISTEN      16000/httpd 


A Quick Rundown Of Popular Linux Ports/Servers:
Here's a quick rundown of what ports/servers were open on this machine:

Port 22:    Secure Shell (sshd, to enable remote log in)
Port 25:    Mail (postfix, to enable sending & receiving mail)
Port 2812:  Server Status (monit, to help me diagnose problems & maintain uptime)
Port 80:    HTTP (apache, main purpose of server)
Port 443:   HTTP/S (apache, main purpose of server)


The netstat output indicated that monit was listening on ALL interfaces, so I shouldn't have had a problem accessing it...but I was.  Something was getting in the way.  That something was iptables, which needed to be told about monit.


What is iptables?


According to its own man page, iptables is:

       Iptables  is  used  to  set  up, maintain, and inspect the tables of IP
       packet filter rules in the Linux kernel.  Several different tables  may
       be  defined.   Each  table contains a number of built-in chains and may
       also contain user-defined chains.


Configuring iptables for Remote Access to monit

Opening a port in iptables is usually a pretty trivial affair.  I have done it many times.  So for monit, I just entered the following command at the CLI to tell iptables to allow remote access to the monit port specified in /etc/monit.conf:


#iptables -A INPUT -p tcp -m tcp --dport 2812 -j ACCEPT


But I still got this on the PC:



As it turns out, the fix was more subtle than I originally thought.  Due to the way that iptables processes its rules, this particular rule needed to appear earlier in the iptables rule set for things to work right, simply appending it to the existing rule set didn't work. So I ended up manually editing the iptables configuration file, located at /etc/sysconfig/iptables, and adding the rule manually as early as possible.

# Generated by iptables-save v1.4.7 on Sat Apr 15 09:12:50 2017
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [30125:21061390]
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p tcp -m tcp --dport 2812 -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 443 -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --dport 25 -m state --state NEW,ESTABLISHED -j ACCEPT
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
-A OUTPUT -o eth0 -p tcp -m tcp --sport 25 -m state --state ESTABLISHED -j ACCEPT
COMMIT
# Completed on Sat Apr 15 09:12:50 2017 

Configuration completed, I restarted iptables:



[root@vps2]# service iptables restart
iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
iptables: Flushing firewall rules:                         [  OK  ]
iptables: Unloading modules:                               [  OK  ]
iptables: Applying firewall rules:                         [  OK  ]
[root@server sysconfig]#

I then double-checked the iptables rule set, and monit appeared where it should, only I had to ignore the fact that linux identifies monit as atmtcp, which is a legacy protocol that almost nobody uses (or even knows about) any more.

[root@vps2]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ndmp
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:atmtcp
ACCEPT     all  --  anywhere             anywhere            state RELATED,ESTABLISHED
ACCEPT     tcp  --  anywhere             anywhere            state NEW tcp dpt:http
ACCEPT     icmp --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     tcp  --  anywhere             anywhere            state NEW tcp dpt:ssh
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:https
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:smtp state NEW,ESTABLISHED

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             anywhere            tcp spt:smtp state ESTABLISHED

Sadly, the developers of monit chose a previously mapped port (atmtcp).  This can lead to terrible confusion.  If, like me, you really don't like your servers being mis-identified, you can always change the port assignment for monit in the /etc/monit.conf file, or change the port mapping in /etc/services.  I have no intention of ever implementing atmtcp on this machine, so that's what I did:


 

Now the iptables output makes a bit more sense:

[root@vps2]# iptables -L | grep monit
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:monit

A Successful Remote Connection to monit

After reconfiguring iptables, I refreshed my browser (by pressing CTRL-F5), and here's what I saw:


Finally, it's working!  Now, at least the initial stages of troubleshooting this problem have been solved.  In an upcoming article, I will discuss the root cause(s) of the "freezing" problem, and how I solved that situation as well...(HINT:  It's all about the logfile located at /var/log/monit.log), which should look like this:


[HKT Apr 15 08:10:34] info     : Starting monit daemon with http interface at [*:2812]
[HKT Apr 15 08:10:34] info     : Starting monit HTTP server at [*:2812]
[HKT Apr 15 08:10:34] info     : monit HTTP server started
[HKT Apr 15 08:10:34] info     : '<domain>' Monit started

not this:


[HKT Apr 14 20:39:36] error    : 'apache' total mem amount of 1027608kB matches resource limit [total mem amount>1024003kB]
[HKT Apr 14 20:40:36] error    : 'apache' total mem amount of 1027608kB matches resource limit [total mem amount>1024003kB]
[HKT Apr 14 20:41:36] error    : 'apache' total mem amount of 1128548kB matches resource limit [total mem amount>1024003kB]
[HKT Apr 14 20:42:36] error    : 'apache' total mem amount of 1135176kB matches resource limit [total mem amount>1024003kB]
[HKT Apr 14 20:43:36] error    : 'apache' total mem amount of 1237728kB matches resource limit [total mem amount>1024003kB]
[HKT Apr 14 20:43:41] info     : 'apache' trying to restart
[HKT Apr 14 20:43:41] info     : 'apache' stop: /etc/init.d/httpd
[HKT Apr 14 20:43:41] info     : 'apache' start: /etc/init.d/httpd
[HKT Apr 14 20:44:42] info     : 'apache' 'apache' total mem amount check succeeded [current total mem amount=178220kB]
[HKT Apr 14 21:48:47] error    : 'apache' total mem amount of 1231992kB matches resource limit [total mem amount>1024003kB]
[HKT Apr 14 21:48:52] info     : 'apache' trying to restart
[HKT Apr 14 21:48:52] info     : 'apache' stop: /etc/init.d/httpd
[HKT Apr 14 21:48:53] info     : 'apache' start: /etc/init.d/httpd
[HKT Apr 14 21:49:53] error    : 'apache' process is not running
[HKT Apr 14 21:49:58] info     : 'apache' trying to restart
[HKT Apr 14 21:49:58] info     : 'apache' start: /etc/init.d/httpd
[HKT Apr 14 21:50:58] error    : 'apache' failed to start
[HKT Apr 14 21:52:03] error    : 'apache' process is not running
[HKT Apr 14 21:52:03] info     : 'apache' trying to restart
[HKT Apr 14 21:52:03] info     : 'apache' start: /etc/init.d/httpd
[HKT Apr 14 21:52:56] info     : 'apache' started
[HKT Apr 14 21:54:01] error    : 'apache' service restarted 3 times within 3 cycles(s) - unmonitor
[HKT Apr 14 21:57:08] info     : Shutting down monit HTTP server
[HKT Apr 14 21:57:08] info     : monit HTTP server stopped
[HKT Apr 14 21:57:08] info     : monit daemon with pid [13452] killed

finis (for now)

REFERENCES


https://askubuntu.com/questions/640150/custom-port-names-for-netstat

https://crm.vpscheap.net/knowledgebase.php?action=displayarticle&id=29

https://en.wikipedia.org/wiki/OSI_model

https://www.centos.org/forums/viewtopic.php?t=9059



Wednesday, April 5, 2017

Bing! Bing! Bing! Zzzzzzz - Fixing a Dying Pinball Machine

Don't let your Pinball fun be ruined by rebooting problems...

Indiana's Dying

A little while ago, a friend mentioned to me that his pinball machine was dying every time he pressed both flipper buttons.  I asked to see this happen in action, and he was kind enough to allow me to operate the game.  It behaved very strangely.  As soon as both flippers were activated, the machine rebooted.  Not being much of a pinball fan, I made some polite noises and moved the conversation on to the real reason I had come to visit, which revolved around picking up a microwave oven.  I think what I said at the time was that it was probably the solenoids operating the flippers.  My guess at the time was that they were probably drawing too much current from the power supply, causing it to under-volt the central processing unit, which, starved for power, was acting like the machine had just been turned on.  In any case, I pointed out that the problem was very likely related to a power supply issue.


I Hate Boredom

The microwave turned out to be a dud.  As time passed, I often thought about the pinball machine, which was Indiana Jones themed, and wondered if it would ever work properly again.  After a while, I contacted my friend and asked how he was doing.  Unsurprisingly, he once again asked me about if I could fix his machine.  There was nothing on the workbench, so I agreed to give it a try:


My "On Again, Off Again" Relationship With Pinball

The last time I played pinball with any seriousness was in the early 1980's.  My family used to go to Old Orchard Beach in Maine for a month over the summer holidays, back in the days when taking a month-long summer holiday didn't raise eyebrows.  At the beach, near the pier, there was an arcade that looked like this:



My sister, who was about 10 years old at that time, liked to play video games, especially Pacman and Ms. Pacman.  So I would go to the arcade with her to pass the time.  Yes, I played with the games, but never really got a charge from them.  I still don't play games.

Microsoft Pinball

But I did love the fact that Microsoft included Space Cadet as a free element of its early GUI Windows lineup:


I loved the look and feel of Microsoft Pinball, and always admired Microsoft for hiring Cinematronics to make it.  Out of the box pinball in Windows is gone now, a victim of Microsoft's transition to 64-bit, but I think Microsoft Pinball will be forever remembered, considering how many articles have been written about it.

UPDATE:  Here's how to download and install Microsoft Pinball on 64-bit

But Why Was Indiana Jones Dying?

As I mentioned at the top, my instincts were telling  me that the machine was dying due to an under-supply of electricity to the CPU.  Now, WHY the under-supply was happening was a bit of a mystery.  Seeing as the event only happened when both flippers were activated, my first guess was that the flippers were simply drawing too much power, and thereby dragging the machine into the netherworld.  Considering that flippers are electromechanical devices (basically, a solenoid) that are subject to wear and tear, my guess was that they just needed a service.  But that guess was pretty weak.  Surely this problem wasn't unique?

The Internet:  An Ocean of Pinball Related Information


Due to my total lack of background in pinball machines, other than as a (very) occasional user, I decided to do a little bit of research.  Thank god for the Internet.  The simplest Google search turns up a cornucopia of articles, how-to's and even videos on how to fix pinball machines.  There are some very dedicated aficionados producing articles and even videos on my exact problem, albeit featuring different machines than the one I was working on:




In fact, this problem is becoming so common, people are even making a business of fixing it with their own specially crafted add-on components:




Pinball Machines Are Dying Like Flies

I found out something a little troubling while performing my research.  The pinball machines that had occupied much of my youth are dying out.  When I was growing up, they were everywhere.  Now, they're going extinct and becoming collectors items.  Besides the fact that everything is going digital these days, pinball machines occupy a lot of space and make a lot of noise.  Compounding this is the fact that after X years, their most vulnerable internal electronic components (mostly capacitors and diodes) start to give up the ghost, accelerating their voyage towards "electronics heaven".  The situation with pinball machines is very similar to that of vintage solid-state amplifiers like my Yamaha DSP-A3090, which was first released in 1995, and even younger units like my Antique Sound Lab Hurricane, which was released in 2005.  Both units have had failures related to component aging - and it's mostly because of the amount of heat they generate.  They are literally cooking themselves to death!

A Primer On Common Component Failures


When Diodes Go Bad


Normally a very reliable component, diodes go bad when they experience an over-voltage, or "spike".  The tolerances for most diodes are very tight these days, and transient voltages can easily damage their delicate junctions.   

The following picture features a blown diode in an Atari Asteroids Deluxe machine:
 
(http://www.aaarpinball.com/AsteroidsDeluxe/AsteroidsDeluxe.htm)

Sometimes diodes are ganged together in a specialized package known as a bridge rectifier.  A Bridge Rectifier uses four diodes hooked together in a special way so as to help convert Alternating Current (AC) to Direct Current (DC).  They are commonplace in power supply circuits.   

Here's a picture of a blown bridge rectifier from a Conrad Johnson PV-7 pre-amplifier:


(http://www.lencoheaven.net/forum/index.php?topic=4279.0)

When Capacitors Go Bad

Capacitors are less reliable than diodes because of how they are constructed.  Basically a "swiss roll" of aluminum foil and liquid impregnated paper, they go bad when they dry out.   

This picture is of a blown capacitor on a computer motherboard:

(https://en.wikipedia.org/wiki/Capacitor_plague)


The Good News and the Bad News

The Good News

The good news is that when it comes to repairing equipment with broken diodes and capacitors, cost is normally not a big consideration.  Diodes and capacitors are very inexpensive, widely available and small.  They can easily be either picked up at an electronics store or ordered from a very wide range of sources including eBay, Amazon, TaoBao, Mouser, Active.  When I was young, before there was an Internet, I would bicycle over to Radio Shack to buy electronics components and books like this from Forrest M. Mims III:


 (http://hackaday.com/2017/01/18/forrest-mims-radio-shack-and-the-notebooks-that-launched-a-thousand-careers/)

Sadly, this avenue of learning about electronics is no more.  Radio Shack stopped selling this kind of stuff a long time ago.  On the positive side, there are now innumerable online sources of electronics tutelage, including Instructables, for whom I have written a couple of articles, including one on making a 3D-Printer from old disc drives, which I worked on with a man named Mark Rogivue.  We initially called it the Infinity, then the Curiosity.  As a kit, the unit was designed to cost under USD100.00, but the build volume was pretty small, at 3 cubit centimeters:

 

The Bad News

There's a few pieces of bad news when it comes to doing this stuff:

Danger
One of the worst pieces of bad news is that messing around with capacitors (or electronics in general) can be painful or even lethal  if you do not know what you are doing - especially if you are working in power supplies.  I am not going to even offer a basic electronics safety course as part of this article.  I don't want to assume any liability and I don't have the room.  Please, if you choose to work with electronics, take some training on how to do it safely, or connect with someone who can show you how to do it safely.  You have been warned.  Proceed with caution and in the understanding that if you end up hurting yourself, someone else or property you have only yourself to blame.

Time
Fiddling with broken electronics can be very time-consuming.  Personally speaking, I do it as a means of relaxation, especially when I am thinking about something else that's complicated.  Believe it or not, I often need to be in the process of solving one problem to make any progress on solving some other, usually more difficult, problem.  So, I often keep my hands busy on a minor challenge while settling a major challenge.  I know this sounds strange, but it's true.  If you are like me, you might want to give it a try.  The process and sense of accomplishment that comes from fixing something often translates somehow to the bigger stuff.

Failure
I should be honest here and state that not all electronics problems are trivial and not all of them can be fixed easily - or at all.  Some electronics is so horribly tiny, it's impossible to work on (almost all mobile phones).  Some electronics feature Integrated Circuits and other proprietary or specialized parts that you just can't get a replacement for.  Some electronics problems are just beyond your skill or troubleshooting level.  Some problems require tools that are too expensive.  Be aware that not everything can be repaired within an acceptable cost window.  These units are often labeled as being Beyond Economic Repair (BER).

Getting Back to Fixing Indiana Jones

Now that we know what we are in for, let's talk about the Indiana Jones situation.  Seeing as it was pretty clearly a Power Supply issue, the question then became one of supply or demand.  My initial reaction was demand - I figured the aging solenoids in the flippers were drawing too much power.  After doing some research, it became clear to me that it was actually a supply problem.  The Power Supply capacitors were very frequently the culprit.  After finding a very similar characterization online, I posted the following message:

 (https://pinside.com/pinball/forum/topic/machine-powering-down-after-hitting-both-flippers)


In terms of this fix, I owe a large debt to Robin, the founder of www.pinside.com for the following post:




Armed with this information, I started to ask my friend about the Power Supply section of his Indiana Jones machine.  He sent me this photo:



Upper Right
On the right side, the section of the board that is of interest is the AC-DC section, where the incoming voltage is converted from Alternating Current (AC) to Direct Current (DC).  This is accomplished with a square shaped bridge rectifier and a duo of smoothing capacitors:


Upper Left
On the left side, there's what looks like another bridge rectifier and a single smoothing capacitor:


So it looked like there were five (5) capacitors to replace - but what size?  Capacitors act a bit like cups that hold hot water.  The volume of the cup is measured in a unit called Farads, after the famous physicist, Michael Faraday.  There's one more rating, which is the voltage ratingYou can think of it as the temperature of the water.  A plastic cup melts if boiling hot water is poured into it.  The same thing happens with a capacitor - too much voltage and it will fail.  The capacitors we found in the Indiana Jones Pinball machine were rated at 15,000μF @ 35v, which means they were rated to "hold" 15,000 microfarads of power that had been charged up to 35 volts:


So the next task for us was to source five (5) capacitors of equal or better ratingSourcing electronics is pretty easy.  I found the following replacement units on www.taobao.com, for about USD1.25 each:





Disassembly

In preparation, I asked my friend to remove the power board. Pinball power boards have a lot of connectors, all of which need to be carefully tagged and documented before the board can be safely removed.  

As we moved through the board removal process some very interesting things were revealed. For example, we found some strangely marked plugs, and a damaged connector:


Here's what the cabinet looked like with the power board removed:   


Once the board was removed, it looked like thisOur focus areas are highlighted:


For the first time, we were able to do a visible inspection of the back of the board.  By the looks of it, the capacitors in question may have already been replaced once before:



Not the best quality soldering work!

This also raises the specter of the capacitors being the wrong size or rating, so I had to go find out whether or not the installed capacitors actually matched the recommended ones.  Luckily, I found an online version of the Indiana Jones Pinball Machine Operations Manual.

Inside the manual, I was able to find the schematics for the power circuit:



 As well as the loner on the other side of the board:




So now we knew we needed to replace C5, C6, C7, C11, and C30.  Consulting the parts list in the Indiana Jones Pinball Machine Operating Manual, the rating of these capacitors is clearly listed:




So the capacitors selected as a replacement for what had been installed would be OK.  In fact, more than OK.  But a question was begging to be asked.  If the capacitors had already been replaced, and were therefore new - why was the machine still resetting?  The only way to really find out was to do one of these two things to eliminate them as a source of the problem:

1a) Take all of the capacitors out
1b)  Put in a new set of capacitors (cost: USD10.00)

2a) Take all of the old capacitors out
2b) Test all of the old capacitors with an ESR meter (cost: USD50.00), and then probably...
2c) Put in a new set of capacitors (cost: $USD10.00)


What is ESR?  How Do We Measure It?

ESR stands for Equivalent Series Resistance.  This approach to troubleshooting capacitors adopts the perspective that capacitors act like a resistor under certain conditions, and can be measured as such.  With age and use, the equivalent series resistance of a capacitor tends to creep upwards, sometimes to spectacularly high values.  This can cause other components in the circuit to malfunction, or even fail.  This is especially true of high frequency circuits that rely on resistor-capacitor combinations, or "RC Circuits" (https://en.wikipedia.org/wiki/RC_circuit) which are very common.

Under ideal circumstances, a capacitor should offer zero resistance to an electrical circuit, but in reality, that's never true.  Capacitors always offer a nominal resistance of some kind, it just depends on the type of capacitor.  What an ESR meter does is measure and display the equivalent series resistance of any given capacitor so it can be compared with the nominal value(s) for that capacitor type.  If the value is wildly out of tolerances, the part is considered defective.  Here's a really inexpensive ESR tester in action:




Why We Bother With Capacitor ESR Testing

In the hobby world, only uber-geeks bother to test out capacitors.  Most hobbyists just swap the capacitors out and have done with it.  At $1.25 for even a large (15,000uf @ 35v) part, why bother?  But in the professional world, knowing exactly what part failed really matters, because clients are paying for the repair and want to know where things went wrong, or the company is paying an in-house tech to do the repair.  It also makes sense when you are contemplating changing thousands of units on an assembly line because you may have sourced substandard capacitors, or are dealing with a large recall program as a warranty support measure.  In those circumstances, rooting out the exact cause of the problem makes sense.  Here are some common ESR values:



(http://peakelec.co.uk/downloads/esrguide-en.pdf)


Six Sigma Isn't Always Your Friend

In the hobbyist world, changing out all of the capacitors makes sense because of the way things are made.  Usually, all of the capacitors in a given piece of electronics were purchased at more or less the same time - so they are going to fail at more or less the same time.  This is especially true in an age of universal quality control systems, where deviations from the norm are not tolerated.  The application of ISO type process controls in manufacturing creates the potential for "clustered" failures because variation is kept to a minimum.  Six sigma quality means your capacitors are likely to fail around each other, because they are all basically "clones":


(http://lablean.blogspot.hk/2015/02/six-sigma.html)


The Replacement Capacitors




The ESR Meter




Prepping for Surgery

With the board out of the machine, and the operating theatre ready to go, all we needed to swap out the capacitors were the following things:


Soldering Iron


Solder Sucker


Solder