Little Nybbles of Development Wisdom

Terence Parr, October 31, 2002. Updated October 14, 2010.

Software is more an art or skill than a science or engineering discipline. The most effective means of becoming a great programmer is through an apprenticeship (even if self-directed). There is no substitute for coding a big system that evolves over time. It seems to take about 2 to 3 years before somebody absorbs the important lessons. You can read books and papers in an effort to avoid common mistakes, but talking to and working with other programmers still seems to be the best (if slow) approach. As Chris Brooks says, becoming a commercial programmer is like becoming an architect; being a junior associate for a while is part of the process.

In this document, I have tried to remember and distill my hard-fought 3-year experience as I evolved into a programmer capable of building a commercial product, http://www.jguru.com (for more information on the evolution and design of the jGuru server, you can check out this lecture). Naturally this is a not complete list of programming advice, but rather what I learned on this project.

Hardware, networks, logs

Use as few machines and system components as possible. System complexity made our first system extremely unstable.
All machines of a certain class (web or db etc...) must be identical down to the exact version of Linux. Reproducibility is important. You must be certain that your test and live environments are identical if you want a chance of finding bugs.
To go from raw linux box in a known state to fully configured system ready to bring live must be completely automated. You should be able to install a few RPMs or tar balls, push your software, and go live. Reproducibility!
Hardware fails a lot more than you would expect in commercial settings. Make sure that not only your backups work but that you can easily reconstruct a system. If you don't have kickstart, make a human script to quickly follow like a pilots checklist to get moving.
Avoid system components that force GUI or webpage initialization / configuration. Automation is the goal and a GUI configuration tool destroys any hope of configuring a box by unzipping or installing RPMs.
Machines where you deploy or test software must be READONLY. You are not tempted to tweak the config or software on the live system (even if you use a repository to deploy).
Lock down your systems tightly; no unnecessary ports open. POP,sendmail,bind (DNS) are gaping holes. Your system is constantly being swept by "target acquisition radar".
Verify that your backup strategy works (i.e., you can bring back data) and that it continues to operate. Back up onto hard drives if you can and then onto tape (shudder) or DVD-RAMs.
Use a hosting service like Rackspace if you can. You can often avoid pesky and surly sys admins <wink> plus get better, cheaper service.
Log generously. It's extremely useful for examining the events leading up to a crash or bug.
Collect all of your logs on one machine if possible. Makes it much easier to back up and you don't risk clobbering log files as you try to back them up onto a single disk somewhere.

Programming

Dealing with change

Your software design decays over time as you add features and modify existing features. Rewriting and cleaning it up (refactoring) is as important as adding new features. Seriously. I'm not kidding. Heh, write this down! jGuru's first system did not get refactored at all. With 4 or 5 coders we had lots of decay and the system was insanely fragile.
There is a constant battle between writing code quickly (yielding brittle code) and writing code that is flexible. The future may render all of your costly-to-write flexibility irrelevant so be careful not to overdesign. Refactoring can fix some problems later. If you have to write brittle code, try to isolate it in a method or via an interface so clients don't have to change.
User code should ask a high level service. For example, have a FAQManager for content and let it worry about where the persistence layer is. The db might be on another machine or move as your system evolves. You want to avoid large scale changes in your user code when services change location. For example, web pages (user code) should never directly make SQL queries. To be able to swap a service out, put a "switch" between your services and your user code so you can swap them out (even dynamically) without having to change code that references that service. You'll need Java interfaces for this.
Specify as much as you can in an array, a property file, or a configuration file. Changing data is much easier and safer than changing code.
Coding a complete system a second time is easy, fast, and accurate because you have few (if any) coding or design decisions to make. It just seems to fall out of your head. This fact has huge implications for refactoring. Managers hate throwing out huge swaths of code because they paid for it and they fear it will take the same amount of time to recode. In reality, writing something a second time is dramatically faster and results in vastly cleaner code. I've seen compression rates of months for iteration 1 down to a week for iteration 2. When you know all the issues, you code with confidence and know there won't be any surprises. Surprises like, "oh! I never thought of that security hole. How can we avoid that?", are the primary speed and cleanliness impediments. If you are recoding only a piece of some software, unit tests are crucial to ensure your new software fits within the old structure.

Robustness

If something can go wrong, make sure you design the software so that it can only work the right way. For example, what if I launch the notification system (emails 10,000 people) twice at once? What if I launch it from the test server?
Automate anything that you might screw up like "is this the live server or a test server?" Don't make somebody specify it--at 3am after the server has crashed, you'll make a mistake. For example, jGuru uses file $HOSTNAME.family to get its list of what sites to host upon startup.
Never leave an enemy at your back (unless you are trying to collect more data on it). I.e., don't leave a strange bug hoping it will go away. It will return in the most horrible way like relatives coming to visit for 3 weeks. If you think there might be a problem with a component, there is.
Build unit tests and functional testing procedures. This includes building or using a load tester to check boundary conditions and possibly to reproduce infrequently-occurring bugs. Automate as many tests as you can even if you have to buy a test harness for a GUI etc... When you find a bug, add a test case for it.
Always build quality in! Don't just test for trouble later to see how bad it is and try to fix it.
If you are only one that knows the server, it will break on vacation or the day you are supposed to leave. The day we launched the 2nd version of jGuru, I flew across the country only to hear the server had crashed. My business partner had to "become my hands" over the phone to debug a system he had never looked at before! Another time, the hard drive on our main live server died a miserable death the day I was to leave for Paris. Rackspace.com had to replace the drive, copy any surviving data, and I had to run through my "human scripts" to launch a new system. I almost missed my flight.
When something goes wrong think about what is different or what has changed. I know this sounds obvious, but it is a very powerful focusing technique. It is really tempting to freak out and try all kinds of fixes when the system becomes totally unstable. After our system crash in the bullet point above (before my trip to Paris), jGuru became super slow and unstable. The system was launching 700 threads, bringing the machine to a grinding halt. I kept thinking "what's changed?", but couldn't think of anything. The software was the same, I said! So, I started building thread debugging tools. Anyway, turns out I did change something. Ah ha, I thought. I did make a minor change when trying to get the server back up after the crash--it was causing portal.init() to be executed twice. The system seemed ok for a few hours, but then was right back to the huge number of threads. Finally, I realized that I had specifically code the system so it could only be initialized once. It couldn't have been that. Using the "what has changed" focusing lens, I convinced myself that the software was the same (confirmed by revision control system). Therefore, no matter how unlikely, there must be a data problem. Given that the server crashed, I would normally be suspicious of this immediately, but our database naturally has transactions and recovers nicely from power outages and so on. Well, it turns out the search database, which is different, got caught in the middle of a locked operation when the system died (leaving a file called commit.lock) around. I copied this search database with the freeze-dried lock to the new drive, making the search database freak out. The search library waits like 3 seconds to see if the lock will free up before timing out. With all of the searches initiated on jGuru, this queued up a HUGE number of threads. Problem was solved literally by removing that lock file. The number of threads dropped before my eyes.
Don't code after drinking. ;)

Design tactics

Don't be too clever. Being able to keep a really complicated design and/or implementation in your head means you may not search for a simpler, more elegant solution. Others will not be able to modify nor maintain your code. You will not be able to figure it out yourself after 6 months. First make it simple and make it work. THEN, if it's too slow, trade complexity for speed.
Only keep one long-term copy of objects related to database entities so you only have one object to update. Go further than only keeping one copy--keep only one pointer to that object. You only want one pointer to, say, a person record laying around so that, when you need to swap out the person record with an updated version, you can change just one pointer. This implies that your data indices must keep symbolic references not actual pointers to objects. For example, I always have one table called personIDToPersonMap, which holds an actual pointer to a Person. All other indices such as superUserIDList track IDs not pointers to the Person so I can do whatever I want to the Person objects w/o screwing up a single index.
If you can afford it, don't store the results of computations. The computation or algorithm may change in future and then you have legacy results to change, possibly with both legacy and new data available in the system. For example, don't store when somebody needs to pay or reregister. Store the account created date and then have an algorithm decide when to ask them to pay when they log in next or whenever. The algorithm will change as you change your business model.
Don't intermingle a computation within another unless it's too slow otherwise. It's too hard to read/modify. Better to see a set of smaller, more encapsulated computations than one giant blob that computes and saves results for later use. E.g., ANTLR grammar analysis tried to track too many statistics rather than simply walking structures later to get computations it needed. I wasn't sure stats were correct.
Nested or recursive structures and related algorithms are the natural solution for many tasks. Unfortunately, recursive thought seems to be a very difficult concept. Don't resist it; practice will unleash its power. Examples of nested and recursive techniques:
1. grammars and languages
2. languages written in themselves
3. hashtable of vectors (in practice, you can use this structure to sort with roughly linear performance for data sets with many repeated keys)
4. hashtable of hashtables
5. trees / walking
Learn about languages, their design and implementation. Skill with computer languages is the single most useful weapon you can acquire because it covers just about every application of computing. As the primary developer of ANTLR, a popular parser/translator generator, I receive questions from an amazingly broad group of users: biologists doing DNA pattern recognition, NASA scientists automatically building communication libraries from deep space probe specification RTF documents, people building configuration files for every conceivable kind of program, and so on. The jGuru.com portal uses many languages and parsers from object-schema specifications to HTML sanitizers. The point is that computer language skills enable you to produce extremely flexible and powerful software, not just compilers for new programming languages.

Performance

Don't worry about writing super efficient code until you know there is or will be a speed problem. Use a profiler to know rather than deduce where the inefficient hot spots are. The relationship between source code and efficient CPU instruction execution is now so distant that you should not try to guess what will be efficient at that level (pipelines, branch prediction, caches, ...). Worry more about algorithmic complexity (i.e., speed/space) and use a profiler.
Do expensive operations either up front or in the background. (load data, snoop or search other sites, sort, ...). This is a good use of threads.
Use memory if you have it. If your sizeof(database) < sizeof(RAM), cache the whole damn thing. There is a lot of resistance to this idea from database experts, but you'll never beat a fetch from your cache with a database fetch (even using database caching). I often load everything upon start up of the server (with simple "SELECT * FROM xxx" queries) and then use a write-through cache strategy. When you add a record, such as add a new forum entry, write to the database and then update the cache (including any indices you may have).
Cache pages that don't change or change infrequently to reduce server load.

3rd Party Software

Do not rely on anybody else's software for your core application unless you really trust and have tested the library or service. If you have to use other software for a critical component, make sure you get the source.

One time with epicentric, I had to email the chief architect and their programmers the exact lines of offensive code before they believed me that their software went to the db every time it wanted an int property. Pages were rendering in 30 seconds a piece.

A really smart friend told me about the excellent object to RDMS mapping library he used. I asked him about the caching policy and then how to change the policy. Turns out you could not really tune it and it wasn't clear how it cached. That is unusable for a real product. Control is crucial.
Most systems don't need the power of oracle. Use something simpler as it may not be worth the hassle of oracle.

Project management

Sometimes just picking a path is better than wasting months and months trying to find an optimal path. You probably won't find it. You must pick something, learn from it, and then decide later for system II. Picking Epicentric was the right idea--we had to get started.
Making robust software is very hard; particularly with lots of coders. There are 3 kinds of dangerous programmers:
1. very lazy programmers; "can't we get a tester?" or "well, this software has problems but testing will find it."
2. a programmer that is so impressed with himself or herself that they "don't need to test that much".
3. a programmer that thinks they are good but isn't totally secure; they don't want to test their code for fear they'll find evidence of poor skills.
  
  Drive the concepts of quality, testing, robustness into your coders.
Programmers are curious beasts, which is normally a good thing. However, watch out that they don't find new technology X and demand to use it because "it's so cool." At the same time, don't let management force X on you to make your software buzzword compliant.
Document everything you can including software design, system configuration, your experiments, and your thoughts. I have note files on many topics and then when I'm ready to implement that topic, I have a good start on a feature list and design. You should be able to refer new programmers to a wealth of information about your system. Even though it was a hassle at jGuru and we were swamped with work, making notes was crucial.
Writing software is about accepting imperfection and incompleteness to make deadlines. Accept failure as part of the job to reduce stress. Prioritize so you know what is reasonable to ignore. Perfectionism forces some employees to become mired and unable to complete anything, some to work 3x too hard, some to freak out thinking you are posing impossible problems.

Herding cats

Most of this I learned from CEO Tom Burns.

There is no such thing as a good, busy manager. A busy manager can only react not act. Further a busy manager has no time to think about how he/she is affecting employees. Example: The CEO and I switched responsibility for getting a doc done. A sales guy sent me something, but I ignored it to do it faster by myself. It turns out he had worked for a week on his version but I didn't know. I sent a very bad signal that he was (incorrectly) irrelevant.
The only realistic definition of loyalty is "our interests are aligned". Works for both the employer and employee. Here are some important related points:
1. Any time a company talks about being loyal to employees, they are lying or are being idealistic at best. A company usually does not have a choice when laying off employees--the company has a responsibility to their shareholders to make money not provide jobs. One could design a system where everybody had a job (even if the government had to lie about it), but I'm pretty sure those countries have all collapsed now in favor of capitalism. Anyway, when jGuru quit being a training company and became a web portal for java developers, we needed coders not trainers. We layed off half the company the day we made the final decision to switch directions.
  
  Remember those companies with "no-fire" policies like DEC and HP? They too have succumbed to reality, dropping their early idealism with their first massive layoffs.
2. Only an employee can choose to be loyal as he/she can't really be forced to quit by an external source...only a company fires people. An employee may have a better opportunity. At that point, their interests are no longer aligned with their current company. Why should he/she be loyal to a company that cannot offer them loyalty?
3. Many individuals and whole cultures will find the underlying facilitating concept of "at will" employment distasteful. But, from a management perspective, being able to lay off people means you are not afraid to hire people. If hiring someone is like getting married for life, a company will be very reluctant to hire new people in order to satisfy a new demand. The company must be agile during difficult and good times. They might actually stick around to keep some people employed or hire more people in the future.
When an employee calls to say he/she screwed up, thank them for their willingness to inform you and then work on a solution. Otherwise, if you yell, the employee will learn not to tell you when things are bad.
Extremely important to see truth even if you don't like it. See it as early as possible. Must know if somebody can't do something or something will be delayed.
You must give decision authority to employees otherwise you have to make all decisions and employees are locked waiting for you to make the decision. Plus, it costs lots of time to make all decisions. You end up with a company of automata if you don't give them authority to act. Along these lines, give someone a task and either be satisfied or not. Don't tell them to do something and then barge in and tell them how to do it.
All employees have faults--do not lightly toss them away. At least you know what their problems are. A new employee will have unknown problems.
Ask people what they want to do. Bribe them if necessary to get them to do icky things by offering good things to do afterwards (unless you can find somebody that doesn't mind doing the icky things). Recognizing an employee's control over themselves is important. Try to let them choose to do the icky thing.
You cannot control anybody or anything. You can only nudge or influence.
Management is a low position as it is all about making sure your employees are productive (like a conductor in an orchestra). You do no real work. Only getting stuff for employees and insulating them from crap above and from outside.
No matter how carefully you phrase something, someone will misinterpret it and be upset.

Summary

Use as few system components as you can and only those that you trust, for which you have source code, and that can be automatically installed and configured. Use a hosting service if you can.

Be paranoid when you write software. Assume you have lots of bugs and make your system tolerant of them. Try to find these bugs aggressively. Continuously groom (refactor) your software as you add features.

Notes

Apologies to Professor Dave Meyer at Purdue University for deriving this document's title from his excellent class notes: Little Bits of Digital Wisdom.