Tuesday, April 29, 2008

Not Using Source Code Control is Insane!

"Like skydiving without checking your parachute or having sex with a stranger without using a condom." -- me, earlier today
I've worked for a few places that didn't use source code control properly or even at all before I started. But today, I met a new PHP programmer who didn't know what I was talking about when I mentioned source code control. He recently migrated over from writing and graphic design and has no background in computer science or software engineering so it isn't his fault really. The fault lies with those that brought him into the fold.

As far as I can tell, I'm somewhat strange - I'm passionate about source code control (now noted as SCC) and have tried a few new ones just to see if I like them. Process is important and while I don't believe in crazy ISO-9000 level of process (then most of your work is to maintain the process rather than produce value) but you need some or there is only chaos. Go check out "The Joel Test", I'll wait.

I don't necessarily believe that all of them are necessary or sufficient. But what is #1? That's right SCC. SCC is so important, many word processors (like word) embed version tracking into the software.

Why do people insist on using nothing or tarballs/zipfiles to ensure the safety of business/personally critical data? One place I worked used zip files but also required you comment out changed code with a reason plus add comments for the new code. This resulted in files where there were pages of code with ocassional live code (it was unreadable). SCC systems do this for you.

What is SCC?

In essence, a SCC tracks changes made to files and directories and allows users to (among other things):
  • Checkout (aka get) the current version or some version from the past of those files/directories.
  • Check-in (aka save) your current changes to the SCC to be saved for posterity.
  • Revert changes you don't like.
  • Find out who made changes and why.
  • Do differences on what exactly has changed either between past versions or with what you have done since the last check-in.
  • Manage multiple people working on the same files and directories.
  • Give the ability to mark and later retrieve files/directories related to some important event like a release.
There are several main paradigms for SCC:
  • Locking vs Merging - Some SCC tools require you to mark a file for editing such that it is locked and no other user can commit changes until you release the lock. Other SCC tools don't require locking but will require (or do it for you if simple) merging if there are conflicts. I have found the locking model interrupts my workflow because I may not know all the files I'll change ahead of time, stopping to lock a file gets in the way, and someone might lock the file I'll need and then need a file I have locked (classic deadlock problems). Merging might sometimes be a pain, but it is better than the alternative.
  • Client/Server vs Peer to Peer - Most SCC has a server that holds the "one true version" and all clients must submit to that server. A few systems use a peer-to-peer structure where the only "one true version" is decided by social convention not technology.
For more information check out:

SCC and the Individual Developer

Even if you work by yourself for work or fun you should use SCC. The few times I haven't, I often regretted it. The first reason is that this helps develop good habits. Secondly, this makes it so much easier to undo mistakes should they happen.

Imagine you are making some changes to a game you wrote. The changes, while simple, are invasive and touch numerous files over the course of a day and you were interrupted numerous times. Sadly you made an error that causes saved data to become corrupted but the error could have been introduced anywhere and you can't remember exactly what you changed. SCC allows you to quickly find out what files changed, what the changes were, and if necessary revert back to good code and start over. Without SCC, you might be debugging for a long time.

SCC Tool Recommendations

This is based on tools I've used for long periods of time for production code with the exception of Git (many tools are omitted):
  • Perforce - This is a commercial product I used long ago. It can use locking or merging. I don't use it now because it costs money, I'm cheap, and there are great open source alternatives.
  • Visual Source Safe (VSS) - This pile of steaming... rubbish is given away by Microsoft. It uses a locking model by default. The problems with VSS are that it often corrupts large codebases and Visual Studio integration (up too 2005 I believe, or beyond) puts little snippets in your project files to handle integration. If you are ever kidnapped to a tropical island by a criminal mastermind with a claw and a cat and are told you must use VSS or be given to a bunch of cannibals, choose the cannibals because it will be less painful. I have heard from some insiders that Microsoft doesn't use VSS. If the producer of software doesn't use their product (and they could), then I don't feel it is wise using it either.
  • Vault - A VSS commercial clone (plus bonuses) of VSS. While this tool never corrupted my code like VSS did on a regular basis, I dont' much like how Vault works, so I don't like using it.
  • CVS - An open source merging SCC used for years by many open source products.
  • Subversion - A newish SCC that is much like CVS but it offers atomic commits (everything goes or nothing - so you don't have partial commits), versioning directories, better brancing/merging, and other improvements. If you want to use a client/server SCC, I'd recommend this (despite what Linus says).
  • Git - This is the only distributed/peer-to-peer tool on my list. I've only been using it for a few weeks, but thus far I like it. Git was quick and easy to setup and does everything I expect in a SCC with the added bonus of being really fast (partially because there is no server to talk to) One of the selling points is better branching/merging but I have yet to do much of that.
Closing Comments

I'll end with a few recommended best practices:
  • Use source code control
  • Check-in as often as possible without breaking the build.
  • Keep your commits small.
  • Add a useful (but short) message to each commit on why you made the commit (not what you changed since the SCC handles that). This should include a bug number if available.
  • Tag all your releases in SCC
This brings my SCC rant to a close. I feel much better now.

Sunday, April 20, 2008

Using ANTLR to create an Excel Like Formula Parser in Flex

Introduction
One of my first Flex programming projects was to write an Excel-like expression parser. This parser was written using the “shunting yard” algorithm by Edsger Dijkstra to transform infix notation to a reverse-polish notation (postfix) stack which is easily processed. That code isn't publicly available but the algorithms are described in a paper I wrote here.

I decided to re-write the expression evaluator using ANTLR. ANTLR is a parser generator that takes BNF-like notation and generates code for a target language. The default target language is Java, but there are many other target languages including an ActionScript target in the upcoming 3.1 version.

Flex Formula Evaluator
The actual Flex application is available here with source available. This is an example of a formula that can be evaluatedl feel free to copy and paste into the evaluator below (variables are noted in "[variable_name]"):
(1+2)*sum(9-8,5,17.5,[numarray])/if(or([string]="hello",1>5),2,99)

How Does it Work?

The formula parser begins with the ANTLR grammar found in Formula.g and FormulaTree.g (in src com.arcanearcade.antlr). ANTLR has 3 types of grammar rules:
  • Lexer - Used to transform the character stream into a series of tokens. Lexer rule names are written in all capital letters. For example, the following code describes how to recognize variables which are any characters except [ or ] found between [ and ].
  • VARIABLE : '[' ~('[' | ']')+ ']';
  • Parser - Walks though the tokens to form sentences in the grammar. Parser rules begin with lower case letters. The following example shows how to recognize a percent token, which is a number token followed by the percent sign (both described in the parser grammar). The '^' in the grammar is a special token that shows what the head of the created tree grammar looks like. So "50%" is transformed into a tree with an s-expression notation of "(% 50)" and will be evaluated to the value "0.5" by the tree grammar.
  • percent: NUMBER PERCENT^;
  • Tree - This optional grammer taks the tree that can be created by parser grammar and with the help of native code snippets interprets the tree to produce output. While you can use code snippets directly in the parser, using a tree grammer allows you to reuse the lexer and parser for other languages and write a (usually simpler) tree parser for each target language. The following tree grammar shows how to evaluate addition where "6 + 7" was turned into "(+ 6 7)" by the parser and is evaluated to the number "13" by the ActionScript code between the {}
  • operation returns [Object value]
    : ^(ADD a=expression b=expression)
    { $value = Number($a.value) + Number($b.value); }
The ANTLR tools parse the grammar files to produce native code as shown in the diagram below:
Code in FlexFormula.mxml evaluateFormula() takes a string, feeds it into the lexer, takes the output of the lexer feeds it into the parser, and then finally feeds that into the tree to evaluate the result. FormulaTree#addFunction() adds new spreadsheet functions like "sum" and "if" and FormulaTree#lookupVariable sets the function callback to retrieve variables.

What Next?
The expression parser isn't as friendly as it could be (it is just a demo). To improve the formula engine to the point it could be used in a spreadsheet the FormulaLexer, FormulaParser, and FormulaTree in a class that would:
  • Add all the common function.
  • Setup binding to watch for changes to the formula string and only call the lexer and parser when the formula changes.
  • Setup a public and bindable variable for the result value.
  • Add binding for all variables in an expression so when they change the tree can be re-evaluated.
  • Improve error handling