Smiley Faces in Linux Source Code and Token Statistics

Aug 5th, 2014

Github uses Linguist, a Ruby library, to help detect which programming language is in a given file. Recently, an issue was filed that indicated that Linguist incorrectly classifies Mercury (a programming language) files as Objective-c since they both use the same extension (.m). Linguist’s primary method for language detection is a file’s extension - a method that fell short for Mercury. If Mercury were added to Linguist, then there would be two languages with the same extension - and this is where things get interesting. If two languages share the same extension, or the file does not have an extension, Linguist has 3 methods for guessing the language. First it checks if the file has a shebang (#!/bin/sh). If there is no shebang the second method it uses is a set of heuristics. For instance, if the file includes the “:-” token it concludes that the contents are prologue code, or if “defun” is present it’s common lisp. If it still hasn’t found a match the third method it uses is a Bayesian classifier. Roughly speaking the classifier iterates over all of a file’s tokens, and for each token determines the probability that it is present in each programming language. Subsequently, it sums all those probabilities, sorts the results, and returns an array of language-probability pairs (e.g [[‘Ruby’, 0.8], [‘Python’, 0.2]]).

I wondered how logistic regression, support vector machines or even clustering algorithms would help in classifying a given file. As I dived into the data I realized that the descriptive statistics on tokens, and even ascii faces, would be nearly as interesting as their predictive power. Thus, this post will summarize the descriptive statistics, while in my next post I’ll cover using tokens to predict a file’s programming language.

Methods

Fetching The Code

Using Github’s API I fetched the 10 most popular repositories for 10 languages (c, haskell, go, javascript, java, lua, objective-c, ruby, python and php). Those 10 were chosen for their popularity, differing paradigms (e.g. Haskell vs Java), differing syntax (Haskell vs go) and overlapping syntaxes (C, JavaScript and Java). After retrieving a list of 100 repositories I downloaded the zip ball for each repo.

Tokenizing The Code

First, a list of common programming tokens (eg: ; , . ( ), etc) was created - tokens that would be found in many of the 10 languages of interest. Using those tokens I created a tokenizer that would output an object with the keys being tokens and their values being the number of times that the token occurred in the file. Base 10 numbers, hex numbers, strings (double quotes) and ‘characters’ (single quotes) were treated as 4 different token types. This was done so the number 8 was not treated as a different token from the number 44 (they are both tokenized as “numbers”).

Each of the 100 repositories was traversed and non binary files were tokenized, with each file’s total token count stored in Redis’ sorted set data structure (sorted by number of occurrences). Using a sorted set made it trivial to retrieve the 1000 most common tokens from all 100 repositories. Each file was then re-tokenize, but only tokens that were present in the 1,000 most common token list were counted. A data set was created that included information on 65,804 files from 100 different repos. Along with the token data, the following data was also recorded: 1) the file’s extension, 2) it’s path within the repository, 3) it’s shebang, if present and 4) the token count for the first 250 most common tokens (I decided to limit my analysis on a smaller number of tokens for the first round of analysis). Finally, the counts for each token were converted to the ratio that that token occurred relative to the total occurrences. of all tokens (e.g. # of periods/total number of all tokens). The absolute number of tokens per file would be skewed by large files, which have more tokens.

Counting Smiley Faces

Two ease analysis I focused on c and JavaScript files (in c and JavaScript repositories): both languages have identical single and multi-line comment syntax. First, text from comments was separated from code. Second, the number of times an ascii face appeared in a given file’s comments was counted. The following “faces”: :( :) :-) :-D :p ;) ;-) were counted.

Statistics

All statistics were carried out using R. Welch Two Sample t-tests were used to compare groups.

Results

Tokens

While the top 1,000 tokens, by occurrence were recorded, only the top 20 are presented in Table 1 (see here for the top 1,000). Not surprisingly numbers are the most prevalent token, with commas coming a very close second. Interestingly, and what sparked my interest in ascii smiley faces, is that there are slightly more right parentheses than left. While the contents of strings were ignored, the contents of comments were not. Seeing as most (all?) of the analyzed languages require parenthesis to be balanced I presumed that the imbalance might be caused by ascii smiley faces in comments.

Interestingly, hexidecimal numbers were the 6th most common token despite rarely being used outside of c. While hex numbers are used extensively in CSS, I only classified numbers that started with ‘x’ as being hex, which precluded the hex numbers in CSS from being included.

Table 1. Top 20 Tokens By Occurence

Token	Occurrences
Numbers	19,640,325
,	19,597,223
)	10,446,695
(	10,425,221
;	7,882,261
Hex Numbers	6,261,887
*	6,205,697
.	5,978,092
=	5,841,844
Strings (Double Quotes - DQ)	4,336,520
}	3,310,463
{	3,305,033
/	2,939,954
:	2,640,872
->	2,425,261
#	2,423,779
[	2,004,437
]	2,002,711
<	1,591,276
Strings (Single Quotes - SQ)	1,578,686

The top 20 tokens, by occurrence in aggregate (across 65,804 files).

Table 2 shows the top 20 tokens and their ratios (token/all tokens in a given file) in 16 different file types. Not surprisingly JSON files lead the pack for double quoted strings, curly brackets, colons and commas. Likewise, Clojure leads by having the highest proportion of parentheses. The right arrow -> occurred most often in Php, C and Haskell. Finally, square brackets were very prevalent in Objective-C.

Table 2. Top 20 Tokens (Scroll Right For Full Table)

File Type	# Files	Numbers	,	)	(	;	Hex Numbers	*	.	=	Strings (DQ)	}	{	/	:	->	#	[	]	<	Strings (SQ)
.php	10289	0.0106	0.0280	0.0634	0.0632	0.0406	0.0000	0.0463	0.0257	0.0172	0.0054	0.0219	0.0218	0.0187	0.0077	0.0239	0.0004	0.0042	0.0042	0.0126	0.0375
.json	1002	0.0183	0.1510	0.0001	0.0001	0.0000	0	0.0001	0.0011	0.0001	0.3575	0.0927	0.0930	0.0000	0.2163	0	0.0001	0.0179	0.0179	0.0001	0.0001
.md	1541	0.0317	0.0238	0.0250	0.0248	0.0030	0.0001	0.0146	0.0559	0.0063	0.0108	0.0039	0.0038	0.0012	0.0251	0.0006	0.0251	0.0127	0.0127	0.0066	0.0064
.hs	6133	0.0416	0.0224	0.0426	0.0424	0.0014	0.0002	0.0025	0.0253	0.0407	0.0121	0.0123	0.0123	0.0006	0.0051	0.0179	0.0229	0.0090	0.0091	0.0019	0.0060
.html	1579	0.0211	0.0091	0.0117	0.0116	0.0154	0.0000	0.0007	0.0252	0.0463	0.0482	0.0295	0.0295	0.0033	0.0099	0.0003	0.0270	0.0016	0.0016	0.1150	0.0052
.css	701	0.0811	0.0212	0.0083	0.0083	0.0676	0.0000	0.0219	0.0564	0.0023	0.0050	0.0392	0.0394	0.0169	0.0779	0	0.0194	0.0011	0.0011	0.0001	0.0018
.js	9089	0.0290	0.0681	0.0646	0.0639	0.0436	0.0005	0.0075	0.0574	0.0239	0.0328	0.0226	0.0220	0.0043	0.0203	0.0000	0.0004	0.0086	0.0086	0.0013	0.0251
.py	3081	0.0211	0.0483	0.0539	0.0537	0.0008	0.0001	0.0024	0.0670	0.0364	0.0213	0.0018	0.0018	0.0003	0.0278	0.0000	0.0294	0.0092	0.0093	0.0008	0.0376
.c	25043	0.0536	0.0560	0.0573	0.0573	0.0555	0.0077	0.0469	0.0248	0.0276	0.0110	0.0156	0.0155	0.0174	0.0050	0.0194	0.0120	0.0053	0.0053	0.0078	0.0017
.xml	1788	0.0127	0.0055	0.0022	0.0022	0.0026	0.0000	0.0005	0.0146	0.0925	0.0993	0.0014	0.0014	0.0191	0.0315	0.0000	0.0060	0.0005	0.0005	0.1242	0.0006
.clj	83	0.0183	0.0064	0.0946	0.0944	0.0529	0	0.0014	0.0462	0.0034	0.0239	0.0049	0.0049	0.0001	0.0189	0.0026	0.0015	0.0339	0.0339	0.0001	0.0005
.rb	8527	0.0229	0.0479	0.0264	0.0263	0.0013	0.0001	0.0018	0.0539	0.0148	0.0510	0.0062	0.0062	0.0009	0.0446	0.0002	0.0166	0.0086	0.0087	0.0141	0.0530
.java	12293	0.0157	0.0262	0.0488	0.0488	0.0433	0.0001	0.0525	0.0930	0.0104	0.0118	0.0186	0.0185	0.0137	0.0032	0.0000	0.0009	0.0018	0.0018	0.0050	0.0005
.go	1259	0.0265	0.0569	0.0684	0.0684	0.0043	0.0029	0.0108	0.0688	0.0111	0.0370	0.0340	0.0339	0.0009	0.0104	0.0000	0.0002	0.0106	0.0106	0.0014	0.0017
.lua	3935	0.0342	0.0783	0.0526	0.0525	0.0020	0.0008	0.0017	0.0463	0.0634	0.0448	0.0173	0.0171	0.0008	0.0166	0.0001	0.0013	0.0158	0.0157	0.0010	0.0193
.m	742	0.0333	0.0261	0.0530	0.0530	0.0580	0.0001	0.0201	0.0327	0.0246	0.0181	0.0190	0.0190	0.0049	0.0353	0.0003	0.0120	0.0345	0.0345	0.0018	0.0005

The ratio of a specific tokens, relative to all tokens in a file, by file type. Only the top 16 file types are present in this table - there is a very long tail of file types. I restricted this table to file types that are relatively abundant in this dataset.

ASCII Faces

To examine the discrepancy between left and right parenthesis I created a set of scripts to separate comments from code, in c and JavaScript files. Second, I then analyzed the comments and counted the number of ascii faces that appeared. I focused on 6 different types of smiley faces and I included 1 type of frown (See Table 3 for types of ascii faces and amount found).

While there were more frowns in JavaScript files, the difference wasn’t statistically significant. Furthermore there was not a statistically significant difference between the total smiley faces between ‘.c’ and ‘.js’ files. However, there were more smiley faces in files that were in “JavaScript” github repos. For instance, Node is a JavaScript repo but includes both JavaScript and c files. This makes sense that the project, with its distinct maintainers, rules and conventions is more important in determining the number of smiley faces.

Eighty percent of c files analyzed were found in the Linux repository, so it made sense to focus on Linux specifically. In Linux c comments I found 631 smiley faces and 73 frowns. In linux the most prevalent smiley faces was `:-)` followed by `:)` (See Table 3.).

Table 3. Ascii faces

	Linux c files (20,060)	c files (24,542)	JavaScript files (6,743)
All Smiley Faces	0.0315 (631)	0.0577 (1415)	0.0721 (486)
Frowns :(	0.0036 (73)	0.0051 (124)	0.0249 (168)
:)	0.0088 (172)	0.0081 (198)	0.0027 (18)
:-)	0.0088 (176)	0.0284 (697)	0.0001 (1)
:-D	0.0001 (2)	0.0001 (3)	0.0006 (4)
:p	0.0051 (102)	0.0120 (295)	0.0475 (320)
;)	0.0044 (89)	0.0048 (117)	0.0212 (143)
;-)	0.0044 (90)	0.0043 (105)	0.0000 (0)

The first value is the number of times the ascii faces appears, relative to other tokens, while the value in brackets is the total number of times that it appears in all files. Linux c files are a subset of the c files.

Discussion

Shortcomings

Despite including 100 different repositories, Linux source files represented 30% of all files in this analysis. Ideally, the number of files from each repository, and language, would be balanced. One approach would be to randomly select a numerically identical subset of files from each language. While this approach might be valid statistically it wouldn’t produce descriptive statistics on each repository, just a subset of files within each repository. Alternatively Linux could be excluded from the analysis since the number of files it contains is an outlier, relative to the other repositories.

In all files there were 21,474 more right parenthesis than left. Given that c and JavaScript files represent nearly half of all files in this analysis, and they only had 1,901 smiley faces, it’s unlikely that the other half of the files had a nearly 20,000 smiley faces - or enough to account for the left and right parenthesis. Future analysis could attempt to locate the source for this difference (presumably within comments).

Conclusions

It is not surprising that ratios of token types can differ dramatically between different languages, however, I was surprised that several tokens (parenthesis, square & curly brackets) did not occur equally. While smiley faces can account for part of this discrepancy, they most likely do not account for all of it.

The biggest surprise was that number of smiley faces per file was not statistically different between JavaScript and C. Being low level I presumed that C code would be more serious, with fewer ascii faces. Interestingly, I was wrong and C code has a similar amount of smiley faces relative to JavaScript.

Viewing Table 2 we can start to see some patterns and differences in token ratios that might help to predict a file’s language. For instance, JSON has very different token ratios than C. In the next article I will explore the power that tokens have in predicting which programming language is being used in a given file.

If you’re interested in replicating the analysis, or obtaining the dataset, please see the links below:

Language Statistics

Language Statistics Data

Understanding AngularJS Directives Part 2: ngView

Nov 11th, 2013

In a previous article I explored ng-repeat, it’s implementation, and how to create a custom repeater. In this article I’ll first delve into the inner workings of ngView and then walk through the creation of an “ngMultiView” directive. To get the most out of this article you’ll need an intermediate understanding of creating directives (via the previous article on ng-repeat and reading the the AngularJS directive guide).

Starting with AngualrJS 1.2, the ngView directive, and the $route service were moved into a separate module called ngRoute. As a result, to use routes and ngView your app must explicitly declare it as a dependency. Furthermore, the syntax for the “otherwise” statement is slightly different from older versions of AngularJS. Below is a complete example of a tiny AngularJS 1.2 app with two routes and a default:

Undocumented features of ngView

In the process of understanding the code behind ngView I came across two undocumented attributes: “onload” and “autoscroll”. The onload attribute can take any Angular expression and will execute it every time the view changes. Autoscroll uses the $autoScroll service and scrolls to a specific element based on the current value of $location.hash(). Finally, near the very end of the directive, after link(currentScope) is called the '$viewContentLoaded' is emitted within the current scope - an event which you can use within your controller. Below is a revised version of the above example that includes the onload attribute.

How ngView Works?

In order to understand ngView, I think it’s useful to create a simplified version of it. Below is ngViewLite, a version of ngView that does not include scope cleanup or animations, but is otherwise identical to ngView.

First, we bind a function update to the $routeChangeSuccess; when the route changes, update will be called. Right after attaching the function to the event we immediately call update() to load the initial contents into the view.

The update function checks if there is a defined template for the current route, if so it proceeds by calling the linker function, passing in a new scope, and a callback function. The callback function’s only parameter is the cloned element, whose html will be replaced with the route’s current template. The cloned element is then appended to div with the ng-view-lite attribute. Afterwhich we remove the previous contents from the view.

Finally, the template must be compiled ($compile(clone.contents())) and a new scope is injected into it (link(newScope)). In between those two steps we check if the current route has an associated controller: if so we instantiate the controller with the newScope and the local variables from the current route.

Making an ngMultiView

ngView works well, but what if you want multiple views to change according the url? According to the documentation ngView can only be used once within an application. To accomplish our ngMultiView we’ll slightly modify ngView and create an Angular value (MultiViewPaths) to hold the mapping between urls, views, controllers and templates.

In ngMultiView, a parameter is passed into the directive <div ng-multi-view="secondaryContent"></div>, in the directive this attribute will be called “panel”. Instead of binding to the '$routeChangeSuccess' event, we’ll bind to '$locationChangeSuccess' to make our directive completely independent of ngRoute. ngMultiView will work the following way:

A url change will trigger '$locationChangeSuccess', which in turn will call update()
Within update: grab the portion of the URL after the hash (in the code this portion is just called url).
Using the url variable, and the panel, we can lookup the corresponding controller and template from the MultiViewPaths value.
Once we have the controller and template, ngMultiView works almost identically to ngView.

Our ngMultiView is very basic, it doesn’t take into account parameters being passed through urls, nor does it deal with scope cleanup, or animations. If you need more functionality I’d recommend starting with the $routes service and modifying it to accommodate multiple views.

Conclusion

Creating custom directives can be intimidating at first. There’s a lot of jargon to overcome, and many little nuances. However, once those are overcome it becomes relatively easy to alter existing directives or create your own.

Using AngularJS Promises

Sep 9th, 2013

In my previous article I discussed the benefits of using dependency injection to make code more testable and modular. In this article I’ll focus on using promises within an AngularJS application. This article assume some prior knowledge of promises (a good intro on promises and AngularJS’ official documentation).

Promises can be used to unnest asynchronous functions and allows one to chain multiple functions together - increasing readability and making individual functions, within the chain, more reusable.

Standard Callbacks (no promises)

function fetchData(id, cb){
  getDataFromServer(id, function(err, result){
    if(err){
      cb(err, null);
    }else{
      transformData(result, function(err, transformedResult){
        if(err){
          cb(err, null);
        }else{
          saveToIndexDB(result, function(err, savedData){
            cb(err, savedData);
          });
        }
      });
    }
  });
}

Once getDataFromServer(), transformData() and saveToIndexDB() are converted to returning promises we can refactor the above code to:

With Promises

function fetchData(id){
  return getDataFromServer(id)
          .then(transformData)
          .then(saveToIndexDB);
}

In addition to increasing readability promises can help with error handling, progress updates, and AngularJS templates.

Handling Errors

If fetchData is called and an exception is raised in transformData() or saveToIndexDB(), it will trigger the final error callback.

fetchData(1)
  .then(function(result){

  }, function(error){
    // exceptions in transformData, or saveToIndexDB
    // will result in this error callback being called.
  });

Unfortunately, if an exception is raised in getDataFromServer() it will not trigger the final error callback. This happens because transformData() and saveToIndexDB() are called within the context of .then(), which uses try-catch, and automatically calls .reject() on an exception. To bring this behaviour to the first function we can introduce a try-catch block like:

getDataFromServer()

function getDataFromServer(id){
  var deferred = $q.defer();

  try{
    // asynchronous function, which calls
    // deferred.resolve() on sucess
  }catch(e){
    deferred.reject(e);
  }

  return deferred.promise;
}

While adding try-catch made getDataFromServer() less elegant, it makes it more robust and easier to use as the first in a chain of promises.

Using Notify for Progress Updates

A promise can only be resolved, or rejected, once. To provide progress updates, which may happen zero or more times, a promise also includes a notify callback (introduced in AngularJS 1.2+). Notify could be used to provide incremental progress updates on a long running asynchronous task. Below is an example of a long running function, processLotsOfData(), that uses notify to provide progress updates.

function processLotsOfData(data){
  var output = [],
      deferred = $q.defer(),
      percentComplete = 0;

  for(var i = 0; i < data.length; i++){
    output.push(processDataItem(data[i]));
    percentComplete = (i+1)/data.length * 100;
    deferred.notify(percentComplete);
  }

  deferred.resolve(output);

  return deferred.promise;
};


processLotsOfData(data)
  .then(function(result){
    // success
  }, function(error){
    // error
  }, function(percentComplete){
    $scope.progress = percentComplete;
  });

Using the notify function, we can make many updates to the $scope’s progress variable before processLotsOfData is resolved (finished), making notify ideal for progress bars.

Unfortunately, using notify in a chain or promises is cumbersome since calls to notify do not bubble up. Every function in the chain would have to manually bubble up notifications, making code a little more difficult to read.

Templates

AngularJS templates understand promises and delays their rendering until they’re resolved, or rejected. AngularJS templates no longer resolve promises - they must be resolved in the controller before they’re assigned to the scope. For instance let’s say our template looks like:

<p>{{bio}}</p>

We could do the following in our controller:

function getBio(){
  var deferred = $q.defer();
  // async call, resolved after ajax request completes
  return deferred.promise;
};

getBio().then(function(bio){
  $scope.bio = bio;
});

The view renders normally, and when the promise is resolved AngularJS automatically updates the view to include the value resolved in getBio.

Limitations of Promises in AngularJS

When a promise is resolved asynchronously, “in a future turn of the event loop”, the .resolve() function must be wrapped in a promise. In the contrived example below, a user would click a button triggering goodbye(), which should update the $scope’s greeting attribute.

app.controller('AppCtrl',
[   '$scope',
    '$q',
    function AppCtrl($scope, $q){
      $scope.greeting = "hello";

       var updateGreeting = function(message){
          var deferred = $q.defer();

          setTimeout(function(){
              deferred.resolve(message);
          }, 5);

          return deferred.promise;
       };
      $scope.goodbye = function(){
          $scope.greeting = updateGreeting('goodbye');
      }
    }
]);

Unfortunately, it doesn’t work as expected, since the asynchronous event works outside of AngularJS’ event loop. The fix for this (besides using AngularJS’ setTimemout function), is to wrap the deferred’s resolve in $scope.$apply to trigger the digest cycle and update the $scope accordingly:

setTimeout(function(){
  $scope.$apply(function(){
    deferred.resolve(message);
  });
}, 5)

Jim Hoskins goes into more detail on using $apply: http://jimhoskins.com/2012/12/17/angularjs-and-apply.html

Conclusions

Using promises is an important part of writing an AngularJS app idiomatically and should help make your code more readable. Understanding their shortcomings, and their strengths make them much easier to work with.

How AngularJS Made Me a Better Node.js Developer

Aug 6th, 2013

Over the past 6 years I’ve used Ruby on Rails, Backbone.js, Node and AngularJS. RoR reinforced my knowledge of Model View Controller (MVC) while Backbone.js did the same for my knowledge of Publish/Subscribe. Like many who made the switch to Node, my first instinct was to try and apply MVC to my Node.js apps - however, it felt unnatural. Taking a “class”-based approach, using CoffeeScript, didn’t feel entirely natural either.

While I enjoyed developing in JavaScript, I always felt I was missing something - that is until I started developing with AngularJS. AngularJS uses both dependency injection and promises extensively, both of which have greatly improved my code. In this article, I’ll focus on dependency injection, and discuss promises in my next article.

Dependency Injection

“Dependency injection is also a core to AngularJS. This means that any component which does not fit your needs can easily be replaced.” - angularjs.org

AngularJS doesn’t just pay lip service to dependency injection, it’s a design pattern that it uses extensively, and builders of AngularJS apps use as well. Wikipedia defines dependency injection thusly:

“Dependency injection is a software design pattern that allows the removal of hard-coded dependencies and makes it possible to change them, whether at run-time or compile-time.”

So, how has dependency injection (DI) improved my Node.js apps? Traditionally I might write a task queue like so:

makeThumbnail.js

var db = require('./database.js');

module.exports = {
  start : function(input){
    // makeThumbnail
    // save timestamp
    db.save({...});
  }
};

uploadToS3.js

var db = require('./database.js');

module.exports = {
  start : function(input){
    // upload thumb
    // perhaps save bucket name
    db.save({...});
  }
};

Using dependency injection I’d change that to this:

makeThumbnail.js

module.exports = {
  start : function(input, db){
    // makeThumbnail
    // save timestamp
    db.save({...});
  }
};

uploadToS3.js

module.exports = {
  start : function(input, db){
    // upload thumb
    // perhaps save bucket name
    db.save({...});
  }
};

taskRunner.js

var db = require('./database.js'),
    makeThumbnail = require('./tasks/makeThumbnail'),
    uploadToS3 = require('./tasks/uploadToS3'),
    taskToRun = process.argv[2],
    taskRunner;

taskRunner = function(task){
  task.start(process.argv, db);
};

if(taskToRun === 'uploadToS3'){
  taskRunner(uploadToS3);
}else{
  taskRunner(makeThumbnail);
}

While the DI example above requires more code than the original, it makes testing easier and I’d argue better. It becomes trivial to test each unit in isolation of other units of code. With the first approach, each unit test would require database calls. With the second approach, we can inject a mock database object like so:

makeThumbnail.test.js

var makeThumbnail = require('makeThumbnail');

describe('Make Thumbnail', function(){
  var database = {};

  it('should make a thumbnail, and call db.save', function(done){
    var input = {
      imageId : 1
    };

    database.save = function(obj){
      assert.equal(obj.id, input.imageId);
      done();
    }

    makeThumbnail(input, database);
  });
});

This speeds testing up significantly and ensures that if the unit tests fails it’s not failing because of issues with the database code. Ultimately, this makes localizing bugs much quicker. In other words we can test just the creation of thumbnails, and not our database (which we’d do separately).

DI forces one to think more rigorously about how code will be divided into modules, and what modules will be injected into other modules. This requires more diligence upfront, but leads to greater flexibility down the line. For instance, the database object is only being required() and injected in a single spot in the code, making it much easier to swap the database from say MySQL to Postgresql.

Why not use use require?

On a post detailing the magic behind AngularJS’ DI, tjholowaychuk (of Express.js, Jade and Mocha fame) asks: “why not just require() what you want? seems odd to me”?

Despite asking 6 months ago, no one has replied, leaving readers pondering why. As the example above shows, requiring dependencies at the top of each file makes mocking more difficult. One could write a wrapper around each dependency, and serve it normally for development/production and serve the mocked version for test ENV, but at that point why not consider DI?

Conclusion

As learning new programming languages makes us better developers so does learning new frameworks. Learning a new framework helps us learn, and reinforce our knowledge of design patterns. Qes, on programmers.stackexchange.com, sums up his experiences with DI:

A quote about the importance of dependency injection.

Extra Reading

Understanding AngularJS Directives Part 1: Ng-repeat and Compile

May 13th, 2013

My first impression of Angular.js was one of amazement. A small amount of code could do a lot. My worry with Angular, and other magical frameworks, is that initially you are productive, but eventually you hit a dead end requiring full understanding of how the magic works. In my quest to master Angular.js, I wanted to learn everything about creating custom directives - a goal that I’d hope would ameliorate the learning curve. Egghead.io does a good job exploring basic, and intermediate, examples of custome directives but it still wasn’t clear when to use the compile parameter in a custome directive.

Miško Hevery, the creator of AngularJS, gave a talk about directives and explained that compile is rarely needed for custom directives, and it is only required for directives like ng-repeat and ng-view. So the next question: how does ng-repeat work?

How does ng-repeat work?

In my quest to understand the compile function, I started examining ng-repeat. Reading the source code was difficult until I walked through an example using the Chrome debugger. After stepping through ng-repeat it became clear that most of its 150 lines of code are related to optimizing, error handling and handling objects or arrays. In order to really understand ng-repeat, and specifically compile, I set out to implement my own version of ng-repeat, which I will call lk-repeat, with just the bare minimum of code. When possible I tried to use the same variable names that ng-repeat uses, and I also used their regular expression for matching passed in attributes.

Transclusion

Before going further it’s important to review the transclude option. Transclude has two options: 1) true or 2) 'element'. First let’s examine transclude : true.

DIV using the person directive

<div person>Ted</div>

Defining the person directive

app.directive('person', function(){
  return {
    transclude : true,
    template: '<h1>A Person</h1><div ng-transclude></div>',
    link : function($scope, $element, $attr){
      // some code
    }
  }
});

Result

<h1>A Person</h1><div ng-transclude><span class="ng-scope">Ted</span></div>

In the above example transclude : true tells Angular to take the contents of the DOM element, using this directive, and insert them into the person’s template. To specify where in the template the HTML will be transcluded include ng-transclude in the template. The span, with class ng-scope is inserted by AngularJS.

In contrast to the above example, ng-repeat, does not have a template, and transcludes the element that calls ng-repeat. Hence, ng-repeat calls transclude : 'element', to denote that the DOM element that called ng-repeat will be used for transclusion.

lk-repeat

Below lk-repeat is used the same way ng-repeat would be used.

<ul>
  <li lk-repeat="name in names">{{name}}</li>
</ul>

var app = angular.module('myApp',[]);

app.directive('lkRepeat', function(){
  return {
    transclude : 'element',
    compile : function(element, attr, linker){
      return function($scope, $element, $attr){
        var myLoop = $attr.lkRepeat,
            match = myLoop.match(/^\s*(.+)\s+in\s+(.*?)\s*(\s+track\s+by\s+(.+)\s*)?$/),
            indexString = match[1],
            collectionString = match[2],
            parent = $element.parent(),
            elements = [];

        // $watchCollection is called everytime the collection is modified
        $scope.$watchCollection(collectionString, function(collection){
          var i, block, childScope;

          // check if elements have already been rendered
          if(elements.length > 0){
            // if so remove them from DOM, and destroy their scope
            for (i = 0; i < elements.length; i++) {
              elements[i].el.remove();
              elements[i].scope.$destroy();
            };
            elements = [];
          }

          for (i = 0; i < collection.length; i++) {
            // create a new scope for every element in the collection.
            childScope = $scope.$new();
            // pass the current element of the collection into that scope
            childScope[indexString] = collection[i];

            linker(childScope, function(clone){
              // clone the transcluded element, passing in the new scope.
              parent.append(clone); // add to DOM
              block = {};
              block.el = clone;
              block.scope = childScope;
              elements.push(block);
            });
          };
        });
      }
    }
  }
});

Above you’ll note that I’m removing all the elements from the DOM, and their scope, every time the collection updates. While this makes the code easier to understand, it is extremely inefficient having to remove everything then add it again. In the real version of ng-repeat, only elements that are removed from the collection, are removed from the DOM. Furthermore, if an item moves within the collection (e.g. 2nd to 4th place) it doesn’t need a new scope, but it needs to be moved in the DOM. Reading ng-repeat’s code gives me confidence that the team behind AngularJS has created a good, well tested and efficient framework.

In part 2 I examine ngView, it’s implementation, hidden features and creating your own ngMultiView.

Adding Real-Time to a RESTful Rails App

Feb 27th, 2013

After rewriting Understoodit several times I’ve spent a lot of time thinking about building real-time web applications. While I elected to rewrite 100% of Understoodit in Node, there are many existing Rails and Sinatra applications that can’t be completely rewritten, but could still benefit with the addition of real-time updates. The tutorial below starts with a traditional web-app written in Backbone and Ruby on Rails (RoR). Of course the modifications could easily be applied to any (Backbone|Angular|Ember) and (Rails|Sinatra|Django|Pylons) app.

Between the overview below, and the code on GitHub, you should be able to follow along and, in less than 50 lines of code, add real-time updates to your Rails app.

Adding Real-Time on Github

Starting Point

In a traditional web app if a user creates a new model other users must refresh their page to see that content. Alternatively, you could poll the server every 30 second and refetch all the content. With both approaches you end up fetching all the content, and in the first case the markup as well.

Traditional RESTful Rails app

In Figure 1, User 1 creates a new book, but User 2 will not see that new book unless they refresh their page.

Adding Real-Time With Redis And Socket.IO

When User 1 creates a new book, we’d like that new book to be pushed to User 2 in real-time. I’m going to cover one method that requires only a few modifications to your existing app and uses Redis, Node and Socket.IO.

How It Will Work

Traditional RESTful Rails app with Real-Time

When User 1 creates a new book, an “after_create” callback publishes that new book to Redis on the “rt-change” channel.
On the Node server, each client subscribing to “rt-change” receives that new book.
The new book is pushed to the client using Socket.IO.
Within the browser, Socket.IO receives that new book and “publishes” that change to our Backbone.js App.
The Backbone.js books collection, listening for changes to books, adds the new book to itself.

The advantage of this approach is that it only requires tiny modifications to a Rails’ model, and if your Node server crashes, your application will work as it always has (without real-time). Thus, I’d consider this a real-time enhancement that gracefully degrades to a conventional Rails RESTful web app.

Socket.IO Connection

First, ensure that socket.io.js has been added to lib/assets/javascripts, and referenced in app/assets/javascripts/application.js. In the web app create a new module, called realtime, that includes the Socket.IO connection code. When the application initializes it calls app.realtime.connect() to setup the Socket.IO connection.

window.app.realtime = {
  connect : function(){
    window.app.socket = io.connect("http://0.0.0.0:5001");

    window.app.socket.on("rt-change", function(message){
      // publish the change on the client side, the channel == the resource
      window.app.trigger(message.resource, message);
    });
  }
}

Node Server & Pub/Sub

In the root of the Rails app create a new folder called ‘realtime’, where the Node server will reside. Don’t forget to create a package.json file and include socket.io, and redis in the dependencies. Finally, remember to run npm install.

realtime/realtime-server.js

var io = require('socket.io').listen(5001),
    redis = require('redis').createClient();

redis.subscribe('rt-change');

io.on('connection', function(socket){
  redis.on('message', function(channel, message){
    socket.emit('rt-change', JSON.parse(message));
  });
});

Rails Models

Assuming you have Redis installed, add redis to your Gemfile. Next, create a file called redis.rb in your initializers with the following content:

config/initializers/redis.rb

#make sure redis has been added to your Gemfile
$redis = Redis.new(:host => 'localhost', :port=> 6379)

The Rails app now has access to Redis through the $redis global variable. Below, we publish changes to Redis whenever a model is created, updated or destroyed. Changes are published to “rt-change”, which our Node.js connections are listening to (see above).

app/models/book.rb

class Book < ActiveRecord::Base
  attr_accessible :num_pages, :title
  after_create {|book| book.message 'create' }
  after_update {|book| book.message 'update' }
  after_destroy {|book| book.message 'destroy' }

  def message action
    msg = { resource: 'books',
            action: action,
            id: self.id,
            obj: self }

    $redis.publish 'rt-change', msg.to_json
  end
end

Listen For Changes in The Backbone App

In the Books Collection, we add the code to both listen for ‘books’ events and the handler to handle those events. For create, we simply add the new object (obj) to the collection. For update we update the existing model, while for destroy we remove the object from the collection.

app.collections.Books = Backbone.Collection.extend({
  model : app.models.Book,
  url : '/books',

  initialize: function(){
    app.on('books', this.handle_change, this);
  },

  handle_change : function(message){
    var model;

    switch(message.action){
      case 'create':
        this.add(message.obj);
        break;
      case 'update':
        model = this.get(message.id);
        model.set(message.obj);
        break;
      case 'destroy':
        this.remove(message.obj);
    }
  }
});

Caveats

In production there are many edge cases to consider. For instance, if someone views your app on their mobile phone and then puts the phone in their pocket, the screen saver goes on and Socket.IO will disconnect. When the user takes the phone out of their pocket, and views the app, Socket.IO will reconnect. However, during the period of disconnection the data in the client-side app may have become out-of-date. An easy fix is just to fetch the data on reconnect. With lots of connections, or lots of data, fetching everything becomes problematic and requires a more clever method for fetching data (e.g. just fetch the new, or changed, data).

Another issue is if two people are editing the same item, and if person 1 clicks save that will replace what person 2 is editing. To solve this you can present person 2 with a message saying that the book they are editing has been updated by someone else and prevent the version of the book they are editing from being replaced. This isn’t an ideal solution, but would be fine if the chances of two people editing the same model were minimal.

In the code above there is only one channel ‘rt-change’, meaning every connected client will get every real-time change. You may want to scope your channels by user (e.g. rt-change/[USERID]). Furthermore, you’d want to create one redis client for every Socket.IO connection (currently there’s one redis client for all connections). In other words the .createClient(), and redis.subscribe('...'), would have to take place within the Socket.IO ‘connection’ callback (after line 6 above).

Alternatives To The Above

[update July 2014]: Realtime Rails Gem

Mike Atlas created a realtime gem which does all the above with minimal setup.

SockJS

Socket.IO could be swapped for SockJS, which uses a similar API to websockets. I’ve heard from several individuals that it’s significantly more stable than the current version of Socket.IO and it’s currently used by Meteor.

Engine.IO

Guillermo Rauch, the creator of Socket.IO, has publically stated that Socket.IO’s approach of starting with websockets and falling back to polling creates issues. As result, he’s been working on Engine.IO, which will power Socket.IO version 1.0, and should provide a much more stable experience. I suspect Socket.IO, v1.0, will be released in the next few months.

Rails 4.0

Rails 4.0, which is due to be released soon, will include streaming. Using a combination of Rails 4 streaming, and Puma, you could potentially remove Node and Socket.IO, and use Rails for real-time. Of course, you’d have to take care of some of what Socket.IO does such as reconnects and heart-beats.

RabbitMQ/ZeroMQ

Redis’ Pub/Sub functionality could be replaced by either RabbitMQ or ZeroMQ. I ended up using Redis, since I was using it for caching, and it has an extremely simple API for pub/sub. While RabbitMQ and ZeroMQ appear more complex, they do offer many more features for messaging.

Commercial Options

If you’re not keen on tinkering with Node, or waiting for Rails 4, there are commercial options such as Pusher and PubNub, that deal with real-time connections for you. While both options can be pricey, especially with many concurrent connections, they do save you the hassle of building the infrastructure yourself.

Conclusions

Adding real-time updates to your Ruby on Rails RESTful app has never been easier. Over the next few months Rails 4, or Socket.IO v1.0, will make the process even more painless. As Google’s services make users more accustomed to real-time updates, it becomes even more important to provide a similar experience in your webapps.

Adding Real-Time on Github

Three Important Conversion Metrics You Should Watch

Feb 8th, 2013

You’ve created an app that your beta testers use and your launch was a huge success. Every day the number of registered users increases and you feel like you’re making progress and moving closer to your goals.

It would be a mistake to hunker down and focus singularly on hammering out new features. Now that you’ve launched you have significantly more people visiting your homepage, creating accounts and using your app. You have much more data to help you prioritize what to do next.

I’ll focus on three important metrics that are extremely important and especially so after launching. These metrics are all conversion metrics and include 1) vistors creating an account, 2) becoming an active user and 3) becoming a paying user. While there are exceptions, the order is important: a user can’t become an active user without creating an account. Likewise, paying users are most likely active users.

Create An Account

What percent of visitors create accounts? There many variables that will influence the percent of visitors that create an account and they include:

Relevant Keywords

Maybe visitors are arriving at your site searching for something else. If that’s the case you might have a high bounce rate (e.g. users coming to your site and immediately leaving). To fix this issue investigate ways to improve your SEO and ensure that your keywords are relevant to your product.

Clear Selling Point

Have you made it as clear as possible why your product is needed? Ask colleagues unfamiliar with your product to evaluate you’re site’s copy and it’s clarity.

Call-to-Action Button

Have you made it obvious how they can create an account? To test this you can use a service like Optimizely to A/B test different variants of your front page and your call-to-action button.

Pricing

Maybe your pricing page is too complicated, or your prices are too high, and as a result users don’t even bother creating an account. This is an issue that you should look into before you launch. Ideally you should speak with an appropriate number of potential customers to figure out if your price is within a reasonable range.

Become An Active User

What percent of users become active users (e.g. use your app every day/week)? Your definition of an active user is dependent on your service. If you’re creating a new email app, or a social network, an active user might be someone who uses your app multiple times per day. If you’re creating a tax app, an active user might be someone who uses it once a year for a week. In order to measure active users first you must define who they are (duh!).

Problem: users create an account and then never use your app again. It could be that you have a very interesting idea, but your implementation is off. Likewise you could be attracting a lot of curious users that have no intention of actually using your app, but just want to see what the fuss is about.

I’d recommend following-up on a user if they haven’t used your app in a week after registering. Ask them why they didn’t end up using it, and what *one* feature they’d need to use it. Your response rate will likely be between 5 - 10% but hopefully that should be enough to pick up trends. If possible I’d automate this step to ensure that emails are sent out consistently and you don’t waste your time sending out copy and pasted emails.

Problem: they use your app for a couple of weeks and then never use it again. While still bad, at least you have more data to work with. Check your database and determine what parts of your app they were using. For an email app, were they sending emails, but not creating contacts? Maybe creating contacts was too tricky and they gave up and stopped using your app.

Analyze your data carefully and see if you can pick patterns that can point you to areas that need to be improved. Alternatively there could be temporal patterns, for instance, people who only use your app once a week quickly stop using it relative to those that use it several times a week. Finding patterns of use and comparing active users to non active users can shed further light on potential problems.

Become A Paying User

What percent of your active users become paying users? Comparing patterns of use between users who’ve converted, and those who have not, in the same cohort is useful for elucidating causality in conversions.

Unlike the previous step you should have much more data to conduct your analysis. More data means that it should be easier to find statistically significant patterns, but it may be more challenging and time consuming to do analysis. I’d recommend generating fewer than five hypotheses before you start your analysis. This will both limit the complexiy of analysis and reduce the multiple comparison problem (e.g. with enough comparisons you’re bound to get significant differences that are due to chance alone).

Conclusions

Analyzing what your users are doing, why some are creating accounts, why some are becoming paid users and why others are not is extremely important. If you love coding, and adding new features, it can seem like a big time waster, however, doing the above will help you prioritize better.

Although I haven’t tried either myself both Kissmetrics or Google Analytics’s Conversion feature should help you with the above. It’s crucial to be able to quickly determine why some users never become active, or never become paying users. Use hypotheses, data and outcomes to determine how to spend your time efficiently. Time is something you have little of, use it wisely.

Are You Ready to Launch?

Jan 30th, 2013

Last Week: Before Launching, Build Software People Use

You’ve created a product that people want to use, and now you’re eager to launch. From my experiences launching Understoodit, in May 2012, I’ve compiled a set of steps that helped Understoodit get on several big sites including TechCrunch, Toronto Star, and BetaKit. If you have a large budget hiring a PR firm might be your best bet, otherwise the steps below will help you get started.

A Unique Angle

If you’re going to catch people’s attention in an app saturated environment, it’s important to communicate what’s unique about your app. If you’ve created a todo app, is it for Doctors, Engineers or does it something truly unique? If you can’t figure out why your product is unique it’s going to be a tough sell and it will certainly make it more difficult to get the press interested on your launch day.

The unique angle for Understoodit was focusing on its confusion feedback feature (students can click confused, and in real-time the teacher can see what percentage of students are confused). If I had said Understoodit was a “classroom response system”, I would have had a much harder time competing for attention.

Press Release

An important step in preparing for a launch is crafting a press release. If you’ve never written one, or you aren’t a strong writer, I’d recommend hiring someone to write it with you (Vicki So helped me).

My familiarity with Understoodit made it difficult to objectively write about it. When you’re focused on the technical side of your product it can be easy to loose focus on what’s important to prospective users. By asking good questions, Vicki was able to tease out of me what was important about Understoodit and why educators might be interested. She was able to turn a product launch into an interesting story about how I started Understoodit.

Reporters read many press releases each day, make sure yours is interesting, tells a story, and is well crafted.

Reporters & Bloggers

One of the most important tips Vicki gave me was: send the press release to a specific reporter, not a newspaper or website in general. A reporter who covers education is potentially more interested in Understoodit than the average person handling general enquiries. In preparation for launch I made a long list of reporters that cover education, and on launch day I emailed each of them a quick note with a press release attached. I’d also recommend adding reporters who cover small business and startups to your list, they may also be interested.

In addition to reporters, I contacted a couple local tech bloggers and asked if I could give them a face-to-face demo. This approached allowed me to pitch a reporter at BetaKit, a Toronto-based website that covers startups. They ended up doing an article about Understoodit a day after the launch.

Social Media

I’m no social media expert, but I can say it played an important role in the early success of Understoodit. Facebook and Twitter were huge sources of traffic, as were social news sites such as Hacker News. Depending on your product’s niche you might have more luck with other social networks such as Pinterest or Instagram. However, for you to have a big impact on sites like Twitter, or Pinterest, it helps to have a lot of followers. Gaining followers takes time and is something you should consider long before you launch.

Friends & Family

I owe a lot of Understoodit’s launch success to friends that not only helped me with the press release, but also tweeted, liked and voted up Understoodit on launch day. Friends and family are also critical in getting you over the trough of despair - so be kind to them!

Luck

No matter how prepared you are there is a strong component of luck to a successful launch. Was your press release the first that a reporter read, or did they read it after reading 5 others? Was there a major news event on the day of your launch? Did a writer on TechCrunch see your launch on Hacker News? All those things are mostly out of your control but can greatly affect your success. Preparation can mitigate some of those issues. For instance, it’s always a good idea to see if there might be any important news events, or tech announcements, that could overshadow your launch.

Conclusions

After following the above actions, mixed with a healthy dose of luck, I was on the front page of Hacker News, and later that day on TechCrunch. Over a 24 hour period Understoodit received hundreds of registrations. Ultimately, that initial burst of excitement made it possible to get an article in the Toronto Star and the Chronicle of Higher Education.

Next Week: I will cover 3 important metrics that you need watch after launching your product.

Before Launching Build Software People Use

Jan 22nd, 2013

You’re a talented developer and have a great idea for a startup. You’ve read The Lean Startup, you’ve attended entrepreneur events, and you read Hacker News. At this point you’re confident that you’ll be able to build a compelling product while avoiding common startup mistakes.

Unfortunately, that pretty much summed up my (immodest) perception of myself prior to launching Understoodit.com.

While the May 2012 launch of Understoodit was more successful than I anticipated, there were certainly things I could have placed more focus on prior to launching. Below I’ve informally divided the pre-launch process into 5 stages: 1) finding customers, 2) the one feature, 3) early beta testing, 4) engaged users, and 5) time to launch.

Stage 1: Finding Customers

It’s an increasingly common sentiment that you should find potential customers before you start building a product. If you have difficulty finding customers at this stage it may not get easier when you’re coding 12 hours a day. In my experience it’s easier to get help from potential customers at this stage since you aren’t necessarily selling anything yet. At this stage you’re simply doing research and building relationships with potential customers. When contacting professors for Understoodit, I found they were happy to give feedback and provide constructive criticism. If initially I had tried to sell them something, they may not have been so forthcoming with help and criticism. If all goes well, those first few relationships will turn into paying customers, so be kind and accept their feedback without becoming defensive!

Stage 2: The One Feature

Assuming you’ve mastered the previous step, getting feedback will be easy. In fact you’ll likely end up with a laundry list of features. Some features are nice to have, while others will be critical. Unfortunately teasing apart the critical features from the nice-to-haves is not always easy [1]. I’d recommend asking potential users: “If we added only one feature, what would be the most important?” This will force your users to prioritize what they think is the most important feature. A single user may not get this right, but the intuition of multiple users should converge on a single feature. That single feature will become your minimum viable product (MVP). Furthermore, if your users thought that feature was very important, it’s likely that others, in the same demographic, will also think it’s important.

Stage 3: Early Beta Testing

At this stage you’ve created an MVP, with one critical feature, and you’re eager to start early beta testing. There is a lot of good advice on user testing online, so I’ll focus on one method that I’ll dub “passive watching”. Passive watching involves sitting with potential users and passively watching them create an account, use your app, perform various actions, etc. It’s important not to help them at this point. Rather, what you want to do is see where they’re getting stuck, where they’re getting frustrated, and where things are working smoothly. Don’t be defensive if they dislike the user interface, or its flow, just listen at this stage. After testing with 10 - 20 potential users, you’ll get a very good idea of what needs to be improved, removed and what needs to be added.

Stage 4: Engaged Users?

During early beta testing you hopefully received a lot of positive feedback. (And thanks to many who gave me feedback, including readers of this blog.) But don’t conflate positive feedback with engagement. Just because a user says your app is great, it doesn’t guarantee that they will actually use it, let alone pay for it. However, if your early beta testers keep using it, and start telling their friends, then that’s a good sign. On the other hand, if they stop using it, it’s critical to find out why they’re not using it. If you can’t engage your early beta testers (that is, users who’ve invested a lot of time already), it will be difficult to engage new users after launch. This stage is important! So don’t fool yourself into thinking you’ve created a great product unless you have engaged users using your app regularly (self awareness and introspection are important attributes for entreprenuers [2]). If your users aren’t engaged you have to decide whether to go back to early beta testing, pivot, or scrap the idea entirely.

Stage 5: Time to launch

You’ve built a strong MVP that is used regularly by your beta testers. In turn those testers are telling friends and colleagues about your app. You’ve validated both your idea and your execution and it’s time to launch and grow the number of users.

Next week I’ll cover some of my experiences that helped get Understoodit featured on TechCrunch, Discovery.com News, The Chronicle of Higher Education, and the Toronto Star. Following next week’s article, on “The Launch,” I will focus on metrics that will help you decide if your startup is succeeding or floundering.

Extra Reading:

Why You Should Be Nice to Your Customers

Nov 21st, 2012

Being nice to your customers seems like a no brainer. It makes perfect business sense: happy customers are less likely to stop using your service and more likely to refer your service to their friends. However, there is another significant reason that being nice pays off.

Running a startup can be a tough slog. You’re unsure if people like what you’re building enough for it to be successful. If you get some traction there will be people out there that will belittle your idea, or your execution. With all the unknowns, and the not-always-constructive criticism, it’s incredibly refreshing to interact with a nice customer that loves your product.

I’ve noticed that the nicer I am to customers the nicer they are back. This virtuous circle means that we get more and more people who are willing to provide thoughtful criticism and who are willing to meet with us and give us feedback. We have one customer that loves Understoodit so much that he just finished writing a blog post for us (to be posted in the next week).

After a difficult day in startup land it’s really gratifying to interact with a nice customer. It puts a face to an email address and makes the whole process of building software that much more interesting and fulfilling.

Make sure you’re nice to your customers, they may just be nice back.

← Older Blog Archives