Liam Kaufman

Software Developer and Entrepreneur

Smiley Faces in Linux Source Code and Token Statistics

Github uses Linguist, a Ruby library, to help detect which programming language is in a given file. Recently, an issue was filed that indicated that Linguist incorrectly classifies Mercury (a programming language) files as Objective-c since they both use the same extension (.m). Linguist’s primary method for language detection is a file’s extension - a method that fell short for Mercury. If Mercury were added to Linguist, then there would be two languages with the same extension - and this is where things get interesting. If two languages share the same extension, or the file does not have an extension, Linguist has 3 methods for guessing the language. First it checks if the file has a shebang (#!/bin/sh). If there is no shebang the second method it uses is a set of heuristics. For instance, if the file includes the “:-” token it concludes that the contents are prologue code, or if “defun” is present it’s common lisp. If it still hasn’t found a match the third method it uses is a Bayesian classifier. Roughly speaking the classifier iterates over all of a file’s tokens, and for each token determines the probability that it is present in each programming language. Subsequently, it sums all those probabilities, sorts the results, and returns an array of language-probability pairs (e.g [[‘Ruby’, 0.8], [‘Python’, 0.2]]).

I wondered how logistic regression, support vector machines or even clustering algorithms would help in classifying a given file. As I dived into the data I realized that the descriptive statistics on tokens, and even ascii faces, would be nearly as interesting as their predictive power. Thus, this post will summarize the descriptive statistics, while in my next post I’ll cover using tokens to predict a file’s programming language.

Methods

Fetching The Code

Using Github’s API I fetched the 10 most popular repositories for 10 languages (c, haskell, go, javascript, java, lua, objective-c, ruby, python and php). Those 10 were chosen for their popularity, differing paradigms (e.g. Haskell vs Java), differing syntax (Haskell vs go) and overlapping syntaxes (C, JavaScript and Java). After retrieving a list of 100 repositories I downloaded the zip ball for each repo.

Tokenizing The Code

First, a list of common programming tokens (eg: ; , . ( ), etc) was created - tokens that would be found in many of the 10 languages of interest. Using those tokens I created a tokenizer that would output an object with the keys being tokens and their values being the number of times that the token occurred in the file. Base 10 numbers, hex numbers, strings (double quotes) and ‘characters’ (single quotes) were treated as 4 different token types. This was done so the number 8 was not treated as a different token from the number 44 (they are both tokenized as “numbers”).

Each of the 100 repositories was traversed and non binary files were tokenized, with each file’s total token count stored in Redis’ sorted set data structure (sorted by number of occurrences). Using a sorted set made it trivial to retrieve the 1000 most common tokens from all 100 repositories. Each file was then re-tokenize, but only tokens that were present in the 1,000 most common token list were counted. A data set was created that included information on 65,804 files from 100 different repos. Along with the token data, the following data was also recorded: 1) the file’s extension, 2) it’s path within the repository, 3) it’s shebang, if present and 4) the token count for the first 250 most common tokens (I decided to limit my analysis on a smaller number of tokens for the first round of analysis). Finally, the counts for each token were converted to the ratio that that token occurred relative to the total occurrences. of all tokens (e.g. # of periods/total number of all tokens). The absolute number of tokens per file would be skewed by large files, which have more tokens.

Counting Smiley Faces

Two ease analysis I focused on c and JavaScript files (in c and JavaScript repositories): both languages have identical single and multi-line comment syntax. First, text from comments was separated from code. Second, the number of times an ascii face appeared in a given file’s comments was counted. The following “faces”: :( :) :-) :-D :p ;) ;-) were counted.

Statistics

All statistics were carried out using R. Welch Two Sample t-tests were used to compare groups.

Results

Tokens

While the top 1,000 tokens, by occurrence were recorded, only the top 20 are presented in Table 1 (see here for the top 1,000). Not surprisingly numbers are the most prevalent token, with commas coming a very close second. Interestingly, and what sparked my interest in ascii smiley faces, is that there are slightly more right parentheses than left. While the contents of strings were ignored, the contents of comments were not. Seeing as most (all?) of the analyzed languages require parenthesis to be balanced I presumed that the imbalance might be caused by ascii smiley faces in comments.

Interestingly, hexidecimal numbers were the 6th most common token despite rarely being used outside of c. While hex numbers are used extensively in CSS, I only classified numbers that started with ‘x’ as being hex, which precluded the hex numbers in CSS from being included.

Table 1. Top 20 Tokens By Occurence

TokenOccurrences
Numbers19,640,325
,19,597,223
)10,446,695
(10,425,221
;7,882,261
Hex Numbers6,261,887
*6,205,697
.5,978,092
=5,841,844
Strings (Double Quotes - DQ)4,336,520
}3,310,463
{3,305,033
/2,939,954
:2,640,872
->2,425,261
#2,423,779
[2,004,437
]2,002,711
<1,591,276
Strings (Single Quotes - SQ)1,578,686

The top 20 tokens, by occurrence in aggregate (across 65,804 files).

Table 2 shows the top 20 tokens and their ratios (token/all tokens in a given file) in 16 different file types. Not surprisingly JSON files lead the pack for double quoted strings, curly brackets, colons and commas. Likewise, Clojure leads by having the highest proportion of parentheses. The right arrow -> occurred most often in Php, C and Haskell. Finally, square brackets were very prevalent in Objective-C.

Table 2. Top 20 Tokens (Scroll Right For Full Table)

File Type # Files Numbers , ) ( ; Hex Numbers * . = Strings (DQ) } { / : -> # [ ] < Strings (SQ)
.php102890.01060.02800.06340.06320.04060.00000.04630.02570.01720.00540.02190.02180.01870.00770.02390.00040.00420.00420.01260.0375
.json10020.01830.15100.00010.00010.000000.00010.00110.00010.35750.09270.09300.00000.216300.00010.01790.01790.00010.0001
.md15410.03170.02380.02500.02480.00300.00010.01460.05590.00630.01080.00390.00380.00120.02510.00060.02510.01270.01270.00660.0064
.hs61330.04160.02240.04260.04240.00140.00020.00250.02530.04070.01210.01230.01230.00060.00510.01790.02290.00900.00910.00190.0060
.html15790.02110.00910.01170.01160.01540.00000.00070.02520.04630.04820.02950.02950.00330.00990.00030.02700.00160.00160.11500.0052
.css7010.08110.02120.00830.00830.06760.00000.02190.05640.00230.00500.03920.03940.01690.077900.01940.00110.00110.00010.0018
.js90890.02900.06810.06460.06390.04360.00050.00750.05740.02390.03280.02260.02200.00430.02030.00000.00040.00860.00860.00130.0251
.py30810.02110.04830.05390.05370.00080.00010.00240.06700.03640.02130.00180.00180.00030.02780.00000.02940.00920.00930.00080.0376
.c250430.05360.05600.05730.05730.05550.00770.04690.02480.02760.01100.01560.01550.01740.00500.01940.01200.00530.00530.00780.0017
.xml17880.01270.00550.00220.00220.00260.00000.00050.01460.09250.09930.00140.00140.01910.03150.00000.00600.00050.00050.12420.0006
.clj830.01830.00640.09460.09440.052900.00140.04620.00340.02390.00490.00490.00010.01890.00260.00150.03390.03390.00010.0005
.rb85270.02290.04790.02640.02630.00130.00010.00180.05390.01480.05100.00620.00620.00090.04460.00020.01660.00860.00870.01410.0530
.java122930.01570.02620.04880.04880.04330.00010.05250.09300.01040.01180.01860.01850.01370.00320.00000.00090.00180.00180.00500.0005
.go12590.02650.05690.06840.06840.00430.00290.01080.06880.01110.03700.03400.03390.00090.01040.00000.00020.01060.01060.00140.0017
.lua39350.03420.07830.05260.05250.00200.00080.00170.04630.06340.04480.01730.01710.00080.01660.00010.00130.01580.01570.00100.0193
.m7420.03330.02610.05300.05300.05800.00010.02010.03270.02460.01810.01900.01900.00490.03530.00030.01200.03450.03450.00180.0005

The ratio of a specific tokens, relative to all tokens in a file, by file type. Only the top 16 file types are present in this table - there is a very long tail of file types. I restricted this table to file types that are relatively abundant in this dataset.

ASCII Faces

To examine the discrepancy between left and right parenthesis I created a set of scripts to separate comments from code, in c and JavaScript files. Second, I then analyzed the comments and counted the number of ascii faces that appeared. I focused on 6 different types of smiley faces and I included 1 type of frown (See Table 3 for types of ascii faces and amount found).

While there were more frowns in JavaScript files, the difference wasn’t statistically significant. Furthermore there was not a statistically significant difference between the total smiley faces between ‘.c’ and ‘.js’ files. However, there were more smiley faces in files that were in “JavaScript” github repos. For instance, Node is a JavaScript repo but includes both JavaScript and c files. This makes sense that the project, with its distinct maintainers, rules and conventions is more important in determining the number of smiley faces.

Eighty percent of c files analyzed were found in the Linux repository, so it made sense to focus on Linux specifically. In Linux c comments I found 631 smiley faces and 73 frowns. In linux the most prevalent smiley faces was `:-)` followed by `:)` (See Table 3.).

Table 3. Ascii faces

Linux c files (20,060) c files (24,542) JavaScript files (6,743)
All Smiley Faces 0.0315 (631) 0.0577 (1415) 0.0721 (486)
Frowns :( 0.0036 (73) 0.0051 (124) 0.0249 (168)
:) 0.0088 (172) 0.0081 (198) 0.0027 (18)
:-) 0.0088 (176) 0.0284 (697) 0.0001 (1)
:-D 0.0001 (2) 0.0001 (3) 0.0006 (4)
:p 0.0051 (102) 0.0120 (295) 0.0475 (320)
;) 0.0044 (89) 0.0048 (117) 0.0212 (143)
;-) 0.0044 (90) 0.0043 (105) 0.0000 (0)

The first value is the number of times the ascii faces appears, relative to other tokens, while the value in brackets is the total number of times that it appears in all files. Linux c files are a subset of the c files.

Discussion

Shortcomings

Despite including 100 different repositories, Linux source files represented 30% of all files in this analysis. Ideally, the number of files from each repository, and language, would be balanced. One approach would be to randomly select a numerically identical subset of files from each language. While this approach might be valid statistically it wouldn’t produce descriptive statistics on each repository, just a subset of files within each repository. Alternatively Linux could be excluded from the analysis since the number of files it contains is an outlier, relative to the other repositories.

In all files there were 21,474 more right parenthesis than left. Given that c and JavaScript files represent nearly half of all files in this analysis, and they only had 1,901 smiley faces, it’s unlikely that the other half of the files had a nearly 20,000 smiley faces - or enough to account for the left and right parenthesis. Future analysis could attempt to locate the source for this difference (presumably within comments).

Conclusions

It is not surprising that ratios of token types can differ dramatically between different languages, however, I was surprised that several tokens (parenthesis, square & curly brackets) did not occur equally. While smiley faces can account for part of this discrepancy, they most likely do not account for all of it.

The biggest surprise was that number of smiley faces per file was not statistically different between JavaScript and C. Being low level I presumed that C code would be more serious, with fewer ascii faces. Interestingly, I was wrong and C code has a similar amount of smiley faces relative to JavaScript.

Viewing Table 2 we can start to see some patterns and differences in token ratios that might help to predict a file’s language. For instance, JSON has very different token ratios than C. In the next article I will explore the power that tokens have in predicting which programming language is being used in a given file.

If you’re interested in replicating the analysis, or obtaining the dataset, please see the links below:

Language Statistics

Language Statistics Data

Comments