Simple stats in an Awk-ward fashion

Ever had an encounter with the most…awkward programming language?

Surely, awk may have a slightly awkward name, possibly a bit awkward syntax at times but overall it’s far from being considered awkward as a language. In fact, it is one of the most powerful tools any programmer should possess to nail simple or more complex operations on a Unix-based platform.

Today, I’m going to quickly show you how to extract some simple statistical metrics from a numerical list stored in a file (steps.txt), and these are: min, max and mean.

Let’s say you have a file with the number of steps you’ve walked during a week starting on Monday (order of days doesn’t really matter in this case and let’s consider for simplicity that the file doesn’t contain the ‘#’ comment part with the day names):

5,101   # Mon
10418   # Tue
4127    # Wed
8912    # Thu
11100   # Fri
1309    # Sat -- looks like too much Netflix over the weekend 🙂
1124    # Sun

Normally, you would write a Python or R script for these tasks, and in Python it would look something like that:

import numpy as np
import pandas as pd

df = pd.read_csv('steps.txt', header=None)
steps = df.iloc[:, 0]

# min
print(np.min(steps))

# max
print(np.max(steps))

# mean
print(np.mean(steps))

That’s pretty neat of course, but there may be cases where you want to do that instantly when you’re in front of your favourite shell terminal. Besides, when your input file is very large (e.g. 20GB) and you don’t have enough physical memory to store all of its contents there, it may actually be much more straightforward to calculate the min, max and mean using awk instead of Python. Of course, you can still do that with Python too but in that case you should be reading your file into chunks that can fit in your machine’s physical memory, do the calculations within each chunk and then merge your results.

When it comes to awk though, you can just start parsing your file line by line – so no need to store any big data into memory – and do the calculations as below:

# min
awk 'NR == 1 || $1 &lt; min {min = $1} END {print min}' steps.txt

# max
awk 'NR == 1 || $1 &gt; max {max = $1} END {print max}' steps.txt


# mean
awk '{sum += $1} END {print sum / NR}' steps.txt

or have all together in a single line:

awk 'NR == 1 || $1 &lt; min {min = $1}; NR == 1 || $1 &gt; max {max = $1}; {sum+=$1} END {print "Min: " min; print "Max: " max; print "Mean: " sum / NR}' steps.txt

For further usability you can add it as a function in your ~/.bashrc:

function awks {
    awk 'NR == 1 || $1 &lt; min {min = $1}; NR == 1 || $1 &gt; 
    max {max = $1}; {sum+=$1} END {print "Min: " min; print   "Max: " max; print "Mean: " sum / NR}' $1
}

source your bashrc (. ~/.bashrc) and then you’ll be able to call it like that:

$ awks steps.txt

and…voilà!

Min: 1124
Max: 11100
Mean: 6013

Getting some other stats like median or other percentiles with awk does get a bit more complicated and probably not very efficient. But just for demonstration purposes, to calculate the median you should first sort the values in your file file, then parse it -storing each value into a dictionary- and eventually get the median based on the total length of your list being either an odd or an even number:

sort -n --parallel=[num_of_cores] steps.txt | awk '{vals[NR] = $1} 
END{if (NR % 2) {print vals[(NR + 1) / 2]} 
else { print(vals[(NR / 2)] + vals[(NR / 2) + 1]) / 2.0}}'

Hope you have an aw…esome experience with awk! 🙂

Dimitrios Vitsios's blog