In this article I would like to concentrate on 4 main measures of statistical dispersion: range (and the biggest and the smallest number as well), average deviation, variance and standard deviation. In Python, we can easily compute them with a few functions.
To compute the range, we have to determine the smallest and the largest value in data set. The range is the difference between them.
def range_min_max(abclist): smallest = abclist largest = abclist range_of_values = 0 for item in abclist[1:]: if item < smallest: smallest = item elif item > largest: largest = item range_of_values = largest - smallest return smallest, largest, range_of_values
This function returns the smallest, the largest number and the range. We assume, that the first value in the collection (
abclist) is both the smallest and the largest value. At this stage the variable
range_of_values can be 0. Next, for each item in the list we check if it is less than the currently stored smallest value. If so, we save this value as the smallest. Next, we check whether item is greater than the current largest, if so, we save this element as the largest value. The rest of the cases, we simply omit. We calculate
range_of_values calculate and return all values.
Average Absolute Deviation, Variance and Standard Deviation
These three measures are interrelated, so I present them in a “cascade” of functions. To calculate them, we have to use the function which counts mean.
def mean(datalist): total = 0 mean = 0 for item in datalist: total += item mean = total / float(len(datalist)) return mean def avg_dev(thislist): average = mean(thislist) sum_of_dev = 0 avg_dev = 0 for item in thislist: sum_of_dev += abs((average - item)) avg_dev = sum_of_dev / len(thislist) return avg_dev def variance(thatlist): average = mean(thatlist) sum_of_sqrt_dev = 0 variance = 0 for item in thatlist: sum_of_sqrt_dev += (average - item) ** 2 variance = sum_of_sqrt_dev / len(thatlist) return variance def std_dev(anotherlist): std_dev = variance(anotherlist) ** 0.5 return std_dev
Average deviation is the arithmetic average of absolute differences between the values and the mean. From each value we subtract the arithmetic mean of the collection (or vice versa, because we count the absolute value), then we sum all the differences (these 2 actions are represented in line 14:
sum_of_dev + = abs ((average - item))). We count the arithmetic average of the sum of differences and return the result.
The variance is the arithmetic average of squared deviations of values from the mean value. It’s calculated similarly to the average deviation, with the difference, that the differences between the values and mean, are squared.
Thus, counting the standard deviation is a trivial operation. Its value is simply the square root of the variance. Here presented as the variance raised to the power 0.5.
Interpretation of Results
After printing all of the functions for the list crater_diameter:
crater_diameter = [46, 51, 49, 82, 74, 63, 49, 70, 48, 47, 79, 48, 52, 55, 49, 51, 58, 82, 72, 45] print range_min_max(crater_diameter) print avg_dev(crater_diameter) print variance(crater_diameter) print std_dev(crater_diameter)
we should have an output:
The results of measures of dispersion tell us about how stretched are the values in the data set. Our collection of craters’ diameters have range of 37 km. The average deviation from the average diameter is 11.25 km and the standard deviation is 12.07 km. The variance amounts 161.45 km2, however, the interpretation of variance is a problem, because of the squared unit. The variance can be useful only in comparative studies of data sets.