Statistics in Python: Central Tendency

In the central tendency there are 3 most common measures: mean (arithmetic average), median and mode. Their manual calculation in Python is presented below.

Mean

Arithmetic mean is a sum of a collection of numbers divided by the total number of numbers in the collection. We are computing the average manually with the mean() function, which iterates through a list, counts a sum of all numbers in the list, then takes this sum and divides it by length of the list and returns the result.

def mean(datalist):
    total = 0
    mean = 0
    for item in datalist:
        total += item
    mean = total / float(len(datalist))
    return mean

Median

Median, as a middle value of sequence of numbers, is a little bit more complicated to compute. There are 2 different ways to get the result, depending on the number of elements in the sequence: even or odd. If the collection consists of an even number of numbers, median is an average of two middle elements.

def median(datalist):
    numsort = sorted(datalist)
    mid = len(numsort) / 2
    median = 1
    if len(numsort) % 2 == 0:
        median = (numsort[mid - 1] + numsort[mid]) / 2.0
    else:
        median = numsort[(len(numsort) - 1) / 2]
    return median

The function median() creates a new, sorted list named numsort. Next, it creates a variable mid, which is the length of list divided by 2 – this is the middle of the list. We need to define also a variable median, which we can assign value 1.
If the number of numbers in the list is even, median is the sum of two values with the middle indices, divided by 2.0 (it has to be float number). If the amount of element is odd, median is the number with the middle index in the list.

Mode

Mode is a most often occurring number in the list. To compute the mode, we can use the function frequency_distribution(), that collected the data we need now. Instead of printing the frequency distribution, we create function mode() which iterates through the keys of the dictionary, looks for most often value and returns its key:

def frequency_distribution(datalist):
    freqs = dict()
    for item in datalist:
        if item not in frees.keys():
            freqs[item] = 1
        else:
            freqs[item] += 1
    return freqs

def mode(datalist):
    d = frequency_distribution(datalist)
    most_often = 0
    mode = 0
    for item in d.keys():
        if d[item] > most_often:
            most_often = d[item]
            mode = item
    return mode

Interpretation of Results

After printing all of the functions for the list crater_diameter:

crater_diameter = [46, 51, 49, 82, 74, 63, 49, 70, 48, 47, 79, 48, 52, 55, 49, 51, 58, 82, 72, 45]

print mean(crater_diameter)
print median(crater_diameter)
print mode(crater_diameter)

we should have an output:

Zrzut ekranu 2016-07-27 o 18.39.54

Measures of central tendency are used for identifying the central values in data set. In our collection of craters’ diameters the most frequent value is 49 km. Average diameter is 58.5 km. Half of the values are less than 51.5 km and half are greater than that.

Advertisements
Tagged with: , , , , ,
Posted in data analysis, python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: