A journey into the unknown with Tech Leaders

In my previous post I mentioned that I got into Tech Leaders program. Now, after a few months I would like to relate my experiences of doing something unbelievable, which I’ve never thought I will be able to do.

Tech Leaders is a mentoring program for women who want to change or develop their career path in IT. In every edition you can choose a mentor you will work with during next 4 months. Mentors are professionals who can help you with your career problem, teach you something, show you how their work looks like and many, many more. But it’s you who should define your expectations and frames of the collaboration, so that you achieve and learn as much as possible. For more information you can visit following sites and I strongly encourage you to follow this idea of mentoring:

techleaders.eu

womenintechnology.pl

facebook.com/techleadersWIT

I worked together with Monika who is an analyst, big data system designer and Python developer with 10 years of experience in IT. She was the perfect choice for my aims, because I wanted to learn more about data analyst profession and I’ve already started to learn to use Python in data analysis.

I developed a project for data collected from Twitter. The aim was to check, which ski brand was the most popular in users tweets in the last winter season. I identified 14 downhill ski brands. I found libraries tweepy and peewee written in Python and I used them to write a program that tracks all tweets with defined keywords, for example “Rossignol”, “Salomon”, “Atomic” or “Head” – simply the names of the ski brands and save them in a database. I also had to define other helping keywords connected to skiing, because „atomic” can be for example bomb or „blizzard” can be understood as snowstorm. You can’t even imagine, how many tweets per second you will get on keyword “head”! Then, I created a database which I worked out in SQLite. In the end, I prepared a report in Excel using different formulas. So, as you can see, I could use many different tools, like Python, programming language and PyCharm, the environment, SQLite database and Datum, the program for databases and Excel for data presentation.

Apart from learning a lot about collecting data and data analysis I learned a few rules of thumb, for example that database is not a pharmacy – it has no ready remedies for all your problems, sometimes you have to find a solutions on your own. Programming is not only writing the code, it’s also reading documentation of libraries – really a lot. And never delete data, always work on copies and always try to gather original data, because you can regret you haven’t done it.

In the next post I will present my findings, which are really puzzling and didactic. But before that, I would like to share with you this instruction, how to do the Tweeter Streaming project step by step. Thank you Adil, this is great!

Tagged with: , , , , , , ,
Posted in learning

I’ve got a new job #2

After several general comments I would like to present a small report on the whole recruitment process. I’ve sent 77 applications during 1.5 month. It gives about 1.8 application daily. This result can be improved at the next opportunity. Applications were sent once a week: every Thursday I was searching for new offers from the last 7 days, on Friday I was sending them and beginning from Monday, I was answering phone calls. I believe this system worked well and next time I will also do it this way.

I applied for various job positions. They can be grouped as follows: Data Analyst, Business Analyst / Project Manager (IT), Supply Chain (this is what I’m doing so far), IT SAP System Analyst and Customer & Communication (roles related to e-commerce, SEO, content analysis). Percentage distribution of applications for those roles is given on the chart below.

zrzut-ekranu-2016-11-15-o-17-43-18

I decided, that success is any positive contact from the company. In practice, I counted only phone calls and invitations for an interview. At the beginning I counted also answers on email, but mostly they were automatically generated notifications, that my application was received or that my application didn’t got to the next stage. Like I mentioned, it’s a nice habit, but has the same consequences like a lack of response.

‘No’ answer and lack of answer are marked in red in the charts. Blue is any positive contact from the company. Additionally there are also ‘yes’ answers distinguished in the first chart.

zrzut-ekranu-2016-11-15-o-17-43-47zrzut-ekranu-2016-11-15-o-17-44-15

Actually, it’s difficult to compare the results with each other, because the numbers of sent applications in every job group differ substantially. There is a lot of coincidence here and the answers rate doesn’t mean anything. Next time, I will also prepare such a report though and then we will see if something will have changed, if I create a new CV and gain more experience in data analysis.

Regrading ‘yes’ answers: in the field of data analysis I got into mentoring programme for women Tech Leaders. I’m going to work on an analytical project with my mentor. It’s not a job in the strict sense, but it certainly lets me spread my wings. In the field of supply chain I got a regular job. It’s not exactly the direction in which I want to develop, but at least, when it comes to working in Excel, it will be a lot to learn.

Tagged with: , , , , ,
Posted in career

I’ve got a new job!

For the last 2 months I was looking for a new job. This time abounded in motivational and didactic experiences. I’ve noticed, that in the past few years when I had a job and I didn’t take part in recruitment processes, the culture of recruitment has gone far forward significantly. Many companies have now their own recruitment services, through which you can apply on your own. If your qualifications aren’t suitable for the role, you will get a notification, that your application won’t be proceeded this time. Sure thing, you don’t waste your time for thinking and hoping for an interview.

But there were also things I didn’t like very much. For example, working hours of HR departments. Regrettably, all interviews were appointed in normal business hours. I don’t know, do they really think the people will take a holiday leave in their current job to go on an interview which guarantees nothing? It’s an absurd!

Unfortunately, a lot of companies are looking for candidates for asap. It’s not my first job, I already have a professional experience and I am really surprised, that at this stage I am still treated like a student who can start work immediately. People have a notice period, though! If you are working longer in one place, it can be even 3 months. You should maybe change the job every 3 years…

On the other hand, a lot of job offers are hanging in job websites for many weeks or months. You have applied, but no answer. Now, you don’t know, whether someone just forgot about this job offer and it’s renewing itself every month automatically, or is it like one friend once told me: her team is terribly understaffed, they hardly get along, but the manager believes no one is good enough and he is still waiting for a perfect candidate. Yeah, every rather savvy person would already be a perfect employee after all these months.

There are a few my general comments on the whole recruitment experience. More details and more positive findings you’ll find in my report on sent applications, which I encourage you to read! 🙂

Tagged with: , ,
Posted in career

My Top Statistics Blogs

I have many blogs in my Reeder application (I recommend!). Most of them are about new technologies, outdoor sports, cosmetics for babies and mothers… But I want also read something about statistics systematically. This is a list of my personal favorites. Although some of them are not updated any more, it is still worth to read them.

Stats Blogs
Simply Statistics
Bad Science
Flowing Data
Blog about Stats
Information is Beautiful
The Analysis Factor
Stats With Cats
dataists
Information Aesthetics
Lies and Stats
Permutations
Radford Neal’s Blog
Stats Make Me Cry

Tagged with: , ,
Posted in data analysis, learning

Statistics in Python: Measures of Dispersion

In this article I would like to concentrate on 4 main measures of statistical dispersion: range (and the biggest and the smallest number as well), average deviation, variance and standard deviation. In Python, we can easily compute them with a few functions.

Range

To compute the range, we have to determine the smallest and the largest value in data set. The range is the difference between them.

def range_min_max(abclist):
    smallest = abclist[0]
    largest = abclist[0]
    range_of_values = 0
    for item in abclist[1:]:
        if item < smallest:
            smallest = item
        elif item > largest:
            largest = item
    range_of_values = largest - smallest
    return smallest, largest, range_of_values

This function returns the smallest, the largest number and the range. We assume, that the first value in the collection (abclist[0]) is both the smallest and the largest value. At this stage the variable range_of_values can ​​be 0. Next, for each item in the list we check if it is less than the currently stored smallest value. If so, we save this value as the smallest. Next, we check whether item is greater than the current largest, if so, we save this element as the largest value. The rest of the cases, we simply omit. We calculate range_of_values ​​calculate and return all values.

Average Absolute Deviation, Variance and Standard Deviation

These three measures are interrelated, so I present them in a “cascade” of functions. To calculate them, we have to use the function which counts mean.

def mean(datalist):
    total = 0
    mean = 0
    for item in datalist:
        total += item
    mean = total / float(len(datalist))
    return mean

def avg_dev(thislist):
    average = mean(thislist)
    sum_of_dev = 0
    avg_dev = 0
    for item in thislist:
        sum_of_dev += abs((average - item))
    avg_dev = sum_of_dev / len(thislist)
    return avg_dev

def variance(thatlist):
    average = mean(thatlist)
    sum_of_sqrt_dev = 0
    variance = 0
    for item in thatlist:
        sum_of_sqrt_dev += (average - item) ** 2
    variance = sum_of_sqrt_dev / len(thatlist)
    return variance

def std_dev(anotherlist):
    std_dev = variance(anotherlist) ** 0.5
    return std_dev

Average deviation is the arithmetic average of absolute differences between the values ​​and the mean. From each value we subtract the arithmetic mean of the collection (or vice versa, because we count the absolute value), then we sum all the differences (these 2 actions are represented in line 14: sum_of_dev + = abs ((average - item))). We count the arithmetic average of the sum of differences and return the result.

The variance is the arithmetic average of squared deviations of values ​​from the mean value. It’s calculated similarly to the average deviation, with the difference, that the differences between the values ​​and mean, are squared.

Thus, counting the standard deviation is a trivial operation. Its value is simply the square root of the variance. Here presented as the variance raised to the power 0.5.

Interpretation of Results

After printing all of the functions for the list crater_diameter:

crater_diameter = [46, 51, 49, 82, 74, 63, 49, 70, 48, 47, 79, 48, 52, 55, 49, 51, 58, 82, 72, 45]

print range_min_max(crater_diameter)
print avg_dev(crater_diameter)
print variance(crater_diameter)
print std_dev(crater_diameter)

we should have an output:

zrzut-ekranu-2016-09-07-o-19-22-39

The results of measures of dispersion tell us about how stretched are the values ​​in the data set. Our collection of craters’ diameters have range of 37 km. The average deviation from the average diameter is 11.25 km and the standard deviation is 12.07 km. The variance amounts 161.45 km2, however, the interpretation of variance is a problem, because of the squared unit. The variance can be useful only in comparative studies of data sets.

Tagged with: , , , , , ,
Posted in data analysis, python

Statistics in Python: Central Tendency

In the central tendency there are 3 most common measures: mean (arithmetic average), median and mode. Their manual calculation in Python is presented below.

Mean

Arithmetic mean is a sum of a collection of numbers divided by the total number of numbers in the collection. We are computing the average manually with the mean() function, which iterates through a list, counts a sum of all numbers in the list, then takes this sum and divides it by length of the list and returns the result.

def mean(datalist):
    total = 0
    mean = 0
    for item in datalist:
        total += item
    mean = total / float(len(datalist))
    return mean

Median

Median, as a middle value of sequence of numbers, is a little bit more complicated to compute. There are 2 different ways to get the result, depending on the number of elements in the sequence: even or odd. If the collection consists of an even number of numbers, median is an average of two middle elements.

def median(datalist):
    numsort = sorted(datalist)
    mid = len(numsort) / 2
    median = 1
    if len(numsort) % 2 == 0:
        median = (numsort[mid - 1] + numsort[mid]) / 2.0
    else:
        median = numsort[(len(numsort) - 1) / 2]
    return median

The function median() creates a new, sorted list named numsort. Next, it creates a variable mid, which is the length of list divided by 2 – this is the middle of the list. We need to define also a variable median, which we can assign value 1.
If the number of numbers in the list is even, median is the sum of two values with the middle indices, divided by 2.0 (it has to be float number). If the amount of element is odd, median is the number with the middle index in the list.

Mode

Mode is a most often occurring number in the list. To compute the mode, we can use the function frequency_distribution(), that collected the data we need now. Instead of printing the frequency distribution, we create function mode() which iterates through the keys of the dictionary, looks for most often value and returns its key:

def frequency_distribution(datalist):
    freqs = dict()
    for item in datalist:
        if item not in frees.keys():
            freqs[item] = 1
        else:
            freqs[item] += 1
    return freqs

def mode(datalist):
    d = frequency_distribution(datalist)
    most_often = 0
    mode = 0
    for item in d.keys():
        if d[item] &gt; most_often:
            most_often = d[item]
            mode = item
    return mode

Interpretation of Results

After printing all of the functions for the list crater_diameter:

crater_diameter = [46, 51, 49, 82, 74, 63, 49, 70, 48, 47, 79, 48, 52, 55, 49, 51, 58, 82, 72, 45]

print mean(crater_diameter)
print median(crater_diameter)
print mode(crater_diameter)

we should have an output:

Zrzut ekranu 2016-07-27 o 18.39.54

Measures of central tendency are used for identifying the central values in data set. In our collection of craters’ diameters the most frequent value is 49 km. Average diameter is 58.5 km. Half of the values are less than 51.5 km and half are greater than that.

Tagged with: , , , , ,
Posted in data analysis, python

Statistics in Python: Frequency Distribution

Previous articles concentrated on managing and visualizing data with numpy and mathplotlib.pyplot libraries. Now, it is time to count more statistics, but manually, without built-in functions. For the beginning, let’s write some code to show a frequency distribution of the variable crater_diameter.


def frequency_distribution(datalist):
    freqs = dict()
    for item in datalist:
        if item not in freqs:
            freqs[item] = 1
        else:
            freqs[item] += 1
    return freqs

def print_dict(freqs):
    for item in freqs:
        print item, freqs[item]

result = frequency_distribution(crater_diameter)
print_dict(result)

Our variable crater_diameter is a list of values. We want the program to iterate through this list and have an output of values and their frequencies in the list. To do that, I defined the function frequency_distribution(), which takes a list as an argument. Then I created a variable freqs, which is a dictionary. Dictionary is actually the best way to combine values with occurrences. Next, for each element in the list, if the element was not added to the dictionary yet, we set the frequency on 1. Else, if the element was mentioned before, the frequency increases of 1. At the end the function returns the dictionary freqs.

We want also to print this dictionary not as an standard dictionary, but in two columns. For that we need another function print_dict(), which prints key and value for every key in dictionary (names of variables can be different, here they are the same only to be easier to imagine):

Zrzut ekranu 2016-07-27 o 08.54.53

Tagged with: , ,
Posted in data analysis, python