Are My Experiences Similar to Those of Others?
As I set out to complete my fourth project under the Udacity Data Science for Enterprise course I was not sure where to start. I had multiple example datasets provided to me as well as the ability to choose any dataset I could find. It seemed the possibilities were limitless.
However, I decided to stick with the base Stack Overflow Survey dataset from 2017. After reviewing the metadata more deeply I saw that there were some very interesting questions asked in the survey which made me think back to my own experiences as a software developer. I was very curious to see whether the data was in line with my own experiences or not.
What Stories Does the Data Hold?
I decided to see if I could use the art of data science to answer three questions related to my experience:
- What level of education do parents of self-taught professional developers have?
- How well can we predict an individual’s highest level of formal education? What aspects correlate well to this?
- How well can we predict whether an individual sees the world in “black and white” or not? What aspects correlate well to this?
Note that I recognize the dataset itself may be limited in it’s ability to help me answer these questions and may even be biased towards a certain type of person based upon coming from one datasource (Stack Overflow). I will talk through other risks and assumptions with the data throughout this piece but will hold these overall concerns out as they are not something I can influence through data science on this single dataset.
Through the remainder of this post I will walk through why these questions were interesting to me and my evaluation of what I was able to learn via the data science process.
Does Parents’ Education Level Influence Self-Taught Developers?
I am a self-taught (former) professional software developer. I started developing with BASIC, Visual Basic and HTML circa 1993–1994 (~8–9 years old). I was able to do this because I had access that many others in my area did not:
- I had access to a desktop computer because my mother needed one for her job (elementary school teacher) and a laptop computer because my father needed one for his job (in the insurance industry).
- I had access to the Internet because my mother was provided with home access for free as part of her job.
I have long assumed that my ability to become a self-taught developer was completely driven by the circumstances of my parents’ occupations at that time as access to those items had not spread to the majority of people.
NOTE: I recognize that education level is not the (sole) driver of occupation and that stating otherwise would be fallacy. However, being a school teacher in the United States at that time (in case it changes in the future or did not in the past) did require a master’s degree and so there was direct correlation there.
Therefore, I had two key reasons for asking this question:
- With computers and Internet access spread to many more households now than they were when I was young, are there more self-taught developers whose parents have lower education levels?
- Is there any data to show that my assumption was wrong all along and I was incorrect to make it when looking at a dataset of size 1 (just myself)?
After looking at the data for self-taught developers (graph to the left) and a similar graph for all respondents I found that the shape is very similar overall but that there were two aspects I wanted to do a deep dive into.
In the self-taught responses…
- The % of those whose parents have a bachelor’s degree or master’s degree (higher education) is higher for self-taught developers.
- The % of those whose parents have primary / elementary school education (basic formal education) is higher for self-taught developers.
It *seems* that these two observations tie into my reasons for asking the question in the first place. To my first reason, the increase in the % of those whose parents have primary / elementary school education only would speak to the availability of computers and Internet access for the majority in recent years. To my second reason, the higher % of those whose parents have a bachelor’s degree or master’s degree would validate my assumption that parents with higher education (or certain occupations) provide an easier environment for self-taught developers to start.
Now, this all could be confirmation bias on my part and so I would be interested to hear thoughts from others :)
What Self-Supplied Information About Developers Would Tell Us More About Them?
My next two questions were focused on trying to determine if we could use other data points in the survey to understand the respondents better: in terms of their highest education level and whether they see the world in a purely “right / wrong” context.
I was really interested in the third question as, when I was younger and at the height of my software development period, I typically saw the world in “right / wrong” and “if then, else” context. Is this a standard mindset of developers or are there specific criteria that can be used to determine if this will be a developer’s mindset or not?
I first spent some time looking at the data and manipulating it where necessary.
- There were no null values in the FormalEducation column (assuming that this was a required field) but there were many null values in the RightWrongWay column. Therefore, when trying to answer the third question, I dropped all rows from the dataset that had null for the RightWrongWay column.
- For the other columns (those I was not trying to solve for at the time) I filled the numeric ones with the mean of that column and dummied the categorical ones.
- Finally, I reduced the columns with more than 10 null values (I played around with this # but it did not affect the accuracy and r2 scores much).
For both questions, I was not able to get a model that predicted the values well.
- For the second question, my model’s accuracy was ~40% and the r2 score was negative.
- For the third question, my model’s accuracy was ~30% and the r2 score was also negative.
As in most situations, there are two reasons that the scores could be so low:
- I built poor models.
- There just is no correlation between the data points in the dataset to answer my questions.
I assume that in this situation it is a mix of the two: a professional data scientist could probably build a better model than I did but there is probably also a strong element (given the models’ very poor scores) of the data points not being able to actually predict the answers.
I did go into this a bit deeper and look at the coefficients / features that impacted each model the most to see if there was anything interesting there.
For the second question, the top two coefficients were ExpectedSalary and Salary (current salary). This *seems* to make sense as those with more education would expect, and probably receive, higher salaries. Beyond that, the coefficients seem to be rather meaningless.
For the third question, the top coefficient is once again ExpectedSalary and then the remainder seem to be rather meaningless. I do not know what I can read into this so I won’t make any conclusions ;)
What Does it All Mean?
This was a great opportunity to put the data science process to work on a dataset that was interesting to me and one where I came up with the questions myself.
As is normal, the “data does not lie” but my interpretations of it and the way my models were set up may be incorrect or biased. That point tempers my conclusions and should be kept clear in mind when reading the below.
What I believe I saw from these results is that…
- For question one, my assumption is a valid one and it does appear that more people whose parents are not highly educated now have the access to become self-taught developers.
- For question two, it does not appear that the survey questions offer great data points for us to determine the highest education level of the developer however the Salary and Expected Salary of the respondent may be the strongest indicators in the dataset.
- For question three, it does not appear that the survey questions offer great data points for us to determine whether the developer sees the world in a purely “right / wrong” perspective or not which means that it is probably not inherent in software developers to see the world this way.
Thanks for reading and I would love to hear feedback from the experts!