Project Two: Tech Industry Survey Data Exploratory Analysis
| Pandas | Numpy | Plotly |
Stack Overflow does an extensive annual survey of its users, providing substantial data on developers. The data includes self-reported stats for Education Level, Compensation, Country of residence, Employment Status, and other information such as preferences related to AI and favorite programming languages. This developer survey is cleaned by Stack Overflow and released as public information, downloadable here.
You can check out this project on GitHub.
Key Insights and Takeaways:
📉 There is a negative correlation between education level and unemployment rate
📈 There is a positive correlation between education level and median salary
📈 There is a Strong Correlation between age and median salary
📈 The median salary for remote workers is 68% higher than that of in-person workers
🏆 The median salary for developers is highest in the United States
🧪 Data Scientists in the US make roughly 59% more per year than Data Analysts
Project Brief:
As someone entering the data analytics/science field, I wanted to ensure that the thousands of hours I’m investing in these skills are well worth it. Coming from a financial background, I’m keen to determine if this switch will be economical and provide enough reward for the risk of switching industries. In particular, I’m curious what variables maximize earnings and where to focus my studies to efficiently plan out a career path that will benefit my family. End goal, I would love to be a data scientist and innovate in the field of AI, but I’m curious specifically what some stepping stones are that would lend a helping hand along the way.
Questions I would like answered:
How is higher education compensated? Is it worth pursuing my master’s or Ph.D?
Does education guarantee a job? What are the unemployment rates among different degrees?
Which professions within the tech industry have the highest compensation, and where do data jobs fall?
In which country are developers paid the most?
Are there specific jobs that appear to pay more for doing less? (The American Dream?)
Data Overview:
The dataset comprises 65,437 respondents and 113 questions. The survey collected data on the developer’s age, years of experience, salary, and employment status (Employed, Student, Independent Contractor, Full-time, Part-time), along with their industry, job title, company size, country of origin, developer type, education level, and preferences regarding certain languages and the use of AI. My tools of choice for this project are Pandas and Plotly, as I wasn’t concerned with creating an interactive dashboard, and I was fine with creating a static report.
df.shape
(65437, 114)
Data Cleaning:
Thanks to Stack Overflow, the data was mostly prepped out of the box. One small thing I had to change was adding an additional column for Employment called “Employed” to make the data easier to work with when calculating the unemployment rate. The responses within the Employed section would be hard to decipher as there were many combinations, such as “Employed, Full-Time” “Student, Full-Time”, “Independent contractor, freelancer, or self-employed”, and many other combinations.
To solve this, I created a function that searched for the keywords within the Employed Column, and if they contained the keywords indicating employment, I marked them as 1, and if not, I marked them as 0.
def is_employed(status):
"""
Determines if a respondent is employed based on their
'Employment' status.
Returns 1 if employed, 0 otherwise.
Parameters:
status (str): The employment status string from the 'Employment'
column.
Returns:
int: 1 if employed, 0 if not employed.
"""
if pd.isnull(status):
# If the status is missing, treat as not employed.
return 0
# List of keywords that indicate employment.
employed_keywords = [
"Employed, full-time",
"Employed, part-time",
"Independent contractor, freelancer, or self-employed"
]
# Check if any employed keyword is present in the status string.
for keyword in employed_keywords:
if keyword in status:
return 1
# If none of the keywords are found, treat as not employed.
return 0
Analysis:
How is education compensated?
An important consideration for me is the cost-benefit analysis of higher education. If I’m spending $30k or more per year to obtain a master’s, does the marginal increase in salary justify that? I’m not only considering the tuition as the cost as that is truly only a portion of the whole opportunity cost. While investing the time into getting higher degrees, you could be working within the field, potentially earning raises and entering into seniority quicker. But that would require a present value of lifetime earnings calculation, which we won’t be getting into in this project, so we’ll be doing a surface level analysis and gut feelings.
To visualize this, I used Plotly to create a bar chart depicting the median salaries grouped by education type.
# Remove rows with missing compensation or education level
df_median = df.dropna(subset=['ConvertedCompYearly', 'EdLevel'])
# Group by 'EdLevel' and calculate the median compensation
median_comp = (
df_median.groupby('EdLevel')['ConvertedCompYearly']
.median()
.reset_index()
.rename(columns={'ConvertedCompYearly': 'MedianComp'})
.sort_values('MedianComp', ascending=False)
)
# Define custom colors to match my website's palette
custom_bar_color = '#A89F91' # Taupe/gray
background_color = '#E2DED3' # Soft beige
font_color = '#333333' # Dark gray for text
# Creating the bar chart
fig = px.bar(
median_comp,
x='MedianComp',
y='EdLevel',
orientation='h',
color_discrete_sequence=[custom_bar_color],
title='Median Yearly Compensation by Education Level',
labels={
'MedianComp': 'Median of Converted Compensation (Yearly)',
'EdLevel': 'Education Level'
},
text='MedianComp'
)
# Update layout for aesthetics and theme matching
fig.update_traces(
texttemplate='%{text:,.0f}',
textposition='outside',
textfont=dict(size=12, color=font_color)
)
fig.update_layout(
yaxis=dict(automargin=True, tickfont=dict(size=14, color=font_color)),
xaxis=dict(tickfont=dict(size=14, color=font_color), title='Median of Converted Compensation (Yearly)'),
title=dict(x=0.5, font=dict(size=22, color=font_color)),
plot_bgcolor=background_color,
paper_bgcolor=background_color,
bargap=0.3,
coloraxis_showscale=False,
height=500,
font=dict(family='Arial, sans-serif', color=font_color)
)
# Finally, show the chart
fig.show()
This visualization shines light on a couple of important notes:
Pursuing a Master’s degree in the development industry is not recommended, as it only provides about a $1000 increase in salary. This marginal increase of $1000 is not worth the additional expense in both tuition and time spent out-of-market.
Crucially, it appears developers with Ph.Ds or equivalent enjoy a median salary roughly $10,000 more than Bachelor’s or master’s. The downside is that a Ph.D in computer science usually takes 4-7 years on top of a master’s, leaving the marginal salary increase something to be desired.
At this point in my life, I am comfortable pursuing a career without the need for a master’s or Ph.D, although I would love to eventually join academia in my old age, in which I will likely obtain a Ph.D sometime in the future, but perhaps after putting my kids through college and being comfortable financially. I do not believe a degree is a golden ticket into the tech industry, and displaying practical projects highlights both the passion and skills of an individual far more. For these reasons, I will refrain from pursuing higher education and focus on honing my skills with real-world applications.
Education Level vs. Unemployment:
Perhaps another aspect is whether a degree provides job security, which could be a measure of how sought-after degrees are from a company perspective. Utilizing the Employed column that I derived from the Employment responses, I was able to sort the data by education level and count the number of employed individuals out of the whole and calculate the unemployment rate:
My calculation of the unemployment rates by education level are as follows:
edlevel_counts = df.groupby('EdLevel').agg(
total=('Employed', 'count'),
unemployed=('Employed', lambda x: (x == 0).sum())
).reset_index()
edlevel_counts['Unemployment Rate (%)'] = 100 * edlevel_counts['unemployed'] / edlevel_counts['total']
edlevel_counts = edlevel_counts.sort_values('Unemployment Rate (%)', ascending=False)
This is how I created the graph (not including styling for my website):
fig = px.bar(
edlevel_counts,
x='Unemployment Rate (%)',
y='EdLevel',
orientation='h',
color_discrete_sequence=[custom_bar_color],
hover_data=['total', 'unemployed'],
title='Unemployment Rate by Education Level (Global)',
text='Unemployment Rate (%)'
)
Here is the result:
Surprisingly, this chart shows that the education level with the lowest unemployment rate is a Master's, and not Ph.Ds, which contradicts my initial assumptions. This could be due to various reasons, such as those with PhDs being more involved in research, still studying, or struggling to find jobs that suit their skill set. When looking at the unemployment rate of 10.9% for Bachelor's degrees, my hypothesis is that there are simply more individuals with that education level, leading to naturally higher competition. The story gets even more confusing when looking at only the United States…
This chart suggests that people with Bachelor’s degrees are the most employed which completely counteracts intuition. This suggests companies value Ph.Ds less than a Bachelor’s so there must be and additional variable here. Overall, 6-11% unemployment seems a bit high and surpasses the United States’ target unemployment rate of 4-6%.
I believe these unemployment rates are greatly inflated due to the advancement of AI and companies replacing lower-performing developers. The year 2024 brought forth unprecedented advancements in Artificial Intelligence, forcing companies to adopt it as quickly as possible or be left behind. Personally, I’ve heard many describe being forced to integrate AI into their coding workflow, as it enables even a junior programmer to be a 10x engineer. While in the short run, this unemployment rate is inflated, as companies harvest this increased efficiency from their workforce, I believe there will be a rapid expansion of job opportunities for everyone within the tech industry in the long run. Additionally, with this data coming from Stack Overflow, naturally, the distribution of the user base could be skewed slightly to include a larger unemployed population than what would otherwise be observed naturally.
Professions by Salary:
One of my most anticipated questions is “How do data job wages compare to the rest of the industry?” To visualize this, I took the median income and grouped it by profession:
# Remove rows with missing compensation or DevType
df_median = df.dropna(subset=['ConvertedCompYearly', 'DevType'])
# Group by 'EdLevel' and calculate the median compensation
median_comp = (
df_median.groupby('DevType')['ConvertedCompYearly']
.median()
.reset_index()
.rename(columns={'ConvertedCompYearly': 'MedianComp'})
.sort_values('MedianComp', ascending=False)
)
fig = px.bar(
median_comp,
x='MedianComp',
y='DevType',
orientation='h',
color_discrete_sequence=[custom_bar_color],
title='Median Yearly Compensation by DevType (Global)',
labels={
'MedianComp': 'Median of Converted Compensation (Yearly)',
'DevType': 'DevType'
},
text='MedianComp'
)
(You can zoom in on the chart to reveal more labels)
We can see that the median salaries for Data Analysts and Data Scientists are $54,507 and $73,036, respectively. As a reminder, these are global stats across all countries. Let’s take a look at just the United States. (Full code on GitHub!)
We can see that the median salaries for data analysts and scientists are now $100,000 and $159,000, respectively. This tells us that the country you live in plays a key role in your earning potential, which we’ll explore in more detail momentarily. Something to also note is that the highest reported salary for data analysts in the US was $259,000, while data scientists boast $810,000, with the mode being $200,000. It’s clear that the earning potential for data scientists is higher than data analysts, but rightfully so. Data analyst positions are usually seen as good stepping stones if you desire to get into data science, since many of the required skills for analyzing data overlap with the duties of a data scientist.
Compared to other professions, data jobs stack up significantly well. Data analysts make more than Designers, System administrators, and Educators, and fall behind Game devs, embedded device devs, and full-stack developers. For being an entry-level position at most companies, this serves as a very good starting point for a career. Additionally, Data scientists hold their own as well, earning more than the median R&D Role, Data engineer, and Product manager, while barely falling behind the median AI Developer, Cloud infrastructure engineer, and back-end dev. Overall, data scientists stand as the 11th highest paid tech position in the United States, further proving its viability as a career path.
Compensation by Country:
As we’ve seen in the previous section, the country you live in can greatly affect your compensation. Let’s visualize just how much of a difference this variable makes:
# Remove rows with missing compensation or education level
df_median = df.dropna(subset=['ConvertedCompYearly', 'Country'])
# Group by 'EdLevel' and calculate the median compensation
median_comp = (
df_median
.groupby('Country')['ConvertedCompYearly']
.median()
.reset_index()
.rename(columns={'ConvertedCompYearly': 'MedianComp'})
.sort_values('MedianComp', ascending=False)
)
# Excluding outlier from the country Gabon
median_comp = median_comp[median_comp['Country'] != 'Gabon']
fig = px.bar(
median_comp,
x='MedianComp',
y='Country',
orientation='h',
color_discrete_sequence=[custom_bar_color],
title='Median Yearly Compensation by Country',
labels={
'MedianComp': 'Median of Converted Compensation (Yearly)',
'Country': 'Country'
},
text='MedianComp'
)
As anticipated, the United States of America has the highest compensation for tech-related jobs in the world. We can see a few surprising countries ranking among the top, such as Haiti, Andorra, and Antigua and Barbuda. These are likely outliers, as the populations in the regions are small so our confidence interval is massive and the significance is very low. For countries where we have a significant amount of responses, such as the US, Israel, and Switzerland, we can have more faith in their accuracy.
The American Dream:
Looking at the Median Yearly Compensation by DevType (US), we can pick out a few professions that pay very well for the amount of work or knowledge base! We can see both the QA/Test developer and the security professional make $130k/yr, which requires minimal education requirements and technical know-how. For a QA/Test developer, you usually get a 3-month training course, and you’re on the job. For a security professional, however, there are some additional steps you’d have to consider, such as getting the CompTIA A+ and Security+ certifications. The upside is that you can achieve these certifications in high school and not have to worry about going to college, although a college education looks good on a resume, and it would help you get these certifications, depending on the program. Another solid option is being a designer, which pays 84k/yr. The designer’s primary goals are to provide a premium feel to software, utilizing tools such as Figma, Adobe Suite, and other no-code tools, to make the life of the frontend developer easier. Designers need to work closely with marketing teams and understand advanced design philosophy, along with being creative.