Analyzing questions asked to the troubleshooter of all tech teams — Stack Overflow

By Agrim Gupta, Darren Tsang, Nora Liu, Yuan Shen

Introduction

Every developer, student, and everyone else who is required to write code and use certain software tools on their jobs knows they have the safety net of Stack Overflow to fall upon when they are ever stuck on a problem. The website serves as a platform for users to ask and answer questions. Through membership and active participation, it allows users to vote questions and answers up or down as well as edit questions and answers in ways much similar to wikis and other platforms used for QnA like Reddit. Our team decided to take a look at the numerous ways people ask questions on the website and decided to further explore Stack Overflow posts.

Stack Overflow publishes their posts data annually and makes it publicly available on Google’s BigQuery platform. Stack Overflow now employs several data scientists themselves to help the organization make decisions like — when a question should be flagged low quality, when a person who asks a question be notified to add more details etc. Despite that, we thought it would be fun to take it upon ourselves to answer some questions we thought were inadequately answered so far.

We started off by asking the following questions:

We think a bit of background about Stack Overflow posts before we dive deep into our analysis and findings, would be essential. Note, we will be using the words post and question in the article interchangeably. The posts on the discussion forum may be asked by users with verified accounts and must have descriptive titles. Examples of good and bad titles from Stack Overflow themselves:

Further, every Stack Overflow post must contain at least one tag. Stack Overflow recommends including all relevant tags in the question asked. They do so to help people who ask questions increase the reach of their questions and be connected to potential people with answers sooner and with greater efficiency.

Analyzing Response time as a factor

Another factor that seemed wholly important to us while exploring Stack Overflow posts was the response time for that particular question. For a particular tag, an average sooner response time would indicate the existence of a larger community of people who answer questions. For the same tags, we assumed it indicates that the question was asked in a far better way to achieve a lower response time — therefore the question was of better quality. We were further able to explore response times through a series of graphs:

In the above graph, we visualized the average response time, the raw number and percentage of questions answered in that particular tag in a single plot. The tags we’ve plotted are the most popular data visualization tags contained in Stack Overflow posts (we could probably have barred Python and R from this plot, but we decided to include them since some of the popular libraries and packages related to data visualization are built-in and use those two languages). The smaller the circle, the lesser the average response time of that particular tag is. If we compare matplotlib and ggplot2 with each other, we find that ggplot2 has a far higher percentage of questions answered, while both tags have a fairly similar number of questions and average response times. Plotly, while clearly the most popular, in terms of the number of times it is tagged in questions, has a poor average response rate and time.

Next, we observed how the average response time and percentage of total questions answered changed annually. We can see that the percentage of total questions answered for every year since the beginning of Stack Overflow remained fairly consistent. However, as the community has grown, the time in which questions get answered has constantly decreased. A bigger circle signifies a greater raw number of questions asked in that particular year.

The ‘how’ questions take a lot longer than ‘why’ and ‘what’ questions to get answered. This seemed intuitively true for our team’s personal experiences. The size of circle is proportional to the number of questions asked for that keyword.

In the same part of our analysis, we used the NLTK library in Python to parse our posts into lists of words. We created a huge list of all the tags in Stack Overflow posts. We checked for every word in a post, whether it was a tag originally included in the post itself. We found only 26% of all posts which contained words in the body of those posts which were also recognized Stack Overflow tags. However, when we compared the response times and response rates of those two categories, we found no significant difference.

Analyzing the quality of questions: Part 1

We found an open-source project online. That person had classified all the different Stack Overflow questions into three ‘quality’ categories. Here’s a description of that dataset:

This dataset only contained 60k questions, which is insignificant if you look at the total number of questions asked in Stack Overflow — over 2 million. We used a prediction model (87% acc.) from a Kaggle community user and applied it to the other remaining 2 million questions.

For the quality classification dataset, we decided to begin with some basic EDA. We filtered for some of the most popular tags used in the questions, and compared their proportions of high-quality questions, low-quality questions closed without a single edit, and edited low-quality questions (open).

This is a barplot showing the frequency of certain tags (we chose these tags based on what we felt were popular/relevant among students) split by HQ, LQC, LQE. The purpose of this graph is to see if there were certain tags that are associated with higher quality questions or low-quality questions.

As we can see in the plot, the Python tag is more common among LQC than any other category (followed by LQE). A possible explanation is that Python is very common as someone’s first programming language, and as a result, learners try to ask questions. When these questions are asked, the right terminology might be used, which results in a low-quality question. Also, we can see overall trends among tags across all classes. R is pretty low compared to Python or Javascript, no matter the category of question.

The low-quality questions are majorly dominated by the top 5 tags, while the high-quality questions have a more diverse distribution (see: the size of the ‘other’ slice). This is potentially being caused by the fact that most beginners start with the most popular tags from the low-quality questions, while the distribution for those tags is a lot more diverse once people are familiarized with and exposed to a greater number of tools.

Finding family of packages and libraries in R and Python respectively by analyzing tags

As we dove deeper and began to compare similar ecosystems of tools that are considered parallel to one another (eg. Kotling vs Swift, and ggplot2 vs maptlotlib), we stumbled upon the realization that R and Python were similarly popular for beginner level data analysis tasks. We tried to the family of tools within those two languages. For example, the tidyverse family of packages in R encapsulates numerous closely related packages.

We used igraph and tidygraph packages to draw network plots. The dark the network line between two nodes is, the stronger they relate to one another. From this network plot, we can visualize the different ecosystems that exist within the R community. Chains of three and four tags occur a lot more commonly in the R network plot as opposed to the Python one which you can find below. The correlation is based on how many common questions two tags appeared in. For this, we had to discard the Stack Overflow posts which contained only a single tag.

As explained before, the network plot for Python tags is a lot more vast. While R is mostly associated with data science and data analysis related tasks, Python is used across the board — from web development to servers, to algorithmic trading — therefore, it makes sense that its network plot is as vast as it is. While there are no chains of 4 tags, there certainly do exist chains of three tags connected triangularly.

Shortcomings and future improvements

Needless to say, in a perfect world, we could’ve done several steps in our analysis differently:

Self-classification of Stack Overflow posts based on the predictability of tags

Since we were unable to verify the methodology followed by the person who classified the initial 60k questions, we decided to classify the Stack Overflow into low quality and high quality by ourselves.

The basic approach of our classification model was that we would try to predict the tags in Stack Overflow posts. If we could predict a tag with greater certainty, we would give it a greater quality score. In the end, we would calculate the median of the quality score and label half the questions as low-quality questions and the other half as high-quality questions. While we were able to address a lot of the shortcomings — like the fact that we used a prediction model trained and tested on merely 45k and 15k (respectively) entries and applied it on a dataset containing 2 million entries — from our previous approach, there are numerous flaws in this approach as well. The first and foremost is the fact that we’re treating our tag predicting algorithm to be the holy grail of tag prediction. Other problems like using percentiles to label quality could potentially lead to meaningless results as well.

The following was our methodology to predict Stack Overflow question tags:

Results:

We were able to predict the tags with really high accuracy for the top 10 tags (android, c#, c++, HTML, ios, java, javascript, jquery, PHP, Python). However, the prediction scores outside the top 10 tags were all over the board, and their precision scores ranged from 0.05 to 0.71.

Therefore, using this model to assign quality scores for tags outside the ten mentioned above would not lead to many meaningful results.

Final thoughts

We found the problem of giving quality scores to Stack Overflow questions rather fascinating and will remain on the lookout for solutions proposed by other people.

The work that Stack Overflow does, especially being a non-profit organization, has been phenomenal and has supported programmers in unthinkable ways. While some members of our team had some opinions, most of our team remained curious about why we thought people chose to answer questions on Stack Overflow in the first place. It is astonishing that such a large community of people who answer questions exists. Our team was very grateful to all the people who answered questions as it granted our team the opportunity to analyze some of the ways in which the community functions.

The trends and graphs show that Stack Overflow is headed in the general direction of reducing response time and increasing the percentage of questions answered. For this they are deploying better labeling techniques and greater reliability in flagging questions. The work they are doing is very interesting and our team hopes we are someday able to take a closer look into Stack Overflow in the near future once again.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store