Analyzing questions asked to the troubleshooter of all tech teams — Stack Overflow

By Agrim Gupta, Darren Tsang, Nora Liu, Yuan Shen

Introduction

Stack Overflow publishes their posts data annually and makes it publicly available on Google’s BigQuery platform. Stack Overflow now employs several data scientists themselves to help the organization make decisions like — when a question should be flagged low quality, when a person who asks a question be notified to add more details etc. Despite that, we thought it would be fun to take it upon ourselves to answer some questions we thought were inadequately answered so far.

We started off by asking the following questions:

  1. How can people ask better questions (in terms of quality and descriptiveness)? This question also involved an exploration of the advice provided by Stack Overflow themselves for asking a good question.
  2. How can people get quicker responses based on their questions?
  3. How do similar ecosystems of tools that are considered parallel to one another (eg. Kotling vs Swift, and ggplot2 vs maptlotlib) compare?

We think a bit of background about Stack Overflow posts before we dive deep into our analysis and findings, would be essential. Note, we will be using the words post and question in the article interchangeably. The posts on the discussion forum may be asked by users with verified accounts and must have descriptive titles. Examples of good and bad titles from Stack Overflow themselves:

  • Bad: C# Math Confusion
  • Good: Why does using float instead of int give me different results when all of my inputs are integers?
  • Bad: [php] session doubt
  • Good: How can I redirect users to different pages based on session data in PHP?
  • Bad: android if else problems
  • Good: Why does str == “value” evaluate to false when str is set to “value”?

Further, every Stack Overflow post must contain at least one tag. Stack Overflow recommends including all relevant tags in the question asked. They do so to help people who ask questions increase the reach of their questions and be connected to potential people with answers sooner and with greater efficiency.

Analyzing Response time as a factor

In the above graph, we visualized the average response time, the raw number and percentage of questions answered in that particular tag in a single plot. The tags we’ve plotted are the most popular data visualization tags contained in Stack Overflow posts (we could probably have barred Python and R from this plot, but we decided to include them since some of the popular libraries and packages related to data visualization are built-in and use those two languages). The smaller the circle, the lesser the average response time of that particular tag is. If we compare matplotlib and ggplot2 with each other, we find that ggplot2 has a far higher percentage of questions answered, while both tags have a fairly similar number of questions and average response times. Plotly, while clearly the most popular, in terms of the number of times it is tagged in questions, has a poor average response rate and time.

Next, we observed how the average response time and percentage of total questions answered changed annually. We can see that the percentage of total questions answered for every year since the beginning of Stack Overflow remained fairly consistent. However, as the community has grown, the time in which questions get answered has constantly decreased. A bigger circle signifies a greater raw number of questions asked in that particular year.

The ‘how’ questions take a lot longer than ‘why’ and ‘what’ questions to get answered. This seemed intuitively true for our team’s personal experiences. The size of circle is proportional to the number of questions asked for that keyword.

In the same part of our analysis, we used the NLTK library in Python to parse our posts into lists of words. We created a huge list of all the tags in Stack Overflow posts. We checked for every word in a post, whether it was a tag originally included in the post itself. We found only 26% of all posts which contained words in the body of those posts which were also recognized Stack Overflow tags. However, when we compared the response times and response rates of those two categories, we found no significant difference.

Analyzing the quality of questions: Part 1

This dataset only contained 60k questions, which is insignificant if you look at the total number of questions asked in Stack Overflow — over 2 million. We used a prediction model (87% acc.) from a Kaggle community user and applied it to the other remaining 2 million questions.

For the quality classification dataset, we decided to begin with some basic EDA. We filtered for some of the most popular tags used in the questions, and compared their proportions of high-quality questions, low-quality questions closed without a single edit, and edited low-quality questions (open).

This is a barplot showing the frequency of certain tags (we chose these tags based on what we felt were popular/relevant among students) split by HQ, LQC, LQE. The purpose of this graph is to see if there were certain tags that are associated with higher quality questions or low-quality questions.

As we can see in the plot, the Python tag is more common among LQC than any other category (followed by LQE). A possible explanation is that Python is very common as someone’s first programming language, and as a result, learners try to ask questions. When these questions are asked, the right terminology might be used, which results in a low-quality question. Also, we can see overall trends among tags across all classes. R is pretty low compared to Python or Javascript, no matter the category of question.

The low-quality questions are majorly dominated by the top 5 tags, while the high-quality questions have a more diverse distribution (see: the size of the ‘other’ slice). This is potentially being caused by the fact that most beginners start with the most popular tags from the low-quality questions, while the distribution for those tags is a lot more diverse once people are familiarized with and exposed to a greater number of tools.

Finding family of packages and libraries in R and Python respectively by analyzing tags

We used igraph and tidygraph packages to draw network plots. The dark the network line between two nodes is, the stronger they relate to one another. From this network plot, we can visualize the different ecosystems that exist within the R community. Chains of three and four tags occur a lot more commonly in the R network plot as opposed to the Python one which you can find below. The correlation is based on how many common questions two tags appeared in. For this, we had to discard the Stack Overflow posts which contained only a single tag.

As explained before, the network plot for Python tags is a lot more vast. While R is mostly associated with data science and data analysis related tasks, Python is used across the board — from web development to servers, to algorithmic trading — therefore, it makes sense that its network plot is as vast as it is. While there are no chains of 4 tags, there certainly do exist chains of three tags connected triangularly.

Shortcomings and future improvements

  1. We used a prediction model trained and tested on merely 45k and 15k (respectively) entries and applied it on a dataset containing 2 million entries. With greater computing resources, one could potentially train and test on the 2 million entries iteratively and in smaller subsets.
  2. We were unable to verify the methodology followed by the person who classified the initial 60k questions (the external dataset that we used for our quality analysis). In a perfect world, had the person replied, we could use his methods to manually tag several more questions than 60k.
  3. We could instead have built our own quality classification model (we did try, look at team red’s GitHub repository — the process of doing that in the next section).
  4. We could’ve tried to compare more communities (eg. Frontend vs backend etc).
  5. We should’ve provided more concrete insights into how people ask and respond to questions. That should’ve been followed up by general advice for people to be able to ask questions of greater quality.
  6. We could’ve correlated length of questions, length of titles with a number of responses, time taken to receive the first response, number of upvotes and downvotes, etc.

Self-classification of Stack Overflow posts based on the predictability of tags

The basic approach of our classification model was that we would try to predict the tags in Stack Overflow posts. If we could predict a tag with greater certainty, we would give it a greater quality score. In the end, we would calculate the median of the quality score and label half the questions as low-quality questions and the other half as high-quality questions. While we were able to address a lot of the shortcomings — like the fact that we used a prediction model trained and tested on merely 45k and 15k (respectively) entries and applied it on a dataset containing 2 million entries — from our previous approach, there are numerous flaws in this approach as well. The first and foremost is the fact that we’re treating our tag predicting algorithm to be the holy grail of tag prediction. Other problems like using percentiles to label quality could potentially lead to meaningless results as well.

The following was our methodology to predict Stack Overflow question tags:

  1. Removing stop words from question sentences.
  2. Split words into their building blocks — tokenization (we used work tokenization). Got a vocabulary size from this process.
  3. Encoded our tokens into text sequences.
  4. Embedding layer — converted text sequences into vector sequences for training. After training, we obtained vectors with similar structures indicating similar meaning words.
    (More efficient than passing one-hot encoded vector to the dense layer in the RNN model).
  5. Predicting tags using an RNN model (from TensorFlow, consists of — a dense layer, dropout layer, batch normalization layer, dense output layer). RNN model from TensorFlow community.
  6. Assigning predicted tags to each question in our dataset, as well as the certainty of this prediction.

Results:

We were able to predict the tags with really high accuracy for the top 10 tags (android, c#, c++, HTML, ios, java, javascript, jquery, PHP, Python). However, the prediction scores outside the top 10 tags were all over the board, and their precision scores ranged from 0.05 to 0.71.

Therefore, using this model to assign quality scores for tags outside the ten mentioned above would not lead to many meaningful results.

Final thoughts

The work that Stack Overflow does, especially being a non-profit organization, has been phenomenal and has supported programmers in unthinkable ways. While some members of our team had some opinions, most of our team remained curious about why we thought people chose to answer questions on Stack Overflow in the first place. It is astonishing that such a large community of people who answer questions exists. Our team was very grateful to all the people who answered questions as it granted our team the opportunity to analyze some of the ways in which the community functions.

The trends and graphs show that Stack Overflow is headed in the general direction of reducing response time and increasing the percentage of questions answered. For this they are deploying better labeling techniques and greater reliability in flagging questions. The work they are doing is very interesting and our team hopes we are someday able to take a closer look into Stack Overflow in the near future once again.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store