Data Mining: MyBB Community Forums

Posted: July 15, 2016

For a long time now I've been a member (and staff member) of the MyBB Community Forums. I've seen thousands of members come and go, and read probably tens of thousands of posts. Over the last week I've been collecting a lot of data on the MyBB forums, specifically user profile data, and I've used it to put together some basic analysis.

Preface

Note: Before anyone loses their mind, I will point out that all of this data is publically accessible. Although you need a forum account to view member profiles, none of this data is private, nor did it come from any special privileges offered by my staff account. Furthermore, all graphs presented in this article contain no identifiable information (such as user ID or user name).

Why, What, and How?

To answer why I performed this analysis, "just because". I have no reason to collect this data, I simply wanted to see how the data would look, and see if I could identify any trends in user behaviour from it.

The data itself contains 10 different metrics (username, sex, location, experience level, joined date, referral count, project count, reputation count, posts, and threads) observed from 91,665 user IDs. This represents every user who has not been deleted (in this data set I identified 16,488 deleted accounts) from 4th of June 2004 until the 12th July 2016 (the last date of data collection). In total, once we account for crawler metadata (such as last updated time) I had generated over 1.7 million data points.

Data was collected using a custom bot written in PHP using the Laravel framework. I chose the Laravel framework for one reason, its command line structure. After identifying the most recent user ID every possible user ID was queued into the crawler, which in turn crawled each profile and stored the user in a MySQL database. From there a custom export command was written using the Laravel Excel package to export the data to CSV/Excel files. The main reasoning behind using this package was for its ability to create separate sheets for different data comparisons.

I could potentially have exported the entire database into a single CSV, and imported that into GraphPad Prism (my statistical software of choice) but that quickly runs into the issues, requiring huge amounts of memory and computational power for sanitising, normalising, and adapting this data to each comparison. One example of the neccessity of this method is in some of the larger scale ungrouped comparisons (like posts vs threads). With a single data point for each user 91,665 users generate a spreadsheet that is simply too large for Excel to display, being limited to 65,536 rows. Because I was using Excel as an intermediary for some of the column based calculations - GraphPad would simply crash if it had to hold all of the data and every single computed value like posts per day since joining. So exporting to 6 different spreadsheets containing over 30 separate sheets of data was the best approach I could come up with.

The majority of this process was entirely automated, with data collection and spreadsheet creation happening with only the tap of a few keys. A lot of the sanitisation was also automated, such as removing all dead or spam accounts (accounts with 0 posts or threads) prior to exporting. The graphs, and a lot of pre-graphing calculations were unfortunately manual, requiring quite a bit of patience - GraphPad crashed at least 5 times. The downside of this manual aspect is that it makes it rather hard to automate the process entirely.

Initially, I wanted to use something like Google Charts to present the data, feeding the data directly from a every updating database that would keep up to date with all the metrics of previously crawled members and also new members who joined. But it quickly became evident that rendering 100,000 data points in Google Charts using the javascript on a web browser was not going to work.

The Results

Registration Date

The effect of registration year on various metrics from 2004 to 2016. Users metric is raw numbers, all other stats are on a "per user" basis.

The graphs above show the year of registration of members, and the total per-user metric for posts, threads, referrals, projects, and reputation. The graphs highlight some interesting observations.

Firstly, the referral statistics are relatively expected, with users who joined earlier bringing in more referrals per user than those who joined recently. However, the number of referrals from members who joined in 2008 is far above the expected number for that year, with 2008 seeing an expected number of user registrations. In comparison, 2011 saw a hugely increased number of registrations, with almost double the number of registrations of any other year. Yet, despite this, referrals from users in 2011 were no higher than expected. Interestingly, for 2016, the number of referrals per user is approximately similar to 2015, impressive considering it is only July!

Both posts per user and threads per user were higher than expected for 2008 and 2009, signalling a possible low time in the membership of MyBB. The lower than expected posts and threads per user in 2011 is probably caused by the unusually high signups that year. Similarly, 2008 and 2009 are the highest in terms of reputation, with almost twice the reputation per user compared with 2007 and 2010 respectively.

In regards to projects, it appears users who are joined in 2006 are the most active contributers in terms of projects, three times as much as users from any other year.

The effect of registration month on various metrics from 2004 to 2016. Users metric is raw numbers, all other stats are on a "per user" basis.

User distribution for MyBB seems to indicate that registration favours the summer months, with July being by far the most common month for registration, with about 14% more registrations than the next most popular month. On the other side of the coin, the winter months show about 15% less registration than the monthly average.

The effect of registration day on various metrics from 2004 to 2016. Users metric is raw numbers, all other stats are on a "per user" basis.

Initial registration data by day seems to suggest that registrations are far less common the 31st of any given month. However, once you account for the fact that only 7 of the 12 months have 31 days, and then adjust for this factor, the number of registrations per day actually appears to increase towards the end of a month.

Adjusted number of registrations per day between 2004 and 2016.

The remaining monthly and daily statistics are purely for the sake of intrigue. Surely, the month and day of registration has no long-term relation on any metric.

The number of registrations over time.

By analysing registration data it's easy to see that MyBB experienced a large number of signups around 1900-2000 days ago, which is approximately 5 and a half years ago. Calculating this from today (12th of July 2016) gives us a date of around February 2011. When combining this with the previous shown yearly statistics it becomes a little more clear that the MyBB forums experienced a constantly elevated registration count for almost the entire year. It is also possible that the forums experienced a huge influx of spam bots, specifically highlighted by a period of around 5 days in July 2011 that saw four times the daily average (approximately 150 signups per day comapred with the usual average of 35).

Event once discounting the unusually high registration count in 2011, the number of registrations shows a generally upwards trend with more registrations in recent months than those in the early 2000's. This, obviously, is expected, with MyBB being considerably less popular in 2004 than it is today.

The effect of registration duration on referrals per user.

The referral data shows that recent members are more likely to refer other members than those who signed up longer ago. However, it is important to note that this is perhaps due to relative inactivity over time, with veteran members eventually moving on and not actively contributing to the project.

The effect of registration duration on the number of projects per user.

This hypothesis is partly mirrored in the number of projects, with more recent members contributing a larger portion of projects per user. However, a number of members who have been members for more than 3000 days also contribute far above the average. This is possibly due to the fact that

The effect of registration duration on the amount of reputation per user.

Reputation per user is generally linear, with only a slight increase for recently registered users. The trendline shows a slight dip around 2011-2013, possibly caused by the large number of signups in 2011. There is visible thinning of the number of data points as the time since registration increases. This is almost certainly due to the fact that reputation was not enabled on the forums until some time after they opened (a number of years). Because this data set was sanitised to remove all zero-reputation points (to avoid having thousands of data points on Y=0) it is entirely plausible that a large portion of longer-registered users simply became inactive before the reputation system was enabled.

The effect of registration duration on the number of threads per user.

The effect of registration duration on the number of posts per user.

Both threads per user and posts per user show a similar trend to the other graphs, with considerably higher per-day stats for recently joined users. Overall this suggests that recently registered users are most active - an entirely expected phenomenon.

Sex Based Data

Metrics by the disclosed sex of the user.

It is immediately obvious that the majority of MyBB users do not disclose their sex, most probably for privacy reasons but also because it not a required field on registration. This means that by and large bots simply don't fill out the field when signing up.

For posts, threads, and reputation per user the highest averages are seen for members who's sex is "other" followed by males and then females. For posts and reputation this trend shows approximate doubling between females and males and then males and "other". Threads per user show a less pronounced effect, however the same general trend applies. For all three metrics, those who do not disclose their sex show significantly less activity.

In the case of referrals, males instead show the largest rate of referral, with females showing a rate that is significantly closer to that of the "other" and male categories when comapred to other metrics. Again undisclosed users fall far behind, likely due to throw-away accounts.

Finally, project data shows a story that is exaggerated when compared to the other metrics. In this case females produce very few projects per user, and males and "other" users contribute a far larger portion. In this case, males and "other" users contribute about 8 times the number of projects per user as females.

Experience based Data

Metrics by the declared experience of the user.

MyBB offers users the ability to declare their MyBB experience level in their profile. This is an entirely optional profile field, presumably explaining the exceptionally large proportion of users who do not declare such experience (NS).

Number of users by declared experience level, and then the same graph but excluding non-declared users.

It is more clearly visible in the above graph that the majority of users who declare and experience level declare themselves as novice users. This is quite expected if you've ever spent much time on the forums where the majority of support threads and posts are for quite simple problems. The free nature of MyBB also tends to draw a lot of novice users who are just starting out at making websites.

All measured metrics of users by experience shows a situation where the most experienced users contribute the most to the forums, a reassuring sign that members are relatively accurate in the determination of their own experience level. Obviously, metrics such as thread and post count do not directly correlate to user experience, but the projects and reputation metrics probably do - after all members who contribute support and make plugins or themes are likely to be experienced members and also are likely to recieve a lot of gratitude for their actions.

Comparative Data

Number of posts vs number of threads for MyBB users

This graph shows the number of posts compared to the number of threads for the majority of users. The data included in this graph was all users who have less than 200 threads and less than 2000 posts. This removed 81 members from the graph (myself included) but without this pruning the graph was barely visible. So for the sake of this graph we will identify them as "outliers".

As expected, the most threads a user makes the more posts they also make. This is relatively straightforward, but there are a number of users who have approximately equal threads and posts. These users generally make a few threads and never reply beyond that, leading to equal posts and thread counts. After potting a trendline it is possible to deduce that, on average, a user will make about 6 times as many posts as they make threads.

Number of posts vs amount of reputation for MyBB users

Number of threads vs amount of reputation for MyBB users

The above two graphs show the relationship between posts and reputation and threads and reputation. As expected, generally the more posts a user has the more reputation they will also have. This is somewhat expected, as users who post more are generally more likely to help other users, whereas throw-away accounts are unlikely to help others and as such are unlikely to get reputation.

However, the same can not be said for threads vs reputation, where GraphPad was unable to plot any kind of trendline. In this case there are a number of users who have a small number of threads but have a lot of reputation, but also a number of users who have a lot of threads and almost no reputation. This is possibly explained by the two types of users on the MyBB forums. There are a number of users, such as the support staff, who very rarely make threads - if at all - but do help users a lot. These users make up the left-hand high-reputation users.

Number of threads vs number of projects for MyBB users

Finally, the relationship between threads and projects is, whilst weak, quite visible. It shows that as the number of projects a user has increases, the number of threads they have also increases. This is probably explained by the fact that users who post their projects also make a thread on the forums as part of the submission process to act as the support thread for the project.

Overall Thoughts

The results of this analysis show some expected and some unexpected points. In summary:

  • A large proportion of users accounts are likely bot signups or inactive accounts (some 55,000).
  • As a project the rate of signups is growing year on year, indicating growth for the project.
  • The majority of users do not declare either a sex or an experience level
  • Of those who do declare a sex, "other" users are generally most productive.
  • Users who do declare an experience level appear, on average, to be relatively accurate in their self assessment.

In the future, I'd like to add more metrics to this analysis, perhaps tracking user group to determine the number of banned/inactive accounts. I'd also like to automate the system more to keep an up-to-date set of data. This would require some tweaking to keep track of active accounts but check inactive accounts less often - to avoid effectively DoS'ing the MyBB servers! I'd also like to track the last-active time of the user, as members who registered long ago are negatively impacted in any test that takes into account the number of days registered.


If you'd like to look at the graphs, in PDF or image form, use the buttons below. If you'd like to look at the data please drop me a message and I'll provide it to you.

Download the graphs (PDF)   Download the graphs (PNG)


Tags: MyBB, Data, Statistics,

© 2012-2018 Tom Kent. All Rights Reserved