Open Source Tools for Learning Data Analysis, Continuous Improvement, and Machine Learning

This image from the BBC is not subject to’s Creative Commons license.

And now for something completely different…

I’m taking a pause from talking directly about open for a moment to share some resources I’ve recently found that have made my data life much more efficient and enjoyable. But don’t worry – there’s still a connection to open.

I spend a lot of my time working in the data generated by use of Lumen’s open courseware. In addition to regular meetings with partner schools where we share insights both surprising and mundane, this work also supports our continuous improvement efforts to make our open courseware objectively more effective term after term. As I’ve said many times:

  • “open” gives you permission to make improvements to course materials but doesn’t tell you what needs changing.
  • “learning analytics” give you information about what needs improving in your course but doesn’t give you permission to make the changes.
  • ∴ to do continuous improvement in education, you need OER (permission to change) plus analytics (info about what to change).

I’m a huge fan of R and R Studio and use them regularly for data extraction, cleaning, and analysis. If you don’t know these, R is open source software for statistical computing and visualization, and RStudio is an IDE that makes R easier to use and includes a code editor, as well as debugging and visualization tools. I love these tools because they make doing reproducible research so much easier – instead of trying to remember how I changed that Excel file and which menus I clicked on in SPSS, I can write R scripts that repeat the entire process from extraction to cleaning to analysis to reporting, so I can always repeat (or audit) my work.

I’m an especially big fan of R Markdown in R Studio, which lets you intermingle analysis code (R) with presentation instructions (Markdown) in one file so that you can plug in new data, literally push a button, and automatically create slides or HTML or PDF that contain updated findings and graphics. When you’re sharing a semester’s worth of findings with multiple schools in short time window, this capability turns out to be an absolute lifesaver. It’s also a sinch to then combine the data from all schools and re-run what is now an omnibus analysis that gives you insight into how things are working overall and share findings in support of continuous improvement efforts.

But recently I’ve found myself spending more time in the machine learning space. And this is where two new (to me) open source tools have proven to be really powerful and interesting. The first is vtreat. vtreat is a data.frame processor/conditioner that prepares “real-world” data for predictive modeling in a statistically sound manner. The main idea with vtreat is that even with a sophisticated machine learning algorithm there are many ways messy real world data can defeat the modeling process, and vtreat helps with at least ten of them.

The second is Rattle (the R Analytical Tool To Learn Easily):

Rattle makes much of the data preparation (e.g., splitting into train/validate/test sets), exploration, transformation, model building, and evaluation process point and click. This is terrifically powerful for getting started quickly with machine learning tasks. Perhaps more importantly, Rattle logs all the R commands used throughout the process and lets you save them as standalone R scripts. Then you can further refine things from the command line (or in R Studio) and plug the scripts into whatever process you use to periodically run your analyses.

I know this is different from the usual iterating toward openness fare, but given how much these tools are improving my life I wanted to make sure others who could benefit from them knew about them. What tools do you use in your data-related work? Drop your favorites in the comments below.

Comments on this entry are closed.

  • siouxgeonz

    I was hoping to see stuff from your regular cohort… being a person who doesn’t have “data related work.” This post is close to jabberwocky to me, out in anecdote land. (Hey, I just had a pre-algebra kiddo figure out that with compound interest, you could do stuff to the “nt” by taking it to the n power and then to the t power, and he wanted to know why it worked ! That’s my data…)
    This isn’t “completely different” because you’re often expounding on the things OER needs to do to become scalable… things the “big guys” are doing. I wonder if an issue is that lots of people in my kind of role would read this blog and say erm, data-related work? Let me get back to my LMS…