We’d like to collect a list of public data sets available for purpose of educational learning, user modeling and recommender systems. Below is a list of available data sets and the URL where you can get a copy of each. Please let us know (edrecsys [AT] if you want to add your own data sets into this list.

Free Code Camp Data

Description: It’s one big JSON array. Inside that array are 103,000 subarrays — one for each camper who’s completed at least one challenge since we overhauled our schema a few months ago. Each subarray contains an individual camper’s completed challenges as a series of JavaScript objects. Each of these objects has the name of the challenge, the date they first completed it (as a Unix timestamp in milliseconds), and their solution, which will either be code or a URL for their solution on CodePen, Heroku or GitHub. It does not include data from users who have opted out of data sharing.

Student Performance Data Set

Description: This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

Educational Process Mining (EPM)

Description: The experiments have been carried out with a group of 115 students of first-year, undergraduate Engineering major of the University of Genoa. We carried out this study over a simulation environment named Deeds (Digital Electronics Education and Design Suite) which is used for e-learning in digital electronics. The environment provides learning materials through specialized browsers for the students, and asks them to solve various problems with different levels of difficulty. For more information about the Deeds simulator used for this course look at: [Web Link] and to know more about the exercises contents of each session see ‘exercises_info.txt’. Our data set contains the students’ time series of activities during six sessions of laboratory sessions of the course of digital electronics. There are 6 folders containing the students’ data per session. Each ‘Session’ folder contains up to 99 CSV files each dedicated to a specific student log during that session. The number of files in each folder changes due to the number of students present in each session. Each file contains 13 features. See ‘features_info.txt’ for more details. For the details of activities performed by the students during the course, see ‘activities_info.txt’

Pittsburgh Science of Learning Center DataShop

Useful Info from EDM Workshop:
Description: The Pittsburgh Science of Learning Center offers DataShop, a system which you can use to conduct learning curve analysis on educational data.

LAK Data Set

Description: In collaboration with strategic partners, SoLAR is making it possible to perform computational analyses on the Learning Analytics research literature. They are making publicly available machine-readable versions of research sources for scientometrics and other analytics methodologies. The data sets were associated with the ACM International Conference on Learning Analytics and Knowledge (LAK).

Scholarly Paper Recommendation Datasets

Description: These data sets are the basic version of new dataset (“Dataset 2” without link information) experimented in the following JCDL2013 and IJDL2015 work. If you are interested in recommendation of scholarly papers, please try this dataset for your experiments! And it can also be used for other purposes such as classification, clustering, trend analysis, and so on.

Book-Crossing Dataset

Description: The data contains user ID, Book ID, user demographic information, book meta information, and book ratings.