Wednesday, November 26, 2014

Competency 5.1/ Week 5 Activity

Competency 5.1: Learn to conduct prediction modeling effectively and appropriately.
I think this competency can be achieved if we are able to complete the given activity in RapidMiner. It is quite difficult for a newbie, but its well-described in the course and definitely doable :)

We were asked to build a set of detectors predicting the variable ONTASK for the given data file using RapidMiner 5.3. I had previously installed Rapidminer 6.1, so I was using it. The question progresses to the next one only if you answer it correctly and there were 13 questions,most of which require you to enter the Kappa of the model you executed. There were a few difficulties along the way (for which I will try to give some useful tips) and I had a huge relief when I finally got this screen ;)




1) Build a decision tree (using operator W-J48 from the Weka Extension Pack) on the entire data set. What is the non-cross-validated kappa?

You can follow the Rapidminer walkthrough video to answer this question. It's almost the same steps, but in the last stage of data import from excel, you need to change the variable types if they are not correctly guessed. My Rapid Miner 6.1 version guessed the data types correctly, but if you are using 5.3, you should probably change the types of polynomial variables which were incorrectly guessed as integer. (You can open and check the input excel file to see what kind of values are there)

You should set attribute name as “ONTASK” and target role = label (the value to be predicted in the given exercise) and add operators W-J48, Apply model and Performance (Binary Classification). The rest should be fine.

2) The kappa value you just obtained is artificially high - the model is over-fitting to which student it is. What is the non-cross-validated kappa, if you build the model (using the same operator), excluding student?

Two ways to exclude a field - to delete the field or use Select attributes operator. The latter is better for obvious reasons. For this question you need to add "Select Attributes" operator and set Attribute filter type = single, attribute=StudentID and check invert selection (since we are asked to exclude student). 

3) Some other features in the data set may make your model overly specific to the current data set. Which data features would not apply outside of the population sampled in the current data set? Select all that apply.

For this question, you need to select the options which will not generalise outside your population. The system will assist you if you are wrong.

4) What is the non-cross-validated kappa, if you build the W-J48 decision tree model (using the same operator), excluding student and the variables from Question 3?

For this, we need to exclude all variables in Q3 which do not apply to the population outside our sample data in addition to the studentID we already excluded. You can change the attributes by selecting filter type= subset and Select the attributes to be excluded in the next window. Check invert selection. 

5) What is the non-cross-validated kappa, for the same set of variables you used for question 4, if you use Naive Bayes?

Replace the W-J48 operator for Weka’s decision tree by Naïve bayes operator.

6) What is the non-cross-validated kappa, for the same set of variables you used for question 4, if you use W-JRip?

Replace Naïve Bayes operator by W-Jrip operator.

7) What is the non-cross-validated kappa, for the same set of variables you used for question 4, if you use Logistic Regression? (Hint: You will need to transform some variables to make this work; RapidMiner will tell you what to do)

Add operators "Nominal to Numerical" and "Logistic Regression" because Logistic Regression cannot handle polynominal attributes/ label

This was the one for which I spent maximum time, but still couldn't go through. I'm not sure if the Kappa I got was wrong or what the system expects itself was wrong. Anyways, since I couldn't afford more than a day on that issue and I was almost on the verge of quitting the activity, I had to trespass this question with Ryan's answer in the discussion forum of Quickhelper. That's very unfortunate, but I hardly had a choice :(

8) What is the non-cross-validated kappa, for the same set of variables you used for question 4, if you use Step Regression (called Linear Regression)?

Just replace Logistic Regression by Linear Regression operator.

9) What is the non-cross-validated kappa, for the same set of variables you used for question 4, if you use k-NN instead of W-J48? (We will discuss the results of this test later).

Replace Linear Regression by K-NN operator.

10) What is the kappa, for the same set of variables you used for question 4, if you use W-J48, and conduct 10-fold stratified-sample cross-validation?

For cross-validating our model, you can refer the Rapidminer walkthrough. You need to add  X-Validation operator. Remove the W-J48, Apply model and Performance operators from the process and add it inside the training and test set of X-Validation operator.

11) Why is the kappa lower for question 11 (cross-validation) than question 4 (no cross-validation?)
K-NN predicts a point using itself when cross-validation is turned off, and that’s bad.

You should be able to answer this question, otherwise the system will help.

12) What is the kappa, for the same set of variables you used for question 4, if you use k-NN, and conduct 10-fold stratified-sample cross-validation?

Replace W-J48 by K-NN inside the X-Validation training set.

13) k-NN and W-J48 got almost the same Kappa when compared using cross-validation. But the kappa for k-NN was much higher (1.000) when cross-validation was not used. Why is that?

You should be able to answer this question as well, else the system will help.


I wanted to post a tutorial with the pictures so it can help new comers, but I didn't have time for that since I haven't started Week6 yet, which is running now. Hope my tips help. All the best!


Competency 5.2/ Week 5 Reflection

A quick intro - I skipped weeks 3 & 4 for the time being since I was very much behind schedule coz of starting late. I jumped to Week 5 - Prediction modeling so that I can participate in the discussion and bazaar but sadly I was still lagging to participate. I'm aiming to complete weeks 3 and 4 later when I'm back on track :)

My Notes/ Learning:

Prediction 
- Developing a model that can infer an aspect of data (predicted variable) from a combination of other data (predictor variables)
- Inferences about the future/ present 
Two categories of prediction model:
  1. Regressers
  2. Classifiers

Regression

- A model that predicts a number in data mining (E.g. How long student takes to answer)
- Label -->.something we want to predict

Building a Regression model:
  1. Training label - a data set where we already know the answer, to train the model for prediction
  2. Test data - the data set for testing our model

- The basic idea of regression is to determine which features, in which combination can predict the label's value.

Linear Regression
E.g. No. of papers = 2+ 2* # of grad students - 0.1*(# of grad students)^2
- only for linear functions 
- flexible, fast
- more accurate than complex models, once cross-validated.
- feasible to understand our model

Regression Trees
E.g. If x>3 y = 2A+ 3B
       else if x< -7 y = 2A- 3B
       else y = 2A+ 0.5B+ C
- Non-linear
- Different linear relationships between some variables, depending on other variables.


Classification

- what we want to predict is categorical
E.g. Correct/ Wrong (0/1), Category A/B/C/D/E/F
- Each label is associated with a set of "features" - to help in predicting the label

Classifier
- Determine which features, in what combination can predict the model.
- Many classification algorithms available 
- Data mining packages that include them:
  • Rapidminer
  • SAS Enterprise miner
  • Weka
  • KEEL

- Some algorithms useful for educational data mining:
  • Step regression
  • logistic regression
  • J48/ C4.5 Decision Trees
  • JRip Decision rules
  • K* Instance-based classifiers

Step Regression:
- fits linear regression function, has arbitrary cut-off
- for binary classification (0/1)
- selects parameters, assigns weight to each parameter and computes numerical values
E.g.  Y = 0.5 a + 0.7 b - 0.2 c +0.4 d + 0.3
If cut-off = 0.5, all values below 0.5 treated as 0 and all values >= 0.5 treated as 1.
- lack of closed-form expression
- often better in EDM, due to over-fitting (conservative)

Logistic Regression:
- also for binary classification
-given a specific set of values of predictor variables, fits logistic function to data to find out the frequency/ odds of a specific value of the dependent variable.
E.g. m = a0+a1v1+a2v2+a2v3...
       p(m) = 1/ (1+e^-m)
- relatively conservative algorithm due to its simple functional form.
- useful for cases where changes in predictor variables have predictable effects on probability of predicted variable class (without interaction effects)
E.g. A= Bad, B= Bad, but A+B = Good
- Automated feature distillation available but it is not optimal.

Decision Trees:
- Deals with interaction effects


J 48/ C4.5:
 - J48 is the open source re-implementation in Weka/Rapidminer of C4.5
- both numerical and categorical predictor variables can be used
- tries to find optimal split in numerical variables
- relatively conservative
- good when the data has natural splits and multi-level
-good for data when same construct can be arrived at in multiple ways.

Decision Rules:
- set of if-then rules to check in order
- many different algorithms available with difference in how rules are generated and selected. 
-most popular sub-category (JRip, PART) repeatedly create decision trees and distills best values.
- relatively conservative - simpler than most decision trees.
- very interpretable models unlike other apporaches
- good when multi-level interactions are common

K*
- instance based classifier
-predicts a data point from neighboring data points (stronger weights when points are nearby)
- good when data is very divergent with:
no easy patterns but there are clusters
different processes can lead to same result
interactable to find general rules
- sometimes works when nothing else works
Drawback - whole data set is needed --> useful for offline analysis

Bagged Stumps:
- related to decision trees
- lot of trees with only the first feature, later we aggregate them
- relatively conservative
- close variant is Random forest (building a lot of trees and aggregating across the trees)

Some less-conservative algorithms:
- complex
Support Vector Machines SVM:
- conducts dimensionality reduction on data space and then fits hyper plane which splits classes.
- creates sophisticated models, great for text mining
- not optimal for other educational data (logs, grades, interactions with software)
Genetic Algorithms:
- uses mutation, combination and natural selection to search space of possible models.
- it can produce inconsistent answers - randomness
Neural networks:
- composes of extremely complex relationships through combining "perceptrons" in different fashions
- complicated

Over-fitting:
Fitting to noise as well as signal. 
Over-fit model will be less good for new data.
Every model is over fit - we can try to reduce it but cannot completely eliminate over-fitting
Assessment:
Check if the model transfers to new contexts or it is over-fit to a specific context
Test model on unseen data
Training set > Test set

Cross-validation:
- split data points into N equal sized groups
- Train on all groups except one and test on the last group.
- Repeat for all groups changing the training ans test data groups for all possible combinations.
K-fold: pick a number K, to split into this number of groups
Leave-out-one: Every data point is a fold (avoids stratification issues)
Variants:
Flat-Cross validation -each point has equal chance of being placed in a fold
Student-level cross validation - minimum requirement (testing generalization to new students)
Other levels like school, lesson, demography, software package etc.

Uses of Prediction Modeling

My ideas in using prediction modeling for education:

1. Predict future career path and train accordingly:
If we are able to predict the future career path of students based on their interests in subjects, we can give more field-level training. That kind of education will be more meaningful to students to gain the skills required in the industry. Students will also be more interested to learn what they like, rather than being forced upon to learn something they don't prefer.

2. Provide help for weak students:
Not all students require the same amount of help to understand the subject. Some students may learn easier than others. If we predict the different possible points where students may find difficulties, we can provide help in the specific areas.

3. Identify competencies:
If we can identify the competencies of students and what they lack, we can provide more guidance in that area. For example, a student does his work perfectly and exhibits good leadership, but doesn't practice teamwork, we can guide him to learn teamwork competency better.


All materials are based on the EdX course - Data, Analytics and Learning
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.