Thursday, December 18, 2014

My reflections on DALMOOC

It's been a great experience working with the Data, Analytics and Learning MOOC (DALMOOC) from EdX at https://www.edx.org/course/data-analytics-learning-utarlingtonx-link5-10x#.VJO6xF4AA. I'm so glad  my boss found it and encouraged me to take it (Although I started only in Week 5 of the 9-Week course). It was not easy, nevertheless, it was a very rewarding experience. I would think an estimated 5 hours/week would not be sufficient to gain all the competencies, especially as the topics become tougher in the later weeks. 

I liked the overall design of the course where the contents progress smoothly, starting from a general introduction of learning analytics, getting started with data analysis to deeper topics like social network analysis, prediction modeling and text mining. Sadly, I jumped back and forth not following the flow, since I started late. But that's the best part, because the course is designed to be flexible, as in, I can directly access topics which interest me the most, without a need to follow the order. I also liked the fact there were ways to create artifacts for future use and easy sharing, like my blog which I created exclusively for this course!

The contents were good and aptly selected. There are deeper stuff in each of these topics, but this course provides the basic foundation, based on which we can explore further. I found Weeks 5 & 6 to be content-rich which was necessary to understand many concepts in Weeks 7 & 8. Activities with specific points allotted in Weeks 5&6 were good, I wish there were more in other topics to test our understanding (but better organized and with some tips).The voices from the field about different research taking place in this field are very useful bonus content.

The ways to interact were well-planned and executed. The discussion forum was so useful and the students were eager to jump in to answer questions and help others. Connecting with twitter helped us to connect to instructors and other course mates with ease.  I din't have luck to go into the collaborative chat though. Along with the different options of connecting with social networks like Facebook and twitter (especially in Prosolo), I personally prefer to add in LinkedIn,Google groups as well.

The instructors asked us to reflect more than to see the different activities only as homework, and not to challenge ourselves to do everything, unfortunately, both of which I was doing at times :( This is what happens when we skip some initial weeks where important instructions are given. Anyways, considering the fact that this is the first MOOC I've ever taken, I give credit to myself for doing reasonable well. Yay :)

In terms of improvements that I would want to see in future courses, most of them are related to accessibility. Most available options, though explained upfront during orientation could be better organized. For example, the different contents in EdX that were in tabbed format were not much noticeable. I used to go to Prosolo every time to see all the contents since I thought EdX didn't have them all (how stupid of me, lol). Another big difficulty when I first started using EdX is to identify which video I should see first in a chapter. I would often see one, and go to the next as given in alphabetical order and then realize that I should have watched the second one first. It would be good if the course contents are numbered and ordered that way!

The option to download handouts was not given in all weeks, so giving that would help. Making the bazaar assignments more accessible would also help students. Rather than linking to en external assignment bank (which didn't have all the weeks' assignments), it would be better if it's integrated within EdX or Prosolo. I discovered late that Quickhelper has two different options like Question and Discussion (A tip while hovering could help, maybe). And I didn't know if I should post under each week's course or the discussion forum link (I later came to know they both link to the same place). Probably, consider any one integrated place for all discussion? Of course, as our instructors talked about future plans, it would be good to see real-time analysis to check who stands where in the network and whom we should get connected to for better access. 

Those were some suggestions I could think of to overcome the difficulties I faced while taking this course. I think these would be definitely looked after in future courses. For the construction of such a course for first time, it is so well done. I've learned so much in so little time (miles to go, yet). Big appreciation to all instructors who put in so much time and effort to bring it together. Thank you so much!

After toiling for days to learn, do assignments and acquire competencies, it's now time to relax . And don't forget to connect to peers. Happy holidays everyone!











Concept Map

Competency 9.2: Integrate various course concepts through creation of a graphical representation (concept map) of the relationships between prominent course topics.

We've had a fruitful time learning about different topics in Learning Analytics. The next task in hand was to create a concept map by finding connections among them. With an assurance from the instructors that there is no right or wrong concept map, I started constructing it using the tool available from http://cmap.ihmc.us/download/

If you are going to do it for the first time just like me, I would recommend you to read about concept maps from few articles (The links given in the course were useful). 

Since the concept maps follow hierarchy, I started from "Learning", since the impact on learning is what all methods look for, though the steps differ. I noticed that text mining and prediction modeling were connected very well. All these methods were also following the same learning analytics cycle. Our course structure reminds me of an opportunity where all these methods can be implemented to get a bigger picture of our learning using analysis of social media, logs, chats, discussion forums etc. Anyways, here it goes, my concept map integrating the different course concepts. (I know it's a bit cluttered, but bear with me, coz I'm so much in a getting-ready-for-holiday mood and didn't feel like cleaning up more!)



Tuesday, December 16, 2014

Competency 4.2


Competency 4.2: Describe and interpret the results of social network analysis for the study of learning.

I'm going to use the analysis I did by extracting twitter data with hashtag #DALMOOC (explained in my post for competency 3.2) to interpret the results of modularity in the network. The density was 0.05 which means that the network is connected relatively well. There are a few sub-communities in the network, but still the overall network is connected to some extent.

In the modularity report, we can see the formation of 13 communities. There are some disconnected nodes in the network containing lesser people/ single person which form smaller sub-communities. We should try to connect the disconnected nodes in the network to others in the bigger community to ensure good sharing of information.


 

The giant component algorithm brought 5 main communities which are color-coded. From the centrality measures, we can see that dgasevic plays the role of a network broker by bridging many nodes in the network. The outdegree from that person is also high which means that he is involved in conversations with many people and is willing to help.

 


    

I've kept this part of "Social network analysis" short and tried to complete it as soon as possible since we are in the last week of the course. I will be doing the final week's tasks from tomorrow and hopefully I will post a reflection about the entire course soon :)

Competency 4.1


Competency 4.1: Describe and critically reflect on approaches to the use of social network analysis for the study of learning.


I've seen some great insights on how to use social network analysis to improve learning. The impact of social network analysis on educational constructs like learning design, sense of community, creative potential, social presence, academic performance and MOOC pedagogy looks promising. The possible data sources could be discussion boards, course enrollments, twitter and other social networks data, self-reports or course design. Metrics like network density, degree centrality, eccentricity, modularity etc. help us to get an idea about the network of and individuals in a network.

Learning design could affect students’ activities in a big way. Students who are familiar with a design are generally more comfortable using it. To see if students in a course are learning as expected, we can monitor them using SNA and guide them as needed. We can see at what stage the instructor's role is more important than peer-facilitation by seeing the interactions and provide help to students.

Monitoring the sense of community will be useful in identifying isolated groups/ individuals who may not receive all information. We can in such cases guide them to be part of larger communities. We can also advise students to join new groups for assignments to get connected to more students. These factors can impact the creative potential, social presence and academic performance of students if suitable help is provided. Awareness of more ways of communication and their usefulness should be advocated to students to help them understand the distributed structure of MOOCs and be better involved.

I would think that this kind of analysis should happen along the way in any course to see how students are doing in the course of time. Seeing the results at the end may not be of much help to students. Rather, positive measures can be taken like introducing new hashtags for better connection or introducing a list or shared document of all students and their resources will help students better in the rest of their course. It could be beneficial to connect different social networks to get the complete picture about an individual student as well.

Monday, December 15, 2014

Competency 3.2/ Assignment 79

Competency 3.2: Perform social network analysis and visualize analysis results in Gephi.

I did assignment 79 in which we are asked to extract data from twitter and analyse the twitter network in Gephi. I extracted twitter data using NodeXL, a freely available add-on to excel. I searched for all tweets with the tag DALMOOC and used it as my data set. 

I exported the data from NodeXL as a Graphml file which I then used to import into Gephi. I tried different visualizations and measures. I was pretty amazed to see what difference a good visualization can make in analyzing and presenting data :) The snapshots should be quite self-explanatory.

It was interesting to see the network graph color-coded by the geographical location (This is only a part of the whole network graph).




I've varied the degree of centrality by color (This is only a part of the whole network graph)



The graph density was 0.050 and the diameter of the graph was 5. 





The modularity report is shown. There were 13 communities, some having single and two peers.





I applied the giant component algorithm to filter out small communities.Now five main communities that emerged are color coded and shown below:



It was very interesting to apply different visualization techniques to our course data. Hoping to use these types of visualizations on my own research data!

Competency 3.1 - Basics of Social Network Analysis

I'm going back to Weeks 3 and 4 to learn about Social Network Analysis since the course is nearing completion. I will go back to the final wrap up Week 9 after I finish these two weeks' lessons.

Competency 3.1: Define social network analysis and its main analysis methods.

Social Network Analysis (SNA) provides insights into how different social processes unfold while learning happens in any learning environment. It helps us to study the effects of interaction and social context in education. The different network elements are actors and their relations. 

The nodes/ actors could be students email addresses, tweets or any such actions. I would typically use SNA to see the interaction between students, for example in a chatroom/ discussion forum, to see who is talking to whom, who replies to whom, who is following what question, who voted for a question etc. Based on the interaction patterns, we can construct the network graph. We can from here see if any measure from the network can correlate to learning or performance.  

Some measures in SNA for analysis are below:

Diameter:

Diameter determines the longest distance between any pair of nodes in a network. It measures the extent to which each individual node can communicate with any other node in the network. 

Density:

Density determines the potential of the entire network to talk to each other. It can be used to determine the extent to which some individual nodes share the information. The spread of information is very fast in a highly dense network. 

Degree Centrality:

Degree centrality is a simple measure that indicates the overall number of connections for each actor in a network. Network measures may have specific meaning when considered in the context of directed graphs.
In-Degree Centrality:
In-degree centrality is a measure of the number of other nodes that directly try to establish connection to a particular node. Also refers to the popularity or prestige of a node in a network.
Out-Degree Centrality:
Out-degree centrality is the measure of the number of nodes to which particular nodes are talking. 

Betweenness Centrality:

Betweenness centrality indicates the ease of connection with anybody else in the network, in particular, to try to connect all small sub communities in the network. Brokerage role is best measured by this measure.

Closeness Centrality:

Closeness centrality measures the ease or the shortest distance of a node to anybody else in the network. It indicates how quickly a node can get to another node in the network.

Network Modularity:

Network modularity is used to identify common sub-groups talking to each other where a group of actors have close ties to each other. An algorithm for finding the giant component can be used to identify the largest component of all connected nodes in the network. This filters out single nodes that are not connected to the network to easily identify and analyse communities in the network.








Friday, December 12, 2014

Competency 8.5

Competency 8.5: Examine texts from different categories and notice characteristics they might want to include in feature space for models and then use this reasoning to start to make tentative decisions about what kinds of features to include in their models.

I tried the Bazaar activity in Prosolo (but my myself since I didn't get matched to a teammate), to explore advanced feature extraction in LightSIDE and see which features work well giving better performance. I first used the sentiment_sentences data set and configured stretchy patterns using the pre-defined categories positive and negative. There was a very significant improvement in performance from unigrams only to stretchy patterns.


To look at the details, I used Explore results pane to analyse the results.


The more indicative words which occurred more times had stronger weights (E.g. dull, too, enjoyable). Commonly occurring words like a, of and punctuation had lesser or no feature weights assigned to them. The stretchy patterns helped in predicting many positive and negative instances correctly, by considering the position and structure of previous and coming words. Examples below:
STRONG-POS [GAP] , but --> the movie is loaded with good intentions , but ---> neg
one [GAP] the STRONG-POS --> one of the best of the year --> pos


In the newsgroup data set, there were overlaps in some categories like religion & atheism, forsale & windows due to some words. The context should be captured more in such cases using stretchy patterns.

In my test data set of plants classification into fruits, vegetables and flowers, it was seen the the unigram features were most predictive. The structure of the text was not of importance since the unigrams feature space did a decent prediction than bigrams and trigrams included.




Thursday, December 11, 2014

Competency 8.3/ 8.4

Competency 8.3: Compare the performance of different models.

I compared two models, one from a unigram only feature set and the other from a unigram, bigram and trigram feature set using my test data set. I was at first using the Newsgroup data set as suggested in the Prosolo assignment, but some options were not working for me in the Explore results tab of LightSIDE. I was not sure if I could make a proper analysis without Feature weights, so I chose to use my small test data set instead. Below is the comparison of the two models:



Competency 8.4: Inspect models and interpret the weights assigned to different features as well as to reason about what these weights signify and whether they make sense.

I went to the Explore results tab to do some basic error analysis. The confusion matrix of 123 grams model was better than the 1 grams model. I looked at specific features in detail that predicted wrong categories.


E.g. The term "flowering" which had a high Feature Influence for flowers wrongly predicted a fruit which contained the term as a flower. Few terms like "genus", "plants" did not make a correct prediction even along with its bigram and trigrams:



The data set was very small, so it did not have enough features to train the model on. There were many wrong predictions in the case of punctuation features as well. I guess that the model would do well when trained with more data using unigrams, bigrams and trigrams.



Wednesday, December 10, 2014

Competency 8.2

Competency 8.2: Build and evaluate models using alternative feature spaces.

I used the different feature spaces that I saved in the previous exercise for building models. My data set was very small and I intended to use it just for testing. I found significant improvement in metrics while comparing the models of POS features Vs Unigrams and bigrams. I could see from my data that the n-grams were most predictive of the categories.



I couldn't find significant improvements in model metrics for many basic features. I used Naive Bayes as the classification algorithm. I also tried other algorithms, but there was not a big difference in the metrics' values. Few feature spaces I tried along with the metrics for their models are below:
                             
Feature Space
Accuracy
Kappa
POS grams
42%
0.12
12 grams_count
58%
0.36
1 grams_pairs
61%
0.41
12 grams_length
61%
0.41
12 POS grams
65%
0.47
12 grams_no stop
69%
0.52
12 grams
73%
0.59
123 grams
73%
0.59

To test with a real data set, I tried the hands on activity of text feature extraction given in Prosolo using sentiment_sentences data set. I extracted different feature spaces from the basic feature set and used logistic regression. There was significant improvement while expanding the feature set.


Competency 8.1

Competency 8.1: Prepare data for use in LightSIDE and use LightSIDE to extract a wide range of feature types.

For the purpose of this exercise, I created a simple data set of three types of plants: vegetables, fruits and flowers. I classified text (taken from Wikipedia) based on the three categories. It looked like this:


I loaded my input csv file into LightSIDE and extracted basic features like unigrams and bigrams first. Then I checked different basic features and extracted their feature sets.


I saved all the feature sets for building models later using alternative feature spaces. 



Week 8 Activity - Data preparation

Activity: Textual data pre-processing and informal analysis

Rule 1:
I created a list of positive words (unigrams and bigrams) from the given data and used them to identify positive and negative instances.

IF (effective OR intriguing OR breathtaking OR captivated OR (NOT not)_perfect OR loved OR real_chemistry OR really_good OR charm OR enthralled OR beautifully_done OR thoughtprovoking OR poignant OR fabulous OR sweet OR true_chemistry OR so_well OR enjoy OR excellent OR well_handled OR touching OR believable OR likeable OR very_successful OR enjoy OR interesting OR good OR entertaining OR great OR believable OR engaging) THEN pos ELSE neg

This rule doesn't apply correctly on all negative instances since some of them have positive words also. 

Rule 2:
This rule is based on a list of negative words from the given data

IF (not_perfect OR dull OR onedimensional OR misused OR unnatural OR lack OR missmarketed OR went_wrong OR worst OR shallow OR awful OR terrible OR really_bad OR cliché OR waste OR unintentional_laughs OR silliness OR immaturity OR passionless OR false_hope OR collapse OR annoying OR undercut OR not_so_well OR disaster OR not_original) THEN neg ELSE pos

This rule predicts some positive instances wrongly since a few of negative words occur in positive instances. 

Rule 3: 
To overcome the issue of wrong predictions due to some instances containing both positive and negative words, I used count to see which dominates which.

FOR ALL(effective OR intriguing OR breathtaking OR captivated OR (NOT not)_perfect OR loved OR real_chemistry OR really_good OR charm OR enthralled OR beautifully_done OR thoughtprovoking OR poignant OR fabulous OR sweet OR true_chemistry OR so_well OR enjoy OR excellent OR well_handled OR touching OR believable OR likeable OR very_successful OR enjoy OR interesting OR good OR entertaining OR great OR believable OR engaging) Add 1 to count_pos for each occurrence

FOR ALL(not_perfect OR dull OR onedimensional OR misused OR unnatural OR lack OR missmarketed OR went_wrong OR worst OR shallow OR awful OR terrible OR really_bad OR cliché OR waste OR unintentional_laughs OR silliness OR immaturity OR passionless OR false_hope OR collapse OR annoying OR undercut OR not_so_well OR disaster OR not_original) Add 1 to count_neg for each occurrence

IF count_pos> count_neg, THEN pos
ELSE neg

Rule 4: 
We can see that the list of words are hand-picked based on our sample data, so the above rule over-fits to our data. I removed words which may have different contexts in different occurrences and maintained only words that are predictive at all occurrences.

FOR ALL(effective OR breathtaking OR loved OR (real OR true_chemistry) OR really_good OR enthralled OR beautifully_done OR thoughtprovoking OR fabulous OR excellent OR well_handled OR very_successful) Add 1 to count_pos for each occurrence

FOR ALL(dull OR unnatural OR missmarketed OR went_wrong OR worst OR shallow OR awful OR terrible OR really_bad OR waste OR silliness OR annoying) Add 1 to count_neg for each occurrence

IF count_pos> count_neg, THEN pos
ELSE neg

Even though the above rule seems to fit okay, it may not be very predictive of instances which contain words other than the ones listed or which contain an opposite context of a word. They can be captured to some extent by complex rules involving the proximity of word occurrence. More features can be added and tested by cross-validation until we get a model with reasonable reliability. My take away is that it is not at all an easy task! :)




Tuesday, December 9, 2014

Week 6 Activity

In the activity for Week 6, we were asked to calculate different metrics for assessing models which were discussed in Ryan Baker's unit of Behavior Detection and Model Assessment. Two data sets, classifier-data-asgn2.csv and regressor-data-asgn2.csv were given. 

I used Excel for these calculations and for the last metric (A' or AUC), I downloaded a plugin called XLSTAT from http://www.xlstat.com/en/ since SPSS didnot give the correct answer. I will detail out the steps which I followed to complete this activity containing 11 questions. I urge you to save all the steps since you may need the answer of previous steps to continue the next steps. To better understand the steps I've described, refer the lecture videos :)

Q1) Using regressor-data-asgn2.csv, what is the Pearson correlation between data and predicted (model)? (Round to three significant digits; e.g. 0.24675 should be written as 0.247) (Hint: this is easy to compute in Excel)


 Use the excel function CORREL or PEARSON to calculate the Pearson correlation for the regressor model using the given two input arrays of data. Round the number you get, instead of truncating it to get the correct answer.

Q2) Using regressor-data-asgn2.csv, what is the RMSE between data and predicted (model)? (Round to three significant digits; e.g. 0.24675 should be written as 0.247) (Hint: this is easy to compute in Excel)


Calculate the residual (difference between actual data and predicted model) and use those values for the array in the below formula:
=SQRT(SUMSQ(A2:A1001)/COUNTA(A2:A1001))

Q3) Using regressor-data-asgn2.csv, what is the MAD between data and predicted (model)? (Round to three significant digits; e.g. 0.24675 should be written as 0.247) (Hint: this is easy to compute in Excel)


 Calculate the absolute values of the previous residual values in an array =ABS(RMSE!A2:A1001) and average them. 

Q4) Using classifier-data-asgn2.csv, what is the accuracy of the predicted (model)? Assume a threshold of 0.5. (Just give a rounded value rather than including the decimal; e.g. write 57.213% as 57) (Hint: this is easy to compute in Excel)


Compute the column of predicted model values with Y based on the given threshold of 0.5 (If >0.5, then Y). Compare it with the no of Ys in data to find the number of agreements. Calculate "= no. of agreements/ total count" for the accuracy.

Q5) Using classifier-data-asgn2.csv, how well would a detector perform, if it always picked the majority (most common) class? (Just give a rounded value rather than including the decimal; e.g. write 57.213% as 57) (Hint: this is easy to compute in Excel)


Calculate "= number of disagreements/total count". Use previous step values.

Q6) Is this detector’s performance better than chance, according to the accuracy and the frequency of the most common class?


Answer Yes/No based on the previous values you got.

Q7) What is this detector’s value for Cohen’s Kappa? Assume a threshold of 0.5. (Just round to the first two decimal places; e.g. write 0.74821 as 0.75).


I calculated the agreements between data and prediction model to form the confusion matrix of the number of True Negatives(TN), True Positives (TP), False Positives (FP), False Negatives (FN) and listed them as below from O5 to O8 and then used a formula:
00 (TN)
11 (TP)
01 (FP)
10 (FN)
=((O5+O6)-((((O6+O7)*(O6+O8))/SUM(O5:O8))+(((O5+O7)*(O5+O8))/SUM(O5:O8))))/((SUM(O5:O8))-((((O6+O7)*(O6+O8))/SUM(O5:O8))+(((O5+O7)*(O5+O8))/SUM(O5:O8))))

Alternatively, you may apply the values from your confusion matrix to any online calculator for Cohen's Kappa.

Q8) What is this detector’s precision, assuming we are trying to predict “Y” and assuming a threshold of 0.5 (Just round to the first two decimal places; e.g. write 0.74821 as 0.75).


Use formula Precision = TP/ (TP+FP)

Q9) What is this detector’s recall, assuming we are trying to predict “Y” and assuming a threshold of 0.5 (Just round to the first two decimal places; e.g. write 0.74821 as 0.75).


Use formula Recall = TP/ (TP+FN)

Q10) Based on the precision and recall, should this detector be used for strong interventions that have a high cost if mis-applied, or fail-soft interventions with low benefit and a low cost if mis-applied?


Select the correct option from the list of options.

Q11) What is this detector's value for A'? (Hint: There are some data points with the exact same detector confidence, so it is probably preferable to use a tool that computes A', such as http://www.columbia.edu/~rsb2162/computeAPrime.zip -- rather than a tool that computes the area under the ROC curve).


I used ROC Curve from XLSTAT plugin to compute Area under the curve (AUC) using excel. 


To compute A' without ROC curve, you may follow our co-learner's steps listed in his blog:

Hope this helps you to reach this screen! :)







All materials are based on the EdX course - Data, Analytics and Learning
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.