Task steps:
1. Create an author-to-author tweet edge file from the original data set, stocktwit_graph_input.csv.
Create an edge file from the original data set, stocktwit_graph_input.csv. We just need two columns – source (Vertex 1) and target (Vertex 2) of an edge to create a graph. Select all rows – tweets for columns K- “from_person” and M – “to_person” (or J and L for numerical author IDs) and save it as “stocktwit_from_to” or another name you prefer.
2. Use Gephi to generate and save author (node) metrics. Select the metrics you like to explore and use for building models later. Include at least 5 different metrics. Save the metrics in a file named as stocktwit_node_yourname.csv. Submit this file. Include answers to the following questions in HW6_yourname.doc for submission.
a. Which three authors have the highest betweenness centrality?
b. Which three authors have the highest total degree?
c. Which three authors have the highest closeness?
3. Build the Node Table for Prediction
(1). Open the stocktwit_node.csv file in Excel, and create a new variable: Expert (i.e. suggested). It is the target variables we aim to classify or predict.
(2). Do not close the stocktwit_node.csv file. Open the stocktwit_graph_input.csv file. And then go to the stocktwit_node.csv.
(3). Note that the unit in the stocktwit_node.csv file is a node (i.e. each individual author) and the unit in the stocktwit_graph_input.csv file is a tweet (i.e. each message). So, in order to transfer the value of expert from the table of stocktwit_graph_input to the stocktwit_node table, we need to do data transformation.
To Expert, we need to assign one value to one author (i.e. whether they are expert or not – 1 stands for yes; 0 stands for no.).
Use the VLOOKUP function to assign the value of “suggested” from the table of stocktwit_graph_input to the column, “Expert”, in stocktwit_node table. The function for the first row should be like this:
= VLOOKUP(A2, stocktwit_graph_input.csv!$K$1:$AB$38200,18,FALSE),
where “A2” is the node name; “stocktwit_graph_input.csv!$K$1:$AB$38200” is the table range we look up; 18 is the column number from the table range that we aim to return the value, “FALSE” stands for an exact match of the value.
(4). Save the stocktwit_node.csv file. BTW, you can delete those rows who have missing value in Expert, because these nodes only appear in the “to_person” column, they do not have tweets.
Use filter function in excel to remove the #NAs.
4. In R, build and evaluate a classification model that uses the metrics in stocktwit_node_yourname.csv from step 2 as features to classify authors into “expert” stocktwit author (i.e., “suggested”=1)” or not (“suggested”=0) which is the target label variable.
(1). Using a seed of 100, randomly select 60% of the rows into training (e.g. called traindata). Divide the other 40% of the rows evenly into two holdout test/validation sets (e.g., called testdata1 and testdata2).
(2). Build the tree using the C50 function with default settings.
(3). Generate predictions (i.e. estimations) of the values of the target variable for the testing instances.
Generate a confusion matrix that shows the counts of true-positive, true-negative, false-positive and false-negative predictions for both testdata1 and testdata2. Consider 1 as positive class.
Generate seven performance metrics – Accuracy (percent of all correctly classified testing instances), and precision (percent of instances predicted to have a class are accurate), recall (also true positive) and F-measure (also F-score) of the two classes of expert.
(4). Would you recommend using the features from network analysis to identify experts in the Stocktwit community? Why or why not? Include answers to the following questions in HW6_yourname.doc for submission.