Project 01 - Don’t Be Sentimental
Due Date: Sept 14, 2020 at 6am
Submission Method:
- Github Classroom Repo Automatically created when you click on this link
Introduction
Have you ever read a tweet and thought, “Gee, what a positive outlook!”? Or, maybe you read a movie review and though, “Wow, why so negative, friend?” Or maybe you read a comment on Instagram and thought, “hmm, this sounds negative, but does Insta know it is negative?” In other words, can we computationally make the same determination about positivity or negativity? Sure, why not?
In Machine Learning, the task of assigning a label to a data item is called broadly classification (putting things into different classes or categories). The more specific name for what we’re going to do is sentiment analysis because you’re trying to determine the “sentiment” or attitude or a block of text based on the words in the review. Project 1 is to build a sentiment classifier! Aren’t you excited?? ( ← That would be positive sentiment!)
You’ll be given a set of Movie Reviews from IMDB that are already pre-classified as positive or negative based on their content. You’ll use word frequency analysis to find common words in the negative reviews and common words in the positive reviews.
Our Data Set
For this project, we will use a data set containing 50,000 movie reviews from IMDB.
The data set is hosted on Kaggle. You can find it here. Note that you may need to create a Kaggle account to access, but Kaggle is a great platform to be a part of. So it is worth it. Once on the Kaggle page for the data set, you should see a Download button on the right under the header image.
The downloaded zip file contains a single CSV file. CSV stands for comma separated values. The first line in the file contains the column headers. The remaining 50K lines contain the review and the pre-classified value for the review. As you can probably guess, a review written in English might contain Commas. So, each individual review is surrounded by double quotes(“…review…”).
We will consider the line of the input file upon which each review appears as the case ID or review ID.
The original data set comes from a Stanford AI Research Lab.
Building a Classifier
The goal in classification is to assign a class label to each element of a data set. Of course, we want this done with the highest accuracy possible. For us, there are only two classes or labels: positive sentiment and negative sentiment. At a high level, the process to build a classifier (and many other machine learning models) is this:
- Train
- Use a training data set with pre-classified members.
- Assume you have 10 reviews and each is pre-classified with + or - sentiment. How might you go about analyzing the words in the reviews to help in assigning + or - sentiment?
- The first 40,000 rows of the input file will be the training data set.
- Test
- Now, you give your classifier un-labeled reviews and ask it to output the class it determines. But behind the scenes, you already know what class each review actually belongs to.
- Compare the predicted class of each review (the output of your classifier) and the actual class of each review and determine the accuracy. In other words, how correct was your classifier?
- The last 10,000 rows of the input file will be the testing data set.
Output File
Your program will create one (1) output file:
- The first like of the file will contain the accuracy of your classifier for classification of the test data set. This will be a floating point number with exactly 4 decimal places of precision.
- The remaining lines of the file will contain the case IDs of the incorrectly classified cases from the Test Data Set.
Example of Output File (note that the data is fake):
0.6192
43557
48645
The above indicates an accuracy of 61.92% on the testing data set when comparing your predicted sentiment to the pre-classified sentiment present in the input file. It also indicates that the reviews on rows 43557 and 48645 were incorrectly classified by your classifier.
Training your Classification Algorithm
You’ll use word frequencies to train your classification algorithm. For the reviews in the testing data set, you’ll calculate word frequencies for the positive reviews and for the negative reviews. You might expect a few things to happen:
- certain words might appear more frequently in negative reviews. Other words might appear more frequently in positive reviews.
- certain words might appear frequently in both positive and negative reviews. Are these useful at all?
As you develop your classification algorithm, you can use the output file containing the incorrectly classified reviews to refine your classifier.
How Good is Your Classifier
Training and Testing of a classification algorithm is an iterative process. You’ll develop a training algorithm, test it, evaluate its performance, tweak the algo, retrain, retest, etc. How do you know how good your classifier is after each development iteration, though? We will use accuracy as the metric for evaluation.
total number of correctly classified cases from testing set
Accuracy = ----------------------------------------------------------------
total number of cases in the testing set
Running Your Program
Your program will have two modes controlled by command line arguments:
- Running Catch TDD Tests
- This mode will be indicated by passing no command line arguments
- Training and Testing the Classifier
- This mode will be indicated by the presence of two command line args
- Arg 1: the name of the input csv file
- Arg 2: the name of the file to which you want to write the output
- This mode will be indicated by the presence of two command line args
Implementation Requirements
You must implement your own custom string class as part of this project. You may not use the STL string class or any other available string class from the internet. Additionally, you may NOT use c-strings (null-terminated character arrays) except to implement your custom string class and as a temporary buffer when reading from a file. More on the requirements of this custom class will be in a separate handout.
You will also need to implement a suite of tests for the string class using the CATCH TDD library. More on this in a separate handout in the first week of lab. You’ll provide a testing mode (the mode with no args) which will execute your tests.
More information on the DLString Class you will be implemented can be found here.
Hopefully Helpful Thoughts
This is your opportunity to make a great first impression on the Prof/TAs in CSE 2341. We will be looking for simple, elegant, well-designed solutions to this problem. Please do not try to “wow” us with your knowledge of random, esoteric programming practices. Here is a list of things that you might want to consider while tackling this project:
- Procedural vs. Object Oriented Design
- A seemingly infinite amount of software has been designed, implemented, deployed, maintained, updated, and redeployed using both of these paradigms. One could argue for days, or week even, about which is the “better” paradigm for modern software development. Regardless of which paradigm you choose to use, the most important thing is that you produce an elegant solution following solid software development practices.
- File Input and output
- It is so important to be able to read from and write to data files. Think about some software program that doesn’t use files…
- Just the right amount of comments in just the right places
- Minimal amount of code in the driver method (main, in the case of C++)
- The code in main should be minimal and only used to “get the ball rolling.”
- Proper use of command-line arguments
- Proper memory management
Your Submission
You must have your final commit pushed to Github by the deadline - Sept 14, 2020 at 6AM CDT.
Your project submission will be evaluated using the following point break-down:
Deliverable | Points |
---|---|
DLString Class | 15 |
CATCH Test | 5 |
Training Algo Implementation | 25 |
Testing Algo Implementation | 25 |
Source code quality | 20 |
Proper use of Git and GitHub | 10 |