Project 02 - Auto Indexer

Due Date: Sept 28, 2020 at 6am

New Due Date: Oct 5, 2020 at 6am

Submission Method:

Github Classroom Repo Automatically created when you click on this link

**See FAQ & Errata Section for more information, corrections, etc. **

Introduction

Professor Jackson was just assigned to be the editor of a riveting textbook titled “Advanced Data Structure Implementation and Analysis”. She is super excited about the possibility of delving into the material and checking it for technical correctness. However, one of the more mundane tasks she must perform is creating an index for the book. Everyone has used the index at the back of a book before. An index organizes important words or phrases in alphabetical order together with a list of pages on which they can be found. But, who or what creates these indexes? Do humans create them? Do computers create them? As a comp sci prof, Jackson decides she wants to automate the process as much as possible because she knows that an automated indexer is faster and more accurate, and because it can be reused later when she finishes writing her own book. So as she is editing the book, she keeps a list of words on each page that should be included in the index. However, time is short, and she needs to get the book edited AND indexed quickly. She’s enlisted your help to write an AutoIndexer.

Your Task

You will implement a piece of software that can read in Professor Jackson’s keyword file (raw ASCII text with page indications), process the keyword data from the book, and output the complete index to a separate file. All of this must be done within specific implementation constraints described in the forthcoming sections.

Input File

The input text file will contain a list of keywords and phrases from the book separated into groups based on the page each word or phrase appears on. The end of the list of keywords will be indicated by <-1> at the end of the file. If a phrase is to be indexed, the words that comprise the phrase will be surrounded by square brackets (ex: [binary search tree]). No index word or phrase will exceed 40 30 characters in length (not including square brackets for phrases).

Some words or phrases may be followed by a parenthetical word or phrase representing the parent category. For example quicksort(sorting) means that quicksort should appear as an independent entry but also as a subordinate listing of the category sorting. If the parent category is a phrase, it will also be enclosed in square brackets as in trees([hierarchical structures]). Both the index term and the parent category may be phrases as in [binary search tree]([hierarchical data structures]). A particular index term need only be succeeded by the parent category once in the input file, and it might not be the first occurrence. So, if quicksort appears on 5 pages and in one of those entries in the input file it is succeeded by parent category (sorting), all entries for quicksort should appear under sorting as well. A parent category will never have more than 1 level of sub-entries. This means that for the following two entries [binary search tree]([binary tree]) and [binary tree](tree), binary search tree would not appear as a grandchild to tree. Note that the parent category may never appear as an independent index entry.

Here are a few things you should know about Prof. Jackson’s messy style for keeping track of the keywords. She didn’t pay attention to letter case, so you’ll need to account for that in your program. This means that ‘tree’ and ‘Tree’ should be considered as the same word. Page numbers will appear in angle brackets (ex: <8>) and will always be on their own individual line.

Page numbers will not necessarily be in order. Because of the editing process, Jackson may accidentally repeat page numbers due to re-reading the same section multiple times. This may mean she accidentally lists a word twice on the same page. In this case, there’s no need to list the word or phrase twice in the index. A (very very) simple input text file can be found in Listing 1.

Listing 1

<15>
algorithm [binary tree] analysis heap
<1>
[binary search tree] analysis 
complexity algorithm [2-3 tree]
<5>
[b+ tree](tree) [Binary Tree](tree)
<8>
graph clique Tree
<5>
tree [full binary tree] 
[complete binary tree]
<-1>

Listing 2

[2]
2-3 tree: 1
[A]
algorithm: 1, 15
analysis: 1, 15
[B]
b+ tree: 5
binary search tree: 1
binary tree: 5, 15
[C]
clique: 8
complexity: 1
complete binary tree: 5
[F]
full binary tree: 5
...
[T]
Tree: 5, 8 
  b+ tree: 5
  binary tree: 5, 15

Output Text File

The output text file will be organized in ascending order with numeric index categories appearing before alphabetic categories. Each category header will appear in square brackets followed by index entries that start with that character in ascending alphabetic or numeric order. An index entry will consist of the indexed word, a colon, then a list of page numbers where that word was found in ascending order. If an index term has sub-listings, those sub-listings should be indented 2 spaces.

No output line should be longer than 50 characters. The line should wrap before 50 characters and subsequent lines for that particular index entry should be indented 4 spaces. An example output text file can be found in Listing 2.

Implementation Requirements

You’ll implement both a templated DSVector and templated DSList class for this project. We will not dictate where or how they should be used, but you must use both of them in a meaningful way in your project.

The DSVector Class

You’ll need to implement a vector class that should minimally include the following features/functionality:

a vector shall be able to hold any data type (template your class)
a vector shall be a contiguously allocated, homogeneously typed sequential container
a vector shall grow as needed
- in other words, don’t start with an array of 500,000 elements or something like that
a vector shall minimally contain the following functionality:
- add a new item to the container
- access elements using the [] operator
- remove an element from the container
- follow rule of three There’s a great deal of other functionality that SHOULD be included, but this is the minimum amount needed. You should make sure your vector class is adequately tested using TDD with the CATCH2 framework.

The DSList Class

You’ll need to implement a linked list class that should also be templated and include functionality similar to the DSVector class. One could argue that they both inherit from the same super class List. However, this isn’t a project requirement. This class should have a test suite as well using CATCH2.

Assumptions

You may make the following simplifying assumptions in your project:

The input file will be properly formatted according to the rules above
You need to remove punctuation from the input file words. ‘Data!!!’ and ‘data’ should be considered the same word
No line of text in the input file will contain more than 80 characters
No word or phrase will be longer than 30 characters not including the brackets or parentheses
Different forms of the same word should be considered as individual entries in the index (e.g. run, runs, and running would each be considered individual words)

Running Modes

Testing - executing with no command line arguments means you should execute your CATCH tests.
Processing - 2 command line arguments
- Argument 1: the name of the input file
- Argument 2: the name of the output file

What to Submit

All of your source code to GitHub via the GitHub Classroom linked at the top of this document.

Grading

Deliverable	Points
DSVector Class	10
DSList Class	10
CATCH Test	10
Proper Templating Implementation	10
Basic Indexing	15
Handling Phrases	10
Handling Subordinate Entries	10
Proper output formatting	5
Following good coding practices*	10
Source code quality, readability, comments, documentation	10

* Example of a good coding practice would be following the rule-of-3 (or rule-of-5 if using move semantics)

FAQ and Errata Section

Errata

The max word/phrase length will be 30 characters, not 40.

FAQ

What do I do with punctuation that shows up in the middle of a word or phrase?
- When dealing with punctuation, you only need to remove trailing punctuation, meaning those characters that might come at the end of a word (if the indexable thing is a single word) or at the end of a phrase (if the indexable thing is a phrase).