Assignment 9: Digits
Deadline
16 April 2021 (Friday), 4pm.
Prerequisite
- You are able to access the CS1010 programming environment.
- You are familiar with basic UNIX CLI and using terminal-based editor
vim
. - You have gone through Exercise 1 and have already set up your
.gitconfig
. - You are familiar with the steps of accepting, downloading, and submitting your exercise and assignment.
- You are familiar with the directory structure and the common files included in your assignment skeleton.
Learning Outcomes
- Be comfortable writing C programs that involve arithmetic operations,
long
,double
,bool
, andchar
types, conditionalif
/else
statements, loops withwhile
/for
/do-while
statements, arrays (including 2D arrays), strings, dynamic arrays, assert, and struct. - Be comfortable breaking down a problem into smaller subproblems that can be resolved using functions, including reusing existing functions written for other programs (with a tweak), writing a function that calls itself, designing what should the inputs and return values/types of a function be.
- Be able to analyze the running time of your code and design algorithm that avoid duplicated or redundant work, exploiting properties of the data.
- Be able to solve problems with recursion, particular those involving non-linear recursion and backtracking.
Setup
- Click on this link to accept the assignment.
- Login to one of the hosts of CS1010 programming environment
- Run:
1 |
|
- You should see the folder
as09-<github id>
in your home directory with the assignment skeleton inside.
Solving The Assignments
- Edit the files
digits.c
to solve the corresponding question as described inquestion.md
- You should break down the problem into smaller subproblems, and write one function for each of the subproblem to solve the problem.
- To compile and run tests with the sample inputs and outputs:
1 |
|
- The test cases are given in the
inputs
andoutputs
subdirectory as well as in~cs1010/as09
. You can usecat
orless
to look at the content of these test cases. You can add more test cases or edit the given ones if needed.
Submission
When you are ready, run the following command to submit:
1 |
|
The files digits.c
will be uploaded to GitHub. You can submit multiple times, but only the last submission will be graded.
Editing Your Files in Multiple Locations
You should edit your code only on the CS1010 PE hosts. If you choose to edit your code in other places, such as directly on Github or in a second location (such as your own laptop), you need to be comfortable with various git
command to synchronize your code across the different locations, possibly needing to resolve synchronization conflicts.
Only the C files listed above will be submitted. You can create additional C files for testing different solutions, but eventually, you must put your solution into the corresponding C file provided as the skeleton.
Identifying Yourself
In every C file that you submit to CS1010, you need to identify yourself by writing your name and tutorial group. Marks will be deducted if you fail to do so. You need to edit the line:
1 |
|
and change it to something like:
1 |
|
Grading
This assignment contributes towards 3.5% of your final grade. The total mark for this assignment is 35 marks. There are four marking criteria: correctness, documentation, style, efficiency. The distribution of the marks different by question. Please refer to questions.md
for details.
- Documentation: Marks are no longer allocated for documentation. The graders reserves the right to deduct up to two marks if you do not document your code properly. Please refer to the documentation and follow the recommended format. You need to explain the purpose of every file, function, and parameter. Logic in your code that is not obvious should be explained as well.
- Style: Marks are no longer allocated for style. But the graders reserves the right to deduct up to two marks if your code blatantly violates the style guidelines.
- Correctness: Note that passing the tests does not guarantee that your code is correct. Correctness here is defined in the broad sense of using the various programming constructs in C (type, function, variable, loops, conditionals, arithmetic expressions, logical expressions) properly, not just producing the correct output and bug-free.
- Efficiency: Some question has an efficiency requirement. You need to meet the efficiency requirement to obtain the efficiency marks.
New clang-tidy
checks
We have been encouraging students to break the solutions down into small functions. Not everyone adhere to this good practice, however. From this assigment onwards, we will generate a warning if your function is too large. We use a threshold of 50 statements and 6 nesting levels. Any function that exceeds this threshold will generate a warning.
Memory checks
We will start checking how you handle the memory returned from malloc
and friends (including those returned by the CS1010 I/O library, including checks for NULL
and calling free
). Penalty applies if you do not manage them correctly.
Remember to return non-zero if your program does not run to its completion and encounters unexpected error.
Reduced test data
We have been providing lots of test data to help you test your program. We will reduce the number of test cases provided from now on. You should come up with your own test cases to test your code to make sure it covers all the cases (especially the edge cases).
We reserve the right to penalize students for using banned C syntax in the assignments. In addition, each grader at his or her own discretion can penalize students for repeating mistakes / bad practices from the student's past assignments (even if it was just a warning with no marks deducted in the earlier assignments).
Responsibility
By now, you should be familiar with how to accept, get, solve, and submit a CS1010 assignment. From this assignment onwards, you need to take full responsibility of submitting the correct version of your assignment before the deadline, or accept the consequences of your actions. We will no longer entertain appeals for marks if you submit an incorrect version of your code, forget to submit the assignment, submit late due to forgotten password, or other reasons due to unprofessional behavior.
Question
To acclimatize you to the PE2 condition, the questions for this assignment are included in the skeleton instead, in a file named questions.md
.
You should learn how to use vim
to open both the .c
file and the questions.md
side-by-side (see https://nus-cs1010.github.io/2021-s1/vim.html#splitting-vims-viewport)
Digits (35 Marks)
We started CS1010 with an assignment question on digits, so it is fitting that we end with one.
In this assignment, your goal is to write a program that can recognize handwritten digits. This is a classic problem in pattern recognition and machine learning. The state of the art techniques can now achieve a recognition rate of over 99%. We, however, will implement a simple algorithm called k-nearest neighbor algorithm, which has a lower accuracy but is simpler.
Supervised Machine Learning
The k-nearest neighbor algorithm belongs to a category of algorithm called supervised machine learning algorithms. These algorithms teach or train the machine to learn to do something, by showing the machine many samples. These are called training samples. After showing the machine a bunch of these training samples, we can then test the machine how well it has learn, by showing the machine some testing samples.
In our case, we want to train the machine to recognize handwritten digits. During the training phase, we will present a set of handwritten digits in the form of images (as 2D arrays) to the machine, together with a label indicating which digits it is (i.e., "this picture shows the digit 1, this picture shows the digit 4, etc.").
During the testing phase, we will then present a set of handwritten digits to the machine, and ask, "ok, which digit to you think this is?"
We will then compare the machine's answer with the label of the handwritten digit (also called the ground truth) to see whether the machine recognize correctly or not.
Representation of a handwritten digit
A handwritten digit is an image, represented as a 28x28 array of characters, consisting of . and #. An example is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
The image above contains the digit 2.
Training Data
For this assignment, we will use the data provided by the MNIST Handwritten Digit Database. The original data has been post-procesed into CSV file by Joseph Redmon, then post-processed again by yours truly into the format above.
The dataset from above contains 60,000 training samples. I have created two smaller subsets for you to test your code.
The first subset contains 10 samples for each digit -- this is given in the file train100.in
.
The second subset contains 6 samples for digits 0 and 1 -- this is given in the file train6.in
.
The full training set from MNIST is too big to be included on GitHub. You can read it from ~cs1010/as09/train60000.in
if you want to play with it.
Testing Samples
We provide two sets of testing samples. The first contains three testing samples per digit. The second contains only a single digit corresponding to the example below.
The full training dataset from MNIST contains 10000 digits. It takes a long time to test every handwritten digit in this file, so you should do this only after you have made sure that your code is fast and is correct. Again, the file is too big to be posted on GitHub. You can read it directly from ~cs1010/as09/test10000.in
.
The Algorithm
We will use the k-nearest neighbor algorithm for this assignment. The idea of this algorithm is that, given a testing sample, the machine will compare it to all the training samples, and find out which digit this testing sample is most similar to.
Let's define the distance between the two handwritten digits d(x1,x2) as the number of pixels that are different between them, i.e., how many pixels are #
in one image but is .
in the other image.
Given a test sample, q, we find the distance between q and all the available training samples and find the k training samples with the smallest distance (i.e., k nearest neighbors).
k is usually small -- we use k=5 in this assignment. The intuition is that q must be "close" to the training data that has the same labels (i.e., the same digits). So we look at the these k nearest neighbors and find the most common digit d among them. We then recognize q as containing the digit d. If there are more than one most common digits or more than k nearest neighbors , then we break ties by returning the nearest (i.e., shorter distance) digit. In the case where even the distance is the same, we return the smaller digit.
Efficiency
Suppose we have n training samples, recognizing a digit should take no more than O(kn) time (or O(n) since k is a constant). There is also an opportunity to stop early the calculation of the distance between two handwritten digits if the distance is too large, pruning away redundant work.
Example
Consider the following simple example with six training samples and a test sample.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
|
The test sample is
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
The distances between the test sample and each of the training sampes are 101, 120, 162, 174, 173, 162. The k nearest neighbors belong to digits 0, 0, 0, 1, and 1. The most common digit among these neighbors is 0, so we conclude (correctly) that the test sample is digit 0.
Using struct
One of the objectives of this assignment is to see if you know how to use struct
.
For this assignment, you must define and use two struct
:
- a
struct digit
that encapsulates the 2D array that represents the handwritten digit and its label, - a
struct neighbor
that encapsulates a neighboring digit and the distance to it.
Input/Output
Write a program digits that reads, from the standard input, the following:
-
A positive integer n, corresponding to the number of training samples, then repeatedly read n handwritten digits, containing:
-
a label corresponding to the digit in the next image (a number between 0 - 9)
-
28 lines of texts, consisting of '.' and '#' only, representing a handwritten digit
-
followed by another positive integer m, corresponding to the number testing samples, then repeatedly read m handwritten digits, containing:
- a label corresponding to the digit in the next image (a number between 0 - 9)
- 28 lines of texts, consisting of '.' and '#' only, representing a handwritten digit
Then prints, to the standard output, the following:
For each testing sample,
- print the digit it is labeled as, followed by a space, followed by the digit it is recognized as.
Finally, print a double value that corresponds to the accuracy, i.e, the percentage of training testing samples correctly recognized.
We separated the training samples and testing samples into two files, so the usual way of redirecting a file into your program does not work anymore (since you need two files).
The way to run your program is to do the following:
1 |
|
We use cat
which concatenate two files into one to pass both the training samples and testing samples into the program using the pipe |
.
The name convention for the output files is different for this assignment. The name is formatted as X-Y.out where X refers to the training samples trainX.in and Y refers to the testing samples testY.in.
Sample Runs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
When you use the file ~cs1010/as09/train60000.in
as the training samples, you should receive 100% accuracy with test30.in
and about 96.56% accuracy with ~cs1010/as09/test10000.in
Grading
- 10 marks for efficiency. Your solution should run in O(n) time to get a full marks.
- 25 marks for correctness.
- No marks are allocated for style and documentation, but we can deduct up to 5 marks for each category.