Assignment 9: Digits

Deadline

16 April 2021 (Friday), 4pm.

Prerequisite

You are able to access the CS1010 programming environment.
You are familiar with basic UNIX CLI and using terminal-based editor vim.
You have gone through Exercise 1 and have already set up your .gitconfig.
You are familiar with the steps of accepting, downloading, and submitting your exercise and assignment.
You are familiar with the directory structure and the common files included in your assignment skeleton.

Learning Outcomes

Be comfortable writing C programs that involve arithmetic operations, long, double, bool, and char types, conditional if/else statements, loops with while/for/do-while statements, arrays (including 2D arrays), strings, dynamic arrays, assert, and struct.
Be comfortable breaking down a problem into smaller subproblems that can be resolved using functions, including reusing existing functions written for other programs (with a tweak), writing a function that calls itself, designing what should the inputs and return values/types of a function be.
Be able to analyze the running time of your code and design algorithm that avoid duplicated or redundant work, exploiting properties of the data.
Be able to solve problems with recursion, particular those involving non-linear recursion and backtracking.

Setup

Click on this link to accept the assignment.
Login to one of the hosts of CS1010 programming environment
Run:

1	`~cs1010/get-as09`

You should see the folder as09-<github id> in your home directory with the assignment skeleton inside.

Solving The Assignments

Edit the files digits.c to solve the corresponding question as described in question.md
You should break down the problem into smaller subproblems, and write one function for each of the subproblem to solve the problem.
To compile and run tests with the sample inputs and outputs:

make

The test cases are given in the inputs and outputs subdirectory as well as in ~cs1010/as09. You can use cat or less to look at the content of these test cases. You can add more test cases or edit the given ones if needed.

Submission

When you are ready, run the following command to submit:

1	`~cs1010/submit-as09`

The files digits.c will be uploaded to GitHub. You can submit multiple times, but only the last submission will be graded.

Editing Your Files in Multiple Locations

You should edit your code only on the CS1010 PE hosts. If you choose to edit your code in other places, such as directly on Github or in a second location (such as your own laptop), you need to be comfortable with various git command to synchronize your code across the different locations, possibly needing to resolve synchronization conflicts.

Only the C files listed above will be submitted. You can create additional C files for testing different solutions, but eventually, you must put your solution into the corresponding C file provided as the skeleton.

Identifying Yourself

In every C file that you submit to CS1010, you need to identify yourself by writing your name and tutorial group. Marks will be deducted if you fail to do so. You need to edit the line:

1	`@author XXXX (Group YYYY)`

and change it to something like:

1	`@author Sarah Manning (Group G02)`

Grading

This assignment contributes towards 3.5% of your final grade. The total mark for this assignment is 35 marks. There are four marking criteria: correctness, documentation, style, efficiency. The distribution of the marks different by question. Please refer to questions.md for details.

Documentation: Marks are no longer allocated for documentation. The graders reserves the right to deduct up to two marks if you do not document your code properly. Please refer to the documentation and follow the recommended format. You need to explain the purpose of every file, function, and parameter. Logic in your code that is not obvious should be explained as well.
Style: Marks are no longer allocated for style. But the graders reserves the right to deduct up to two marks if your code blatantly violates the style guidelines.
Correctness: Note that passing the tests does not guarantee that your code is correct. Correctness here is defined in the broad sense of using the various programming constructs in C (type, function, variable, loops, conditionals, arithmetic expressions, logical expressions) properly, not just producing the correct output and bug-free.
Efficiency: Some question has an efficiency requirement. You need to meet the efficiency requirement to obtain the efficiency marks.

New `clang-tidy` checks

We have been encouraging students to break the solutions down into small functions. Not everyone adhere to this good practice, however. From this assigment onwards, we will generate a warning if your function is too large. We use a threshold of 50 statements and 6 nesting levels. Any function that exceeds this threshold will generate a warning.

Memory checks

We will start checking how you handle the memory returned from malloc and friends (including those returned by the CS1010 I/O library, including checks for NULL and calling free). Penalty applies if you do not manage them correctly.

Remember to return non-zero if your program does not run to its completion and encounters unexpected error.

Reduced test data

We have been providing lots of test data to help you test your program. We will reduce the number of test cases provided from now on. You should come up with your own test cases to test your code to make sure it covers all the cases (especially the edge cases).

We reserve the right to penalize students for using banned C syntax in the assignments. In addition, each grader at his or her own discretion can penalize students for repeating mistakes / bad practices from the student's past assignments (even if it was just a warning with no marks deducted in the earlier assignments).

Responsibility

By now, you should be familiar with how to accept, get, solve, and submit a CS1010 assignment. From this assignment onwards, you need to take full responsibility of submitting the correct version of your assignment before the deadline, or accept the consequences of your actions. We will no longer entertain appeals for marks if you submit an incorrect version of your code, forget to submit the assignment, submit late due to forgotten password, or other reasons due to unprofessional behavior.

Question

~~To acclimatize you to the PE2 condition, the questions for this assignment are included in the skeleton instead, in a file named questions.md.~~

You should learn how to use vim to open both the .c file and the questions.md side-by-side (see https://nus-cs1010.github.io/2021-s1/vim.html#splitting-vims-viewport)

MNIST

Digits (35 Marks)

We started CS1010 with an assignment question on digits, so it is fitting that we end with one.

In this assignment, your goal is to write a program that can recognize handwritten digits. This is a classic problem in pattern recognition and machine learning. The state of the art techniques can now achieve a recognition rate of over 99%. We, however, will implement a simple algorithm called k-nearest neighbor algorithm, which has a lower accuracy but is simpler.

Supervised Machine Learning

The k-nearest neighbor algorithm belongs to a category of algorithm called supervised machine learning algorithms. These algorithms teach or train the machine to learn to do something, by showing the machine many samples. These are called training samples. After showing the machine a bunch of these training samples, we can then test the machine how well it has learn, by showing the machine some testing samples.

In our case, we want to train the machine to recognize handwritten digits. During the training phase, we will present a set of handwritten digits in the form of images (as 2D arrays) to the machine, together with a label indicating which digits it is (i.e., "this picture shows the digit 1, this picture shows the digit 4, etc.").

During the testing phase, we will then present a set of handwritten digits to the machine, and ask, "ok, which digit to you think this is?"

We will then compare the machine's answer with the label of the handwritten digit (also called the ground truth) to see whether the machine recognize correctly or not.

Representation of a handwritten digit

A handwritten digit is an image, represented as a 28x28 array of characters, consisting of . and #. An example is:

............................
............................
............................
..........#######...........
.........########...........
........##########..........
.......####....###..........
.......###....####..........
..............####..........
.............####...........
............#####...........
............####............
...........####.............
...........###..............
..........####..............
.........####...............
.........####...............
........####................
........###.................
........###...........####..
........###################.
........###################.
.........###########........
............................
............................
............................
............................
............................

The image above contains the digit 2.

Training Data

For this assignment, we will use the data provided by the MNIST Handwritten Digit Database. The original data has been post-procesed into CSV file by Joseph Redmon, then post-processed again by yours truly into the format above.

The dataset from above contains 60,000 training samples. I have created two smaller subsets for you to test your code.

The first subset contains 10 samples for each digit -- this is given in the file train100.in.

The second subset contains 6 samples for digits 0 and 1 -- this is given in the file train6.in.

The full training set from MNIST is too big to be included on GitHub. You can read it from ~cs1010/as09/train60000.in if you want to play with it.

Testing Samples

We provide two sets of testing samples. The first contains three testing samples per digit. The second contains only a single digit corresponding to the example below.

The full training dataset from MNIST contains 10000 digits. It takes a long time to test every handwritten digit in this file, so you should do this only after you have made sure that your code is fast and is correct. Again, the file is too big to be posted on GitHub. You can read it directly from ~cs1010/as09/test10000.in.

The Algorithm

We will use the k-nearest neighbor algorithm for this assignment. The idea of this algorithm is that, given a testing sample, the machine will compare it to all the training samples, and find out which digit this testing sample is most similar to.

Let's define the distance between the two handwritten digits d(x1,x2) as the number of pixels that are different between them, i.e., how many pixels are # in one image but is . in the other image.

Given a test sample, q, we find the distance between q and all the available training samples and find the k training samples with the smallest distance (i.e., k nearest neighbors).

k is usually small -- we use k=5 in this assignment. The intuition is that q must be "close" to the training data that has the same labels (i.e., the same digits). So we look at the these k nearest neighbors and find the most common digit d among them. We then recognize q as containing the digit d. If there are more than one most common digits or more than k nearest neighbors , then we break ties by returning the nearest (i.e., shorter distance) digit. In the case where even the distance is the same, we return the smaller digit.

Efficiency

Suppose we have n training samples, recognizing a digit should take no more than O(kn) time (or O(n) since k is a constant). There is also an opportunity to stop early the calculation of the distance between two handwritten digits if the distance is too large, pruning away redundant work.

Example

Consider the following simple example with six training samples and a test sample.

6
0
............................
............................
............................
............................
...............#####........
...............#####........
.............########.......
............##########......
...........###########......
..........########.###......
..........####.##...###.....
.........#####......###.....
........####........###.....
.......####.........###.....
.......###..........###.....
......####..........###.....
......###..........###......
......###.........####......
......###........###........
......###......####.........
......####..#######.........
......###########...........
.......########.............
........#####...............
............................
............................
............................
............................
0
............................
............................
............................
............................
.................####.......
................######......
...............#######......
.............#########......
...........###########......
..........############......
.........#############......
........#####.####.###......
.......#####..###...##......
.......####.........##......
.......##...........##......
......###...........##......
.....####..........###......
.....####..........###......
.....####.........###.......
.....####.......####........
.....##############.........
......############..........
........########............
.........######.............
............................
............................
............................
............................
0
............................
............................
............................
............................
............................
..............######........
..............########......
.............#########......
.............#########......
............##########......
..........######..####......
..........#####...####......
..........#####...####......
..........#####..#####......
..........####...#####......
.........####....####.......
........#####...#####.......
........#####...####........
........#####..#####........
........####..#####.........
........##########..........
........#########...........
........#########...........
.........######.............
..........#####.............
............................
............................
............................
1
............................
............................
............................
............................
............................
..................####......
.................#####......
.................#####......
................####........
...............#####........
...............####.........
...............####.........
..............####..........
.............####...........
............####............
............####............
...........####.............
...........####.............
..........####..............
..........####..............
.........####...............
.........####...............
........####................
........####................
.........###................
............................
............................
............................
1
............................
............................
............................
............................
............###.............
............####............
............####............
............####............
............####............
............####............
............#####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............#####..........
.............#####..........
..............####..........
..............####..........
..............####..........
...............###..........
............................
............................
............................
............................
1
............................
............................
............................
............................
............................
.............##.............
.............##.............
.............##.............
.............##.............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
..............###...........
..............##............
..............##............
..............###...........
..............##............
..............##............
............................
............................
............................

The test sample is

1
0
............................
............................
............................
............................
............................
..............#####.........
.............#######........
...........##########.......
..........####.##.####......
.........####......####.....
.........###........###.....
........####.........##.....
........###..........##.....
.......###...........##.....
.......###...........##.....
.......###...........##.....
......####...........##.....
......####..........###.....
.......###..........###.....
.......###.........###......
........###.......###.......
........####....#####.......
.........##########.........
..........#########.........
............#####...........
............................
............................
............................

The distances between the test sample and each of the training sampes are 101, 120, 162, 174, 173, 162. The k nearest neighbors belong to digits 0, 0, 0, 1, and 1. The most common digit among these neighbors is 0, so we conclude (correctly) that the test sample is digit 0.

Using struct

One of the objectives of this assignment is to see if you know how to use struct.

For this assignment, you must define and use two struct:

a struct digit that encapsulates the 2D array that represents the handwritten digit and its label,
a struct neighbor that encapsulates a neighboring digit and the distance to it.

Input/Output

Write a program digits that reads, from the standard input, the following:

A positive integer n, corresponding to the number of training samples, then repeatedly read n handwritten digits, containing:
a label corresponding to the digit in the next image (a number between 0 - 9)
28 lines of texts, consisting of '.' and '#' only, representing a handwritten digit
followed by another positive integer m, corresponding to the number testing samples, then repeatedly read m handwritten digits, containing:
- a label corresponding to the digit in the next image (a number between 0 - 9)
- 28 lines of texts, consisting of '.' and '#' only, representing a handwritten digit

Then prints, to the standard output, the following:

For each testing sample,

print the digit it is labeled as, followed by a space, followed by the digit it is recognized as.

Finally, print a double value that corresponds to the accuracy, i.e, the percentage of training testing samples correctly recognized.

We separated the training samples and testing samples into two files, so the usual way of redirecting a file into your program does not work anymore (since you need two files).

The way to run your program is to do the following:

1	`cat <training samples> <testing samples> \| ./digits`

We use cat which concatenate two files into one to pass both the training samples and testing samples into the program using the pipe |.

The name convention for the output files is different for this assignment. The name is formatted as X-Y.out where X refers to the training samples trainX.in and Y refers to the testing samples testY.in.

Sample Runs

ooiwt@pe121:~/cs1010/as09$ cat inputs/train6.in inputs/test1.in | ./digits
0 0
100.0000
ooiwt@pe121:~/cs1010/as09$ cat inputs/train100.in inputs/test30.in | ./digits
0 0
0 0
0 6
1 1
1 1
1 1
2 2
2 8
2 2
3 3
3 3
3 3
4 9
4 4
4 9
5 5
5 3
5 3
6 5
6 6
6 6
7 7
7 7
7 7
8 8
8 1
8 8
9 9
9 9
9 9
73.3333

When you use the file ~cs1010/as09/train60000.in as the training samples, you should receive 100% accuracy with test30.in and about 96.56% accuracy with ~cs1010/as09/test10000.in

Grading

10 marks for efficiency. Your solution should run in O(n) time to get a full marks.
25 marks for correctness.
No marks are allocated for style and documentation, but we can deduct up to 5 marks for each category.

Assignment 9: Digits

Deadline

Prerequisite

Learning Outcomes

Setup

Solving The Assignments

Submission

Identifying Yourself

Grading

New clang-tidy checks

Memory checks

Reduced test data

Responsibility

Question

Digits (35 Marks)

Supervised Machine Learning

Representation of a handwritten digit

Training Data

Testing Samples

The Algorithm

Efficiency

Example

Using struct

Input/Output

Sample Runs

Grading

New `clang-tidy` checks