OPTICAL MUSIC RECOGNITION

PROBLEM STATEMENT
Given a portion of a musical score (PNG Image) as input, we need to find the Notes, Quarter Rests and Eighth Rests based on the 3 optical symbols (pre-defined as templates) in the test_images folder using techniques like Hough Transform, Naive and Edge Detection Based Template Matching. The github repository for the same is given below.
ALGORITHM OVERVIEW



1. Music Sheet Image is given as input by the user.
2. The application has a predefined catalogue of templates that it uses to detect the notes, quarter and eighth rests.
3. First step of preprocessing is to convert the input image and all the templates to grayscale.
4. Hough Transform is performed to find the spacing between the staff lines and row coordinates of the first lines among each set.
5. Once we obtain the spacing between the staff lines, we can resize the templates to match the scale of the input image.
6. The next step after resizing templates is to apply object detection to identify the type of music symbol.
7. Naïve Template Matching and Edge Detection based Template Matching have been implemented and are chosen based on the use case.
8. Matching regions along with corresponding probability are obtained by Template Matching Algorithms which are then subjected to Non-maximal Suppression.
9. Output image is generated by drawing Bounding boxes and printing pitch names.
10. Text file is generated which has the row and column of the top left corner, height and width of bounding box, type of symbol, pitch, and probability.
IMPLEMENTATION AND DESIGN CHOICES
The three utlity function files are imported in the main file omr.py. Once the program is run, main method reads name of input music image from command line argument. The input PNG image and all the templates are read. The method getResults() is called and input image, templates along with corresponding template factors are passed to method.
getResults() — Instances of the classes in the three utility files are created in the getResults() method.
The input image is then converted to numpy array format for further operations followed by multiplying each channel in the RGB image array with specific weights — 0.2989 for red channel, 0.5870 for green channel and 0.1140 for blue channel to obtain the grayscale image using the ‘rgbtogray()’ method in the kernelOperations class.
Hough transform is then performed on the grayscale image array to detect staves and estimate of the size of note heads because space between stave lines is approximately height of a note head which is then used as a reference for rescaling.
hough() — This method takes grayscale image as input. The pixel values lesser than 128 are replaced with 0 while the rest of the pixels are replaced with 1 resulting in Binary Image. Every pixel in this image is scanned and vote dictionary named ‘votesDict’ is created with each row index as the key (x) and its value is increased by 1 for every pair of (row, height) that has an edge pixel. Eack key value pair in the final ‘votesDict’ has the row index and the corresponding number of white pixels in that row. The list of rows, i.e the keys with values more than 50% of the column length (y) are then filtered from the dictionary. This filtering assumes that the staves are parallel, horizontal and continuous lines in the music image that stretches for more than 50% of its columns in the row. Thus, if a row index has more than 50% of continuous white pixels, it can be considered a stave line.
The row indices of the stave lines are in the filtered list l. We know from the theory that there is a fixed space between any two stave lines in a set and this space is the same across the set of staves. To find the space between two stave lines in a set, the difference between the row indices that are not immediately next to each other is determined. This difference gives the space between two stave lines.
Further row coordinate of first stave lines in all sets must be found. We know that the first-row index (or the first entry) in the filtered list l is the row coordinate of the first line in the first set. We already know the space between the lines and the row coordinate of the first line. We loop through all the possible row candidates and see if they lie within ‘space x 5’ distance of the first-row coordinate. If any line is found which is not within this bound, then we can assume that row is the first row for another set of stave lines. We keep checking the list till all possible row candidates are either rejected or have been exhausted.
Now, the space and the row coordinates of the first stave lines in all sets have been determined using this hough() method, drawLines() method can be used to visualize the predicted staff lines.
drawLines() — This method takes in the grayscale image, space and the first row coordinates as the input. For every set of the stave lines, the row coordinates of the next four lines are determined using the first line row coordinate. This is repeated for the other set of first row coordinates. Then, copy image of same shape as our grayscale input is initialized with all zero values and pixel values of all row coordinates found previously are set to 255.
The copy image when viewed in the getResults() method, is a black and white image with only the two sets of stave lines similar to the input music image. This copy image is stored as “detected_staves.png”.
Once after the stave lines are detected, a Pitch Dictionary is created to find the type of pitch. The space and first row coordinates are passed to a method named getPitchDictionary().
getPitchDictionary() — This method takes grayscale image, space and first row coordinates as inputs. Types of the pitch from A through G for both the staves, are defined in the pitch dictionary based on the space, i.e the space between any two lines in a stave. This look up dictionary is created for further use.
resizeTemplate() — The resize() method is called by passing template image and scaling factor. The second argument passed to this method is space multiplied by a template specific value which are determined as follows. If template to be detected in music image is eighth-rest, then it spans over a length of twice the space between any two stave lines. Thus, template scaling factor should be 2 * space. If template to be detected is quarter-rest, then it spans over approximately a length of thrice the space between any two stave lines. Thus, template scaling factor should be 3 * space.
The resized template is returned for further use. The resized template is then converted to numpy array and converted to grayscale as we did for the input image for visualization purposes. All the above steps comprise preprocessing, after which omrApplication() method is called.
omrApplication() — Inputs to this method are grayscale input image, resized grayscale template, type of template matching (naive or edge detection based template matching), name of template to be matched in the image (filled_note,quarter_rest or eighth_rest), pitch dictionary, space and arbitrary template specific threshold factor.
nonMaximalSupression() — Here, we consider the method of Intersection over Union (IoU) to determine the overlap between the bounding boxes and then set a upper threshold of 0.5 times the area. When intersection is above 50%, it is safe to assume that bounding boxes contain same object.
naiveTemplateMatching() — In this method, the image and templates are converted to corresponding binary forms. Threshold is calculated as product of the confidence interval (passed as input to the method), height and width of template. This score evaluates how similar the region in image I is to template. This function needs to be computed for each m×n — pixel neighborhood of the image. When the score is above threshold, score along with region of the image is added to a list named ‘scorArr’ which is subjected to non-maximal suppression to avoid duplicate detection of edges.
getEdges() — In this method, separable Sobel operators are used to find gradients in X and Y directions. The square root of sum of squares of X and Y gradients give edges of template. Here, the pixels which exceed threshold value are substituted as edge pixels.
edgeDetectionTemplateMatching() — In this method, we get edges for both image and template using the getEdges() method. A distance transform matrix is defined for edge detection. This method takes result of getEdges() method as input and returns distance matrix.
After the distance transform is done for the image edge, threshold is calculated as sum of template edge * threshold factor passed as input. Then, like naive matching, a region of the distance matrix and template edge are matched. Regions with maximum score are subjected to non-maximal suppression.
Once the matching regions are returned from naive method or edge detection method, bounding boxes are drawn, and type of note is determined using pitch dictionary. The confidence value is considered as the probability of the predicted note.
The utility file named kernel operations comprise of various commonly used separable and inseparable kernels to perform convolution on grayscale images for edge detection, blurring etc.
RESULTS
The input images can be obtained from the MusicNet Dataset.




CONCLUSION
The Algorithm described above works well within the scope of Sheet Music Dataset. Some of the choices have been left to user — Choosing between types of Template Matching, Changing threshold values etc. However, in some cases, the application needs to be robust without having to rely on user decisions. Hence, the next step after this implementation would be to try Optical Music Recognition using Advanced Deep Learning techniques.