CS计算机代考程序代写 scheme matlab python data structure jvm chain compiler deep learning Java c++ AI algorithm %—————————————————————————–

%—————————————————————————–
% PACKAGES AND OTHER DOCUMENT CONFIGURATIONS
%—————————————————————————–

\documentclass[11pt,addpoints,answers]{exam}
\usepackage[margin=1in]{geometry}
\usepackage{amsmath, amsfonts}
\usepackage{enumerate}
\usepackage{graphicx}
\usepackage{titling}
\usepackage{url}
\usepackage{xfrac}
\usepackage{natbib}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{paralist}
\usepackage{epstopdf}
\usepackage{tabularx}
\usepackage{longtable}
\usepackage{multirow}
\usepackage{multicol}
\usepackage[colorlinks=true,urlcolor=blue]{hyperref}
\usepackage{algorithm}
\usepackage{algorithmicx}
\usepackage[noend]{algpseudocode}
\usepackage{float}
\usepackage{enumerate}
\usepackage{array}
\usepackage{environ}
\usepackage{times}
\usepackage{textcomp}
\usepackage{caption}
\usepackage{parskip} % For NIPS style paragraphs.
\usepackage[compact]{titlesec} % Less whitespace around titles
\usepackage[inline]{enumitem} % For inline enumerate* and itemize*
\usepackage{datetime}
\usepackage{comment}
% \usepackage{minted}
\usepackage{lastpage}
\usepackage{color}
\usepackage{xcolor}
\usepackage[final]{listings}
\usepackage{tikz}
\usetikzlibrary{shapes,decorations}
\usepackage{framed}
\usepackage{booktabs}
\usepackage{cprotect}
\usepackage{verbatimbox}
\usepackage{multicol}
\usepackage{hyperref}
\usepackage{subcaption}
\usepackage{mathtools} % For drcases
\usepackage{cancel}
\usepackage[many]{tcolorbox}
\usepackage{soul}
\usepackage[bottom]{footmisc}
\usepackage{bm}
\usepackage{wasysym}

\usepackage{tikz}
\usetikzlibrary{shapes.geometric}
\usetikzlibrary{positioning, arrows, automata, calc}

\newtcolorbox[]{your_solution}[1][]{
% breakable,
enhanced,
nobeforeafter,
colback=white,
title=Your Answer,
sidebyside align=top,
box align=top,
#1
}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Rotated Column Headers %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{adjustbox}
\usepackage{array}

%https://tex.stackexchange.com/questions/32683/rotated-column-titles-in-tabular

\newcolumntype{R}[2]{%
>{\adjustbox{angle=#1,lap=\width-(#2)}\bgroup}%
l%
<{\egroup}% } \newcommand*\rot{\multicolumn{1}{R{45}{1em}}}% no optional argument here, please! %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Better numbering % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \numberwithin{equation}{section} % Number equations within sections (i.e. 1.1, 1.2, 2.1, 2.2 instead of 1, 2, 3, 4) % \numberwithin{figure}{section} % Number figures within sections (i.e. 1.1, 1.2, 2.1, 2.2 instead of 1, 2, 3, 4) % \numberwithin{table}{section} % Number tables within sections (i.e. 1.1, 1.2, 2.1, 2.2 instead of 1, 2, 3, 4) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Custom commands % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand{\vc}[1]{\boldsymbol{#1}} \newcommand{\adj}[1]{\frac{d J}{d #1}} \newcommand{\chain}[2]{\adj{#2} = \adj{#1}\frac{d #1}{d #2}} \newcommand{\R}{\mathbb{R}} \newcommand{\blackcircle}{\tikz\draw[black,fill=black] (0,0) circle (1ex);} \renewcommand{\circle}{\tikz\draw[black] (0,0) circle (1ex);} \newcommand{\emptysquare}{{\LARGE $\square$}\ \ } \newcommand{\filledsquare}{{\LARGE $\boxtimes$}\ \ } \newcommand{\emptycircle}{{\LARGE $\fullmoon$}\ \ } \newcommand{\filledcircle}{{\LARGE $\newmoon$}\ \ } \newcommand{\ntset}{test} % mathcal \newcommand{\Ac}{\mathcal{A}} \newcommand{\Bc}{\mathcal{B}} \newcommand{\Cc}{\mathcal{C}} \newcommand{\Dc}{\mathcal{D}} \newcommand{\Ec}{\mathcal{E}} \newcommand{\Fc}{\mathcal{F}} \newcommand{\Gc}{\mathcal{G}} \newcommand{\Hc}{\mathcal{H}} \newcommand{\Ic}{\mathcal{I}} \newcommand{\Jc}{\mathcal{J}} \newcommand{\Kc}{\mathcal{K}} \newcommand{\Lc}{\mathcal{L}} \newcommand{\Mc}{\mathcal{M}} \newcommand{\Nc}{\mathcal{N}} \newcommand{\Oc}{\mathcal{O}} \newcommand{\Pc}{\mathcal{P}} \newcommand{\Qc}{\mathcal{Q}} \newcommand{\Rc}{\mathcal{R}} \newcommand{\Sc}{\mathcal{S}} \newcommand{\Tc}{\mathcal{T}} \newcommand{\Uc}{\mathcal{U}} \newcommand{\Vc}{\mathcal{V}} \newcommand{\Wc}{\mathcal{W}} \newcommand{\Xc}{\mathcal{X}} \newcommand{\Yc}{\mathcal{Y}} \newcommand{\Zc}{\mathcal{Z}} % mathbb \newcommand{\Ab}{\mathbb{A}} \newcommand{\Bb}{\mathbb{B}} \newcommand{\Cb}{\mathbb{C}} \newcommand{\Db}{\mathbb{D}} \newcommand{\Eb}{\mathbb{E}} \newcommand{\Fb}{\mathbb{F}} \newcommand{\Gb}{\mathbb{G}} \newcommand{\Hb}{\mathbb{H}} \newcommand{\Ib}{\mathbb{I}} \newcommand{\Jb}{\mathbb{J}} \newcommand{\Kb}{\mathbb{K}} \newcommand{\Lb}{\mathbb{L}} \newcommand{\Mb}{\mathbb{M}} \newcommand{\Nb}{\mathbb{N}} \newcommand{\Ob}{\mathbb{O}} \newcommand{\Pb}{\mathbb{P}} \newcommand{\Qb}{\mathbb{Q}} \newcommand{\Rb}{\mathbb{R}} \newcommand{\Sb}{\mathbb{S}} \newcommand{\Tb}{\mathbb{T}} \newcommand{\Ub}{\mathbb{U}} \newcommand{\Vb}{\mathbb{V}} \newcommand{\Wb}{\mathbb{W}} \newcommand{\Xb}{\mathbb{X}} \newcommand{\Yb}{\mathbb{Y}} \newcommand{\Zb}{\mathbb{Z}} % mathbf lowercase \newcommand{\av}{\mathbf{a}} \newcommand{\bv}{\mathbf{b}} \newcommand{\cv}{\mathbf{c}} \newcommand{\dv}{\mathbf{d}} \newcommand{\ev}{\mathbf{e}} \newcommand{\fv}{\mathbf{f}} \newcommand{\gv}{\mathbf{g}} \newcommand{\hv}{\mathbf{h}} \newcommand{\iv}{\mathbf{i}} \newcommand{\jv}{\mathbf{j}} \newcommand{\kv}{\mathbf{k}} \newcommand{\lv}{\mathbf{l}} \newcommand{\mv}{\mathbf{m}} \newcommand{\nv}{\mathbf{n}} \newcommand{\ov}{\mathbf{o}} \newcommand{\pv}{\mathbf{p}} \newcommand{\qv}{\mathbf{q}} \newcommand{\rv}{\mathbf{r}} \newcommand{\sv}{\mathbf{s}} \newcommand{\tv}{\mathbf{t}} \newcommand{\uv}{\mathbf{u}} \newcommand{\vv}{\mathbf{v}} \newcommand{\wv}{\mathbf{w}} \newcommand{\xv}{\mathbf{x}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} % mathbf uppercase \newcommand{\Av}{\mathbf{A}} \newcommand{\Bv}{\mathbf{B}} \newcommand{\Cv}{\mathbf{C}} \newcommand{\Dv}{\mathbf{D}} \newcommand{\Ev}{\mathbf{E}} \newcommand{\Fv}{\mathbf{F}} \newcommand{\Gv}{\mathbf{G}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\Iv}{\mathbf{I}} \newcommand{\Jv}{\mathbf{J}} \newcommand{\Kv}{\mathbf{K}} \newcommand{\Lv}{\mathbf{L}} \newcommand{\Mv}{\mathbf{M}} \newcommand{\Nv}{\mathbf{N}} \newcommand{\Ov}{\mathbf{O}} \newcommand{\Pv}{\mathbf{P}} \newcommand{\Qv}{\mathbf{Q}} \newcommand{\Rv}{\mathbf{R}} \newcommand{\Sv}{\mathbf{S}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\Uv}{\mathbf{U}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\Zv}{\mathbf{Z}} % bold greek lowercase \newcommand{\alphav }{\boldsymbol \alpha } \newcommand{\betav }{\boldsymbol \beta } \newcommand{\gammav }{\boldsymbol \gamma } \newcommand{\deltav }{\boldsymbol \delta } \newcommand{\epsilonv }{\boldsymbol \epsilon } \newcommand{\varepsilonv}{\boldsymbol \varepsilon} \newcommand{\zetav }{\boldsymbol \zeta } \newcommand{\etav }{\boldsymbol \eta } \newcommand{\thetav }{\boldsymbol \theta } \newcommand{\varthetav }{\boldsymbol \vartheta } \newcommand{\iotav }{\boldsymbol \iota } \newcommand{\kappav }{\boldsymbol \kappa } \newcommand{\varkappav }{\boldsymbol \varkappa } \newcommand{\lambdav }{\boldsymbol \lambda } \newcommand{\muv }{\boldsymbol \mu } \newcommand{\nuv }{\boldsymbol \nu } \newcommand{\xiv }{\boldsymbol \xi } \newcommand{\omicronv }{\boldsymbol \omicron } \newcommand{\piv }{\boldsymbol \pi } \newcommand{\varpiv }{\boldsymbol \varpi } \newcommand{\rhov }{\boldsymbol \rho } \newcommand{\varrhov }{\boldsymbol \varrho } \newcommand{\sigmav }{\boldsymbol \sigma } \newcommand{\varsigmav }{\boldsymbol \varsigma } \newcommand{\tauv }{\boldsymbol \tau } \newcommand{\upsilonv }{\boldsymbol \upsilon } \newcommand{\phiv }{\boldsymbol \phi } \newcommand{\varphiv }{\boldsymbol \varphi } \newcommand{\chiv }{\boldsymbol \chi } \newcommand{\psiv }{\boldsymbol \psi } \newcommand{\omegav }{\boldsymbol \omega } % bold greek uppercase \newcommand{\Gammav }{\boldsymbol \Gamma } \newcommand{\Deltav }{\boldsymbol \Delta } \newcommand{\Thetav }{\boldsymbol \Theta } \newcommand{\Lambdav }{\boldsymbol \Lambda } \newcommand{\Xiv }{\boldsymbol \Xi } \newcommand{\Piv }{\boldsymbol \Pi } \newcommand{\Sigmav }{\boldsymbol \Sigma } \newcommand{\Upsilonv }{\boldsymbol \Upsilon } \newcommand{\Phiv }{\boldsymbol \Phi } \newcommand{\Psiv }{\boldsymbol \Psi } \newcommand{\Omegav }{\boldsymbol \Omega } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Code highlighting with listings % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \definecolor{bluekeywords}{rgb}{0.13,0.13,1} \definecolor{greencomments}{rgb}{0,0.5,0} \definecolor{redstrings}{rgb}{0.9,0,0} \definecolor{light-gray}{gray}{0.95} \newcommand{\MYhref}[3][blue]{\href{#2}{\color{#1}{#3}}}% \definecolor{dkgreen}{rgb}{0,0.6,0} \definecolor{gray}{rgb}{0.5,0.5,0.5} \definecolor{mauve}{rgb}{0.58,0,0.82} \lstdefinelanguage{Shell}{ keywords={tar, cd, make}, %keywordstyle=\color{bluekeywords}\bfseries, alsoletter={+}, ndkeywords={python, py, javac, java, gcc, c, g++, cpp, .txt, octave, m, .tar}, %ndkeywordstyle=\color{bluekeywords}\bfseries, identifierstyle=\color{black}, sensitive=false, comment=[l]{//}, morecomment=[s]{/*}{*/}, commentstyle=\color{purple}\ttfamily, %stringstyle=\color{red}\ttfamily, morestring=[b]', morestring=[b]", backgroundcolor = \color{light-gray} } \lstset{columns=fixed, basicstyle=\ttfamily, backgroundcolor=\color{light-gray},xleftmargin=0.5cm,frame=tlbr,framesep=4pt,framerule=0pt} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Custom box for highlights % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Define box and box title style \tikzstyle{mybox} = [fill=blue!10, very thick, rectangle, rounded corners, inner sep=1em, inner ysep=1em] % \newcommand{\notebox}[1]{ % \begin{tikzpicture} % \node [mybox] (box){% % \begin{minipage}{\textwidth} % #1 % \end{minipage} % }; % \end{tikzpicture}% % } \NewEnviron{notebox}{ \begin{tikzpicture} \node [mybox] (box){ \begin{minipage}{\textwidth} \BODY \end{minipage} }; \end{tikzpicture} } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Commands showing / hiding solutions % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% To HIDE SOLUTIONS (to post at the website for students), set this value to 0: \def\issoln{1} % Some commands to allow solutions to be embedded in the assignment file. \ifcsname issoln\endcsname \else \def\issoln{1} \fi % Default to an empty solutions environ. \NewEnviron{soln}{}{} \if\issoln 1 % Otherwise, include solutions as below. \RenewEnviron{soln}{ \leavevmode\color{red}\ignorespaces % \textbf{Solution} \BODY \BODY }{} \fi %% To HIDE TAGS set this value to 0: \def\showtags{0} %%%%%%%%%%%%%%%% \ifcsname showtags\endcsname \else \def\showtags{1} \fi % Default to an empty tags environ. \NewEnviron{tags}{}{} \if\showtags 1 % Otherwise, include solutions as below. \RenewEnviron{tags}{ \fbox{ \leavevmode\color{blue}\ignorespaces \textbf{TAGS:} \texttt{\url{\BODY}} } \vspace{-.5em} }{} \fi %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Commands for customizing the assignment % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand{\courseName}{10-301/10-601 Introduction to Machine Learning (Spring 2021)} \newcommand{\hwName}{Homework 5: Neural Networks} \newcommand{\dueDate}{Monday, March 29, 2021 11:59 PM} \title{\textsc{\hwName} %\thanks{Compiled on \today{} at \currenttime{}} } % Title \author{\courseName\\ \url{http://www.cs.cmu.edu/~mgormley/courses/10601/} \\ DUE: \dueDate{} \\ } \date{} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Useful commands for typesetting the questions % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand \expect {\mathbb{E}} \newcommand \mle [1]{{\hat #1}^{\rm MLE}} \newcommand \map [1]{{\hat #1}^{\rm MAP}} \newcommand \argmax {\operatorname*{argmax}} \newcommand \argmin {\operatorname*{argmin}} \newcommand \code [1]{{\tt #1}} \newcommand \datacount [1]{\#\{#1\}} \newcommand \ind [1]{\mathbb{I}\{#1\}} %%%%%%%%%%%%%%%%%%%%%%%%%% % Document configuration % %%%%%%%%%%%%%%%%%%%%%%%%%% % Don't display a date in the title and remove the white space \predate{} \postdate{} \date{} % Don't display an author and remove the white space %\preauthor{} %\postauthor{} %%%%%%%%%%%%%%%%%% % Begin Document % %%%%%%%%%%%%%%%%%% \begin{document} \maketitle \begin{notebox} \paragraph{Summary} In this assignment, you will build a handwriting recognition system using a neural network. In the Written component, you will walk through an on-paper example of how to implement a neural network. Then, in the Programming component, you will implement an end-to-end system that learns to perform handwritten letter classification. \end{notebox} \section*{START HERE: Instructions} \begin{itemize} \item \textbf{Collaboration Policy}: Please read the collaboration policy here: \url{http://www.cs.cmu.edu/~mgormley/courses/10601/} \item\textbf{Late Submission Policy:} See the late submission policy here: \url{http://www.cs.cmu.edu/~mgormley/courses/10601/} \item\textbf{Submitting your work:} You will use Gradescope to submit answers to all questions and code. Please follow instructions at the end of this PDF to correctly submit all your code to Gradescope. \begin{itemize} % COMMENT IF NOT USING CANVAS \begin{comment} \item \textbf{Canvas:} Canvas (\url{https://canvas.cmu.edu}) will be used for quiz-style problems (e.g. multiple choice, true / false, numerical answers). Grading is done automatically. % You may only \textbf{submit once} on canvas, so be sure of your answers before you submit. However, canvas allows you to work on your answers and then close out of the page and it will save your progress. You will not be granted additional submissions, so please be confident of your solutions when you are submitting your assignment. % {\color{red} The above is true for future assignments, but this one allows {\bf unlimited submissions}.} \end{comment} % COMMENT IF NOT USING GRADESCOPE \begin{comment} \item \textbf{Gradescope:} For written problems such as derivations, proofs, or plots we will be using Gradescope (\url{https://gradescope.com/}). Submissions can be handwritten, but should be labeled and clearly legible. If your writing is not legible, you will not be awarded marks. Alternatively, submissions can be written in LaTeX. Upon submission, label each question using the template provided. Regrade requests can be made, however this gives the TA the opportunity to regrade your entire paper, meaning if additional mistakes are found then points will be deducted. % Each derivation/proof should be completed on a separate page. % COMMENT IF NOT USING AUTOLAB \item \textbf{Autolab:} You will submit your code for programming questions on the homework to Autolab (\url{https://autolab.andrew.cmu.edu/}). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). % When you are developing, check that the version number of the programming language environment (e.g. Python 2.7.6/3.6.8, Octave 3.8.2, OpenJDK 1.8.0, g++ 4.8.5) and versions of permitted libraries (e.g. \texttt{numpy} 1.11.1 and \texttt{scipy} 0.18.1) match those used on Autolab. DO NOT use any 3rd party python packages (i.e. \texttt{sklearn} or \texttt{pandas}) in your implementation! Note: Also do not use \texttt {numpy.loadtxt()}. % (Octave users: Please make sure you do not use any Matlab-specific libraries in your code that might make it fail against our tests.) % You have a {\bf total of 10 Autolab submissions}. Use them wisely. In order to not waste Autolab submissions, we recommend debugging your implementation on the linux servers to make sure your code is running correctly first before any Autolab submission. % \end{comment} \item \textbf{Written:} For written problems such as short answer, multiple choice, derivations, proofs, or plots, we will be using Gradescope (\url{https://gradescope.com/}). Please use the provided template. Submissions must be written in LaTeX. Regrade requests can be made, however this gives the staff the opportunity to regrade your entire paper, meaning if additional mistakes are found then points will be deducted. Each derivation/proof should be completed in the boxes provided. For short answer questions you \textbf{should not} include your work in your solution. If you include your work in your solutions, your assignment may not be graded correctly by our AI assisted grader. \item \textbf{Programming:} You will submit your code for programming questions on the homework to Gradescope (\url{https://gradescope.com}). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). When you are developing, check that the version number of the programming language environment (e.g. Python 3.6.9, OpenJDK 11.0.5, g++ 7.4.0) and versions of permitted libraries (e.g. \texttt{numpy} 1.17.0 and \texttt{scipy} 1.4.1) match those used on Gradescope. You have \textbf{a total of 10 Gradescope programming submissions}. Use them wisely. In order to not waste Gradescope submissions, we recommend debugging your implementation on your local machine (or the linux servers) and making sure your code is running correctly first before submitting you code to Gradescope. \end{itemize} \item \textbf{Materials:} The data that you will need in order to complete this assignment is posted along with the writeup and template on Piazza. \begin{notebox} \paragraph{Linear Algebra Libraries} When implementing machine learning algorithms, it is often convenient to have a linear algebra library at your disposal. In this assignment, Java users may use EJML\footnote{\url{https://ejml.org}} or ND4J\footnote{\url{https://deeplearning4j.org/docs/latest/nd4j-overview}} and C++ users Eigen\footnote{\url{http://eigen.tuxfamily.org/}}. Details below. % (As usual, Python users have NumPy.) % \begin{description} \item[EJML for Java] EJML is a pure Java linear algebra package with three interfaces. We strongly recommend using the SimpleMatrix interface. The autograder will use EJML version 0.38. When compiling and running your code, we will add the additional command line argument \small{\lstinline{-cp "linalg_lib/ejml-v0.38-libs/*:linalg_lib/nd4j-v1.0.0-beta7-libs/*:./"}} to ensure that all the EJML jars are on the classpath as well as your code. \item[ND4J for Java] ND4J is a library for multidimensional tensors with an interface akin to Python's NumPy. The autograder will use ND4J version 1.0.0-beta7. When compiling and running your code, we will add the additional command line argument \small{\lstinline{-cp "linalg_lib/ejml-v0.38-libs/*:linalg_lib/nd4j-v1.0.0-beta7-libs/*:./"}} to ensure that all the ND4J jars are on the classpath as well as your code. \item[Eigen for C++] Eigen is a header-only library, so there is no linking to worry about---just \lstinline{#include} whatever components you need. The autograder will use Eigen version 3.3.7. The command line arguments above demonstrate how we will call you code. When compiling your code we will include, the argument \lstinline{-I./linalg_lib} in order to include the \lstinline{linalg_lib/Eigen} subdirectory, which contains all the headers. \end{description} We have included the correct versions of EJML/ND4J/Eigen in the \lstinline{linalg_lib.zip} posted on the Piazza Resources page for your convenience. It contains the same \lstinline{linalg_lib/} directory that we will include in the current working directory when running your tests. Do {\bf not} include EJML, ND4J, or Eigen in your homework submission; the autograder will ensure that they are in place. \end{notebox} \end{itemize} \clearpage {\LARGE \bf Written Questions (44 points)} % \section{Written Questions [44 Points]} % \label{sec:written} % Answer the following questions in the HW5 solutions template provided. Then upload your solutions to Gradescope. You may use \LaTeX\ or print the template and hand-write your answers then scan it in. Failure to use the template may result in a penalty. % \textbf{Note:} For all questions which require numerical answers, round up your final answers to four decimal places. For integers, you may drop trailing zeros. % \subsection{Example Feed Forward and Backpropagation [34 points]} \section{Convolutional Neural Networks} In this problem, consider only the convolutional layer of a standard implementation of a CNN as described in Lecture 12. \begin{questions} \question We are given image $X$ and filter $F$ below. \begin{figure}[h] \centering \includegraphics[scale=0.6]{img/CNN.PNG} \end{figure} \begin{parts} % Copied from 701 \part[1] Let $X$ be convolved with $F$ using no padding and a stride of 1 to produce an output $Y$. What is value of $j$ in the output $Y$?\\ \begin{your_solution}[title=Answer,height=2cm,width=4cm] \end{your_solution} \part[1] Suppose you had an input feature map of size 6x4 and filter size 2x2, using no padding and a stride of 2, what would be the resulting output size? Write your answer in terms of height $\times$ width.\\ \begin{your_solution}[title=Answer,height=2cm,width=4cm] \end{your_solution} \end{parts} \break \question Parameter sharing is a very important concept for CNN because it drastically reduces the complexity of the learning problem. For the following questions, assume that there is no bias term in our convolutional layer. \begin{parts} \part[1] Which of the following are parameters of a convolutional layer? \\ \textbf{Select all that apply:} \begin{list}{} % Change \emptysquare to \filledsquare for the appropriate selection \item \emptysquare % \filledsquare stride size \item \emptysquare % \filledsquare padding size \item \emptysquare % \filledsquare image size \item \emptysquare % \filledsquare filter size \item \emptysquare % \filledsquare weights in the filter \item \emptysquare % \filledsquare None of above. \end{list} \part[1] Which of the following are hyperparameters of a convolutional layer? \\ \textbf{Select all that apply:} \begin{list}{} % Change \emptysquare to \filledsquare for the appropriate selection \item \emptysquare % \filledsquare stride size \item \emptysquare % \filledsquare padding size \item \emptysquare % \filledsquare image size \item \emptysquare % \filledsquare filter size \item \emptysquare % \filledsquare weights in the filter \item \emptysquare % \filledsquare None of above. \end{list} \part[1] Suppose for the convolutional layer, we are given black and white images of size $22\times 22$. Using one single $4 \times 4$ filter with a stride of 2 and no padding, what is the number of parameters you are learning in this layer? \\ \begin{your_solution}[title=Answer,height=2cm,width=4cm] % YOUR ANSWER \end{your_solution}\\ \part[1] Suppose instead of sharing the same filter for the entire image, you learn a new filter each time you move across the image. Using $4 \times 4$ filters with a stride of 2 and no padding, what is the number of parameters you are learning in this layer? \\ \begin{your_solution}[title=Answer,height=2cm,width=4cm] % YOUR ANSWER \end{your_solution}\\ \part[1] Now suppose you are given a $40 \times 40$ colored image, which consists of 3 channels (so your input is a $40 \times 40 \times 3$ tensor), each representing the intensity of one primary color. Without sharing, using $4 \times 4$ filters with a stride of 2 and no padding, what is the number of parameters you are learning in this layer? \\ \begin{your_solution}[title=Answer,height=2cm,width=4cm] % YOUR ANSWER \end{your_solution}\\ \part[1] Parameter sharing is not usually used for fully-connected layers, but is usually used for convolutional layers. In a sentence, describe a reason why parameter sharing is a good idea for a convolutional layer, besides reduction in problem complexity. Hint: think about applications of CNNs. \\ \begin{your_solution}[title=Answer,height=4cm,width=14.5cm] % YOUR ANSWER \end{your_solution}\\ \end{parts} \end{questions} \clearpage % \section{Written Questions [44 Points]} % \label{sec:written} % Answer the following questions in the HW5 solutions template provided. Then upload your solutions to Gradescope. You may use \LaTeX\ or print the template and hand-write your answers then scan it in. Failure to use the template may result in a penalty. % \textbf{Note:} For all questions which require numerical answers, round up your final answers to four decimal places. For integers, you may drop trailing zeros. % \subsection{Example Feed Forward and Backpropagation [34 points]} \section{Backpropagation} \begin{figure}[h] \def\distH{3.6cm} \def\distHTwo{0.1cm} \def\distHThree{0.3cm} \def\distV{0.6cm} \def\distVTwo{0.3cm} \centering % \includegraphics[scale=0.8]{img/oneHL6.png} \begin{tikzpicture}[ > = stealth, % arrow head style
shorten > = 0pt, % don’t touch arrow head to node
auto,
% node distance = 2.5cm, % distance between nodes
thick % line style
]

\tikzstyle{every state}=[
draw = black,
thick,
fill = white,
minimum size = 0.8cm
]

\node[state] (X1) {$x_1$};
\node[state] (X2) [below = \distV of X1] {$x_2$};
\node[state] (X3) [below = \distV of X2] {$x_3$};
\node[state] (Z2) [right = \distH of X3] {$z_2$};
\node[state] (Z1) [above = \distV of Z2] {$z_1$};
\node[state] (Y1) [right = \distH of Z1] {$\hat{y}_1$};
\node[state] (Y2) [below = \distV of Y1] {$\hat{y}_2$};
\path[->] (X1) edge node {} (Z1);
%\path[->] (X1) edge node {} (Z2);
\path[->] (X2) edge node {} (Z1);
\path[->] (X2) edge node {} (Z2);
%\path[->] (X3) edge node {} (Z1);
\path[->] (X3) edge node {} (Z2);
\path[->] (Z1) edge node {} (Y1);
\path[->] (Z1) edge node {} (Y2);
%\path[->] (Z2) edge node {} (Y1);
\path[->] (Z2) edge node {} (Y2);
\end{tikzpicture}
\caption{A Directed Acyclic Graph (DAG)}
\label{fig:oneHL}
\end{figure}

Consider this above Directed Acyclic Graph (DAG). Notice that it looks different than the fully-connected Neural Networks that we have seen before. Recall from lecture that you can perform back propagation on any DAG. We will work through back propagation on this graph. \\\\
%We let $f_{i}$ denote the transformation from the relevant $x$ nodes to $z_i$ and let $g_{i}$ denote the transformation from the relevant $z$ nodes to $\hat{y}_i$.
Let $(\xv, \yv)$ be the training example that is considered, where $\xv = \begin{bmatrix} x_1 & x_2 &x_3 \end{bmatrix}^T$ and $\yv = \begin{bmatrix} y_1 & y_2 \end{bmatrix}^T$. All the other nodes in the graph are defined as:
\begin{align*}
z_1 &= \text{ReLU}(w_{1,1}x_1+ w_{2,1}x_2)\\
z_2 &= w_{2,2}x_2^2 + w_{3,2}x_3+b_2\\
\hat{y}_1 &= \sigma(m_{1,1}z_1^3+c_1)\\
\hat{y}_2 &= m_{1,2}\sin(z_1) + m_{2,2}\cos(z_2)
\end{align*}
where $\sigma(x) = \frac{1}{1+e^{-x}}$ is the sigmoid function, and ReLU$(x) = \max(0,x)$. Let $\mathbf{\thetav}$ be the set of all parameters to be learned in this graph. We have that
$$\mathbf{\thetav} = \{ w_{1,1},w_{2,1},w_{2,2}, w_{3,2}, m_{1,1}, m_{1,2}, m_{2,2},b_2,c_1 \}$$
For every set of input $\xv = (x_1,x_2,x_3)$, we will define objective of the problem as minimizing the loss function
$$J(\thetav) = \log\left((y_1-\hat{y}_1)^2\right)+\log\left((y_2-\hat{y}_2)^2\right)$$
Assume that you have already gone through the forward pass with inputs $\xv = (x_1, x_2, x_3)$ and stored all the relevant values. In the following questions, you will derive the backpropagation algorithm applied to the above DAG.\\
\begin{parts}

\part[1] First, we will derive the gradients with repect to the outputs. What are the expressions for $\frac{\partial J}{\partial \hat{y}_1}$? Write your solution in terms of $\hat{y}_1$.\\
\begin{your_solution}[title=Answer,height=2cm,width=14cm]
% YOUR ANSWER
\end{your_solution}\\

\break

\part Now, we will derive the gradients associated with the last layer, ie nodes $y_1,y_2$. Note that for the full backpropagation algorithm, you would need to calculate the gradients of the loss function with respect to every parameter ($m_{1,1},m_{1,2}, m_{2,2}, c_1)$ as well as every input of the layer $(z_1, z_2)$, but we are not asking for all of them in this part. For all of the questions in this part, you should use Chain Rule, and write your solution in terms of values from the forward pass, or \textbf{gradients with respect to the outputs of this layer,} $\mathbf{\frac{\partial J}{\partial \hat{y}_1}}$, $\mathbf{\frac{\partial J}{\partial \hat{y}_2}}$, because you have already calculated these values. In addition, use the sigmoid function $\sigma(x)$ in your answer instead of its explicit form.

\begin{parts}

\part[1] What is the expression for $\frac{\partial J}{\partial z_1}$? \\
\begin{your_solution}[title=Answer,height=2cm,width=14cm]
% YOUR ANSWER
\end{your_solution}\\

\part[1] What is the expression for $\frac{\partial J}{\partial z_2}$? \\
\begin{your_solution}[title=Answer,height=2cm,width=14cm]
% YOUR ANSWER
\end{your_solution}\\

\part[1] What is the expression for $\frac{\partial J}{\partial m_{1,1}}$? \\
\begin{your_solution}[title=Answer,height=2cm,width=14cm]
% YOUR ANSWER
\end{your_solution}\\

\part[1] What is the expression for $\frac{\partial J}{\partial m_{1,2}}$?\\
\begin{your_solution}[title=Answer,height=2cm,width=14cm]
% YOUR ANSWER
\end{your_solution}\\

\part[1] What is the expression for $\frac{\partial J}{\partial c_1}$? \\
\begin{your_solution}[title=Answer,height=2cm,width=14cm]
% YOUR ANSWER
\end{your_solution}\\

\end{parts}

\break

\part[1] Lastly, we will derive the gradients associated with the second layer, ie nodes $z_1,z_2$. Note that for the full backpropagation algorithm, you need to calculate the gradients of the loss function with respect to every parameter (every $w_{i,j}, b_2)$. However, we do not need to calculate the gradients with respect to the inputs of this layer $(x_1,x_2,x_3)$, because they are fixed inputs of the model. For all of the questions in this part, you should use Chain Rule, and write your solution in terms of values from the forward pass or \textbf{gradients with respect to the outputs of this layer,} $\mathbf{\frac{\partial J}{\partial z_1}}$, $\mathbf{\frac{\partial J}{\partial z_2}}$, because you have already calculated these values.

\begin{parts}

\part[1] What is the expression for $\frac{\partial J}{\partial w_{2,2}}$? \\
\begin{your_solution}[title=Answer,height=2cm,width=14cm]
% YOUR ANSWER
\end{your_solution}\\

\part[1] What is the expression for $\frac{\partial J}{\partial b_2}$? \\
\begin{your_solution}[title=Answer,height=2cm,width=14cm]
% YOUR ANSWER
\end{your_solution}\\

\part[2] Recall that ReLU$(x) = \max(x,0)$. The ReLU function is not differentiable at $x = 0$, but for backpropagation, we define its derivative as
\begin{align*}
\text{ReLU}'(x) =
\begin{cases}
0, x<0\\ 1, \text{ otherwise} \end{cases} \end{align*} Now, what is the expression for $\frac{\partial J}{\partial w_{1,1}}$? Explicitly write out the cases.\\ \begin{your_solution}[title=Answer,height=4cm,width=14cm] % YOUR ANSWER \end{your_solution} \end{parts} \end{parts} \clearpage % \section{Written Questions [44 Points]} % \label{sec:written} % Answer the following questions in the HW5 solutions template provided. Then upload your solutions to Gradescope. You may use \LaTeX\ or print the template and hand-write your answers then scan it in. Failure to use the template may result in a penalty. % \textbf{Note:} For all questions which require numerical answers, round up your final answers to four decimal places. For integers, you may drop trailing zeros. % \subsection{Example Feed Forward and Backpropagation [34 points]} \section{Backpropagation Through Time} Consider the following one-layer many-to-many Recurrent neural network (RNN), \begin{figure}[h] \def\distH{3.6cm} \def\distHTwo{0.1cm} \def\distHThree{0.3cm} \def\distV{0.6cm} \def\distVTwo{0.3cm} \centering \begin{tikzpicture}[ > = stealth, % arrow head style
shorten > = 0pt, % don’t touch arrow head to node
auto,
% node distance = 2.5cm, % distance between nodes
thick % line style
]

\tikzstyle{every state}=[
draw = black,
thick,
fill = white,
minimum size = 0.8cm
]

\node[state] (O) {$o$};
\node[state] (H) [below=\distV of O] {$h$};
\node[state] (X) [below=\distV of H] {$x$};

\path[->] (H) edge [out=45, in=315, loop] node[midway,below=-0.3]{\hspace{-5mm}$W$\LARGE{$\blacksquare$}} (H);
\path[->] (X) edge node {$U$} (H);
\path[->] (H) edge node {$V$} (O);
\end{tikzpicture}
\caption{A one hidden layer many-to-many Recurrent Neural Network}
\label{fig:rnn}
\end{figure}\\
Recall from lecture, this is equivalent to
\begin{figure}[h]
\centering
\def\distH{1.5cm}
\def\distHTwo{0.1cm}
\def\distHThree{0.3cm}
\def\distV{0.6cm}
\def\distVTwo{0.3cm}
\centering
\begin{tikzpicture}[
> = stealth, % arrow head style
shorten > = 0pt, % don’t touch arrow head to node
auto,
% node distance = 2.5cm, % distance between nodes
thick % line style
]

\tikzstyle{every state}=[
draw = black,
thick,
fill = white,
minimum size = 0.8cm
]
\node[state] (O1) {$o^{(1)}$};
\node[state] (H1) [below=\distV of O1] {$h^{(1)}$};
\node[state,dashed] (H0) [left=\distV of H1] {$h^{(0)}$};
\node[state] (X1) [below=\distV of H1] {$x^{(1)}$};
\node[state] (O2) [right=\distH of O1] {$o^{(2)}$};
\node[state] (H2) [below=\distV of O2] {$h^{(2)}$};
\node[state] (X2) [below=\distV of H2] {$x^{(2)}$};
\node[state] (Ot) [right=2*\distH of O2] {$o^{(n)}$};
\node[state] (Ht) [below=\distV of Ot] {$h^{(n)}$};
\node[state] (Xt) [below=\distV of Ht] {$x^{(n)}$};
\node at ($(O2)!.5!(Ot)$) {\ldots\ldots};
\node[state,dashed] at ($(H2)!.5!(Ht)$) {$h^{(\dots)}$};
\node at ($(X2)!.5!(Xt)$) {\ldots\ldots};
\coordinate[right=2.5*\distH of Ot] (E5);
\node at ($(Ot)!.5!(E5)$) {\ldots\ldots};
\coordinate[right=2.5*\distH of Ht] (E6);
\node[state,dashed] at ($(Ht)!.5!(E6)$) {$h^{(\dots)}$};
\coordinate[right=2.5*\distH of Xt] (E7);
\node at ($(Xt)!.5!(E7)$) {\ldots\ldots};

\path[->] (X1) edge node {} (H1);
\path[->] (H1) edge node {} (O1);
\path[->] (X2) edge node {} (H2);
\path[->] (H2) edge node {} (O2);
\path[->] (Xt) edge node {} (Ht);
\path[->] (Ht) edge node {} (Ot);

\coordinate[right of=H2] (E1);
\coordinate[left of=Ht] (E2);
\coordinate[right of=Ht] (E3);
\path[->] (H0) edge node {} (H1);
\path[->] (H1) edge node {} (H2);
\draw [->] (H2) to[right] node[auto] {} (E1);
\draw [->] (E2) to[right] node[auto] {} (Ht);
\draw [->] (Ht) to[right] node[auto] {} (E3);
\end{tikzpicture}
\caption{The unfolding of the above RNN in time of the computation involved in its forward computation}
\label{fig:rnn-unroll}
\end{figure}\\
Assume $h^{(0)}$ is given. To keep things simple, we consider an RNN with identity activation function and no bias terms. We also assume that $W = I$, the identity matrix. For each timestep $n \geq 1$, the hidden state and output can be expressed as
\begin{align*}
&h^{(n)}=f(h^{(n-1)},x^{(n)},U)=Ux^{(n)}+h^{(n-1)}\\
&o^{(n)}=g(h^{(n)}, V)=Vh^{(n)}
\end{align*}
where $x^{(n)}\in\mathbb{R}^a,h^{(n)}\in\mathbb{R}^b,o^{(n)}\in\mathbb{R}^c,U\in\mathbb{R}^{b\times a}$, and $V\in\mathbb{R}^{c\times b}$. Assume finite time $N$. The loss can then be expressed as
\begin{align*}
J(\mathbf{\thetav})=\sum_{t=1}^Nl(y^{(t)},o^{(t)})
\end{align*}
where $l(y^{(t)},o^{(t)})$ is the loss at time step $t$.

For the following questions, express your answer in terms of $x^{(n)},h^{(n)},o^{(n)},U,V$ unless specified otherwise. Some rules of matrix calculus may be helpful: for $J\in\mathbb{R},\xv\in\mathbb{R}^n,\Xv\in\mathbb{R}^{n\times m}$,
\begin{align*}
\frac{\partial J}{\partial\xv}=\begin{bmatrix}
\frac{\partial J}{\partial x_1} \\ \frac{\partial J}{\partial x_2} \\ \dots \\ \frac{\partial J}{\partial x_n}
\end{bmatrix}\qquad
\frac{\partial J}{\partial\Xv}=\begin{bmatrix}
\frac{\partial J}{\partial x_{11}} & \frac{\partial J}{\partial x_{12}} & \dots & \frac{\partial J}{\partial x_{1m}}\\
\frac{\partial J}{\partial x_{21}} & \frac{\partial J}{\partial x_{22}} & \dots & \frac{\partial J}{\partial x_{2m}}\\
\vdots & \vdots & \ddots & \vdots\\
\frac{\partial J}{\partial x_{n1}} & \frac{\partial J}{\partial x_{n2}} & \dots & \frac{\partial J}{\partial x_{nm}}
\end{bmatrix}
\end{align*}

\begin{questions}
\question Assume $N=1$. For the following subparts (a)-(d), you may give your answer in terms of $\frac{\partial l}{\partial o^{(1)}}$.
\begin{parts}
\part[1] What is $\frac{\partial J}{\partial h^{(1)}}$?\\
\begin{your_solution}[title=Answer,height=2cm,width=8cm]
% YOUR ANSWER
\end{your_solution}\\

\part[1] What is $\frac{\partial J}{\partial V}$?\\
\begin{your_solution}[title=Answer,height=2cm,width=8cm]
% YOUR ANSWER
\end{your_solution}\\

\part[1] What is $\frac{\partial J}{\partial U}$?\\
\begin{your_solution}[title=Answer,height=2cm,width=8cm]
% YOUR ANSWER
\end{your_solution}\\

\end{parts}

\question Assume $N=2$. For the following subparts (a)-(d), you may give your answer in terms of $\frac{\partial l}{\partial o^{(1)}}$ and $\frac{\partial l}{\partial o^{(2)}}$.
\begin{parts}
\part[1] What is $\frac{\partial J}{\partial h^{(1)}}$?\\
\begin{your_solution}[title=Answer,height=2cm,width=8cm]
% YOUR ANSWER
\end{your_solution}\\

\part[1] What is $\frac{\partial J}{\partial V}$?\\
\begin{your_solution}[title=Answer,height=2cm,width=8cm]
% YOUR ANSWER
\end{your_solution}\\

\part[1] What is $\frac{\partial J}{\partial U}$?\\
\begin{your_solution}[title=Answer,height=2cm,width=8cm]
% YOUR ANSWER
\end{your_solution}\\

\end{parts}

\break

\question Now let’s generalize the result to an arbitrary time $N > 1$. For the following subparts (a)-(c), you may give your answer in terms of $\frac{\partial l}{\partial o^{(n)}}$, $n=1,\dots,N$.
\begin{parts}
\part[2] What is $\frac{\partial J}{\partial h^{(n)}}$?\\
\begin{your_solution}[title=Answer,height=2cm,width=8cm]
% YOUR ANSWER
\end{your_solution}\\

\part[2] What is $\frac{\partial J}{\partial V}$?\\
\begin{your_solution}[title=Answer,height=2cm,width=8cm]
% YOUR ANSWER
\end{your_solution}\\

\part[2] What is $\frac{\partial J}{\partial U}$?\\
\begin{your_solution}[title=Answer,height=2cm,width=8cm]
% YOUR ANSWER
\end{your_solution}\\

\end{parts}
\end{questions}\clearpage

\clearpage
\section{Empirical Questions}
\label{sec:interp}

The following questions should be completed after you work through the programming portion of this assignment.

For these questions, \textbf{use the large dataset}. Use the following values for the hyperparameters unless otherwise specified:

\begin{center}
\begin{tabular}{|c|c|}
\hline
\textbf{Parameter} & \textbf{Value} \\
\hline
Number of Hidden Units & 50 \\
\hline
Weight Initialization & {\sc Random} \\
\hline
Learning Rate & 0.01 \\
\hline
\end{tabular}
\end{center}

Please submit computer-generated plots for (a)i and (b)i. Note: we expect it to take about \textbf{5 minutes} to train each of these networks.

\begin{questions}
\question Hidden Units

\begin{parts}

\part[2] Train a single hidden layer neural network using the hyperparameters mentioned in the table above, except for the number of hidden units which should vary among 5, 20, 50, 100, and 200. Run the optimization for 100 epochs each time.

Plot the average training cross-entropy (sum of the cross-entropy terms over the training dataset divided by the total number of training examples) on the y-axis vs number of hidden units on the x-axis. In the \textbf{same figure}, plot the average validation cross-entropy.

\begin{your_solution}[title=Avg. Train and Validation Cross-Entropy Loss,height=10cm,width=14.5cm]
% YOUR ANSWER
\end{your_solution}

\clearpage

\part[2] Examine and comment on the the plots of training and validation cross-entropy. What is the effect of changing the number of hidden units?

\begin{your_solution}[title=Answer,height=4cm,width=14.5cm]
% YOUR ANSWER
\end{your_solution}

\end{parts}

\question Learning Rate

\begin{parts}
\part[6] Train a single hidden layer neural network using the hyperparameters mentioned in the table above, except for the learning rate which should vary among 0.1, 0.01, and 0.001. Run the optimization for 100 epochs each time.

Plot the average training cross-entropy on the y-axis vs the number of epochs on the x-axis for the mentioned learning rates. In the \textbf{same figure}, plot the average validation cross-entropy loss. Make a separate figure for each learning rate.

\begin{your_solution}[title=Plot LR 0.1,height=10cm,width=14.5cm]
% YOUR ANSWER
\end{your_solution}

\begin{your_solution}[title=Plot LR 0.01,height=10cm,width=14.5cm]
% YOUR ANSWER
\end{your_solution}

\begin{your_solution}[title=Plot LR 0.001,height=10cm,width=14.5cm]
% YOUR ANSWER
\end{your_solution}

\clearpage

\part[2] Examine and comment on the plots of training and validation cross-entropy. How does adjusting the learning rate affect the convergence of cross-entropy on the datasets?

\begin{your_solution}[title=Answer,height=5cm,width=14.5cm]
% YOUR ANSWER
\end{your_solution}

\end{parts}

\question Weight Initialization

\begin{parts}

\part[2]
For this exercise, you can work on any data set. Initialize $\alphav$ and $\betav$ to zero and print them out after the first few updates. For example, you may use the following command to begin:
\begin{lstlisting}[language=Shell]
$ python neuralnet.py smallTrain.csv smallValidation.csv \
smallTrain_out.labels smallValidation_out.labels \
smallMetrics_out.txt 1 4 2 0.1
\end{lstlisting}
Compare the values across rows and columns in $\alphav$ and $\betav$. Comment on what you observed. Do you think it is reasonable to use zero initialization? Why or why not?

\begin{your_solution}[title=Answer,height=5cm,width=14.5cm]
% YOUR ANSWER
\end{your_solution}

\end{parts}

\end{questions}

\newpage
{\Large Collaboration Questions}

After you have completed all other components of this assignment, report your answers to these questions regarding the collaboration policy. Details of the policy can be found \href{http://www.cs.cmu.edu/~mgormley/courses/10601/about.html}{here}.
\begin{enumerate}
\item Did you receive any help whatsoever from anyone in solving this assignment? Is so, include full details.
\item Did you give any help whatsoever to anyone in solving this assignment? Is so, include full details.
\item Did you find or come across code that implements any part of this assignment ? If so, include full details.
\end{enumerate}

\begin{your_solution}[height=6cm]
% YOUR ANSWER

\end{your_solution}
\clearpage
{\LARGE \bf Programming (56 points)}

\begin{figure}[H]
\centering
\includegraphics[scale=0.99]{lettergrid.png}
\caption{10 Random Images of Each of 10 Letters in OCR}
\label{fig:grid}
\end{figure}

\section{The Task}
Your goal in this assignment is to label images of handwritten letters by implementing a Neural Network from scratch. You will implement all of the functions needed to initialize, train, evaluate, and make predictions with the network.

The programs you write will be automatically graded using the Gradescope system. You may write your programs in \textbf{Python, Java, or C++}. However, you should use the same language for all parts below.

\section{The Datasets}
\label{sec:dataset}

\paragraph{Datasets} We will be using a subset of an Optical Character Recognition (OCR) dataset. This data includes images of all 26 handwritten letters; our subset will include only the letters “a,” “e,” “g,” “i,” “l,” “n,” “o,” “r,” “t,” and “u.” The handout contains three datasets drawn from this data: a small dataset with 60 samples \textit{per class} (50 for training and 10 for validation), a medium dataset with 600 samples per class (500 for training and 100 for validation), and a large dataset with 1000 samples per class (900 for training and 100 for validation). Figure \ref{fig:grid} shows a random sample of 10 images of few letters from the dataset.

\paragraph{File Format} Each dataset (small, medium, and large) consists of two csv files—train and validation. Each row contains 129 columns separated by commas. The first column contains the label and columns 2 to 129 represent the pixel values of a $16 \times 8$ image in a row major format. Label 0 corresponds to “a,” 1 to “e,” 2 to “g,” 3 to “i,” 4 to “l,” 5 to “n,” 6 to “o,” 7 to “r,” 8 to “t,” and 9 to “u.”
%
Because the original images are black-and-white (not grayscale), the pixel values are either 0 or 1. However, you should write your code to accept arbitrary pixel values in the range [0, 1]. The images in Figure \ref{fig:grid} were produced by converting these pixel values into .png files for visualization. Observe that no feature engineering has been done here; instead the neural network you build will \emph{learn} features appropriate for the task of character recognition.

\section{Model Definition}

In this assignment, you will implement a single-hidden-layer neural network with a sigmoid activation function for the hidden layer, and a softmax on the output layer. Let the input vectors $\xv$ be of length $M$, the hidden layer $\zv$ consist of $D$ hidden units, and the output layer $\hat{\yv}$ be a probability distribution over $K$ classes. That is, each element $\hat{y}_k$ of the output vector represents the probability of $\xv$ belonging to the class $k$.

\begin{align*}
& \hat{y}_k = \frac{\exp(b_k)}{\sum_{l=1}^K \exp(b_l)}
\\
& b_k = \beta_{k,0} + \sum_{j=1}^D \beta_{kj} z_j
\\
& z_j = \frac{1}{1+\exp(-a_j)}
\\
& a_j = \alpha_{j,0} + \sum_{m=1}^M \alpha_{jm} x_m
\end{align*}

We can compactly express this model by assuming that $x_0=1$ is a bias feature on the input and that $z_0=1$ is also fixed. In this way, we have two parameter matrices $\alphav \in \Rb^{D \times (M+1)}$ and $\betav \in \Rb^{K \times (D+1)}$. The extra $0$th column of each matrix (i.e. $\alphav_{\cdot,0}$ and $\betav_{\cdot,0}$) hold the bias parameters.

\begin{align*}
& \hat{y}_k = \frac{\exp(b_k)}{\sum_{l=1}^K \exp(b_l)}
\\
& b_k = \sum_{j=0}^D \beta_{kj} z_j
\\
& z_j = \frac{1}{1+\exp(-a_j)}
\\
& a_j = \sum_{m=0}^M \alpha_{jm} x_m
\end{align*}

The objective function we will use for training the neural network is the average cross entropy over the training dataset $\Dc = \{ (\xv^{(i)}, \yv^{(i)}) \}$:

\begin{equation}
\label{eq:celoss}
J(\alphav, \betav)= – \frac{1}{N} \sum_{i=1}^N \sum_{k=1}^{K} y_k^{(i)} \log (\hat{y}^{(i)}_k)
\end{equation}

In Equation~\ref{eq:celoss}, $J$ is a function of the model parameters $\alphav$ and $\betav$ because $\hat{y}^{(i)}_k$ is implicitly a function of $\xv^{(i)}$, $\alphav$, and $\betav$ since it is the output of the neural network applied to $\xv^{(i)}$. Of course, $\hat{y}^{(i)}_k$ and $y_k^{(i)}$ are the $k$th components of $\hat{\yv}^{(i)}$ and $\yv^{(i)}$ respectively.

To train, you should optimize this objective function using stochastic gradient descent (SGD), where the gradient of the parameters for each training example is computed via backpropagation. Note that SGD has a slight impact on the objective function, where we are “summing” over the current point, $i$:
\begin{equation}
\label{eq:sgd_celoss}
J_{SGD}(\alphav, \betav)= -\sum_{k=1}^{K} y_k^{(i)} \log (\hat{y}^{(i)}_k)
\end{equation}

\subsection{Initialization}
\label{sec:init}

In order to use a deep network, we must first initialize the weights and biases in the network. This is typically done with a random initialization, or initializing the weights from some other training procedure. For this assignment, we will be using two possible initialization:
\begin{quote}
\begin{description}
\item[{\sc Random}] The weights are initialized randomly from a uniform distribution from -0.1 to 0.1. The bias parameters are initialized to zero.
\item[{\sc Zero}] All weights are initialized to 0.
\end{description}
\end{quote}

You must support both of these initialization schemes.

\section{Implementation}
\label{sec:model}
Write a program \texttt{neuralnet.\{py|java|cpp|m\}} that implements an optical character recognizer using a one hidden layer neural network with sigmoid activations. Your program should learn the parameters of the model on the training data, report the cross-entropy at the end of each epoch on both train and validation data, and at the end of training write out its predictions and error rates on both datasets.
%

Your implementation must satisfy the following requirements:

\begin{itemize}
\item Use a {\bf sigmoid} activation function on the hidden layer and {\bf softmax} on the output layer to ensure it forms a proper probability distribution.
\item Number of {\bf hidden units} for the hidden layer should be determined by a command line flag.
\item Support two different {\bf initialization strategies}, as described in Section \ref{sec:init}, selecting between them via a command line flag.
\item Use stochastic gradient descent (SGD) to optimize the parameters for one hidden layer neural network. The number of {\bf epochs} will be specified as a command line flag.
\item Set the {\bf learning rate} via a command line flag.
\item Perform stochastic gradient descent updates on the training data in the order that the data is given in the input file. Although you would typically shuffle training examples when using stochastic gradient descent, in order to autograde the assignment, we ask that you {\bf DO NOT} shuffle trials in this assignment.
\item In case there is a tie in the output layer $\hat{\yv}$, predict the smallest index to be the label.
\item You may assume that the input data will always have the same output label space (i.e. $\{0,1,\ldots,9\}$). Other than this, do not hard-code any aspect of the datasets into your code. We will autograde your programs on multiple data sets that include different examples.
\item Do \emph{not} use any machine learning libraries. You may use supported linear algebra packages. See Section \ref{sec:args} for more details.
\end{itemize}

Implementing a neural network can be tricky: the parameters are not just a simple vector, but a collection of many parameters; computational efficiency of the model itself becomes essential; the initialization strategy dramatically impacts overall learning quality; other aspects which we will \emph{not} change (e.g. activation function, optimization method) also have a large effect. These \emph{tips} should help you along the way:

\begin{itemize}
\item Try to “vectorize” your code as much as possible—this is particularly important for Python. For example, in Python, you want to avoid for-loops and instead rely on \lstinline{numpy} calls to perform operations such as matrix multiplication, transpose, subtraction, etc. over an entire \lstinline{numpy} array at once. Why? Because these operations are actually implemented in fast C code, which won’t get bogged down the way a high-level scripting language like Python will.
\item For low level languages such as Java/C++, the use of primitive arrays and for-loops would not pose any computational efficiency problems—however, it is still helpful to make use of a linear algebra library to cut down on the number of lines of code you will write.
\item Implement a finite difference test to check whether your implementation of backpropagation is correctly computing gradients. If you choose to do this, comment out this functionality once your backward pass starts giving correct results and before submitting to Gradescope—since it will otherwise slow down your code.
\end{itemize}

\subsection{Command Line Arguments}
\label{sec:args}
The autograder runs and evaluates the output from the files generated, using the following command:

\begin{tabular}{ll}
For Python: &
\begin{lstlisting}[language=Shell]
$ python3 neuralnet.py [args…]
\end{lstlisting}
\\
For Java: &
\begin{lstlisting}[language=Shell]
$ javac -cp “./lib/ejml-v0.38-libs/*:./” neuralnet.java
$ java -cp “./lib/ejml-v0.38-libs/*:./” neuralnet [args…]
\end{lstlisting}
\\
For C++: &
\begin{lstlisting}[language=Shell]
$ g++ -g -std=c++11 -I./lib neuralnet.cpp; ./a.out [args…]
\end{lstlisting}
\end{tabular}

Where above \texttt{[args\dots]} is a placeholder for nine command-line arguments: \texttt{}\newline \texttt{validation\_input> \newline }. These arguments are described in detail below:
\begin{enumerate}
\item \texttt{}: path to the training input \texttt{.csv} file (see Section~\ref{sec:dataset})
\item \texttt{}: path to the validation \ input \texttt{.csv} file (see Section~\ref{sec:dataset})
\item \texttt{}: path to output \texttt{.labels} file to which the prediction on the \emph{training} data should be written (see Section~\ref{output})
\item \texttt{}: path to output \texttt{.labels} file to which the prediction on the \emph{validation} data should be written (see Section~\ref{output})
\item \texttt{}: path of the output \texttt{.txt} file to which metrics such as train and validation\thinspace error should be written (see Section~\ref{metrics})
\item \texttt{}: integer specifying the number of times backpropogation loops through all of the training data (e.g., if \texttt{} equals 5, then each training example will be used in backpropogation 5 times).
\item \texttt{}: positive integer specifying the number of hidden units.
\item \texttt{}: integer taking value 1 or 2 that specifies whether to use {\sc Random} or {\sc Zero} initialization (see Section~\ref{sec:init} and Section~\ref{sec:model})—that is, if \lstinline{init_flag}==1 initialize your weights randomly from a uniform distribution over the range [-0.1, 0.1] (i.e. {\sc Random}), if \lstinline{init_flag}==2 initialize all weights to zero (i.e. {\sc Zero}). For both settings, {\bf always initialize bias terms to zero}.
\item \texttt{}: float value specifying the learning rate for SGD.

\end{enumerate}

As an example, if you implemented your program in Python, the following command line would run your program with 4 hidden units on the small data provided in the handout for 2 epochs using zero initialization and a learning rate of 0.1.
\\
\begin{lstlisting}[language=Shell]
$ python3 neuralnet.py smallTrain.csv smallValidation.csv \
smallTrain_out.labels smallValidation_out.labels smallMetrics_out.txt \
2 4 2 0.1
\end{lstlisting}

\subsection{Output: Labels Files} \label{output}
Your program should write two output \texttt{.labels} files containing the predictions of your model on training data (\texttt{}) and validation\thinspace data (\texttt{}). Each should contain the predicted labels for each example printed on a new line. Use \lstinline{\n} to create a new line.

Your labels should exactly match those of a reference implementation — this will be checked by the autograder by running your program and evaluating your output file against the reference solution.

\textbf{Note}: You should output your predicted labels using the same \emph{integer} identifiers as the original training data. You should also insert an empty line (again using �\textbackslash n�) at the end of each sequence (as is done in the input data files). The first few lines of the predicted labels for the validation\thinspace dataset is given below.
\\

\begin{lstlisting}
6
4
8
8
\end{lstlisting}

\subsection{Output Metrics} \label{metrics}
Generate a file where you report the following metrics:

\begin{description}
\item[cross entropy] After each Stochastic Gradient Descent (SGD) epoch, report mean cross entropy on the training data \lstinline{crossentropy(train)} and validation\thinspace data \texttt{crossentropy(validation)} (See Equation~\ref{eq:celoss}). These two cross-entropy values should be reported at the end of each epoch and prefixed by the epoch number. For example, after the second pass through the training examples, these should be prefixed by \lstinline{epoch=2}. The total number of train losses you print out should equal \texttt{num\_epoch}—likewise for the total number of validation\thinspace losses.
\item[error] After the final epoch (i.e. when training has completed fully), report the final training error \lstinline{error(train)} and validation\thinspace error \texttt{error(validation)}.
\end{description}

%All of your reported numbers should be within 0.01 of the reference solution.
A sample output is given below. It contains the train and validation\thinspace losses for the first 2 epochs and the final error rate when using the command given above.

\begin{lstlisting}
epoch=1 crossentropy(train): 2.18506276114
epoch=1 crossentropy(validation): 2.18827302588
epoch=2 crossentropy(train): 1.90103257727
epoch=2 crossentropy(validation): 1.91363803461
error(train): 0.77
error(validation): 0.78
\end{lstlisting}

Take care that your output has the exact same format as shown above. There is an equal sign \lstinline{=} between the word \lstinline{epoch} and the epoch number, but no spaces. There should be a single space after the epoch number (e.g. a space after \lstinline{epoch=1}), and a single space after the colon preceding the metric value (e.g. a space after \lstinline{epoch=1 likelihood(train):}). Each line should be terminated by a Unix line ending \lstinline{\n}.

\subsection{Tiny Data Set} \label{tiny}
To help you with this assignment, we have also included a tiny data set, \texttt{tinyTrain.csv} and\\ \texttt{tinyValidation.csv}, and a reference output file \texttt{tinyOutput.txt} for you to use. The tiny dataset is in a format similar to the other datasets, but it only contains two samples with five features. The reference file contains outputs from each layer of one correctly implemented neural network, for both forward and back-propagation steps. We advise you to use this set to help you debug in case your implementation doesn’t produce the same results as in the written part.

There is more information in the \texttt{README.md} file. Do read through the README file if you plan to use it for debugging. For your reference, \texttt{tinyOutput.txt} is generated from the following command line specifications:

\begin{lstlisting}[language=Shell]
$ python3 neuralnet.py tinyTrain.csv tinyValidation.csv \
tinyTrain_out.labels tinyValidation_out.labels tinyMetrics_out.txt \
1 4 2 0.1
\end{lstlisting}
The specific output file names are not important, but be sure to keep the other arguments exactly as they are shown above.

\section{Gradescope Submission}
You should submit your \texttt{neuralnet.\{py|java|cpp\}} to Gradescope. Please do not use any other file name for your implementation. This will cause problems for the autograder to correctly detect and run your code.

Some additional tips: Make sure to read the autograder output carefully. The autograder for Gradescope prints out some additional
information about the tests that it ran. For this programming assignment we’ve specially designed some buggy implementations that you might implement and will try our best to detect those and give you some more useful feedback in Gradescope’s autograder. Make wise use of autograder’s output for debugging your code.

Note: For this assignment, you may make up to 10 submissions to Gradescope before the deadline, but only your last submission will be graded.

\clearpage

\appendix
\section{Implementation Details for Neural Networks}
\label{sec:implementationdetails}

This section provides a variety of suggestions for how to efficiently and succinctly implement a neural network and backpropagation.

\subsection{SGD for Neural Networks}
\label{sec:sgd}

% $\hat{\yv} = h_{\alphav, \betav}(\xv) = \text{softmax}(\betav \sigma(\alphav \xv + \alphav_0) + \betav_0)$

Consider the neural network described in Section \ref{sec:model} applied to the $i$th training example $(\xv, \yv)$ where $\yv$ is a one-hot encoding of the true label. Our neural network outputs $\hat{\yv} = h_{\alphav, \betav}(\xv)$, where $\alphav$ and $\betav$ are the parameters of the first and second layers respectively and $h_{\alphav, \betav}(\cdot)$ is a one-hidden layer neural network with a sigmoid activation and softmax output. The loss function is negative cross-entropy $J = \ell(\hat{\yv}, \yv) = – \yv^T \log(\hat{\yv})$. $J = J_{\xv, \yv}(\alphav, \betav)$ is actually a function of our training example ($\xv, \yv$), and our model parameters $\alphav, \betav$ though we write just $J$ for brevity.

In order to train our neural network, we are going to apply stochastic gradient descent. Because we want the behavior of your program to be deterministic for testing on Gradescope, we make a few simplifications: (1) you should \emph{not} shuffle your data and (2) you will use a fixed learning rate. In the real world, you would \emph{not} make these simplifications.

SGD proceeds as follows, where $E$ is the number of epochs and $\gamma$ is the learning rate.

\begin{algorithm}[H]
\caption{Stochastic Gradient Descent (SGD) without Shuffle}
\label{alg:sgd}
\begin{algorithmic}[1] % The number tells where the line numbering should start
\Procedure{SGD}{Training data $\Dc$, \ntset\thinspace data $\Dc_t$}
\State Initialize parameters $\alphav, \betav$ \Comment{Use either {\sc Random} or {\sc Zero} from Section \ref{sec:init}}
\For{$e \in \{1, 2, \ldots, E\}$} \Comment{For each epoch}
\For{$(\xv, \yv) \in \Dc$} \Comment{For each training example (No shuffling)}
\State Compute neural network layers:
%$\hat{\yv} = h_{\alphav, \betav}(\xv)$ and loss $J = \ell(\hat{\yv}, \yv)$ storing intermediate quantities as in
\State $\ov = \texttt{object}( \xv, \av, \bv, \zv, \hat{\yv}, J) = \textproc{NNForward}(\xv, \yv, \alphav, \betav)$
\State Compute gradients via backprop:
% $\nabla_{\alphav} J, \nabla_{\betav} J$
%\State $\gv_{\alphav} = \nabla_{\alphav} J$
%\State $\gv_{\betav} = \nabla_{\betav} J$
\State
$
\begin{drcases}
\gv_{\alphav} = \frac{\partial J}{\partial \alphav} \\
\gv_{\betav} = \frac{\partial J}{\partial \betav}
\end{drcases}
= \textproc{NNBackward}(\xv, \yv, \alphav, \betav, \ov)
$
\State Update parameters:
\State $\alphav \gets \alphav – \gamma \gv_{\alphav}$
\State $\betav \gets \betav – \gamma \gv_{\betav}$
% \State Update parameters:
% \State $\alphav \gets \alphav – \gamma \nabla_{\alphav} J$
% \State $\betav \gets \betav – \gamma \nabla_{\betav} J$
\EndFor
\State Evaluate training mean cross-entropy $J_{\Dc}(\alphav, \betav)$
\State Evaluate \ntset\thinspace mean cross-entropy $J_{\Dc_t}(\alphav, \betav)$
\EndFor
\State \textbf{return} parameters $\alphav, \betav$
\EndProcedure
\end{algorithmic}
\end{algorithm}

%
At test time, we output the most likely prediction for each example:

\begin{algorithm}[H]
\caption{Prediction at Test Time}
\label{alg:predict}
\begin{algorithmic}[1] % The number tells where the line numbering should start
\Procedure{Predict}{Unlabeled train or \ntset\thinspace dataset $\Dc’$, Parameters $\alphav, \betav$}
\For{$\xv \in \Dc’$}
\State Compute neural network prediction $\hat{\yv} = h(\xv)$
\State Predict the label with highest probability $l = \argmax_k \hat{y}_k$
\EndFor
\EndProcedure
\end{algorithmic}
\end{algorithm}

The gradients we need above are themselves matrices of partial derivatives. Let $M$ be the number of input features, $D$ the number of hidden units, and $K$ the number of outputs.

\begin{align}
&\alphav =
\begin{bmatrix}
\alpha_{10} & \alpha_{11} & \dots & \alpha_{1M} \\
\alpha_{20} & \alpha_{21} & \dots & \alpha_{2M} \\
\vdots & \vdots & \ddots & \vdots \\
\alpha_{D0} & \alpha_{D1} & \dots & \alpha_{DM}
\end{bmatrix}
&&
\gv_{\alphav} = \frac{\partial J}{\partial \alphav} =
\begin{bmatrix}
\adj{\alpha_{10}} & \adj{\alpha_{11}} & \dots & \adj{\alpha_{1M}} \\
\adj{\alpha_{20}} & \adj{\alpha_{21}} & \dots & \adj{\alpha_{2M}} \\
\vdots & \vdots & \ddots & \vdots \\
\adj{\alpha_{D0}} & \adj{\alpha_{D1}} & \dots & \adj{\alpha_{DM}}
\end{bmatrix}
\end{align}

\begin{align}
&\betav =
\begin{bmatrix}
\beta_{10} & \beta_{11} & \dots & \beta_{1D} \\
\beta_{20} & \beta_{21} & \dots & \beta_{2D} \\
\vdots & \vdots & \ddots & \vdots \\
\beta_{K0} & \beta_{K1} & \dots & \beta_{KD}
\end{bmatrix}
&&
\gv_{\betav} = \frac{\partial J}{\partial \betav} =
\begin{bmatrix}
\adj{\beta_{10}} & \adj{\beta_{11}} & \dots & \adj{\beta_{1D}} \\
\adj{\beta_{20}} & \adj{\beta_{21}} & \dots & \adj{\beta_{2D}} \\
\vdots & \vdots & \ddots & \vdots \\
\adj{\beta_{K0}} & \adj{\beta_{K1}} & \dots & \adj{\beta_{KD}}
\end{bmatrix}
\end{align}

Observe that we have (in a rather \emph{tricky} fashion) defined the matrices such that both $\alphav$ and $\gv_{\alphav}$ are $D \times (M+1)$ matrices. Likewise, $\betav$ and $\gv_{\betav}$ are $K \times (D+1)$ matrices. The $+1$ comes from the extra columns $\alphav_{\cdot, 0}$ and $\betav_{\cdot, 0}$ which are the bias parameters for the first and second layer respectively. We will always assume $x_0 = 1$ and $z_0 = 1$. This should greatly simplify your implementation as you will see in Section \ref{sec:vectorize}.

\subsection{Recursive Derivation of Backpropagation}
\label{sec:recursive}

In class, we described a very general approach to differentiating arbitrary functions: backpropagation. One way to understand \emph{how} we go about deriving the backpropagation algorithm is to consider the natural consequence of recursive application of the chain rule.

In practice, the partial derivatives that we need for learning are $\adj{\alpha_{ij}}$ and $\adj{\beta_{kj}}$.

\subsubsection{Symbolic Differentiation}

\begin{notebox}
\paragraph{Note} In this section, we motivate backpropagation via a strawman: that is, we will work through the \emph{wrong} approach first (i.e. symbolic differentiation) in order to see why we want a more efficient method (i.e. backpropagation). Do {\bf not} use this symbolic differentiation in your code.
\end{notebox}

Suppose we wanted to find $\adj{\alpha_{ij}}$ using the method we know from high school calculus. That is, we will analytically solve for an equation representing that quantity.

\begin{enumerate}
\item Considering the computational graph for the neural network, we observe that $\alpha_{ij}$ has exactly one child $a_i = \sum_{m=0}^M \alpha_{im} x_m$. That is $a_i$ is the \emph{first and only} intermediate quantity that uses $\alpha_{ij}$. Applying the chain rule, we obtain
\[ \chain{a_i}{\alpha_{ij}} = \adj{a_i} x_j \]
\item So far so good, now we just need to compute $\adj{a_i}$. Not a problem! We can just apply the chain rule again. $a_i$ just has exactly one child as well, namely $z_i = \sigma(a_i)$. The chain rule gives us that $\chain{z_i}{a_i} = \adj{z_i} z_i (1 – z_i)$. Substituting back into the equation above we find that
\[ \adj{\alpha_{ij}} = \adj{z_i} (z_i (1-z_i)) x_j \]
\item How do we get $\adj{z_i}$? You guessed it: apply the chain rule yet again. This time, however, there are \emph{multiple} children of $z_i$ in the computation graph; they are $b_1, b_2, \ldots b_K$. Applying the chain rule gives us that $\adj{z_i} = \sum_{k=1}^K \adj{b_k}\frac{\partial b_k}{\partial z_i} = \sum_{k=1}^K \adj{b_k}\beta_{ki}$. Substituting back into the equation above gives:
\[ \adj{\alpha_{ij}} = \sum_{k=1}^K \adj{b_k}\beta_{ki} (z_i (1-z_i)) x_j \]
\item Next we need $\adj{b_k}$, which we again obtain via the chain rule: $\adj{b_k} = \sum_{l=1}^K \adj{\hat{y}_l} \frac{\partial \hat{y}_l}{\partial b_k} = \sum_{l=1}^K \adj{\hat{y}_l} \hat{y}_l (\Ib[k=l] – \hat{y}_k)$. Substituting back in above gives:
\[ \adj{\alpha_{ij}} = \sum_{k=1}^K \sum_{l=1}^K \adj{\hat{y}_l} \hat{y}_l (\Ib[k=l] – \hat{y}_k) \beta_{ki} (z_i (1-z_i)) x_j \]
\item Finally, we know that $\adj{\hat{y}_l} = – \frac{y_l}{\hat{y}_l}$ which we can again substitute back in to obtain our final result:
\[ \adj{\alpha_{ij}} = \sum_{k=1}^K \sum_{l=1}^K – \frac{y_l}{\hat{y}_l} \hat{y}_l (\Ib[k=l] – \hat{y}_k) \beta_{ki} (z_i (1-z_i)) x_j \]
Although we have successfully derived the partial derivative w.r.t. $\alpha_{ij}$, the result is far from satisfying. It is overly complicated and requires deeply nested for-loops to compute.
\end{enumerate}

The above is an example of {\bf symbolic differentiation}. That is, at the end we get an equation representing the partial derivative w.r.t. $\alpha_{ij}$. At this point, you should be saying to yourself: What a mess! Isn’t there a better way? Indeed there is and its called backpropagation. The algorithm works just like the above symbolic differentiation except that we \textit{never} subsitute the partial derivative from the previous step back in. Instead, we work “backwards” through the steps above computing partial derivatives in a top-down fashion.

\subsection{Matrix / Vector Operations for Neural Networks}
\label{sec:vectorize}

Some programming languages are fast and some are slow. Below is a simple benchmark to show this concretely. The task is to compute a dot-product $\av^T \bv$ between two vectors $\av \in \Rb^{500}$ and $\bv \in \Rb^{500}$ one thousand times. Table \ref{tab:dotprod} shows the time taken for several combinations of programming language and data structure.

\begin{table}[H]
\begin{center}
\begin{tabular}{llr}
\toprule
{\bf language} & {\bf data structure} & {\bf time (ms)} \\
\midrule
Python & list & 200.99 \\
Python & \texttt{numpy} array & 1.01\\
Java & \texttt{float[]} & 4.00 \\
C++ & \texttt{vector} & 0.81\\
\bottomrule
\end{tabular}
\end{center}
\caption{Computation time required for dot-product in various languages.}
\label{tab:dotprod}
\end{table}

Notice that Java\footnote{Java would approach the speed of C++ if we had given the just-in-time (JIT) compiler inside the JVM time to “warm-up”.} and C++ with standard data structures are quite efficient. By contrast, Python differs dramatically depending on which data structure you use: with a standard list object \\(e.g. \lstinline{a = [float(i) for x in range(500)]}) the computation time is an appallingly slow 200+ milliseconds. Simply by switching to a numpy array (e.g. \lstinline{a = np.arange(500, dtype=float)}) we obtain a 200x speedup. This is because a numpy array is actually carrying out the dot-product computation in pure C, which is just as fast as our C++ benchmark, modulo some Python overhead.

Thus, for this assignment, Java and C++ programmers could easily implement the entire neural network using standard data structures and some for-loops. However, Python programmers would find that their code is simply too slow if they tried to do the same. As such, particularly for Python users, one must convert all the deeply nested for-loops into efficient “vectorized” math via \lstinline{numpy}. Doing so will ensure efficient code. Java and C++ programmers can also benefit from linear algebra packages since it can cut down on the total number of lines of code you need to write.

\subsection{Procedural Method of Implementation}
\label{sec:procedural}

Perhaps the simplest way to implement a 1-hidden-layer neural network is procedurally. Note that this approach has some drawbacks that we’ll discuss below (Section \ref{sec:proceduraldrawbacks}).

The procedural method: one function computes the outputs of the neural network and all intermediate quantities $\ov = \textproc{NNForward}(\xv, \yv, \alphav, \betav) = \texttt{object}( \xv, \av, \bv, \zv, \hat{\yv}, J)$, where the object is just some struct. Then another function computes the gradients of our parameters $\gv_{\alphav}, \gv_{\betav} = \textproc{NNBackward}(\xv, \yv, \alphav, \betav, \ov)$, where $\ov$ is a data structure that stores all the forward computation.

One must be careful to ensure that functions are vectorized. For example, your Sigmoid function should be able to take a vector input and return a vector output with the Sigmoid function applied to all of its elements. All of these operations should avoid for-loops when working in a high-level language like Python. We can compute the softmax function in a similar vectorized manner.

\subsubsection{Drawbacks to Procedural Method}
\label{sec:proceduraldrawbacks}

As noted in Section \ref{sec:finitediff}, it is possible to use a finite difference method to check that the backpropagation algorithm is correctly computing the gradient of its corresponding forward computation. We \emph{strongly} encourage you to do this.

There is a big problem however: what if your finite difference check informs you that the gradient is \emph{not} being computed correctly. How will you know \emph{which} part of your \textproc{NNForward}() or \textproc{NNBackward}() functions has a bug? There are two possible solutions here:
\begin{enumerate}
\item As usual, you can (and should) work through a tiny example dataset on paper. Compute each intermediate quantity and each gradient. Check that your code reproduces each number. The one that does not should indicate where to find the bug.
\item Replace your procedural implementation with a module-based one (as described in Section \ref{sec:modulebased}) and then run a finite-difference check on \emph{each} layer of the model individually. The finite-difference check that fails should indicate where to find the bug.
\end{enumerate}

Of course, rather than waiting until you have a bug in your procedural implementation, you could jump straight to the module-based version—though it increases the complexity slightly (i.e. more lines of code) it \emph{might} save you some time in the long run.

\subsection{Module-based Method of Implementation}
\label{sec:modulebased}

Module-based automatic differentiation (AD) is a technique that has long been used to develop libraries for deep learning. Dynamic neural network packages are those that allow a specification of the computation graph dynamically at runtime, such as Torch\footnote{\url{http://torch.ch/}}, PyTorch\footnote{\url{http://pytorch.org/}}, and DyNet\footnote{\url{https://dynet.readthedocs.io}}—these all employ module-based AD in the sense that we will describe here.\footnote{Static neural network packages are those that require a static specification of a computation graph which is subsequently compiled into code. Examples include Theano, Tensorflow, and CNTK. These libraries are also module-based but the particular form of implementation is different from the dynamic method we recommend here.}

The key idea behind module-based AD is to componentize the computation of the neural-network into layers. Each layer can be thought of as consolidating numerous nodes in the computation graph (a subset of them) into one \emph{vector-valued} node. Such a vector-valued node should be capable of the following and we call each one a {\bf module}:
\begin{enumerate}
\item Forward computation of output $\bv = [b_1, \ldots, b_B]$ given input $\av = [a_1, \ldots, a_A]$ via some differentiable function $f$. That is $\bv = f(\av)$.
\item Backward computation of the gradient of the input $\gv_{\av} = \frac{\partial J}{\partial \av} = [\adj{a_1}, \ldots, \adj{a_A}]$ given the gradient of output $\gv_{\bv} = \frac{\partial J}{\partial \bv} = [\adj{b_1}, \ldots, \adj{b_B}]$, where $J$ is the final real-valued output of the entire computation graph. This is done via the chain rule $\adj{a_i} = \sum_{j=1}^M \adj{b_j} \frac{\partial b_j}{\partial a_i}$ for all $i \in \{1, \ldots, A\}$.
\end{enumerate}

\subsubsection{Module Definitions}

The modules we would define for our neural network would correspond to a Linear layer, a Sigmoid layer, a Softmax layer, and a Cross-Entropy layer. Each module defines a forward function $\bv = \textproc{*Forward}(\av)$, and a backward function $\gv_{\av} = \textproc{*Backward}(\av, \bv, \gv_{\bv})$ method. These methods accept parameters if appropriate. You�ll want to pay close attention to the dimensions that you pass into and return from your
modules.

% The dimensions $A$ and $B$ are specific to the module such that we have input $\av \in \Rb^A$, output $\bv \in \Rb^B$, gradient of output $\gv_{\av} \triangleq \nabla_{\av}J \in \Rb^A$, and gradient of input $\gv_{\bv} \triangleq \nabla_{\bv}J \in \Rb^B$. We have provided you the pseudocode for the Linear Module as an example.

\begin{description}
\item[Linear Module] The linear layer has two inputs: a vector $\av$ and parameters $\omega \in \Rb^{B \times A}$. The output $\bv$ is not used by \textproc{LinearBackward}, but we pass it in for consistency of form.
\begin{algorithmic}[1]
\Procedure{LinearForward}{$\av$, $\alphav$}
\State Compute $\bv$ % $\bv = \alphav \av$
\State \textbf{return} $\bv$
\EndProcedure
\Procedure{LinearBackward}{$\av$, $\alpha$, $\bv$, $\gv_{\bv}$}
\State Compute $\gv_{\alphav}$ % $\gv_{\alphav} = \gv_{\bv} \av^T$
\State Compute $\gv_{\av}$ %$\gv_{\av} = \alphav^T \gv_{\bv}$
\State \textbf{return} $\gv_{\alphav}, \gv_{\av}$
\EndProcedure
\end{algorithmic}
\end{description}

It’s also quite common to combine the Cross-Entropy and Softmax layers into one. The reason for this is the cancelation of numerous terms that result from the zeros in the cross-entropy backward calculation. (Said trick is \emph{not} required to obtain a sufficiently fast implementation for Gradescope.)
%You can read more about this in a detailed writeup from your TAs on Piazza.\footnote{\url{https://piazza.com/class/jar95m88puk3v6?cid=1092}} (Said trick is \emph{not} required to obtain a sufficiently fast implementation for Gradescope.)

\subsubsection{Module-based AD for Neural Network}

Using these modules, we can re-define our functions \textproc{NNForward} and \textproc{NNBackward} as follows.

\begin{algorithm}[H]
\caption{Forward Computation}
\label{alg:forwardmodule}
\begin{algorithmic}[1] % The number tells where the line numbering should start
\Procedure{NNForward}{Training example ($\xv$, $\yv$), Parameters $\alphav$, $\betav$}
\State $\av = \textproc{LinearForward}(\xv, \alphav)$
\State $\zv = \textproc{SigmoidForward}(\av)$
\State $\bv = \textproc{LinearForward}(\zv, \betav)$
\State $\hat{\yv} = \textproc{SoftmaxForward}(\bv)$
\State $J = \textproc{CrossEntropyForward}(\yv, \hat{\yv})$
\State $\ov = \texttt{object}(\xv, \av, \zv, \bv, \hat{\yv}, J)$
\State \textbf{return} intermediate quantities $\ov$
\EndProcedure
\end{algorithmic}
\end{algorithm}

\begin{algorithm}[H]
\caption{Backpropagation}
\label{alg:backpropmodule}
\begin{algorithmic}[1] % The number tells where the line numbering should start
\Procedure{NNBackward}{Training example ($\xv$, $\yv$), Parameters $\alphav$, $\betav$, Intermediates $\ov$}
\State Place intermediate quantities $\xv, \av, \zv, \bv, \hat{\yv}, J$ in $\ov$ in scope
\State $g_J = \frac{\partial J}{\partial J} = 1$ \Comment{Base case}
\State $\gv_{\hat{\yv}} = \textproc{CrossEntropyBackward}(\yv, \hat{\yv}, g_J)$
\State $\gv_{\bv} = \textproc{SoftmaxBackward}(\bv, \hat{\yv}, \gv_{\hat{\yv}})$
\State $\gv_{\betav}, \gv_{\zv} = \textproc{LinearBackward}(\zv, \betav, \gv_{\bv})$
\State $\gv_{\av} = \textproc{SigmoidBackward}(\av, \zv, \gv_{\zv})$
\State $\gv_{\alphav}, \gv_{\xv} = \textproc{LinearBackward}(\xv, \alphav, \gv_{\av})$ \Comment{We discard $\gv_{\xv}$}
\State \textbf{return} parameter gradients $\gv_{\alphav}, \gv_{\betav}$
\EndProcedure
\end{algorithmic}
\end{algorithm}

Here’s the big takeaway: we can actually view these two functions as themselves defining another module! It is a 1-hidden layer neural network module. That is, the cross-entropy of the neural network for a single training example is \emph{itself} a differentiable function and we know how to compute the gradients of its inputs, given the gradients of its outputs.

\subsection{Testing Backprop with Numerical Differentiation}
\label{sec:finitediff}

Numerical differentiation provides a convenient method for testing
gradients computed by backpropagation. The \emph{centered}
finite difference approximation is:
\begin{align}
\frac{\partial}{\partial \theta_i} J(\vc{\theta}) \approx \frac{\left( J(\vc{\theta} + \epsilon \cdot \vc{d}_i) – J(\vc{\theta} – \epsilon \cdot \vc{d}_i) \right)}{2\epsilon}
\end{align}
where $\vc{d}_i$ is a 1-hot vector consisting of all zeros except for
the $i$th entry of $\vc{d}_i$, which has value 1. Unfortunately, in
practice, it suffers from issues of floating point
precision. Therefore, it is typically only appropriate to use this on
small examples with an appropriately chosen $\epsilon$.

In order to apply this technique to test the gradients of your backpropagation implementation, you will need to ensure that your code is appropriately factored. Any of the modules including \textproc{NNForward} and \textproc{NNBackward} could be tested in this way.
%

For example, you could use two functions: \lstinline{forward(x,y,theta)} computes the cross-entropy for a training example. \lstinline{backprop(x,y,theta)} computes the gradient of the cross-entropy for a training example via backpropagation. Finally, \lstinline{finite_diff} as defined below approximates the gradient by the centered finited difference method. The following pseudocode provides an overview of the entire procedure.

\begin{lstlisting}[language=Python]
def finite_diff(x, y, theta):
epsilon = 1e-5
grad = zero_vector(theta.length)
for m in [1, …, theta.length]:
d = zero_vector(theta.length)
d[m] = 1
v = forward(x, y, theta + epsilon * d)
v -= forward(x, y, theta – epsilon * d)
v /= 2*epsilon
grad[m] = v

# Compute the gradient by backpropagation
grad_bp = backprop(x, y, theta)
# Approximate the gradient by the centered finite difference method
grad_fd = finite_diff(x, y, theta)

# Check that the gradients are (nearly) the same
diff = grad_bp – grad_fd # element-wise difference of two vectors
print l2_norm(diff) # this value should be small (e.g. < 1e-7) \end{lstlisting} \subsubsection{Limitations} This does \emph{not} catch all bugs---the only thing it tells you is whether your backpropagation implementation is correctly computing the gradient for the forward computation. Suppose your \emph{forward} computation is incorrect, e.g. you are always computing the cross-entropy of the wrong label. If your \emph{backpropagation} is also using the same wrong label, then the check above will not expose the bug. Thus, you always want to \emph{separately} test that your forward implementation is correct. \subsubsection{Finite Difference Checking of Modules} Note that the above would test the gradient for the entire end-to-end computation carried output by the neural network. However, if you implement a module-based automatic differentiation method (as in Section \ref{sec:modulebased}), then you can test each individual component for correctness. The only difference is that you need to run the finite-difference check for each of the output values (i.e. a double for-loop). \end{document}