Lesson 1- TMB through RTMB
22 November 2023

\[ y_{i}=f(\underline{\theta}, \underline{X})+\varepsilon_{i}\\ L_{i}=L_{\infty}\left(1-e^{-K\left(a_{i}-t_{0}\right)}\right)+\varepsilon_{i} \]

Data: T. Brenden unpublished. Photo: E. Engbretson, USFWS https://commons.wikimedia.org/w/index.php?curid=3720748
\[ \begin{array}{l} \log N_{4, y}=\log N_{4, y-1}+ \\ \varepsilon_{y}^{(R)} ; \varepsilon_{y}^{(R)} \sim N\left(0, \sigma_{R}^{2}\right) \\ \log N_{a, 1986}=\log N_{4,1986-(a-4)}- \\ \sum_{4}^{a-1} \log \bar{Z}_{a}, 4<a \leq 9 \\ \log N_{a, 1986}=0, a>9 \\ \log N_{a, y}=\log N_{a-1, y-1}-Z_{a-1, y-1}, \\ 4 \leq a<A, \log N_{A, y}= \\ \log \left(N_{A-1, y-1} e^{-Z_{A-1, y-1}}+N_{A, y-1} e^{-Z_{A, y-1}}\right) \\ Z_{a, y}=M+\sum_{G=g, t} F_{a, y, G} \end{array} \]
\[ \begin{array}{l} F_{a, y, G}=q_{a, y, G} E_{y, G}, G=g, t \\ \log q_{y, G}=\log q_{y-1, G}+\varepsilon_{y}^{(G)} ; \varepsilon_{y}^{(G)} \sim \\ N\left(0, \Sigma_{G}\right), G=g, t \\ \boldsymbol{\Sigma}_{a, \tilde{a}}=\rho^{|a-\tilde{a}|} \sigma_{a} \sigma_{a}, 4<a \leq A, \\ 4<\tilde{a} \leq A \\ B_{y}^{(S p a w n)}=\sum_{a=4}^{A} m_{a, y} W_{a, y}^{(s p a w n)} \log N_{a, y} \\ C_{a, y, G}=\frac{F_{a, y, G}}{Z_{a, y}} N_{a, y}\left(1-\exp \left(-Z_{a, y}\right)\right), \\ \end{array} \]
Whole books about definitions and meaning.
I follow a frequentist definition for intuition, while recognizing that there is some logic to Bayesian claims of degree of belief
Frequentist definition: The long run proportion of of times an event occurs under identical conditions
Statisticians sometimes distinguish outcomes from events. Outcomes are really elementary events. Event might be catching a fish in 7 to 8 inch bin, outcome would be catch a fish and measure its length.
The sum of probabilites over all possible mutually exclusive events is 1.0 (So something will happen)
Probability of any given event is \(\geq\) 0 (and \(\leq\) 1)
The probability of the union of mutually exclusive events is the sum of their separate probabilities
if A and B independent \(P(A \cap B) = P(A)P(B)\)
\[ P(A \mid B)=\frac{P(A \cap B)}{P(B)} \]
“|” read as “given or conditional on – Probability of A given B
Conditional probabilities recognize that the occurence of event B can provide information on whether event A will occur
Convince yourself that \(P(A \mid B)=P(A)\) if A and B independent
stats.stackexchange.com/questions/587109
Technical definition is that they are functions that convert probability spaces for events/outcomes to numeric results.
Less technically (but still techno speak!) they describe the numeric outcome of a random process. I.e., they are not a number (or vector/matrix of numbers) but rather the process of producing them.
A random variate is a particular numeric outcome
Text books say usually capital letters used for random variables and lower case for random variates.
\[ \text{suppose } p=\operatorname{Pr}(Y=1) \\ \operatorname{Pr}(Y=1)+\operatorname{Pr}(Y=0)=1 \\ \operatorname{Pr}(Y=0)=1-p \]
\[ \operatorname{Pr}(Y=y)=p^{y}(1-p)^{1-y} \]
The pmf (or pdf) is a function that calculates the probability given the random variate (\(y\) value) and the parameter(s) (here \(p\))
\(y\) is observed datum
If this was a continuously distributed random variable we would use probability density function (pdf)
\[ f(y \mid \theta) \]

In general I will provide the pmf (or pdf) expressed as a function of \(y\) and the parameters of the distribution.
For example, will use \(y\sim \mathrm{N}(\mu, \sigma^2)\) to indicate a random variable \(Y\) is normally distributed with mean \(\mu\) and variance \(\sigma^2\)
In general regular font for scalars, bold for vectors and matrices
\[ \operatorname{Normal}(y \mid \mu, \sigma^2)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2}\left(\frac{y-\mu}{\sigma}\right)^{2}\right). \]
Discrete means the set of possible outcomes is countable with each possible value having an associated probability (calculable from the pmf).
Continuous means not countable (generally this means there are infinite numbers of possible values between any two other possible values). Pr(y) for any particular y is 0. So we use a probability density function.
Intuition/common sense sometimes used to choose between the two. E.g., catch or CPUE often modeled as continuous
Pr(\(c_1\))=Pr(\(c_2\))=0 !
Area under the pdf function gives probability for interval
Pr(\(c_1<x<c_2\))=Pr(\(x<c_2\))-Pr(\(x<c_1\))
F(x)=Pr(X<x)
For continuous variables, the derivative of F(x) with respect to x is f(x) (the density)
Vector of observed values, with elements having the same pdf/pmf or different ones and these random variables might be independent or not: \(f(x_1,x_2,...,x_k)=f(\mathbf{x})\)
Special case of each element representing an independent random variable: \(f(\mathbf{x})=f_{X_1}(x_1)f_{X_2}(x_2)...f_{X_k}(x_k)\)
Special special case of independent and identically distributed (iid) random variables: \(\mathbf{f}(\mathbf{x})=f(x_1 \mid\theta_1)f(x_2 \mid \theta_2)...f(x_k \mid \theta_k)=\Pi_{i=1}^{i=k} f\left(x_{i} \mid \theta_{i}\right)\)
These special cases very important for practical MLE work!
No new math!!!
The likelihood function is just the joint pdf re-expressed as a function of the parameters: \(f(\mathbf{\theta}|\mathbf{x})\)
Adjust \(\mathbf{\theta}\) until \(f(\mathbf{\theta}|\mathbf{x})\) is maximized
The rest is “just” details :->
Perhaps obviously, if you adjust parameters to maximize the log of the likelihood function this will also maximize the likelihood.
RTMB and most software minimizes the negative log likelihood rather than maximizing log likelihood (convention)
Working on the log-scale improves numerical performance.
Specify \(\alpha\), \(\beta\), and \(\sigma^2\)
Calculate \(\mu_i\)
Calculate NLL
Search over different values of \(\alpha\), \(\beta\), and \(\sigma^2\) and repeat 1-3 until you find the values that minimize the NLL
\[ \begin{array}{l}y_{i}=\mu_{i}+\epsilon_{i}=\alpha+\beta * X_{i}+\epsilon_{i} \\\epsilon \stackrel{i i d}{\sim} N\left(0, \sigma^{2}\right)\end{array} \]
Analytical solution (involves derivatives)
Grid search
Iterative searches
Non-derivative methods
Derivative methods (such as quasi-Newton)
NLL as function of single parameter with derivatives
Derivatives of NLL with respect to parameters zero at minimum
Second derivatives of NLL with respect to parameters are positive at minimum
For the sample of five observations we used before find the MLE estimate of the mean assuming the variance known equal to 2 by conducting a grid search
Time permitting find MLE estimates of both the mean and variance at the same time by grid search
Terminology: ML Estimator versus ML Estimate
Ideal estimator is lowest variance among unbiased estimators
MLE not guaranteed to do this!
MLEs are consistent, meaning estimates will become closer to correct values and sample sizes increase
Errors get smaller with more data
MLE for variance of normal random sample: \(\hat{\sigma}^{2}=\sum\left(x_{i}-\hat{\mu}\right)^{2} / k\)
Expected value: \(E(\hat{\sigma}^{2})=\frac{k-1}{k} \sigma^{2}\)
Standard (unbiased) estimator: \(\hat{\sigma}_{u}^{2}=\sum\left(x_{i}-\hat{\mu}\right)^{2} /(k-1)\)