In circular statistics, the expectation value of a random variable $Z$ with values on the circle $S$ is defined as$$m_1(Z)=\int_S z P^Z(\theta)\textrm{d}\theta$$(see wikipedia).This is a very natural definition, as is the definition of the variance$$\mathrm{Var}(Z)=1-|m_1(Z)|.$$So we didn't need a second moment in order to define the variance!Nonetheless, we define the higher moments$$m_n(Z)=\int_S z^n P^Z(\theta)\textrm{d}\theta.$$I admit that this looks rather natural as well at first sight, and very similar to the definition in linear statis...Read more

can someone provide an intuition on why the higher moments of a probability distribution p(x) like the third and fourth moments correspond to skewness and kurtosis, respectively?specifically, why does the deviation about the mean raised to the 3rd or 4th power end up translating into a measure of skewness and kurtosis? Is there a way to relate this to the third or fourth derivatives of the function? consider this definition of kurtosis:$Kurtosis(X) = E[(x - \mu_{X})^4] / \sigma^4$again, not clear why raising $(x-\mu)^4$ gives "peakedness" or wh...Read more

The heaviest tailed smooth normalizable continuous distributions that I am familiar with are those with fat power-law tails $\frac{1}{x^{1+\alpha}}$, e.g. a Pareto with $\alpha\rightarrow 0^+$ or a Student's t with $\nu\rightarrow 0^+$, but are there distributions with heavier tails? I am curious about what is the worst case possible for a distribution that decreases monotonically away from a peak positive value towards a minimum of 0.I think that the heaviest possible normalizable heavy tails are indeed those asymptotic to $\frac{k}{x}$ as $x...Read more

I am trying to understand how I should approach the problem of a Taylor approximation to the expectation of the ratio of two random variables. In my particular problem I am concerned with the following ratio estimated using a sample of size $n$$$\hat{\gamma_i}=\frac{x_i\sum_{i=1}^{n} y_i}{\sum_{i=1}^{n} x_i}=\frac{x_i\bar{y}}{\bar{x}}$$We may assume for simplicity $E(x_i)=\mu_x$ and $E(y_i)=\mu_y$, but we may not have $E(x_iy_i) \ne E(x_i)E(y_i)$. I try to find $E(\hat{\gamma})$. How should I approach this problem?...Read more

The likelihood could be defined by several ways, for instance :the function $L$ from $\Theta\times{\cal X}$ which maps $(\theta,x)$ to $L(\theta \mid x)$ i.e. $L:\Theta\times{\cal X} \rightarrow \mathbb{R} $.the random function $L(\cdot \mid X)$we could also consider that the likelihood is only the "observed" likelihood $L(\cdot \mid x^{\text{obs}})$in practice the likelihood brings information on $\theta$ only up to a multiplicative constant, hence we could consider the likelihood as an equivalence class of functions rather than a functionAnot...Read more

I know $E(aX+b) = aE(X)+b$ with $a,b $ constants, so given $E(X)$, it's easy to solve. I also know that you can't apply that when its a nonlinear function, like in this case $E(1/X) \neq 1/E(X)$, and in order to solve that, I've got to do an approximation with Taylor's.So my question is how do I solve $E(\ln(1+X))$?? do I also approximate with Taylor?...Read more

I am doing master in statistics and I am advised to learn differential geometry. I would be happier to hear about statistical applications for differential geometry since this would make me motivated. Does anyone happen to know applications for differential geometry in statistics?...Read more

What is the distribution of $\mathrm{tr}(AA'BB')$ where $A$ and $B$ are two random matrices of $d \times k$ size with orthonormal columns?Maybe the expected value is easier to compute? A fallback solution would be to use a simulation. What would be the most effective scheme? Typical values for $d$ would be around 2000, while $k$ ranges from ~10 to a few hundreds.Below is a more detailed account of my problem and its context, how I ended up to ask this question and what I tried.ContextI want to check if the principal components computed from a s...Read more

I'm fairly mathematically inclined — had 6 semesters of Math in my undergrad — though I'm a bit out of practice and slow with say partial differential equations and path integrals my concepts come back with a bit of practice. I have not had a course on mathematical proofs (mathematical thinking) or one on analysis.I also understand graduate level probability — have studied it formally and refreshed my knowledge lately.I also have had a couple of graduate level courses on statistics and statistical learning.I want to, out of personal interest, s...Read more

I have recently been looking into canonical correlation analysis (CCA) as a way to map between different spaces. As I understand it, CCA maps data from both distinct spaces to a common (possibly lower dimensional) space where they can be compared. It works in a similar way to PCA, choosing the direction from each input space which maximises the correlation between datasets, subject to the chosen directions being uncorrelated. Now, the descriptions I've seen suggest that CCA can learn any linear transformation. However, I can't see how it's poss...Read more

Let $A$ and $B$ be random variables and $f(A,B)=\frac{A}{B}$. How should I approximate $E(f(A,B))$? I think a Taylor expansion may be in order, but I am not sure how to fire it off in this function.My question comes from a practical problem in survey statistics. It may be discussed in textbooks, but I would not know where. Let a sample of size $n$ be taken from an (infinite) population. Not every sample unit may reply to the survey. Let $S$ indicate response ($S=1$) or non-response ($S=0$). The mean estimator $\hat{\mu}=\frac{1}{\sum{S_i}}\sum{...Read more

I have been struggling quite a bit with reconciling my intuitive understanding of probability distributions with the weird properties that almost all topologies on probability distributions possess.For example, consider a mixture random variable $X_n$: pick a Gaussian centered at 0 with variance 1, and with probability $\frac{1}{n}$, add $n$ to the result. A sequence of such random variables would converge (weakly and in total variation) to a Gaussian centered at 0 with variance 1, but the mean of the $X_n$ is always $1$ and the variances conve...Read more

I've been working on building some test statistics based on the KL-Divergence,\begin{equation}D_{KL}(p \| q) = \sum_i p(i) \log\left(\frac{p(i)}{q(i)}\right),\end{equation}And I ended up with a value of $1.9$ for my distributions. Note that the distributions have support of $140$K levels, so I don't think plotting out the whole distributions would be reasonable here.What I'm wondering is, is it possible to have a KL-Divergence of greater than 1? A lot of the interpretations I've seen of KL-Divergence are based on an upper bound of 1. If it can ...Read more

Let $X_1, \ldots, X_n$ be i.i.d. exponentially distributed random variables with density$$\eqalign{\theta^{-1} e^{-x/\theta}, &x \ge 0 \\ 0, &x \lt 0} $$and let $Y_i = X_{(i)}$ denote the order statistics such that $Y_1 \leq \cdots \leq Y_n$.How to show that$$ 2\frac{\left(\sum_{i=1}^{r}Y_i\right) + (n-r)Y_r}{\theta}$$ has a chi-square distribution with $2r$ degrees of freedom?I wrote the joint density of $(Y_1,Y_2,...,Y_r)$ but nothing became apparent....Read more

Three components are randomly sampled, one at atime, from a large lot. As each component is selected,it is tested. If it passes the test, a success (S) occurs; ifit fails the test, a failure (F) occurs.Assume that 80%of the components in the lot will succeed in passing thetest. Let X represent the number of successes amongthe three sampled components.What are the possible values for X? And There Probabilities ?...Read more