Udopia Doctoral Students Day: Machine learning for astrophysical inverse problems

Benjamin Remy

Supervisors :
Francois Lanusse & J.-L. Starck

slides at b-remy.github.io/talks/UDOPIA2021

Gravitational lensing

Galaxy shapes as estimators for gravitational shear

$$ e = \gamma + e_i \qquad \mbox{ with } \qquad e_i \sim \mathcal{N}(0, I)$$

We are trying the measure the ellipticity $e$ of galaxies as an estimator for the gravitational shear $\gamma$

A Physicist's approach: let's build a model

Lanusse et al. (2020)

$\longrightarrow$
$g_\theta$

$\longrightarrow$
PSF

$\longrightarrow$
Pixelation

$\longrightarrow$
Noise

Probabilistic model

$$ x \sim ? $$

$$ x \sim \mathcal{N}(z, \Sigma) \quad z \sim ? $$
latent $z$ is a denoised galaxy image

$$ x \sim \mathcal{N}( \mathbf{P} z, \Sigma) \quad z \sim ?$$
latent $z$ is a super-resolved and denoised galaxy image

$$ x \sim \mathcal{N}( \mathbf{P} (\Pi \ast z), \Sigma) \quad z \sim ? $$
latent $z$ is a deconvolved, super-resolved, and denoised galaxy image

$$ x \sim \mathcal{N}( \mathbf{P} (\Pi \ast g_\theta(z)), \Sigma) \quad z \sim \mathcal{N}(0, \mathbf{I}) $$
latent $z$ is a Gaussian sample
$\theta$ are parameters of the model

$\Longrightarrow$ Decouples the morphology model from the observing conditions.

Bayesian Inference a.k.a. Uncertainty Quantification

The Bayesian view of the problem: $$ p(z | x ) \propto p_\theta(x | z, \Sigma, \mathbf{\Pi}) p(z)$$ where:

$p( z | x )$ is the posterior
$p( x | z )$ is the data likelihood, contains the physics
$p( z )$ is the prior

Data
$x_n$

Truth
$x_0$

Posterior samples
$g_\theta(z)$

$\mathbf{P} (\Pi \ast g_\theta(z))$

Median

Data residuals
$x_n - \mathbf{P} (\Pi \ast g_\theta(z))$

Standard Deviation

$\Longrightarrow$ Uncertainties are fully captured by the posterior.

Thesis project:
Joint estimation of Galaxy Morphologies, PSF, and Cosmics shear

Schneider et al. 2014

Build a hierarchical model, combining:

Physical models: instrumental response (optics)
Generative models: galaxy morphology modeled with a neural network (e.g. Normalizing Flow, score prior, etc.)

Joint estimation

Inference methods:

Markv Chain Monte Carlo
Variational Inference

The Weak Lensing Mass-Mapping as an Inverse Problem

Shear $\gamma$

Convergence $\kappa$

$$\gamma_1 = \frac{1}{2} (\partial_1^2 - \partial_2^2) \ \Psi \quad;\quad \gamma_2 = \partial_1 \partial_2 \ \Psi \quad;\quad \kappa = \frac{1}{2} (\partial_1^2 + \partial_2^2) \ \Psi$$

$$\boxed{\gamma = \mathbf{P} \kappa}$$

Illustration on the Dark Energy Survey (DES) Y3

Jeffrey, et al. (2021)

Linear inverse problems

$\boxed{\gamma = \mathbf{A}\kappa + n}$

$\mathbf{A}$ is known and encodes our physical understanding of the problem.
$\Longrightarrow$ When non-invertible or ill-conditioned, the inverse problem is ill-posed with no unique solution $x$

The Bayesian view of the problem:
$$ p(\kappa | \gamma) \propto p(\gamma | \kappa) \ p(\kappa) $$

$p(\gamma | \kappa)$ is the data likelihood, which contains the physics

$p(\kappa)$ is the prior knowledge on the solution.

We can estimate for instance the Maximum A Posteriori solution:
$$\hat{\kappa} = \arg\max\limits_\kappa \ \log p(\gamma \ | \ \kappa) + \log p(\kappa)$$ $\hat \kappa = \arg\max\limits_\kappa \ - \frac{1}{2} \parallel \gamma - \mathbf{A} x \parallel_{\mathbf{\Sigma}}^2 + \log p(\kappa)$

Or estimate from the full posterior $p(\kappa|\gamma)$ with MCMC or Variational Inference methods.

$\boxed{\log p(\kappa)}$ ??

But what about learning the prior
with deep generative models?

The evolution of generative models

Deep Belief Network
(Hinton et al. 2006)

Variational AutoEncoder
(Kingma & Welling 2014)

Generative Adversarial Network
(Goodfellow et al. 2014)

Wasserstein GAN
(Arjovsky et al. 2017)

The score is all you need!

Whether you are looking for the MAP or sampling with HMC or MALA, you only need access to the score of the posterior: $$\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x \color{orange}|\color{orange} y\color{orange})}{\color{orange} d \color{orange}x}$$
- Gradient descent: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) $
- Langevin algorithm: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) + \sqrt{2\tau} n_t$

The score of the full posterior is simply: $$\nabla_\kappa \log p(\kappa |\gamma) = \underbrace{\nabla_\kappa \log p(\gamma |\kappa)}_{\mbox{known}} \quad + \quad \underbrace{\nabla_\kappa \log p(\kappa)}_{\mbox{can be learned}}$$ $\Longrightarrow$ all we have to do is model/learn the score of the prior.

Neural Score Estimation by Denoising Score Matching

Denoising Score Matching: An optimal Gaussian denoiser learns the score of a given distribution.
- If $x \sim \mathbb{P}$ is corrupted by additional Gaussian noise $u \in \mathcal{N}(0, \sigma^2)$ to yield $$x^\prime = x + u$$
- Let's consider a denoiser $r_\theta$ trained under an $\ell_2$ loss: $$\mathcal{L}=\parallel x - r_\theta(x^\prime, \sigma) \parallel_2^2$$
- The optimal denoiser $r_{\theta^\star}$ verifies: $$\boxed{\boldsymbol{r}_{\theta^\star}(\boldsymbol{x}', \sigma) = \boldsymbol{x}' + \sigma^2 \color{orange}{\nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}}$$

$\Bigg($

$\boldsymbol{x}'$

$-$

$\boldsymbol{r}_{\theta^\star}(\boldsymbol{x}', \sigma)$

$\Bigg)~/~\sigma^2=$

$\color{orange}{\nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}$

Efficient sampling by Annealed HMC

Even with gradients, sampling in high number of dimensions is difficult! Because of:
- Curse of dimensionality
- Highly correlated chains
$\Longrightarrow$ Use a parallel annealing strategy to effectively sample from full distribution.
We use the fact that our score network $\mathbf{r}_\theta(x, \sigma)$ is learning a noise-convolved distribution $\nabla \log p_\sigma$, where $$p_\sigma(x) = \int p_{\mathrm{data}}(x')\mathcal{N}(x|x', \sigma^2)dx', ~~~~~~~~\sigma_1 > \sigma_2 > \sigma_3 > \sigma_4 $$
Run many HMC chains in parallel, progressively annealing the $\sigma$ to 0, keep last point in the chain as independent sample.

Illustration on $\kappa$-TNG simulations

$$\nabla_\kappa \log p_\sigma(\kappa |\gamma) = \nabla_\kappa \log p_\sigma(\gamma |\kappa) \quad + \quad \color{orange}{\nabla_\kappa \log p_\sigma(\kappa)}$$