Probabilistic Mass Mapping with Neural Score Estimation

Benjamin Remy

With : Francois Lanusse, Niall Jeffrey, Jia Liu, J.-L. Starck, Ken Osato and Tim Schrabback



slides at b-remy.github.io/talks/Moriond2022

Data analysis and inverse problems

Many data analysis problems occur in cosmology because of ill-posed inverse problems, e.g. when the measurements are too much mixed, because of survey mask, noise corruption, ...

21 cm foreground removal

Weak Lensing Mass-map reconstruction

+ shape noise

Classical examples of signal priors

Sparse
$$ \log p(x) = \parallel \mathbf{W} x \parallel_1 $$
Gaussian $$ \log p(x) = x^t \mathbf{\Sigma^{-1}} x $$
Total Variation $$ \log p(x) = \parallel \nabla x \parallel_1 $$

But what about learning the prior
with deep generative models?

Mass-maps

Jeffrey et al. (2021) DES Y3

  • 2pt functions and higher order statistics (3pt functions, Wavelet peak counts, $\ell_1$-norm, scattering transform, neural summaries, ...)
Peel et al. (2021) Abel 520 cluster

  • Search for localized features: localized peaks visible in mass-maps but with no optical counterpart, uncertainty quantification

Gravitational lensing

Galaxy shapes as estimators for gravitational shear
$$ e = \gamma + e_i \qquad \mbox{ with } \qquad e_i \sim \mathcal{N}(0, I)$$
  • We are trying the measure the ellipticity $e$ of galaxies as an estimator for the gravitational shear $\gamma$

The Weak Lensing Mass-Mapping as an Inverse Problem

Shear $\gamma$
Convergence $\kappa$
$$\gamma_1 = \frac{1}{2} (\partial_1^2 - \partial_2^2) \ \Psi \quad;\quad \gamma_2 = \partial_1 \partial_2 \ \Psi \quad;\quad \kappa = \frac{1}{2} (\partial_1^2 + \partial_2^2) \ \Psi$$
$$\boxed{\gamma = \mathbf{P} \kappa}$$

Illustration on the Dark Energy Survey (DES) Y3

Different priors lead to different mass-maps...
Jeffrey, et al. (2021)

State Of The Art of mass-map reconstruction: $\texttt{DeepMass}$

Neural Network based, estimates the posterior mean $\hat{\kappa} = \int \kappa p(\kappa|\gamma)d\kappa$
HST/ACS COSMOS field

Kaiser-Squires (1993)
DeepMass (Jeffrey et al. 2020)
Limitations:
  • Posterior mean only: no uncertainty on the reconstruction
  • Implicit likelihood: 1) no garantee at inference 2) needs a new training for any change in the lensing catalog (mask, noise)

Our contribution:

We propose to sample the full posterior distribution $p(\kappa|\gamma)$



  • Posterior Mean (DeepMass) ✅


  • Uncertainty Quantifications (posterior samples, moments) ✅


  • Explicit likelihood: train the network only once, data fidelity garanteed at inference ✅

Remy et al. (2022) Posterior mean
Remy et al. (2022) Posterior samples

Bayesian Modeling

$$\boxed{\gamma = \mathbf{P}\kappa + n}$$ $\mathbf{P}$ is known and encodes our physical understanding of the problem
$\Longrightarrow$ Non-invertible (survey mask, shape noise), the inverse problem is ill-posed
with no unique solution $\kappa$


The Bayesian view of the problem:
$$ \boxed{p(\kappa | \gamma) \propto p(\gamma | \kappa) \ p(\kappa)} $$
  • $p(\gamma | \kappa)$ is the data likelihood, which contains the physics

  • $p(\kappa)$ is the prior knowledge on the solution.


$\gamma$
$\kappa$
In this perspective we can provide point estimates: Posterior Mean, Max, Median, etc.
and the full posterior $p(\kappa|\gamma)$ with Markov Chain Monte Carlo or Variational Inference methods

How do you choose the prior ?

Writing down the convergence map log posterior

$$ \log p( \kappa | e) = \underbrace{\log p(e | \kappa)}_{\simeq -\frac{1}{2} \parallel e - P \kappa \parallel_\Sigma^2} + \log p(\kappa) +cst $$
  • The likelihood term is known analytically.
  • There is no close form expression for the full non-Gaussian prior of the convergence.
    However:
    • We do have access to samples of full implicit prior through simulations: $X = \{x_0, x_1, \ldots, x_n \}$ with $x_i \sim \mathbb{P}$

      $\kappa$TNG (Osato et al. 2021)
$\Longrightarrow$ Our strategy: Learn the prior from simulation, and then sample the full posterior.

Sampling method: Annealed Hamiltonian Monte Carlo

Sampling convergence maps $\kappa \sim p(\kappa|\gamma)$ is very difficult due to the high dimensionality of the space
( $360 \times 360 \approx 10^{5}$ parameters).

Especially for MCMC algorithms because of curse of dimensionality leading to highly correlated chains.

We need to design an efficient sampler. $~~~~\kappa_1, \kappa_2, ..., \kappa_N \sim p(\kappa| \gamma)$


  • Hamiltonial Monte Carlo proposal for a step size $\alpha$: $$\begin{array}{lcl} \boldsymbol{m}_{t+\frac{\alpha}{2}} & = & \boldsymbol{m}_t + \dfrac{\alpha}{2} \color{orange}{\nabla_{\kappa}\log p(\boldsymbol{\kappa}_t|\gamma)} \\ \boldsymbol{\kappa}_{t+\alpha} & = & \boldsymbol{\kappa}_t + \alpha \boldsymbol{\mathrm{M}}^{-1} \boldsymbol{\kappa}_{t+\frac{\alpha}{2}} \\ \boldsymbol{m}_{t+\alpha} & = & \boldsymbol{m}_{t+\frac{\alpha}{2}} + \dfrac{\alpha}{2} \color{orange}{\nabla_\kappa \log p (\boldsymbol{\kappa}_{t+\alpha}|\gamma)} \end{array}$$
  • Annealing: convolve the posterior with a wide gaussian to always remain on high probability density. $$p_\sigma(x) = \int p_{\mathrm{data}}(x')\mathcal{N}(x|x', \sigma^2)dx', ~~~~~~~~\sigma_1 > \sigma_2 > \sigma_3 > \sigma_4 $$

Sampling method: Annealed Hamiltonian Monte Carlo


Target $\sigma_0 \approx 0$
Initialization $\sigma_T$ is high
Annealing scheme

The score is all you need!


  • With the HMC proposal, you only need access to the score of the posterior: $$\color{orange}{\nabla_\kappa \log p (\kappa|\gamma)}$$



  • The score of the full posterior is simply: $$\nabla_x \log p(x |y) = \underbrace{\nabla_x \log p(y |x)}_{\mbox{known}} \quad + \quad \underbrace{\nabla_x \log p(x)}_{\mbox{can be learned}}$$ $\Longrightarrow$ all we have to do is model/learn the score of the prior.

Neural Score Estimation by Denoising Score Matching

  • Denoising Score Matching: An optimal Gaussian denoiser learns the score of a given distribution.
    • If $x \sim \mathbb{P}$ is corrupted by additional Gaussian noise $u \in \mathcal{N}(0, \sigma^2)$ to yield $$x^\prime = x + u$$
    • Let's consider a denoiser $r_\theta$ trained under an $\ell_2$ loss: $$\mathcal{L}=\parallel x - r_\theta(x^\prime, \sigma) \parallel_2^2$$
    • The optimal denoiser $r_{\theta^\star}$ verifies: $$\boxed{\boldsymbol{r}_{\theta^\star}(\boldsymbol{x}', \sigma) = \boldsymbol{x}' + \sigma^2 \color{orange}{\nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}}$$
$\Bigg($
$\boldsymbol{x}'$
$-$
$\boldsymbol{r}_{\theta^\star}(\boldsymbol{x}', \sigma)$
$\Bigg)~/~\sigma^2=$
$\color{orange}{\nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}$

Illustration on $\kappa$-TNG simulations

$$\nabla_\kappa \log p_\sigma(\kappa |\gamma) = \nabla_\kappa \log p_\sigma(\gamma |\kappa) \quad + \quad \color{orange}{\nabla_\kappa \log p_\sigma(\kappa)}$$

Illustration on $\kappa$-TNG simulations


True convergence map
Traditional Kaiser-Squires
Wiener Filter
Posterior Mean (ours)



Posterior samples

Reconstruction of the HST/ACS COSMOS field

1.637 square degree, 64.2 gal/arcmin$^2$

Massey et al. (2007)
Remy et al. (2022) Posterior mean
Remy et al. (2022) Posterior samples
Remy et al. (2022)

Takeaways



  • Hybrid physical/deep learning modeling:
    • Deep generative models can be used to provide data driven priors.

    • Explicit likelihood, uses of all of our physical knowledge.
      $\Longrightarrow$ The method can be applied for varying PSF, noise, or even different instruments!

  • We implemented a new class of mass mapping method, providing the full posterior
    $\Longrightarrow$ Find the highest quality convergence map of the COSMOS field online: https://zenodo.org/record/5825654

Thank you!



Training a Neural Score Estimator in practice




A standard UNet
  • We use a very standard residual UNet, and we adopt a residual score matching loss: $$ \mathcal{L}_{DSM} = \underset{\boldsymbol{x} \sim P}{\mathbb{E}} \underset{\begin{subarray}{c} \boldsymbol{u} \sim \mathcal{N}(0, I) \\ \sigma_s \sim \mathcal{N}(0, s^2) \end{subarray}}{\mathbb{E}} \parallel \boldsymbol{u} + \sigma_s \boldsymbol{r}_{\theta}(\boldsymbol{x} + \sigma_s \boldsymbol{u}, \sigma_s) \parallel_2^2$$ $\Longrightarrow$ direct estimator of the score $\nabla \log p_\sigma(x)$

  • Lipschitz regularization to improve robustness:

    Without regularization
    With regularization