Probabilistic Mapping of Dark Matter with Neural Score Matching

Benjamin Remy

With : Francois Lanusse, Zaccharie Ramzi, Niall Jeffrey, Jia Liu, J.-L. Starck





slides at b-remy.github.io/talks/ITA2021

Linear inverse problems

$\boxed{y = \mathbf{A}x + n}$

$\mathbf{A}$ is known and encodes our physical understanding of the problem.
$\Longrightarrow$ When non-invertible or ill-conditioned, the inverse problem is ill-posed with no unique solution $x$
Deconvolution
Inpainting
Denoising
$\boxed{y = \mathbf{A}x + n}$

The Bayesian view of the problem:

$$ p(x | y) \propto p(y | x) \ p(x) $$
  • $p(y | x)$ is the data likelihood, which contains the physics

  • $p(x)$ is the prior knowledge on the solution.


With these concepts in hand, we can estimate for instance the Maximum A Posteriori solution:
$$\hat{x} = \arg\max\limits_x \ \log p(y \ | \ x) + \log p(x)$$ For instance, if $n$ is Gaussian, $\hat x = \arg\max\limits_x \ - \frac{1}{2} \parallel y - \mathbf{A} x \parallel_{\mathbf{\Sigma}}^2 + \log p(x)$

Or estimate from the full posterior p(x|y) with MCMC or Variational Inference methods.

How do you choose the prior ?

Classical examples of signal priors

Sparse
$$ \log p(x) = \parallel \mathbf{W} x \parallel_1 $$
Gaussian $$ \log p(x) = x^t \mathbf{\Sigma^{-1}} x $$
Total Variation $$ \log p(x) = \parallel \nabla x \parallel_1 $$

But what about learning the prior
with deep generative models?

The evolution of generative models




  • Deep Belief Network
    (Hinton et al. 2006)

  • Variational AutoEncoder
    (Kingma & Welling 2014)

  • Generative Adversarial Network
    (Goodfellow et al. 2014)

  • Wasserstein GAN
    (Arjovsky et al. 2017)



Not all generative models are created equal

Grathwohl et al. 2018


  • GANs and VAEs are very common and successfull but do not fit our purposes.

  • We would want a model with explicit likelihood, which can evaluate $\log p_\theta(x)$.

The score is all you need!


  • Whether you are looking for the MAP or sampling with HMC or MALA, you only need access to the score of the posterior: $$\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x \color{orange}|\color{orange} y\color{orange})}{\color{orange} d \color{orange}x}$$
    • Gradient descent: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) $
    • Langevin algorithm: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) + \sqrt{2\tau} n_t$



  • The score of the full posterior is simply: $$\nabla_x \log p(x |y) = \underbrace{\nabla_x \log p(y |x)}_{\mbox{known}} \quad + \quad \underbrace{\nabla_x \log p(x)}_{\mbox{can be learned}}$$ $\Longrightarrow$ all we have to do is model/learn the score of the prior.

A deep denoising example

$$ \boxed{{\color{Orchid} y} = {\color{SkyBlue} x} + n} $$
  • We learn the distribution of noiseless data $\log p_\theta(x)$ from samples using a deep generative model.

  • We measure a noisy ${\color{Orchid} y}$ and we want to estimate a denoised ${\color{SkyBlue} x}$

  • The solution should lie on the realistic data manifold, symbolized by the two-moons distribution.

    We want to solve for the Maximum A Posterior solution:

    $$\arg \max - \frac{1}{2} \parallel {\color{Orchid} y} - {\color{SkyBlue} x} \parallel_2^2 + \log p_\theta({\color{SkyBlue} x})$$ This can be done by gradient descent as long as one has access to $\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x\color{orange})}{\color{orange} d \color{orange}x}$.

Neural Score Estimation by Denoising Score Matching

  • Denoising Score Matching: An optimal Gaussian denoiser learns the score of a given distribution.
    • If $x \sim \mathbb{P}$ is corrupted by additional Gaussian noise $u \in \mathcal{N}(0, \sigma^2)$ to yield $$x^\prime = x + u$$
    • Let's consider a denoiser $r_\theta$ trained under an $\ell_2$ loss: $$\mathcal{L}=\parallel x - r_\theta(x^\prime, \sigma) \parallel_2^2$$
    • The optimal denoiser $r_{\theta^\star}$ verifies: $$\boxed{\boldsymbol{r}_{\theta^\star}(\boldsymbol{x}', \sigma) = \boldsymbol{x}' + \sigma^2 \nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}$$
$\boldsymbol{x}'$
$\boldsymbol{x}$
$\boldsymbol{x}'- \boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$
$\boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$

Training a Neural Score Estimator in practice




A standard UNet
  • We use a very standard residual UNet, and we adopt a residual score matching loss: $$ \mathcal{L}_{DSM} = \underset{\boldsymbol{x} \sim P}{\mathbb{E}} \underset{\begin{subarray}{c} \boldsymbol{u} \sim \mathcal{N}(0, I) \\ \sigma_s \sim \mathcal{N}(0, s^2) \end{subarray}}{\mathbb{E}} \parallel \boldsymbol{u} + \sigma_s \boldsymbol{r}_{\theta}(\boldsymbol{x} + \sigma_s \boldsymbol{u}, \sigma_s) \parallel_2^2$$ $\Longrightarrow$ direct estimator of the score $\nabla \log p_\sigma(x)$

  • Lipschitz regularization to improve robustness:

    Without regularization
    With regularization

Annealed Langevin samples from DSM model in Song & Ermon (2020)

Efficient sampling by Annealed HMC

  • Even with gradients, sampling in high number of dimensions is difficult! Because of:
    • Curse of dimensionality
    • Highly correlated chains
  • $\Longrightarrow$ Use a parallel annealing strategy to effectively sample from full distribution.
  • We use the fact that our score network $\mathbf{r}_\theta(x, \sigma)$ is learning a noise-convolved distribution $\nabla \log p_\sigma$, where $$p_\sigma(x) = \int p_{\mathrm{data}}(x')\mathcal{N}(x|x', \sigma^2)dx', ~~~~~~~~\sigma_1 > \sigma_2 > \sigma_3 > \sigma_4 $$
  • Run many HMC chains in parallel, progressively annealing the $\sigma$ to 0, keep last point in the chain as independent sample.

Example of one chain during annealing

Probabilistic Mapping of Dark Matter by
Neural Score Matching

Remy, Ramzi, Lanusse et al. 2020

$\Longrightarrow$ Probe full posterior of convergence map, application to the COSMOS field.


Gravitational lensing

Galaxy shapes as estimators for gravitational shear
$$ e = \gamma + e_i \qquad \mbox{ with } \qquad e_i \sim \mathcal{N}(0, I)$$
  • We are trying the measure the ellipticity $e$ of galaxies as an estimator for the gravitational shear $\gamma$

The Weak Lensing Mass-Mapping as an Inverse Problem

Shear $\gamma$
Convergence $\kappa$
$$\gamma_1 = \frac{1}{2} (\partial_1^2 - \partial_2^2) \ \Psi \quad;\quad \gamma_2 = \partial_1 \partial_2 \ \Psi \quad;\quad \kappa = \frac{1}{2} (\partial_1^2 + \partial_2^2) \ \Psi$$
$$\boxed{\gamma = \mathbf{P} \kappa}$$

Illustration on the Dark Energy Survey (DES) Y3

Jeffrey, et al. (2021)

Writing down the convergence map log posterior

$$ \log p( \kappa | e) = \underbrace{\log p(e | \kappa)}_{\simeq -\frac{1}{2} \parallel e - P \kappa \parallel_\Sigma^2} + \log p(\kappa) +cst $$
  • The likelihood term is known analytically.
  • There is no close form expression for the full non-Gaussian prior of the convergence.
    However:
    • We do have access to samples of full implicit prior through simulations: $X = \{x_0, x_1, \ldots, x_n \}$ with $x_i \sim \mathbb{P}$
$\Longrightarrow$ Our strategy: Learn the prior from simulation, and then sample the full posterior.

Writing down the convergence map log posterior

$$ \log p( \kappa | e) = \underbrace{\log p(e | \kappa)}_{\simeq -\frac{1}{2} \parallel e - P \kappa \parallel_\Sigma^2} + \log p(\kappa) +cst $$
  • The likelihood term is known analytically.
  • There is no close form expression for the full non-Gaussian prior of the convergence.
  • We learn a hybriod Denoiser: theoretical Gaussian on large scale, data-driven on small scales using N-body simulations. $$\underbrace{\nabla_{\boldsymbol{\kappa}} \log p(\boldsymbol{\kappa})}_\text{full prior} = \underbrace{\nabla_{\boldsymbol{\kappa}} \log p_{th}(\boldsymbol{\kappa})}_\text{gaussian prior} + \underbrace{\boldsymbol{r}_\theta(\boldsymbol{\kappa}, \nabla_{\boldsymbol{\kappa}} \log p_{th}(\boldsymbol{\kappa}))}_\text{learned residuals}$$

Illustration on $\kappa$-TNG simulations


True convergence map
Traditional Kaiser-Squires
Wiener Filter
Posterior Mean (ours)



Posterior samples

Illustration on $\kappa$-TNG simulations


Annealing HMC chain

Probabilistic Mass-Mapping of the HST COSMOS field


  • COSMOS shear data from Schrabback et al. 2010

  • Prior learned from MassiveNuS at fiducial cosmology (320x320 maps at 0.4 arcsec resolution).

  • Known massive X-ray clusters indicated with crosses, along with their redshifts, right pannel shows cutouts of central cluster from multiple posterior samples.


Uncertainty quantification in Magnetic Resonance Imaging (MRI)

Ramzi, Remy, Lanusse et al. 2020


$$\boxed{y = \mathbf{M} \mathbf{F} x + n}$$



$\Longrightarrow$ We can see which parts of the image are well constrained by data, and which regions are uncertain.

Takeaways



  • Hybrid physical/deep learning modeling:
    • Deep generative models can be used to provide data driven priors.

    • Explicit likelihood, uses of all of our physical knowledge.
      $\Longrightarrow$ The method can be applied for varying PSF, noise, or even different instruments!

  • Neural Score Estimation is a scalable approach to learn a prior score.

  • Knowledge of the posterior score is all we need for Bayesian inference aka uncertain quantification.

  • We implemented a new class of mass mapping method, providing the full posterior
    $\Longrightarrow$ recovered a very high quality convergence map of the COSMOS field.




Thank you!