A central challenge in 3D scene perception via inverse graphics is robustly modeling the gap between the generative graphics model and real-world data. We propose a novel 3D neural embedding likelihood (3DNEL) that jointly models RGB and depth images, and empirically demonstrate that it enables robust 6D object pose estimation via Bayesian inverse graphics on real-world RGB-D images. 3DNEL uses neural embeddings, learned entirely from synthetic data, to predict dense 2D-3D correspondence scores from RGB, combines this with depth information in a principled manner, and uses a mixture model formulation to jointly model multiple objects in a scene. 3DNEL achieves new state-of-the-art (SOTA) performance in sim-to-real pose estimation on the YCB-Video dataset, and demonstrates improved robustness when compared with the previous SOTA, with significantly fewer large-error pose predictions. Formulated as a structured probabilistic generative model, 3DNEL can be easily adapted for object tracking from dynamic videos, further improving accuracy of 6D pose estimation.