G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing

TL;DR: We introduce G3T, a transformer that predicts upright, gravity-aligned pointmaps regardless of input image orientation, and G3T-Long, a pipeline that leverages this uprightness to enable robust long-sequence 3D reconstruction.

What is this visualization?

📌 About: This visualization compares the uprightness of pointmaps predicted by G3T and VGGT on three self-captured scenes.
🕹️ Interact: Drag to orbit, scroll to zoom. Use the ⇌ handle to move between RGB and height-based coloring.
🔍 Focus on: Ground-parallel surfaces such as floors, stairs, and benches.
📝 Observe: Ground-parallel surfaces on G3T pointmaps show uniform color (), indicating that the pointmaps are upright, with horizontal surfaces sitting at a consistent height. The same surfaces on VGGT's pointmaps show color gradients () instead, indicating the pointmap is not upright. Note also that G3T's ground-parallel surfaces align with the rendered grid lines.
💡 Takeaway: G3T reliably predicts upright pointmaps regardless of input image orientation.

Loading mesh…

⇌

RGB

Height

VGGT

Loading mesh…

⇌

RGB

Height

G3T

Input Images

Abstract

Modern feed-forward 3D reconstruction methods like VGGT predict pixel-aligned pointmaps in camera-centric coordinate frames. However, this choice of coordinate frame is not always optimal.

We propose instead to predict pointmaps in upright, gravity-aligned frames that exploit strong structural cues present in many real-world scenes. Unlike camera-centric frames, gravity-aligned frames share a common vertical axis across viewpoints, reducing the rotational degrees of freedom needed to relate pointmaps to one another.

To this end, we introduce the Gravity Grounded Geometry Transformer (G3T), fine-tuned from existing models on gravity-aligned 3D data. G3T produces highly accurate gravity-aware predictions, including upright pointmaps and camera-to-gravity poses. We further introduce G3T-Long, a submap-based incremental 3D reconstruction pipeline that leverages the reduced rotational degrees of freedom afforded by upright frames to achieve significantly improved reconstruction accuracy.

Model Architecture

G3T builds upon VGGT with two key modifications. First, the point head outputs pointmaps in the gravity-aligned frame of the first image ( $\mathcal{X}^{G_1} = \{X^{G_1}_i\}$ ). Second, we replace VGGT's camera head with two new heads: the local camera head, whose outputs capture gravity-to-camera rotation and camera intrinsics parameters in $\mathcal{G}^l = \{G^l_i\}$ ; and the relative camera head, which capture 1-DoF relative yaw and translation parameters in $\mathcal{G}^r = \{G^r_i\}$ . The aggregator, depth head, and point head architectures are otherwise unchanged from VGGT.

Swapping Reference Image

G3T can produce upright pointmaps regardless of the orientation of the reference image, whereas VGGT's pointmaps inherit the roll and pitch of the camera that captured the reference image.

What is this visualization?

📌 About: Shows how VGGT and G3T pointmaps respond when different input images are chosen as the reference.
🕹️ Interact: Click on an input image to select it as the reference. Drag inside the viewer to orbit, scroll to zoom. Use the ⇌ handle to move between RGB and height-based coloring.
🔍 Focus on: Ground-parallel surfaces such as floors, stairs, and benches.
📝 Observe: VGGT reconstructions tilt with the reference image; G3T reconstructions stay upright.
💡 Takeaway: G3T produces gravity-aligned pointmaps regardless of which image is chosen as the reference.

Loading mesh…

⇌

RGB

Height

VGGT

Loading mesh…

⇌

RGB

Height

G3T

Input Images

Robust Long Sequence Reconstruction

Pointmaps predicted by G3T can be related to each other using a pose $\pi^y(s, R_{y}, t)$ that has 5-DoF (i.e., rotations are restricted to be rotations along the y-axis, which have 1-DoF).

We leverage this property to create G3T-Long, a submap-based reconstruction pipeline that improves upon VGGT-Long and reconstructs long video sequences with significantly improved robustness.

What is this visualization?

📌 About: This visualization compares two submap-based reconstruction methods, VGGT-Long and G3T-Long, on the same self-captured video with heavy camera roll and pitch variations.
🕹️ Interact: Drag to orbit, scroll to zoom. Use to pause the chunk alignment animation. Click any video to play/pause all videos.
🔍 Focus on: The initial orientation of each chunk and the rotation needed to align them. Note: chunks are spaced out for visual clarity and thus translations are exaggerated.
📝 Observe: For G3T-Long, both individual chunks and the final reconstruction remain gravity-aligned despite chaotic camera motion, requiring only a 1-DoF yaw correction per chunk. VGGT-Long requires a full 3-DoF rotation correction.
💡 Takeaway: G3T stays upright even under severe roll and pitch, making long-sequence alignment simpler and more robust.

Loading…

VGGT-Long

Loading…

G3T-Long

Input Chunk Videos · Click any video to play / pause all

Uprightness Comparison

We compare G3T pointmaps against VGGT pointmaps made upright using GeoCalib (with multi-image optimization) as a baseline, alongside ground-truth gravity-aligned pointmaps.

What is this visualization?

📌 About: This visualization compares three sets of upright pointmaps: G3T, VGGT+GeoCalib, and Ground Truth.
🕹️ Interact: Drag to orbit, scroll to zoom.
🔍 Focus on: How closely the blue (G3T) and red (VGGT+GeoCalib) pointmaps align with the green ground truth.
📝 Observe: Scenes 1, 2, 3, 7 show severe VGGT+GeoCalib failures; the remaining scenes show modest failures.
💡 Takeaway: G3T consistently produces upright pointmaps closer to ground truth than VGGT+GeoCalib, even in challenging scenes with large camera roll and pitch variations.

Loading mesh…

VGGT + GeoCalib

Loading mesh…

G3T

Input Images

Discussion

Limitations. G3T may not produce good upright-aware predictions in scenes with ambiguous structural cues. For example, G3T can struggle to estimate upright pointmaps from close-up images of floors and walls if additional unambiguous context is not present.

Future work. It would be interesting to explore if gravity-aligned prediction naturally encourages performance improvement in tasks that favor uprightedness, such as physically based simulation or spatial reasoning. Additionally, it could also be advantageous to explore coordinate frames with even more structure, such as Manhattan frames (when appropriate), or even compositions of coordinate frames (e.g., a mixture of an overall scene coordinate frame, as well as local object coordinate frames for items in the scene).

Acknowledgements

We thank the authors of DUSt3R, VGGT, CUT3R, and VGGT-Long for open-sourcing their projects, which our work builds upon. Additionally, we would like to thank Aditya Chetan, Haian Jin and Jay Karhade for their feedback on initial drafts of the paper. This work was funded in part by the National Science Foundation (IIS-2211259 and IIS-2212084), and benefited from the NVIDIA Academic Grant Program for compute resources.

BibTeX

@article{kani2026g3t,
  author    = {Nagoor Kani, Bharath Raj and Snavely, Noah},
  title     = {G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing},
  journal   = {arXiv preprint},
  year      = {2026},
}