Options to do SLAM for video? Get poses and camera intrinsics? #8

alik-git · 2024-03-03T19:49:17Z

Hi there, congrats on the fantastic work! These are amazing results.

I'm working on 3D mapping systems for robotics, and was wondering if

Given a video, can this method help with obtaining the camera parameters, and poses for each frame?

Do you guys have any scripts already for this? I see that in the example usage you have:

    # retrieve useful values from scene:
    imgs = scene.imgs
    focals = scene.get_focals()
    poses = scene.get_im_poses()

And you can do scene.get_intrinsics() which is great, but when I run this on 12 images from the replica dataset, scene.get_intrinsics() outputs 12 different intrinsic matrices, none of which really match the original camera intrinsics of the replica dataset.

Am I doing something wrong? Should I specify the scale or resolution or something else about the images at some point? The replica images are 1200x600 (w,h) but they get resized to 512 I'm assuming.

Just wondering how I should go about getting the camera parameters for a monocular rgb video, or if that's not really possible to do super accurately yet with this method.

For extra detail, I'm using the following frames from the replica dataset

    image_filenames = [
        'frame000000.jpg', 'frame000023.jpg', 'frame000190.jpg', 'frame000502.jpg',
        'frame000606.jpg', 'frame000988.jpg', 'frame001181.jpg', 'frame001374.jpg',
        'frame001587.jpg', 'frame001786.jpg', 'frame001845.jpg', 'frame001928.jpg'
    ]
...
    images = load_images(images_path_list, size=512)
    pairs = make_pairs(images, scene_graph='complete', prefilter=None, symmetrize=True)
...
    scene = global_aligner(output, device=device, mode=GlobalAlignerMode.PointCloudOptimizer)
    loss = scene.compute_global_alignment(init="mst", niter=niter, schedule=schedule, lr=lr)
...

and the output of the scene.get_intrinsics() is as follows, I'm only showing two of the matrices here, not all 12:

print(scene.get_intrinsics())
tensor([[[250.8425,   0.0000, 256.0000],
         [  0.0000, 250.8425, 144.0000],
         [  0.0000,   0.0000,   1.0000]],
...
        [[250.7383,   0.0000, 256.0000],
         [  0.0000, 250.7383, 144.0000],
         [  0.0000,   0.0000,   1.0000]]], device='cuda:0',
       grad_fn=<CopySlices>)

compared to the ground truth camera params of the replica dataset from the camera_params.json

K_given = [
    [600.0, 0, 599.5],
    [0, 600.0, 339.5],
    [0, 0, 1]
]

here is the actual camera_params.json file in case it helps

{
    "camera": {
        "w": 1200,
        "h": 680,
        "fx": 600.0,
        "fy": 600.0,
        "cx": 599.5,
        "cy": 339.5,
        "scale": 6553.5
    }
}

Also, just curious, how would I go about running this on long videos? Or is that not possible yet?

My apologies if these are too many questions! This method is really awesome, and I'm having a lot of fun using it. Thanks again for the wonderful work!

The text was updated successfully, but these errors were encountered:

hwanhuh · 2024-03-05T07:57:53Z

Although the task is a little different, I converted the dust3r result into a NeRF transform matrix and used it to train NeRF.
Below is my code.

  focals = scene.get_focals().cpu()
  extrinsics= scene.get_im_poses().cpu()
  intrinsics = scene.get_intrinsics().cpu()
  
  def make_nerf_transform_matrix_json(focals, intrinsics, extrinsics, scale=1):
      js_file = OrderedDict()
  
      focals = (to_numpy(focals) * scale).tolist()
      scaled_width = int(intrinsics[0, :, 2][0].item() * scale * 2)
      scaled_height = int(intrinsics[0, :, 2][1].item() * scale * 2)
      aspect_ratio = scaled_width / scaled_height
  
      # rotate the extrinsic matrix to the NeRF space
      frames = [] 
      for i, ext in enumerate(to_numpy(extrinsics)):
          ext[:3, 3:] *= scale
          temp_dict = {
                      'file_path': 'images/image_%03d.png'%(i), 
                      'transform_matrix': (ext @ OPENGL).tolist(), # OPENGL = np.array([[1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]])
                      'fl_x': focals[i][0],
                      'fl_y': focals[i][0],
                      }
          frames.append(temp_dict)
  
      js_file['w'] = scaled_width
      js_file['h'] = scaled_height
      js_file['cx'] = scaled_width / 2.0
      js_file['cy'] = scaled_height / 2.0
      js_file['frames'] = frames
  
      # dust3r does not know the distortion params 
      js_file['k1'] = 0.0
      js_file['k2'] = 0.0
      js_file['p1'] = 0.0
      js_file['p2'] = 0.0
  
      with open('./transforms.json', 'w', encoding='utf-8') as file: 
          json.dump(js_file, file, ensure_ascii=False, indent='\t')

Here, the 'scale' variable is calculated in 'load_image' function in dust3r.utils.image.

Unfortunately, NeRF results with those converted camera poses are terrible.
I wonder if my code has some mistake, or if the recovered dust3r's camera params are not suitable for Image-based rendering like NeRF.

MisterEkole · 2024-03-05T15:32:04Z

Hi @alik-git
Dust3r isn't suitable to recover camera params. It's true they have functions that help recover camera params but these params do not correspond to a any real world camera model. Hence it will be highly inefficient to use dust3r to recover cam intrisincs, I mean its clearly stated in their paper, their implementation doesn't enforce any geometrical constraint.

Dust3r is efficient for reconstruction for some scenes, it wouldn't be ideal for slam based task, unless it is coupled with some classical technique or after performing some camera calibraition and fintetuning on your own (data and cam model) maybe.

Leonard-Yao · 2024-03-08T07:02:43Z

Hi, Dust3r is a wonderful work and I want to get poses between two images from the real world. So I wonder the datas from "poses = scene.get_im_poses()" is relative pose estimate or Absolute Pose Estimation？Thank U. The datas is as follows:

poses = tensor([[[ 9.9999e-01, -3.6112e-03, 3.9793e-03, -3.2458e-05],
[ 3.5933e-03, 9.9998e-01, 4.4953e-03, -1.0670e-03],
[-3.9955e-03, -4.4809e-03, 9.9998e-01, -1.4901e-04],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00]],
...

    [[ 9.9867e-01, -4.9443e-02, -1.4822e-02, -5.3161e-04],
     [ 4.9266e-02,  9.9871e-01, -1.2068e-02,  6.6914e-04],
     [ 1.5399e-02,  1.1321e-02,  9.9982e-01,  1.0587e-04],
     [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00]]],
   device='cuda:0', grad_fn=<CopySlices>)

vincent-leroy · 2024-03-08T16:04:16Z

Thanks a lot Hello everyone,

I'll start by answering @alik-git 's questions:
The scene.get_intrinsics() function returns the intrinsics evaluated for the downscaled images e.g. maxdim=512

If you started with images of shape 680x1200 they get reshaped to 290x512, that is a factor of 2.34375
If you want the intrinsics of the image in the original resolution you thus need to multiply optical center and focal length them by this factor, that gives:

tensor([[587.9121,   0.0000, 600.0000],
        [  0.0000, 587.9121, 337.5000],
        [  0.0000,   0.0000,   1.0000]])

which looks pretty close to what you gave as GT =)

vincent-leroy · 2024-03-08T16:06:18Z

@hwanhuh I gave a quick look at your code and it seems your a mistakenly considering that scene.get_im_poses() returns camera matrices (world2cam) where it returns cam2world matrices:

dust3r/dust3r/cloud_opt/optimizer.py

Line 134 in 517a430

def get_im_poses(self): # cam to world

You should thus invert them to get camera projection matrices

vincent-leroy · 2024-03-08T16:06:45Z

@MisterEkole it is true that the implicit camera model of DUSt3R is an over-parametrization of standard camera models, that enforce less constraints. However, we train only on data captured with "pinhole" camera models, and we show in additional
experiments that the predictions closely follow the pinhole camera model even though we do not enforce it explicitly.

The experiments are as follow: we compute average absolute error of field-of-view estimates (in degrees) and the average 2D reprojection accuracy (in %) at a threshold of 1% of the image diagonal, both measured on raw predictions for 1000 randomly sampled (unseen) test images from the Habitat, BlendedMVS and CO3D datasets. Note that Habitat and BlendedMVS are synthetically generated, thus the intrinsics are perfectly known. For CO3D, we consider approximate focals estimated via COLMAP.

In conclusion, DUSt3R predicts a 3D geometry that closely respects the pinhole camera model and allows for reliable focal length estimation, even in the monocular case. Of course, there are degenerate cases and innacuracies for the problem of focal estimation from one or few views is ill-posed and one cannot always rely upon data priors.

vincent-leroy · 2024-03-08T16:09:41Z

@Leonard-Yao The world2cam matrices are expressed in a common reference frame although the term "absolute" might be misleading: the poses are not expressed in a metric space, because the reconstruction is performed up to an unknown scale factor.

Leonard-Yao · 2024-03-09T12:13:00Z

@Leonard-Yao The world2cam matrices are expressed in a common reference frame although the term "absolute" might be misleading: the poses are not expressed in a metric space, because the reconstruction is performed up to an unknown scale factor.

@vincent-leroy Thank U for you patient reply. After get depths of every pixels in paired pictures, whether I can figure out the unknown scale factor according to the returned 3D points? such as factor = real depth / returned depth.

alik-git · 2024-03-09T18:34:05Z

Thanks for answering my question @vincent-leroy!

And just a follow-up question, what do you think about getting poses? Basically, do you think this can be extended into a full-fledged slam system? Or would that be a lot of work.

vincent-leroy · 2024-03-09T19:40:25Z

I think it should not be too hard to make a functional a slam system with it but some points need to be figured out. We might look into that in the future, but this is not the top priority

Mr-ZhuJun · 2024-03-27T00:42:43Z

Although the task is a little different, I converted the dust3r result into a NeRF transform matrix and used it to train NeRF. Below is my code.

  focals = scene.get_focals().cpu()
  extrinsics= scene.get_im_poses().cpu()
  intrinsics = scene.get_intrinsics().cpu()
  
  def make_nerf_transform_matrix_json(focals, intrinsics, extrinsics, scale=1):
      js_file = OrderedDict()
  
      focals = (to_numpy(focals) * scale).tolist()
      scaled_width = int(intrinsics[0, :, 2][0].item() * scale * 2)
      scaled_height = int(intrinsics[0, :, 2][1].item() * scale * 2)
      aspect_ratio = scaled_width / scaled_height
  
      # rotate the extrinsic matrix to the NeRF space
      frames = [] 
      for i, ext in enumerate(to_numpy(extrinsics)):
          ext[:3, 3:] *= scale
          temp_dict = {
                      'file_path': 'images/image_%03d.png'%(i), 
                      'transform_matrix': (ext @ OPENGL).tolist(), # OPENGL = np.array([[1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]])
                      'fl_x': focals[i][0],
                      'fl_y': focals[i][0],
                      }
          frames.append(temp_dict)
  
      js_file['w'] = scaled_width
      js_file['h'] = scaled_height
      js_file['cx'] = scaled_width / 2.0
      js_file['cy'] = scaled_height / 2.0
      js_file['frames'] = frames
  
      # dust3r does not know the distortion params 
      js_file['k1'] = 0.0
      js_file['k2'] = 0.0
      js_file['p1'] = 0.0
      js_file['p2'] = 0.0
  
      with open('./transforms.json', 'w', encoding='utf-8') as file: 
          json.dump(js_file, file, ensure_ascii=False, indent='\t')

Here, the 'scale' variable is calculated in 'load_image' function in dust3r.utils.image.

Unfortunately, NeRF results with those converted camera poses are terrible. I wonder if my code has some mistake, or if the recovered dust3r's camera params are not suitable for Image-based rendering like NeRF.

Hello, I would like to ask, if the pose estimated by dust3r is put into nerf, will there be ghost blur in the nerf rendering?

andybak · 2024-03-27T10:43:29Z

Not sure why this is closed. @alik-git can you reopen? It's a fruitful discussion.

alik-git · 2024-03-29T14:57:19Z

Sure, no problem. Reopening.

Bin-ze · 2024-04-02T14:26:04Z

@

Although the task is a little different, I converted the dust3r result into a NeRF transform matrix and used it to train NeRF. Below is my code.

  focals = scene.get_focals().cpu()
  extrinsics= scene.get_im_poses().cpu()
  intrinsics = scene.get_intrinsics().cpu()
  
  def make_nerf_transform_matrix_json(focals, intrinsics, extrinsics, scale=1):
      js_file = OrderedDict()
  
      focals = (to_numpy(focals) * scale).tolist()
      scaled_width = int(intrinsics[0, :, 2][0].item() * scale * 2)
      scaled_height = int(intrinsics[0, :, 2][1].item() * scale * 2)
      aspect_ratio = scaled_width / scaled_height
  
      # rotate the extrinsic matrix to the NeRF space
      frames = [] 
      for i, ext in enumerate(to_numpy(extrinsics)):
          ext[:3, 3:] *= scale
          temp_dict = {
                      'file_path': 'images/image_%03d.png'%(i), 
                      'transform_matrix': (ext @ OPENGL).tolist(), # OPENGL = np.array([[1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]])
                      'fl_x': focals[i][0],
                      'fl_y': focals[i][0],
                      }
          frames.append(temp_dict)
  
      js_file['w'] = scaled_width
      js_file['h'] = scaled_height
      js_file['cx'] = scaled_width / 2.0
      js_file['cy'] = scaled_height / 2.0
      js_file['frames'] = frames
  
      # dust3r does not know the distortion params 
      js_file['k1'] = 0.0
      js_file['k2'] = 0.0
      js_file['p1'] = 0.0
      js_file['p2'] = 0.0
  
      with open('./transforms.json', 'w', encoding='utf-8') as file: 
          json.dump(js_file, file, ensure_ascii=False, indent='\t')

Here, the 'scale' variable is calculated in 'load_image' function in dust3r.utils.image.
Unfortunately, NeRF results with those converted camera poses are terrible. I wonder if my code has some mistake, or if the recovered dust3r's camera params are not suitable for Image-based rendering like NeRF.

Hello, I would like to ask, if the pose estimated by dust3r is put into nerf, will there be ghost blur in the nerf rendering?

I'm a nerf researcher and I'm very interested in applying dust3 with nerf modeling.
I would like to know what the results of your experiments are so far.
Can you show the render results you got using nerf under this issue?
Maybe we can figure out a way to make it work together.
think you！

PHOENIXFURY007 · 2024-04-04T15:09:17Z

Thanks a lot Hello everyone,

I'll start by answering @alik-git 's questions: The scene.get_intrinsics() function returns the intrinsics evaluated for the downscaled images e.g. maxdim=512

If you started with images of shape 680x1200 they get reshaped to 290x512, that is a factor of 2.34375 If you want the intrinsics of the image in the original resolution you thus need to multiply optical center and focal length them by this factor, that gives:
tensor([[587.9121,   0.0000, 600.0000],
        [  0.0000, 587.9121, 337.5000],
        [  0.0000,   0.0000,   1.0000]])
which looks pretty close to what you gave as GT =)

This looks really interesting.
@vincent-leroy , can you share the code as to how you got the scaled-down intrinsics ?

PHOENIXFURY007 · 2024-04-04T15:26:30Z

Is it possible to get the focals and intrinsic value for in-the-wild samples? Assuming the values to be almost similar to the real value and not exact

ljjTYJR · 2024-04-26T21:45:36Z

For the SLAM system, I wonder how to maintain the scale consistently, especially when the scene changes (like from outdoor to indoor)

alik-git closed this as completed Mar 9, 2024

alik-git reopened this Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Options to do SLAM for video? Get poses and camera intrinsics? #8

Options to do SLAM for video? Get poses and camera intrinsics? #8

alik-git commented Mar 3, 2024

hwanhuh commented Mar 5, 2024 •

edited

MisterEkole commented Mar 5, 2024

Leonard-Yao commented Mar 8, 2024

vincent-leroy commented Mar 8, 2024 •

edited

vincent-leroy commented Mar 8, 2024 •

edited

vincent-leroy commented Mar 8, 2024

vincent-leroy commented Mar 8, 2024

Leonard-Yao commented Mar 9, 2024

alik-git commented Mar 9, 2024

vincent-leroy commented Mar 9, 2024

Mr-ZhuJun commented Mar 27, 2024

andybak commented Mar 27, 2024

alik-git commented Mar 29, 2024

Bin-ze commented Apr 2, 2024 •

edited

PHOENIXFURY007 commented Apr 4, 2024 •

edited

PHOENIXFURY007 commented Apr 4, 2024 •

edited

ljjTYJR commented Apr 26, 2024

Options to do SLAM for video? Get poses and camera intrinsics? #8

Options to do SLAM for video? Get poses and camera intrinsics? #8

Comments

alik-git commented Mar 3, 2024

Given a video, can this method help with obtaining the camera parameters, and poses for each frame?

Just wondering how I should go about getting the camera parameters for a monocular rgb video, or if that's not really possible to do super accurately yet with this method.

hwanhuh commented Mar 5, 2024 • edited

MisterEkole commented Mar 5, 2024

Leonard-Yao commented Mar 8, 2024

vincent-leroy commented Mar 8, 2024 • edited

vincent-leroy commented Mar 8, 2024 • edited

vincent-leroy commented Mar 8, 2024

vincent-leroy commented Mar 8, 2024

Leonard-Yao commented Mar 9, 2024

alik-git commented Mar 9, 2024

vincent-leroy commented Mar 9, 2024

Mr-ZhuJun commented Mar 27, 2024

andybak commented Mar 27, 2024

alik-git commented Mar 29, 2024

Bin-ze commented Apr 2, 2024 • edited

PHOENIXFURY007 commented Apr 4, 2024 • edited

PHOENIXFURY007 commented Apr 4, 2024 • edited

ljjTYJR commented Apr 26, 2024

hwanhuh commented Mar 5, 2024 •

edited

vincent-leroy commented Mar 8, 2024 •

edited

vincent-leroy commented Mar 8, 2024 •

edited

Bin-ze commented Apr 2, 2024 •

edited

PHOENIXFURY007 commented Apr 4, 2024 •

edited

PHOENIXFURY007 commented Apr 4, 2024 •

edited