Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options to do SLAM for video? Get poses and camera intrinsics? #8

Open
alik-git opened this issue Mar 3, 2024 · 17 comments
Open

Options to do SLAM for video? Get poses and camera intrinsics? #8

alik-git opened this issue Mar 3, 2024 · 17 comments

Comments

@alik-git
Copy link

alik-git commented Mar 3, 2024

Hi there, congrats on the fantastic work! These are amazing results.

I'm working on 3D mapping systems for robotics, and was wondering if

Given a video, can this method help with obtaining the camera parameters, and poses for each frame?

Do you guys have any scripts already for this? I see that in the example usage you have:

    # retrieve useful values from scene:
    imgs = scene.imgs
    focals = scene.get_focals()
    poses = scene.get_im_poses()

And you can do scene.get_intrinsics() which is great, but when I run this on 12 images from the replica dataset, scene.get_intrinsics() outputs 12 different intrinsic matrices, none of which really match the original camera intrinsics of the replica dataset.

Am I doing something wrong? Should I specify the scale or resolution or something else about the images at some point? The replica images are 1200x600 (w,h) but they get resized to 512 I'm assuming.

Just wondering how I should go about getting the camera parameters for a monocular rgb video, or if that's not really possible to do super accurately yet with this method.

For extra detail, I'm using the following frames from the replica dataset

    image_filenames = [
        'frame000000.jpg', 'frame000023.jpg', 'frame000190.jpg', 'frame000502.jpg',
        'frame000606.jpg', 'frame000988.jpg', 'frame001181.jpg', 'frame001374.jpg',
        'frame001587.jpg', 'frame001786.jpg', 'frame001845.jpg', 'frame001928.jpg'
    ]
...
    images = load_images(images_path_list, size=512)
    pairs = make_pairs(images, scene_graph='complete', prefilter=None, symmetrize=True)
...
    scene = global_aligner(output, device=device, mode=GlobalAlignerMode.PointCloudOptimizer)
    loss = scene.compute_global_alignment(init="mst", niter=niter, schedule=schedule, lr=lr)
...

and the output of the scene.get_intrinsics() is as follows, I'm only showing two of the matrices here, not all 12:

print(scene.get_intrinsics())
tensor([[[250.8425,   0.0000, 256.0000],
         [  0.0000, 250.8425, 144.0000],
         [  0.0000,   0.0000,   1.0000]],
...
        [[250.7383,   0.0000, 256.0000],
         [  0.0000, 250.7383, 144.0000],
         [  0.0000,   0.0000,   1.0000]]], device='cuda:0',
       grad_fn=<CopySlices>)

compared to the ground truth camera params of the replica dataset from the camera_params.json

K_given = [
    [600.0, 0, 599.5],
    [0, 600.0, 339.5],
    [0, 0, 1]
]

here is the actual camera_params.json file in case it helps

{
    "camera": {
        "w": 1200,
        "h": 680,
        "fx": 600.0,
        "fy": 600.0,
        "cx": 599.5,
        "cy": 339.5,
        "scale": 6553.5
    }
}

Also, just curious, how would I go about running this on long videos? Or is that not possible yet?

My apologies if these are too many questions! This method is really awesome, and I'm having a lot of fun using it. Thanks again for the wonderful work!

@hwanhuh
Copy link

hwanhuh commented Mar 5, 2024

Although the task is a little different, I converted the dust3r result into a NeRF transform matrix and used it to train NeRF.
Below is my code.

  focals = scene.get_focals().cpu()
  extrinsics= scene.get_im_poses().cpu()
  intrinsics = scene.get_intrinsics().cpu()
  
  def make_nerf_transform_matrix_json(focals, intrinsics, extrinsics, scale=1):
      js_file = OrderedDict()
  
      focals = (to_numpy(focals) * scale).tolist()
      scaled_width = int(intrinsics[0, :, 2][0].item() * scale * 2)
      scaled_height = int(intrinsics[0, :, 2][1].item() * scale * 2)
      aspect_ratio = scaled_width / scaled_height
  
      # rotate the extrinsic matrix to the NeRF space
      frames = [] 
      for i, ext in enumerate(to_numpy(extrinsics)):
          ext[:3, 3:] *= scale
          temp_dict = {
                      'file_path': 'images/image_%03d.png'%(i), 
                      'transform_matrix': (ext @ OPENGL).tolist(), # OPENGL = np.array([[1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]])
                      'fl_x': focals[i][0],
                      'fl_y': focals[i][0],
                      }
          frames.append(temp_dict)
  
      js_file['w'] = scaled_width
      js_file['h'] = scaled_height
      js_file['cx'] = scaled_width / 2.0
      js_file['cy'] = scaled_height / 2.0
      js_file['frames'] = frames
  
      # dust3r does not know the distortion params 
      js_file['k1'] = 0.0
      js_file['k2'] = 0.0
      js_file['p1'] = 0.0
      js_file['p2'] = 0.0
  
      with open('./transforms.json', 'w', encoding='utf-8') as file: 
          json.dump(js_file, file, ensure_ascii=False, indent='\t')

Here, the 'scale' variable is calculated in 'load_image' function in dust3r.utils.image.

Unfortunately, NeRF results with those converted camera poses are terrible.
I wonder if my code has some mistake, or if the recovered dust3r's camera params are not suitable for Image-based rendering like NeRF.

@MisterEkole
Copy link

Hi @alik-git
Dust3r isn't suitable to recover camera params. It's true they have functions that help recover camera params but these params do not correspond to a any real world camera model. Hence it will be highly inefficient to use dust3r to recover cam intrisincs, I mean its clearly stated in their paper, their implementation doesn't enforce any geometrical constraint.

Dust3r is efficient for reconstruction for some scenes, it wouldn't be ideal for slam based task, unless it is coupled with some classical technique or after performing some camera calibraition and fintetuning on your own (data and cam model) maybe.

@Leonard-Yao
Copy link

Hi, Dust3r is a wonderful work and I want to get poses between two images from the real world. So I wonder the datas from "poses = scene.get_im_poses()" is relative pose estimate or Absolute Pose Estimation?Thank U. The datas is as follows:

poses = tensor([[[ 9.9999e-01, -3.6112e-03, 3.9793e-03, -3.2458e-05],
[ 3.5933e-03, 9.9998e-01, 4.4953e-03, -1.0670e-03],
[-3.9955e-03, -4.4809e-03, 9.9998e-01, -1.4901e-04],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00]],
...

    [[ 9.9867e-01, -4.9443e-02, -1.4822e-02, -5.3161e-04],
     [ 4.9266e-02,  9.9871e-01, -1.2068e-02,  6.6914e-04],
     [ 1.5399e-02,  1.1321e-02,  9.9982e-01,  1.0587e-04],
     [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00]]],
   device='cuda:0', grad_fn=<CopySlices>)

@vincent-leroy
Copy link
Contributor

vincent-leroy commented Mar 8, 2024

Thanks a lot Hello everyone,

I'll start by answering @alik-git 's questions:
The scene.get_intrinsics() function returns the intrinsics evaluated for the downscaled images e.g. maxdim=512

If you started with images of shape 680x1200 they get reshaped to 290x512, that is a factor of 2.34375
If you want the intrinsics of the image in the original resolution you thus need to multiply optical center and focal length them by this factor, that gives:

tensor([[587.9121,   0.0000, 600.0000],
        [  0.0000, 587.9121, 337.5000],
        [  0.0000,   0.0000,   1.0000]])

which looks pretty close to what you gave as GT =)

@vincent-leroy
Copy link
Contributor

vincent-leroy commented Mar 8, 2024

@hwanhuh I gave a quick look at your code and it seems your a mistakenly considering that scene.get_im_poses() returns camera matrices (world2cam) where it returns cam2world matrices:

def get_im_poses(self): # cam to world

You should thus invert them to get camera projection matrices

@vincent-leroy
Copy link
Contributor

@MisterEkole it is true that the implicit camera model of DUSt3R is an over-parametrization of standard camera models, that enforce less constraints. However, we train only on data captured with "pinhole" camera models, and we show in additional
experiments that the predictions closely follow the pinhole camera model even though we do not enforce it explicitly.

The experiments are as follow: we compute average absolute error of field-of-view estimates (in degrees) and the average 2D reprojection accuracy (in %) at a threshold of 1% of the image diagonal, both measured on raw predictions for 1000 randomly sampled (unseen) test images from the Habitat, BlendedMVS and CO3D datasets. Note that Habitat and BlendedMVS are synthetically generated, thus the intrinsics are perfectly known. For CO3D, we consider approximate focals estimated via COLMAP.
image
In conclusion, DUSt3R predicts a 3D geometry that closely respects the pinhole camera model and allows for reliable focal length estimation, even in the monocular case. Of course, there are degenerate cases and innacuracies for the problem of focal estimation from one or few views is ill-posed and one cannot always rely upon data priors.

@vincent-leroy
Copy link
Contributor

@Leonard-Yao The world2cam matrices are expressed in a common reference frame although the term "absolute" might be misleading: the poses are not expressed in a metric space, because the reconstruction is performed up to an unknown scale factor.

@Leonard-Yao
Copy link

@Leonard-Yao The world2cam matrices are expressed in a common reference frame although the term "absolute" might be misleading: the poses are not expressed in a metric space, because the reconstruction is performed up to an unknown scale factor.

@vincent-leroy Thank U for you patient reply. After get depths of every pixels in paired pictures, whether I can figure out the unknown scale factor according to the returned 3D points? such as factor = real depth / returned depth.

@alik-git alik-git closed this as completed Mar 9, 2024
@alik-git
Copy link
Author

alik-git commented Mar 9, 2024

Thanks for answering my question @vincent-leroy!

And just a follow-up question, what do you think about getting poses? Basically, do you think this can be extended into a full-fledged slam system? Or would that be a lot of work.

@vincent-leroy
Copy link
Contributor

I think it should not be too hard to make a functional a slam system with it but some points need to be figured out. We might look into that in the future, but this is not the top priority

@Mr-ZhuJun
Copy link

Although the task is a little different, I converted the dust3r result into a NeRF transform matrix and used it to train NeRF. Below is my code.

  focals = scene.get_focals().cpu()
  extrinsics= scene.get_im_poses().cpu()
  intrinsics = scene.get_intrinsics().cpu()
  
  def make_nerf_transform_matrix_json(focals, intrinsics, extrinsics, scale=1):
      js_file = OrderedDict()
  
      focals = (to_numpy(focals) * scale).tolist()
      scaled_width = int(intrinsics[0, :, 2][0].item() * scale * 2)
      scaled_height = int(intrinsics[0, :, 2][1].item() * scale * 2)
      aspect_ratio = scaled_width / scaled_height
  
      # rotate the extrinsic matrix to the NeRF space
      frames = [] 
      for i, ext in enumerate(to_numpy(extrinsics)):
          ext[:3, 3:] *= scale
          temp_dict = {
                      'file_path': 'images/image_%03d.png'%(i), 
                      'transform_matrix': (ext @ OPENGL).tolist(), # OPENGL = np.array([[1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]])
                      'fl_x': focals[i][0],
                      'fl_y': focals[i][0],
                      }
          frames.append(temp_dict)
  
      js_file['w'] = scaled_width
      js_file['h'] = scaled_height
      js_file['cx'] = scaled_width / 2.0
      js_file['cy'] = scaled_height / 2.0
      js_file['frames'] = frames
  
      # dust3r does not know the distortion params 
      js_file['k1'] = 0.0
      js_file['k2'] = 0.0
      js_file['p1'] = 0.0
      js_file['p2'] = 0.0
  
      with open('./transforms.json', 'w', encoding='utf-8') as file: 
          json.dump(js_file, file, ensure_ascii=False, indent='\t')

Here, the 'scale' variable is calculated in 'load_image' function in dust3r.utils.image.

Unfortunately, NeRF results with those converted camera poses are terrible. I wonder if my code has some mistake, or if the recovered dust3r's camera params are not suitable for Image-based rendering like NeRF.

Hello, I would like to ask, if the pose estimated by dust3r is put into nerf, will there be ghost blur in the nerf rendering?

@andybak
Copy link

andybak commented Mar 27, 2024

Not sure why this is closed. @alik-git can you reopen? It's a fruitful discussion.

@alik-git
Copy link
Author

Sure, no problem. Reopening.

@alik-git alik-git reopened this Mar 29, 2024
@Bin-ze
Copy link

Bin-ze commented Apr 2, 2024

@

Although the task is a little different, I converted the dust3r result into a NeRF transform matrix and used it to train NeRF. Below is my code.

  focals = scene.get_focals().cpu()
  extrinsics= scene.get_im_poses().cpu()
  intrinsics = scene.get_intrinsics().cpu()
  
  def make_nerf_transform_matrix_json(focals, intrinsics, extrinsics, scale=1):
      js_file = OrderedDict()
  
      focals = (to_numpy(focals) * scale).tolist()
      scaled_width = int(intrinsics[0, :, 2][0].item() * scale * 2)
      scaled_height = int(intrinsics[0, :, 2][1].item() * scale * 2)
      aspect_ratio = scaled_width / scaled_height
  
      # rotate the extrinsic matrix to the NeRF space
      frames = [] 
      for i, ext in enumerate(to_numpy(extrinsics)):
          ext[:3, 3:] *= scale
          temp_dict = {
                      'file_path': 'images/image_%03d.png'%(i), 
                      'transform_matrix': (ext @ OPENGL).tolist(), # OPENGL = np.array([[1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]])
                      'fl_x': focals[i][0],
                      'fl_y': focals[i][0],
                      }
          frames.append(temp_dict)
  
      js_file['w'] = scaled_width
      js_file['h'] = scaled_height
      js_file['cx'] = scaled_width / 2.0
      js_file['cy'] = scaled_height / 2.0
      js_file['frames'] = frames
  
      # dust3r does not know the distortion params 
      js_file['k1'] = 0.0
      js_file['k2'] = 0.0
      js_file['p1'] = 0.0
      js_file['p2'] = 0.0
  
      with open('./transforms.json', 'w', encoding='utf-8') as file: 
          json.dump(js_file, file, ensure_ascii=False, indent='\t')

Here, the 'scale' variable is calculated in 'load_image' function in dust3r.utils.image.
Unfortunately, NeRF results with those converted camera poses are terrible. I wonder if my code has some mistake, or if the recovered dust3r's camera params are not suitable for Image-based rendering like NeRF.

Hello, I would like to ask, if the pose estimated by dust3r is put into nerf, will there be ghost blur in the nerf rendering?

I'm a nerf researcher and I'm very interested in applying dust3 with nerf modeling.
I would like to know what the results of your experiments are so far.
Can you show the render results you got using nerf under this issue?
Maybe we can figure out a way to make it work together.
think you!

@PHOENIXFURY007
Copy link

PHOENIXFURY007 commented Apr 4, 2024

Thanks a lot Hello everyone,

I'll start by answering @alik-git 's questions: The scene.get_intrinsics() function returns the intrinsics evaluated for the downscaled images e.g. maxdim=512

If you started with images of shape 680x1200 they get reshaped to 290x512, that is a factor of 2.34375 If you want the intrinsics of the image in the original resolution you thus need to multiply optical center and focal length them by this factor, that gives:

tensor([[587.9121,   0.0000, 600.0000],
        [  0.0000, 587.9121, 337.5000],
        [  0.0000,   0.0000,   1.0000]])

which looks pretty close to what you gave as GT =)

This looks really interesting.
@vincent-leroy , can you share the code as to how you got the scaled-down intrinsics ?

@PHOENIXFURY007
Copy link

PHOENIXFURY007 commented Apr 4, 2024

Is it possible to get the focals and intrinsic value for in-the-wild samples? Assuming the values to be almost similar to the real value and not exact

@ljjTYJR
Copy link

ljjTYJR commented Apr 26, 2024

For the SLAM system, I wonder how to maintain the scale consistently, especially when the scene changes (like from outdoor to indoor)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants