NeRF Application for Smart Farm

Distributed Autonomous Systems Lab

2024

I explored how Neural Radiance Fields (NeRF) can redefine 3D perception for autonomous systems operating in unstructured, real-world environments. This project involved building high-fidelity spatial reconstruction pipelines that allowed robots to accurately perceive plants, obstacles, and terrain—achieving levels of environmental awareness that traditional LiDAR and stereo vision systems could not match.

Key Contributions:

Model Implementation: Built and optimized NeRF models in PyTorch to reconstruct 3D scenes, tuning positional encoding and hierarchical sampling parameters for real-time rendering performance
Data Loader Development: Built a custom data-loading pipeline to preprocess RGB images and camera poses into ray batches, enabling efficient training and visualization
Training & Testing Pipeline: Trained and validated the model on agricultural datasets, tuning parameters for real-time rendering and comparing performance across lighting and environmental conditions

This project advanced my understanding of how machine learning and computer vision frameworks can transform raw image data into structured, actionable perception for autonomous systems. By integrating NeRF-based 3D reconstruction with robotic navigation, I learned how high-fidelity scene understanding enhances both accuracy and decision-making in unstructured environments. The experience strengthened my research foundation in robotic perception and adaptive intelligence, which I aim to extend toward scalable, human-centered applications.

Motivation

Smart farmers face increasing labor shortages and rising operational costs, while existing crop-monitoring systems depend on expensive sensors and dense measurements that limit scalability. Lightweight NeRF-based 3D perception was explored as a low-cost alternative for reconstructing detailed plant structure from sparse visual data in real agricultural environments.

Method

Dataloader: I built a NeRF-specific data loader that converts each image and its 4×4 camera pose into per-pixel camera rays (rays_o, rays_d) and RGB targets.

			
class Dataloader(Dataset):
    def __init__(
        self,
        datadir,
        json_dir='transforms_train.json',
        img_dir='train/',
        batch_size=256,
        H=400,
        W=400
    ):
        # Load Blender-style metadata
        meta = json.loads(open(datadir + json_dir, "r").read())
        frames = meta['frames']
        camera_angle_x = meta['camera_angle_x']
        # Collect file paths and 4×4 poses
        file_paths, poses = [], []
        for fr in frames:
            file_paths.append((datadir + fr['file_path']).replace('\\', '/'))
            poses.append(fr['transform_matrix'])
        # Build global ray tensors across all images
        rays_o_all, rays_d_all, rgb_all = [], [], []
        for img_path, pose in zip(file_paths, poses):
            rays_o, rays_d, rgb = read_data(img_path, pose, camera_angle_x, H, W)
            rays_o_all.append(rays_o)
            rays_d_all.append(rays_d)
            rgb_all.append(rgb)
        # Flatten: (N_imgs * H * W, 3)
        self.rays_o = torch.cat(rays_o_all)
        self.rays_d = torch.cat(rays_d_all)
        self.target_px_values = torch.cat(rgb_all)
        self.size = self.rays_o.shape[0]
        self.batch_size = batch_size
        self.H = H
        self.W = W
    def __len__(self):
        # number of mini-batches this dataset emits
        return (self.size + self.batch_size - 1) // self.batch_size
    def __getitem__(self, idx):
        s = idx * self.batch_size
        e = min(s + self.batch_size, self.size)
        return {
            'rays_o': self.rays_o[s:e],                 # (B, 3)
            'rays_d': self.rays_d[s:e],                 # (B, 3)
            'target_px_values': self.target_px_values[s:e]  # (B, 3)
        }

		

For every frame, read_data(...) computes a focal length from the camera FOV, generates per-pixel directions via a pinhole model, transforms those directions into world coordinates using the camera-to-world pose, and flattens everything into GPU-friendly mini-batches of rays. The Dataloader then concatenates rays from all images and emits contiguous ray batches during training—this avoids per-image loops in the training step and keeps the pipeline fast and simple.

NeRF Model: The model predicts shape and color separately and uses positional encoding to capture fine details, making NeRF training more efficient and stable.

This NeRF MLP maps a 3D point o and viewing direction d to an RGB color and volume density (σ). Both inputs are first positional-encoded with multi-frequency sin/cos terms (L=10 for positions, L=4 for directions) to capture high-frequency detail. A position trunk (block1) produces features; a density head (block2) outputs σ (ReLU) and refined features. Those features are then fused with the direction encoding in a color head (block3 → block4), which predicts RGB in [0,1] via Sigmoid. Xavier initialization is used for all linear layers for stable training.

			
class NerfModel(nn.Module):
    def __init__(self, embedding_dim_pos=10, embedding_dim_direction=4, hidden_dim=128):
        super(NerfModel, self).__init__()
        self.block1 = nn.Sequential(nn.Linear(embedding_dim_pos * 6 + 3, hidden_dim), nn.ReLU(),
                                    nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
                                    nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
                                    nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), )
        # density estimation
        self.block2 = nn.Sequential(nn.Linear(embedding_dim_pos * 6 + hidden_dim + 3, hidden_dim), nn.ReLU(),
                                    nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
                                    nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
                                    nn.Linear(hidden_dim, hidden_dim + 1), )
        # color estimation
        self.block3 = nn.Sequential(nn.Linear(embedding_dim_direction * 6 + hidden_dim + 3, hidden_dim // 2), nn.ReLU(), )
        self.block4 = nn.Sequential(nn.Linear(hidden_dim // 2, 3), nn.Sigmoid(), )
        self.embedding_dim_pos = embedding_dim_pos
        self.embedding_dim_direction = embedding_dim_direction
        self.relu = nn.ReLU()
    @staticmethod
    def positional_encoding(x, L):
        out = [x]
        for j in range(L):
            out.append(torch.sin(2 ** j * x))
            out.append(torch.cos(2 ** j * x))
        return torch.cat(out, dim=1)
    def forward(self, o, d):
        emb_x = self.positional_encoding(o, self.embedding_dim_pos) # emb_x: [batch_size, embedding_dim_pos * 6]
        emb_d = self.positional_encoding(d, self.embedding_dim_direction) # emb_d: [batch_size, embedding_dim_direction * 6]
        h = self.block1(emb_x) # h: [batch_size, hidden_dim]
        tmp = self.block2(torch.cat((h, emb_x), dim=1)) # tmp: [batch_size, hidden_dim + 1]
        h, sigma = tmp[:, :-1], self.relu(tmp[:, -1]) # h: [batch_size, hidden_dim], sigma: [batch_size]
        h = self.block3(torch.cat((h, emb_d), dim=1)) # h: [batch_size, hidden_dim // 2]
        c = self.block4(h) # c: [batch_size, 3]
        return c, sigma

		

Training&Testing: The training loop repeatedly renders colors along camera rays, compares them to ground-truth images, and updates the NeRF model using gradient descent while periodically visualizing progress. The testing function renders full images by processing rays in small chunks for memory efficiency and saves reconstructed views to monitor NeRF training quality.

During training, the model samples batches of camera rays and renders predicted RGB values using volumetric ray marching through the NeRF network.
The predicted colors are compared with ground-truth pixel values using a mean squared error loss, and the network parameters are updated via backpropagation.
This process is repeated over multiple epochs with a learning-rate scheduler, while intermediate renderings are generated to monitor convergence and visual quality.

During testing, the trained NeRF renders full-resolution images by evaluating rays corresponding to each pixel in the target view.
To manage memory usage, rays are processed in small chunks and their predicted colors are accumulated to reconstruct the final image.
The rendered images are periodically saved, allowing qualitative evaluation of reconstruction quality and training progress.

			
def train(nerf_model, optimizer, scheduler, train_loader, test_loader):
    create output directory
    initialize training_loss = []
    for each epoch in total_epochs:
        epoch_loss = []
        for each batch in train_loader:
            # 1. Load a batch of camera rays and their ground-truth RGB values
            rays_o, rays_d, rgb_gt = batch['rays_o'], batch['rays_d'], batch['target_px_values']
            # 2. Predict RGB along each ray using NeRF rendering
            rgb_pred = render_rays(nerf_model, rays_o, rays_d, hn, hf, nb_bins)
            # 3. Compute loss between predicted and true pixel colors
            loss = MSE(rgb_pred, rgb_gt)
            # 4. Backpropagation and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            record batch loss
        # 5. Update learning rate scheduler
        scheduler.step()
        # 6. Save training loss and render test images for visualization
        training_loss.append(loss)
        test(model=nerf_model, test_loader, output_dir)
    return training_loss

		

			
def test(hn, hf, dataset, chunk_size=10, nb_bins=192, H=400, W=400, epoch_idx = 0,
         output="/content/drive/MyDrive/NeRF_Data_Repository/output", lr=5e-4):
    """
    Args:
        hn: near plane distance
        hf: far plane distance
        dataset: dataset to render
        chunk_size (int, optional): chunk size for memory efficiency. Defaults to 10.
        img_index (int, optional): image index to render. Defaults to 0.
        nb_bins (int, optional): number of bins for density estimation. Defaults to 192.
        H (int, optional): image height. Defaults to 400.
        W (int, optional): image width. Defaults to 400.
    Returns:
        None: None
    """
    idx = 0
    for batch in dataset:
      if idx == len(dataset):
        break
      ray_origins = batch['rays_o']
      ray_directions = batch['rays_d']
      data = []   # list of regenerated pixel values
      for i in range(int(np.ceil(H / chunk_size))):   # iterate over chunks
          # Get chunk of rays
          ray_origins_ = ray_origins[i * W * chunk_size: (i + 1) * W * chunk_size].to(device)
          ray_directions_ = ray_directions[i * W * chunk_size: (i + 1) * W * chunk_size].to(device)
          regenerated_px_values = render_rays(model, ray_origins_, ray_directions_, hn=hn, hf=hf, nb_bins=nb_bins)
          if (torch.any(regenerated_px_values > 1)):
            print("Test Not Normalized")
          data.append(regenerated_px_values)
      img = torch.cat(data).data.cpu().numpy().reshape(H, W, 3)
      if np.any(img > 1):
        print("error")
      if (idx % 25 == 0):
        plt.figure()
        plt.title("Test")
        plt.imshow(img)
        file_name = f'image_{idx}_epoch_{epoch_idx}_lr_{lr}.png'
        save_path = f'{output}/{file_name}'
        plt.savefig(save_path, bbox_inches='tight')
        plt.close()
      idx += 1

		

Result

Metric	MAE [%]	RMSE [%]	PSNR [dB]
Error	1.33	3.11	29.1

The lightweight NeRF achieves successful 1.1–1.6% mean absolute pixel error and 29–31 dB PSNR tomato crop images.

Mean Absolute Error (MAE) used to measure the average absolute pixel-wise difference between the ground truth image and reconstructed image.

MAE = \frac{1}{N}\sum_{i=1}^{N}\abs{I_i-\hat{I_i}}

Root Mean Squared Error (MSE) measures the pixel outliers as it is sensitive to larger errors

RMSE=\sqrt{MSE}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(I_i-\hat{I_i})^2}

Peak Signal-to-Noise Ratio (PSNR) estimates a logarithmic measure of reconstruction quality from MSE

PSNR = 20\log_{10}(\frac{MAX}{\sqrt{MSE}}), MAX = 255

View Examples

Takeaways

GPU Memory Optimization:
NeRF training consumed significant GPU memory due to the large number of ray samples. I optimized batch sizes, sampling bins, and ray batching logic to prevent out-of-memory errors while maintaining model fidelity.
Color Normalization Issues:
Some input images had inconsistent RGB value ranges, which caused unstable loss behavior. I added data normalization checks and preprocessing in the data loader to ensure all pixel values were scaled to [0, 1].
Camera Pose Misalignment:
Misaligned 4×4 transformation matrices in the Blender dataset occasionally distorted 3D reconstructions. I resolved this by verifying pose consistency across frames and adjusting the pose-loading pipeline for accurate scene geometry.

Extensive Goal

During my time at the Distributed Autonomous Systems Lab (DASLab), I proposed combining D-NeRF (Dynamic NeRF) and PaG-NeRF (Patch-based Generalizable NeRF) into a unified framework to improve both temporal adaptability and scene generalization. This approach aimed to extend NeRF’s capabilities from static scene reconstruction to dynamic, multi-environment perception, allowing robots to interpret and reconstruct time-varying scenes with higher efficiency and robustness.

Pag-NeRF

D-NeRF

PaG-NeRF improves generalization and training efficiency by learning NeRF representations from image patches rather than entire scenes. This allows the model to adapt quickly to new environments and viewpoints without retraining from scratch.

D-NeRF extends the original NeRF framework to handle dynamic scenes by modeling how 3D points change over time. It learns both the spatial structure and temporal deformation of objects, enabling realistic reconstruction of moving scenes from 2D images.

Skills

Programming: Python, PyTorch, Numpy, Matplotlib, PIL
Machine Learning: Deep Learning, Neural Network Implementation (MLP), Computer Vision
Tools: Google Colab, CUDA