===================== Point Cloud Encoders ===================== Flemme supports various point cloud encoders using MLP (1-D convolution), transformer and SSM as backbones. Flemme accommodates both non-hierarchical and hierarchical architectural designs for point cloud encoders and decoders. Non-Hierarchical ================ Non-hierarchical encoders and decoders follow a pipeline illustrated in following figure (similar to PointNet and DGCNN): .. image:: ../_static/en_de_pcd.png Note that only one of the components enclosed within red dashed box should be employed. PointNet -------- ``PointNetEncoder`` use **sharedMLP** as backbones and has a similar architecture to `PointNet `_. In addition, we also integrate dynamic local feature extraction (from `DGCNN `_) into ``PointNetEncoder``. To initialize a PointNet encoder, you need to specify the following parameters: .. code-block:: python :linenos: PointNetEncoder.__init__(self, point_dim=3, projection_channel = 64, time_channel = 0, ## parameter of KNN for local feature extraction. ## if num_neighbors_k is set as 0, local feature extraction will not be performed. num_neighbors_k=0, local_feature_channels = [64, 64, 128, 256], num_blocks = 2, dense_channels = [256, 256], building_block = 'dense', normalization = 'group', num_norm_groups = 8, activation = 'lrelu', dropout = 0., z_count = 1, ## If vector_embedding is true, the latent embedding is a vector; ## otherwise, the latent embedding is point-wise features. vector_embedding = True, last_activation = True, **kwargs) In the above, if ``num_neighbors_k`` is specified, we will use edge convolution to extraction dynamic local features through KNN algorithm. Otherwise, local features will not be computed. If ``vector_embedding`` is set as ``True``, the encoder will return a latent vector for each inputs. Otherwise, the encoder will return point-wise features. ``PointNetDecoder`` is used to decode points from latent space through MLPs, which can used for reconstruction, segmentation, and other tasks. Typically, we dirrectly recover points from latent embeddings. In addition, we also integrate **Folding** operation (from `FoldingNet `_) into ``PointNetDecoder`` when you specify ``folding_times``. We support ``grid2d``, ``grid3d``, ``cylinder``, ``uvsphere`` and ``icosphere`` as base shapes for folding operation. Note that ``PointNetDecoder`` can be used for all point cloud encoders. .. code-block:: python :linenos: PointNetDecoder.__init__(self, point_dim=3, point_num = 2048, ## out_channel of encoder is the in channel of decoder in_channel = 256, dense_channels = [256], time_channel = 0, normalization = 'group', num_norm_groups = 8, activation = 'lrelu', dropout = 0., ## if folding_times > 0, we will perform folding operations. folding_times = 0, ## channel of hidden layers in folding operations. folding_hidden_channels = [512, 512], base_shape_config = {}, num_blocks = 2, ## If folding_times = 0, we process results through a final MLPs. final_channels = [512, 512], ## Note that vector_embedding need to be True if you want to perform folding operations. vector_embedding = True, **kwargs) The aboved parameters can be defined in the config file, in which the ``in_channel`` and ``out_channel`` refer to the ``point_dim`` of encoder and decoder respectively., ``decoder_fc_channel`` refers to ``fc_channel`` of decoder. As we discussed before, you don't need to define all parameters for encoder and decoder in the configuration file. .. code-block:: yaml :linenos: encoder: name: PointNet in_channel: 3 out_channel: 46 point_num: 2048 building_block: dense # num_neighbors_k: 20 local_feature_channels: [64, 64, 128, 256, 512] dense_channels: [1024, 512, 256] final_channels: [512, 512] activation: lrelu normalization: group Supported ``building_block`` for PointNet encoder and decoder: ``[dense, res_dense, double_dense]``. PointTrans ---------- PointTrans indicates the encoders have a similar architecture as PointNet encoder but using **transformer** as backbones. To initialize a ``PointTransEncoder``, you need to specify the following parameters: .. code-block:: python :linenos: PointTransEncoder.__init__(self, point_dim=3, projection_channel = 64, time_channel = 0, num_neighbors_k=0, local_feature_channels = [64, 64, 128, 256], num_blocks = 2, dense_channels = [256, 256], building_block = 'pct_sa', normalization = 'group', num_norm_groups = 8, activation = 'lrelu', dropout = 0., ### transformer parameters num_heads = 4, d_k = None, qkv_bias = True, qk_scale = None, atten_dropout = None, residual_attention = False, skip_connection = True, z_count = 1, vector_embedding = True, last_activation = True, **kwargs) Note that ``PointTransDecoder`` is just an alias of ``PointNetDecoder``. Supported ``building_block`` for PointTrans encoder: ``[pct_sa, pct_oa]``, with ``pct_sa`` denoting self-attention and ``pct_oa`` denoting offset-attention from `PCT `_, respectively. PointMamba ---------- PointMamba indicates the encoders using **state space model (mamba)** as backbones. To initialize a ``PointMambaEncoder``, you need to specify the following parameters: .. code-block:: python :linenos: PointMambaEncoder.__init__(self, point_dim=3, projection_channel = 64, time_channel = 0, num_neighbors_k=0, local_feature_channels = [64, 64, 128, 256], num_blocks = 2, dense_channels = [256, 256], building_block = 'pmamba', normalization = 'group', num_norm_groups = 8, activation = 'lrelu', dropout = 0., state_channel = 64, conv_kernel_size = 4, inner_factor = 2.0, head_channel = 64, conv_bias=True, bias=False, learnable_init_states = True, chunk_size=256, dt_min=0.001, A_init_range=(1, 16), dt_max=0.1, dt_init_floor=1e-4, dt_rank = None, dt_scale = 1.0, z_count = 1, vector_embedding = True, last_activation = True, skip_connection = True, **kwargs) Note that ``PointMambaDecoder`` is just an alias of ``PointNetDecoder``. Supported ``building_block`` for PointMamba encoder: ``[pmamba, pmamba2]``. Hierarchical ============ Hierarchical point cloud encoders and decoders follow a pipeline illustrated in following figure (similar to PointNet++): .. image:: ../_static/en_de_pcd2.png In hierarchical encoders, we down-sample point cloud into different levels through the farthest-point sampling (FPS) and aggregate points through multi-scale neighbor queries. We will elaborate supported hierarchical encoders in the remainder of this article. PointNet2 ---------- PointNet2 has a hierarchical architecture with sharedMLP backbones. To define a ``PointNet2Encoder``, you need to specify the following parameters: .. code-block:: python :linenos: PointNet2Encoder.__init__(self, point_dim = 3, projection_channel = 64, time_channel = 0, ## number of fps points for each level num_fps_points = [1024, 512, 256, 64], ## number of neighbor query for each level num_neighbors_k = 32, ## max radius of radius query for each level neighbor_radius = [0.1, 0.2, 0.4, 0.8], ## number of feature channels for each level fps_feature_channels = [128, 256, 512, 1024], num_blocks = 2, ## number of scales ## the radius and number of feature channels at each scale will be assigned based on the neighbor_radius and fps_feature_channels num_scales = 2, ## concat point position embedding in feature extraction use_xyz = True, ## sort the radius query results by distance (knn query returns a sorted result by default). sorted_query = False, ## use knn query or radius query. ## if knn_query is not false, it should be one of ['xyz', 'xyz_embed', 'feature'], which determine the knn query space. knn_query = False, dense_channels = [1024], building_block = 'dense', normalization = 'group', num_norm_groups = 8, activation = 'lrelu', dropout = 0., vector_embedding = True, ## if the decoder is a pointnet2-like decoder, the feature list will also be returned. is_point2decoder = False, z_count = 1, return_xyz = False, last_activation = True, ## final concatenation of sample features at different levels final_concat = False, ## enable positional embedding pos_embedding = False, **kwargs) PointTrans2 ----------- PointTrans2 has a hierarchical architecture with **transformer** backbones. To define a ``PointTrans2Encoder``, you need to specify the following parameters: .. code-block:: python :linenos: PointTrans2Encoder.__init__(self, point_dim = 3, projection_channel = 64, time_channel = 0, num_fps_points = [1024, 512, 256, 64], num_neighbors_k = 32, neighbor_radius = [0.1, 0.2, 0.4, 0.8], fps_feature_channels = [128, 256, 512, 1024], num_blocks = 2, num_scales = 2, use_xyz = True, sorted_query = False, knn_query = False, dense_channels = [1024], building_block = 'dense', normalization = 'group', num_norm_groups = 8, activation = 'lrelu', dropout = 0., num_heads = 4, d_k = None, qkv_bias = True, qk_scale = None, atten_dropout = None, residual_attention = False, skip_connection = True, vector_embedding = True, is_point2decoder = False, ## Perform long range modeling on a global scale. ## Long range modeling is only suitable for sequence-modeling backbones, e.g., transformer and ssm. long_range_modeling = False, z_count = 1, return_xyz = False, last_activation = True, final_concat = False, pos_embedding = False, **kwargs) PointMamba2 ----------- PointMamba2 has a hierarchical architecture with **state space model (mamba)** backbones. A more detailed illustration of PointMamba2 Encoder with xyz¢er scanning and long-range modeling is presented in: .. image:: ../_static/pointmamba2.png To define a ``PointMamba2Encoder``, you need to specify the following parameters: .. code-block:: python :linenos: PointMamba2Encoder.__init__(self, point_dim = 3, projection_channel = 64, time_channel = 0, num_fps_points = [1024, 512, 256, 64], num_neighbors_k = 32, neighbor_radius = [0.1, 0.2, 0.4, 0.8], fps_feature_channels = [128, 256, 512, 1024], num_blocks = 2, num_scales = 2, use_xyz = True, sorted_query = False, knn_query = False, dense_channels = [1024], building_block = 'dense', flip_scan = False, normalization = 'group', num_norm_groups = 8, activation = 'lrelu', dropout = 0., vector_embedding = True, state_channel = 64, conv_kernel_size = 4, inner_factor = 2.0, head_channel = 64, conv_bias=True, bias=False, learnable_init_states = True, chunk_size=256, dt_min=0.001, A_init_range=(1, 16), dt_max=0.1, dt_init_floor=1e-4, dt_rank = None, dt_scale = 1.0, skip_connection = True, is_point2decoder = False, long_range_modeling = False, ## point serialization ## scan strategies should be a list whose elements are from ['x_order', 'y_order', 'z_order', 'center_dist', 'nonsort'] scan_strategies = None, z_count = 1, return_xyz = False, last_activation = True, final_concat = False, pos_embedding = False, **kwargs) To summarize, we support the following point cloud encoders: ======================== ==================== ============================== =========================== =========== Encoder Backbones Building Blocks Long-range Modeling Scanning ======================== ==================== ============================== =========================== =========== PointNet MLP dense, double_dense, res_dense × × PointNet2 MLP dense, double_dense, res_dense × × PointTrans transformer pct_sa, pct_oa × × PointTrans2 transformer pct_sa, pct_oa √ × PointMamba ssm (mamba) pmamba, pmamba2 × × PointMamba2 ssm (mamba) pmamba, pmamba2 √ √ ======================== ==================== ============================== =========================== ===========