VPP 开发简析

软件框架

VPP 的软件框架主要分为四个层面：

Plugins：包含越来越丰富的数据平面插件集，可以认为每一个插件是一个小型的 APP。
VNET：与 VPP 的网络接口（L2-4）协同工作，执行会话和流量管理，并与设备和数据控制平面配合使用。
VLIB：矢量处理库。VLIB 层还处理各种应用程序管理功能，例如：缓冲区，内存和 graph 管理，维护和导出计数器，线程管理，数据包跟踪。VLIB 实现调试 CLI 等。
VPP Infra：VPP 基础设施层，包含核心库源代码。提供一些基本的通用的功能函数库，包括内存管理，向量操作，hash， timer 等

核心概念

Node

VPP 通过 Node 的级联组成 Graph，实现报文的流转处理；Node 就是处理报文的最小逻辑单元，一个 Node 就代表着一个处理报文的逻辑。

Node 级联的需要关注：

上一级 pre_node。
报文处理。
处理完成后的下一级 Node。

node 注册关键宏：

/* 注册 node 宏 */
VLIB_REGISTER_NODE (gtpu4_input_node) = {
    .name = "gtpu4-input",                    // name 必须唯一
    /* Takes a vector of packets. */
    .vector_size = sizeof (u32),

    .n_errors = GTPU_N_ERROR,
    .error_strings = gtpu_error_strings,

    .n_next_nodes = GTPU_INPUT_N_NEXT,       // 下挂了多少可调度的 node
    .next_nodes = {
#define _(s,n) [GTPU_INPUT_NEXT_##s] = n,    // 下挂的可调度的 node
    foreach_gtpu_input_next
#undef _
    },

...

/* 注册 node 处理函数宏 ，需要注意 gtpu4_input_node 必须通过 注册node 宏注册 node 一致*/
VLIB_NODE_FN (gtpu4_input_node) (vlib_main_t * vm,
             vlib_node_runtime_t * node,
             vlib_frame_t * from_frame)
{
	return gtpu_input(vm, node, from_frame, /* is_ip4 */ 1);
}

如何级联进 Graph，主要就是上一级 pre_node。

静态方式：通过修改需要级联的上一级 pre_node 的 next_nodes, 并且需要对应修改其处理函数，使报文可正常流转到本 Node。
固有钩子函数方式：

在不使用 Feature 机制的话，结点间的关系相对来说更加静态，只能在编译的时候确定结点间的关系，不能在运行的时候进行改变，可以插入节点的地方只能由系统提供的几个接口。向这些入口登记函数后，后续的数据流将传到你定义的结点。可能还有其他的一些插入结点的函数，这里只列出常用到的几个函数：
- L1
vnet_hw_interface_rx_redirect_to_node (vnet_main_t *vnm, u32 hw_if_index, u32 node_index) 将某个 hw interface 的 rx 数据重定向到某个结点，node_index 为结点的index索引。
- L2、L3
ethernet_register_input_type (vlib_main_t *vm, ethernet_type_t type, u32 node_index) 将在 ethernet-input 结点后插入特定 type 的结点，这里 type 包括 ethernet_type(0x806, ARP)、ethernet_type (0x8100, VLAN)、ethernet_type (0x800, IP4) 等二、三层协议。具体支持的相关协议见 src/vnet/ethernet/types.def 文件。
- L4
ip4_register_protocol (u32 protocol, u32 node_index) 将在 ip4-local 结点后插入特定 protocol 的结点，这里 protocol 包括 ip_protocol (6, TCP)、ip_protocol (17, UDP) 等四层协议。具体支持的相关协议见 src/vnet/ip/protocols.def 文件。
- L5
udp_register_dst_port (vlib_main_t * vm, udp_dst_port_t dst_port, u32 node_index, u8 is_ip4) 将在 ip4-udp-lookup 结点后插入特定 dst_port 的结点，这里 dst_port 包括 ip_port (WWW, 80) 等五层应用端口。具体支持的相关端口见 src/vnet/ip/ports.def 文件。

Feature

VPP 本身的所谓的静态 Node 框架比较固定，各个 Node 之间逻辑连接已经固化，为此新版本增加了 Feature 机制，这里 Feature 机制本质上来说还是结点，只不过该结点可以在运行的时候通过命令进行配置是否打开或关闭，从而影响数据流的走向。

对新加入的结点进行管理，新的 Feature，即：我们新建的结点。必须属于某个 Arc 类，并作用于某个 Interface 实体。通过 set interface feature <intfc> <feature_name> arc <arc_name> [disable] 命令来开启或关闭该 Feature 功能。通常 Arc 类的名字对应为其起点结点的名字，使用命令开启关闭 Feature 功能能动态的改变数据的流向。

如果选择按照 Feature 机制来加入结点的话需要注意以下几点：

VPP 提供的 Arc 类比较多，我们需要自己选择合适的 Arc 来插入我们的结点:

nsh-output
mpls-output
mpls-input
ip6-drop
ip6-punt
ip6-local
ip6-output
ip6-multicast
ip6-unicast
ip4-drop
ip4-punt
ip4-local
ip4-output
ip4-multicast
ip4-unicast
ethernet-output
interface-output
device-input

Feature 相关宏：

/* 注册 feature 宏*/
VNET_FEATURE_INIT (test0214, static) =
{
    .arc_name = "device-input",
    .node_name = "test0214",                         // feature 相关 node，必须和 node name 相同
    .runs_before = VNET_FEATURES ("ethernet-input"),
};

/* feature 打开关闭 cli */
VLIB_CLI_COMMAND (test0214_enable_disable_command, static) =
{
    .path = "test0214 enable-disable",
    .short_help = "test0214 enable-disable <interface-name> [disable]",
    .function = test0214_enable_disable_command_fn,
};

/* feature 相关 node 注册 */
VLIB_REGISTER_NODE (test0214_node) = 
{
    .name = "test0214",
    .vector_size = sizeof (u32),
    .format_trace = format_test0214_trace,
    .type = VLIB_NODE_TYPE_INTERNAL,
  
    .n_errors = ARRAY_LEN(test0214_error_strings),
    .error_strings = test0214_error_strings,
    
    .n_next_nodes = TEST0214_N_NEXT,

    /* edit / add dispositions here */
    .next_nodes = {
        [TEST0214_NEXT_INTERFACE_OUTPUT] = "interface-output",
    },
};

/* 打开 feature 相关函数*/
static clib_error_t *
test0214_enable_disable_command_fn (vlib_main_t * vm,
                                   unformat_input_t * input,
                                   vlib_cli_command_t * cmd)
{
    test0214_main_t * tmp = &test0214_main;
    u32 sw_if_index = ~0;
    int enable_disable = 1;
    int rv;
    while (unformat_check_input (input) != UNFORMAT_END_OF_INPUT)
    {
        if (unformat (input, "disable"))
            enable_disable = 0;
        else if (unformat (input, "%U", unformat_vnet_sw_interface,
                         tmp->vnet_main, &sw_if_index));
        else
            break;
    }
    if (sw_if_index == ~0)
        return clib_error_return (0, "Please specify an interface...");
        rv = test0214_enable_disable (tmp, sw_if_index, enable_disable);

...
    return 0;
}

int test0214_enable_disable (test0214_main_t * tmp, u32 sw_if_index,
                                   int enable_disable)
{
    vnet_sw_interface_t * sw;
    int rv = 0;

    /* Utterly wrong? */
    if (pool_is_free_index (tmp->vnet_main->interface_main.sw_interfaces,
                          sw_if_index))
        return VNET_API_ERROR_INVALID_SW_IF_INDEX;

    /* Not a physical port? */
    sw = vnet_get_sw_interface (tmp->vnet_main, sw_if_index);
    if (sw->type != VNET_SW_INTERFACE_TYPE_HARDWARE)
        return VNET_API_ERROR_INVALID_SW_IF_INDEX;

    test0214_create_periodic_process (tmp);

    vnet_feature_enable_disable ("device-input", "test0214",
                                 sw_if_index, enable_disable, 0, 0);
...
}

Plugin

更全面的功能，相当于一个小型 App 的处理功能。

注册 Plugin 关键宏：

/* plugin 初始化函数，基本所有初始化都会在该函数中完成，包括 node 在 graph 中的挂载 */
VLIB_INIT_FUNCTION (test0214_init);

/* 相关feature  */
VNET_FEATURE_INIT (test0214, static) =
{
    .arc_name = "device-input",
    .node_name = "test0214",
    .runs_before = VNET_FEATURES ("ethernet-input"),
};

/* 注册 plugin */
VLIB_PLUGIN_REGISTER () =
{
    .version = VPP_BUILD_VER,
    .description = "test0214 plugin description goes here",
};

注：VLIB_INIT_FUNCTION 宏讲述透彻参考 https://blog.csdn.net/qq_39965097/article/details/103726055

如何创建自己的 Plugin

目前 VPP 提供了一个创建插件的脚本，直接使用这个脚本就可以创建我们需要的插件基本框架。

如果自己系统没有安装 emacs，需要安装一下：

sudo apt update
sudo apt install -y emacs

需要提供两个设置：

插件的名字
调度类型，有双单环路对还是四单环路对下面是具体命令：

$ cd ./src/plugins
$ ../../extras/emacs/make-plugin.sh
<snip>
Loading /scratch/vpp-docs/extras/emacs/tunnel-c-skel.el (source)...
Loading /scratch/vpp-docs/extras/emacs/tunnel-decap-skel.el (source)...
Loading /scratch/vpp-docs/extras/emacs/tunnel-encap-skel.el (source)...
Loading /scratch/vpp-docs/extras/emacs/tunnel-h-skel.el (source)...
Loading /scratch/vpp-docs/extras/emacs/elog-4-int-skel.el (source)...
Loading /scratch/vpp-docs/extras/emacs/elog-4-int-track-skel.el (source)...
Loading /scratch/vpp-docs/extras/emacs/elog-enum-skel.el (source)...
Loading /scratch/vpp-docs/extras/emacs/elog-one-datum-skel.el (source)...
Plugin name: test0214
Dispatch type [dual or qs]: dual
(Shell command succeeded with no output)

OK...

调度类型暂时我还不太清楚有多大差异，暂时选择dual模式，后面自己根据自己业务，对插件做相关的修改就行。

生成出来的文件:

$ cd .plugins/test0214
$ ls
CMakeLists.txt  node.c  setup.pg  test0214.api  test0214.c  test0214.h  test0214_periodic.c  test0214_test.c

重新编译插件

$ cd <top-of-workspace>
$ make rebuild [or rebuild-release]

验证插件是否正常

vpp# show plugins test0214
 Plugin path is: /usr/lib/x86_64-linux-gnu/vpp_plugins:/usr/lib/vpp_plugins

     Plugin                                   Version                          Description
 ......
 14. test0214_plugin.so                       1.0-release                      test0214 plugin description goes here
 ......

如果上面有显示自己插件的信息，表示你提供的插件功能基本完备，能正常加载使用了。

测试插件

默认创建的插件已经实现了以下功能：

注册了 process 节点，监听插件是否工作的事件（MYPLUGIN_EVENT_PERIODIC_ENABLE_DISABLE），通过命令行来触发 (VLIB_CLI_COMMAND (myplugin_enable_disable_command, static)) 这个事件。使用这里 enable 了，该插件才会 work。
注册了内部节点，让其在 ethernet-input 节点运行之前运行。

VLIB_REGISTER_NODE (test0214_node) = 
{
    .name = "test0214",
    .vector_size = sizeof (u32),
    .format_trace = format_test0214_trace,
    .type = VLIB_NODE_TYPE_INTERNAL,
  
    .n_errors = ARRAY_LEN(test0214_error_strings),
    .error_strings = test0214_error_strings,

    .n_next_nodes = TEST0214_N_NEXT,

    /* edit / add dispositions here */
    .next_nodes = {
        [TEST0214_NEXT_INTERFACE_OUTPUT] = "interface-output",
    },
};

VNET_FEATURE_INIT (test0214, static) =
{
    .arc_name = "device-input",
    .node_name = "test0214",
    .runs_before = VNET_FEATURES ("ethernet-input"),
};

在内部节点的实现函数里面（VLIB_NODE_FN (myplugin_node)），主要实现功能是对 input 节点收进来的报文，做一个 src dst mac 交换，然后源端口发送出去。可自行阅读

VLIB_NODE_FN (test0214_node) (vlib_main_t * vm,
		  vlib_node_runtime_t * node,
		  vlib_frame_t * frame)

VPP 中如何处理报文流转

VPP 中的报文流转实际体现在两个方面：

如何获取到报文，然后按自己逻辑正确处理
如何将处理完成的报文送入到期望的next node

关键函数：

如何获取报文：

vlib_frame_vector_args (frame) 本 Node 收到 Vector 起始地址
vlib_get_next_frame (vm, node, next_index, to_next, n_left_to_next) 获取下一 Node 的收包缓存空闲首地址
vlib_buffer_get_current (b0) 获取 vlib_buffer_t

如何将处理完成的报文送入期望的 next node：

vlib_validate_buffer_enqueue_x1 (vm, node, next_index, to_next, n_left_to_next, bi0, next0)

             根据真实下一 node(next0) ，调整next_index: 默认的下一结点的 index；next0: 实际的下一个结点的 index

             next0 == next_index则不需要做特别的处理，报文会自动进入下一个节点

             next0 != next_index则需要对该数据包做调整，从之前next_index对应的frame中删除，添加到next0对应的frame中

STATIC_ASSERT (sizeof (upf_buffer_opaque_t) <=STRUCT_SIZE_OF (vnet_buffer_opaque2_t, unused),"upf_buffer_opaque_t too large for vnet_buffer_opaque2_t")

#define upf_buffer_opaque(b) ((upf_buffer_opaque_t *)((u8 *)((b)->opaque2) + STRUCT_OFFSET_OF (vnet_buffer_opaque2_t, unused)))

自定义报文元数据；
vlib_put_next_frame (vm, node, next_index, n_left_to_next) 所有流程都正确处理完毕后，下一结点的 frame 上已经有本结点处理过后的数据索引执行该函数，将相关信息登记到 vlib_pending_frame_t 中，准备开始调度处理

always_inline uword
gtpu_input (vlib_main_t *vm, vlib_node_runtime_t *node,
            vlib_frame_t *from_frame, u8 is_ip4)
{
  u32 n_left_from, next_index, *from, *to_next;
  // flowtable_main_t* fm = &flowtable_main;
  clib_bihash_kv_16_8_t last_key4;
  clib_bihash_kv_24_8_t last_key6;
  u32 pkts_decapsulated = 0;

  if (is_ip4)
    memset (&last_key4, 0xff, sizeof (last_key4));
  else
    memset (&last_key6, 0xff, sizeof (last_key6));

// 获取报文对应vector 包起始地址
  from = vlib_frame_vector_args (from_frame);
  n_left_from = from_frame->n_vectors;
// 获取上一次NODE 处理后对应的 NEXT-NODE 索引
  next_index = node->cached_next_index;

  upf_trace ("######gtpu_input#####\n");
// 一次单个处理报文
  while (n_left_from > 0)
    {
      u32 n_left_to_next;
//  to_next: next_index所指下一个节点的收包缓存的空闲位置首地址
//  n_left_to_next:下一个节点收包缓存的空闲位置总数
      vlib_get_next_frame (vm, node, next_index, to_next, n_left_to_next);

      while (n_left_from > 0 && n_left_to_next > 0)
        {
          u32 bi0;
          vlib_buffer_t *b0;
          // u32 is_reverse0;
// next0 下一节点索引
          u32 next0 = GTPU_INPUT_NEXT_DROP;
          ip4_header_t *ip4_0;
          ip6_header_t *ip6_0;
          gtpu_header_t *gtpu0;
          u32 gtpu_hdr_len0 = 0;
          u32 session_index0;
          u32 error0;
          u16 hdr_len0;
          pdu_sess_info_t *pdu_sess_info_p = NULL;
          upf_session_t *sx;

          bi0 = from[0];
          to_next[0] = bi0;
          from += 1;
          to_next += 1;
          n_left_from -= 1;
          n_left_to_next -= 1;
// 从当前node 中 获取vlib_buffer_t 
          b0 = vlib_get_buffer (vm, bi0);

          /* udp leaves current_data pointing at the gtpu header */
// 从vlib_buffer_t 中获取buffer 头，类似于rte_pktmbuf_mtod
          gtpu0 = vlib_buffer_get_current (b0);
          hdr_len0 = is_ip4 ? sizeof (*ip4_0) : sizeof (*ip6_0);
          hdr_len0 += sizeof (udp_header_t);

          if (is_ip4)
            {
// 移动b0至原始报文外层ip 头
              vlib_buffer_advance (
                  b0, -(word) (sizeof (udp_header_t) + sizeof (ip4_header_t)));
              ip4_0 = vlib_buffer_get_current (b0);
            }
          else
            {
              vlib_buffer_advance (
                  b0, -(word) (sizeof (udp_header_t) + sizeof (ip6_header_t)));
              ip6_0 = vlib_buffer_get_current (b0);
            }

          session_index0 = ~0;
          error0 = 0;
// PREDICT_FALSE 相当于unlikely
          if (PREDICT_FALSE ((gtpu0->ver_flags & GTPU_VER_MASK) !=
                             GTPU_V1_VER))
            {
              error0 = GTPU_ERROR_BAD_VER;
              next0 = GTPU_INPUT_NEXT_DROP;
              goto trace00;
            }
... ...
          session_index0 = upf_buffer_opaque (b0)->upf.session_index;
          if (session_index0 == ~0)
            {
              next0 = GTPU_INPUT_NEXT_DROP;
              goto trace00;
            }
          sx = pool_elt_at_index (upf_main.sessions, session_index0);

          /* Manipulate gtpu header */
          if ((gtpu0->ver_flags & GTPU_E_S_PN_BIT) != 0)
            {

              /* Manipulate Sequence Number and N-PDU Number */
              /* TBD */

              /* Manipulate Next Extension Header */
              if (gtpu0->ver_flags & 0x04)
                {
                  u8 gtp_ex_hdr_type;
                  u8 gtp_ex_hdr_len;
                  u16 total_gtp_ex_hdr_len = 0;
                  u8 *gtp_ex_hdr_p;

                  gtp_ex_hdr_type = gtpu0->next_ext_type;
                  gtp_ex_hdr_p = (u8 *)gtpu0 + sizeof (gtpu_header_t);

                  while (gtp_ex_hdr_type)
                    {
                      // printf("%s:%d gtp_ex_hdr_type:%x\n",
                      // __func__,__LINE__,gtp_ex_hdr_type);
                      gtp_ex_hdr_len = *gtp_ex_hdr_p;
                      total_gtp_ex_hdr_len = +gtp_ex_hdr_len * 4;
#if 0                      
                      if (total_gtp_ex_hdr_len >
                          gtpu_hdr_len0 - sizeof (gtpu_header_t))
                        {
                          error0 = GTPU_ERROR_BAD_FLAGS;
                          next0 = GTPU_INPUT_NEXT_DROP;
                          goto trace00;
                        }
#endif
                      gtp_ex_hdr_p++;
                      switch (gtp_ex_hdr_type)
                        {
                        case GTP_EX_TYPE_PDU_SESS:
                          pdu_sess_info_p = (pdu_sess_info_t *)gtp_ex_hdr_p;
                          break;
                        default:
                          break;
                        }
                      u32 offset;
                      offset = gtp_ex_hdr_len * 4 - 2;
                      gtp_ex_hdr_p = gtp_ex_hdr_p + offset;
                      gtp_ex_hdr_type = *gtp_ex_hdr_p;
                      gtp_ex_hdr_p++;
                    }
                  gtpu_hdr_len0 =
                      sizeof (gtpu_header_t) + total_gtp_ex_hdr_len;
                }
            }
          else
            {
              gtpu_hdr_len0 = sizeof (gtpu_header_t) - 4;
            }

          hdr_len0 += gtpu_hdr_len0;

          upf_buffer_opaque (b0)->upf.data_offset = hdr_len0;
          upf_buffer_opaque (b0)->upf.teid =
              clib_net_to_host_u32 (gtpu0->teid);
          upf_buffer_opaque (b0)->upf.flags =
              (is_ip4) ? BUFFER_GTP_UDP_IP4 : BUFFER_GTP_UDP_IP6;
          if (NULL != pdu_sess_info_p)
            {
              upf_buffer_opaque (b0)->upf.extension_hdr_flag =
                  GTP_EXT_HDR_FLAG;
              upf_buffer_opaque (b0)->upf.pdu_type = pdu_sess_info_p->pdu_type;
              upf_buffer_opaque (b0)->upf.qfi = pdu_sess_info_p->qfi;
              upf_buffer_opaque (b0)->upf.rqi = pdu_sess_info_p->rqi;
              upf_trace ("upf_buffer_opaque (b0)->upf.qfi:%d, dir:%d",
                         upf_buffer_opaque (b0)->upf.qfi,
                         pdu_sess_info_p->pdu_type);
            }

          if (sx->pdn_type == PDN_TYPE_ETHERNET)
            {
              next0 = GTPU_INPUT_NEXT_ETH_INPUT;
            }
          /* inner IP header */
          else if (is_v4_packet (
                       (u8 *)(vlib_buffer_get_current (b0) + hdr_len0)))
            {
              ip4_0 = vlib_buffer_get_current (b0) + hdr_len0;
              if (PREDICT_FALSE (ip4_is_fragment (ip4_0)))
                {
                  vnet_buffer (b0)->ip.reass.next_index =
                      upf_main.ip4_reass_next;
                  vlib_buffer_advance (
                      b0, upf_buffer_opaque (b0)->upf.data_offset);
                  next0 = GTPU_INPUT_NEXT_IP4_REASSEMBLY;
                }
              else
                {
  // 指定next node
                  next0 = GTPU_INPUT_NEXT_PDR_DETECT;
                }
            }
          else if (is_v6_packet (
                       (u8 *)(vlib_buffer_get_current (b0) + hdr_len0)))
            {
              ip6_0 = vlib_buffer_get_current (b0) + hdr_len0;
              if (PREDICT_FALSE (ip6_0->protocol ==
                                 IP_PROTOCOL_IPV6_FRAGMENTATION))
                {
                  vnet_buffer (b0)->ip.reass.next_index =
                      upf_main.ip6_reass_next;
                  vlib_buffer_advance (
                      b0, upf_buffer_opaque (b0)->upf.data_offset);
                  next0 = GTPU_INPUT_NEXT_IP6_REASSEMBLY;
                }
              else
                {
                  next0 = GTPU_INPUT_NEXT_PDR_DETECT;
                }
            }
          else
            {
              next0 = GTPU_INPUT_NEXT_PDR_DETECT;
            }

          pkts_decapsulated++;

        trace00:
          b0->error = error0 ? node->errors[error0] : 0;

          if (PREDICT_FALSE (b0->flags & VLIB_BUFFER_IS_TRACED))
            {
              gtpu_rx_trace_t *tr =
                  vlib_add_trace (vm, node, b0, sizeof (*tr));
              tr->next_index = next0;
              tr->error = error0;
              tr->session_index = session_index0;
              tr->teid = clib_net_to_host_u32 (gtpu0->teid);
            }

          vlib_validate_buffer_enqueue_x1 (vm, node, next_index, to_next,
                                           n_left_to_next, bi0, next0);
        }

      vlib_put_next_frame (vm, node, next_index, n_left_to_next);
    }
  /* Do we still need this now that tunnel tx stats is kept? */
  vlib_node_increment_counter (
      vm, is_ip4 ? gtpu4_input_node.index : gtpu6_input_node.index,
      GTPU_ERROR_DECAPSULATED, pkts_decapsulated);

  return from_frame->n_vectors;
}