Skip to main content

Verifying usage of data-encoding

Β· 8 min read
Foobar.land (for Nomadic Labs) & TriliTech

In the protocol of Tezos, we define the serialization of values using the data-encoding library. We make these definitions by hand to be able to audit the code and fine-tune the serialization.

The serialization and deserialization functions must be inverse. Missing this property can lead to bugs or exploits. We present a solution to formally verify the definition of data-encodings in Coq. As a result, we increase our confidence in the correctness of our encodings. We even found a mistake in a development code and fixed it before release. This mistake was found by TriliTech.

The example of verification of encoding which we will follow is the Nonce.info_encoding:

  • OCaml definition:

    type info = Revealed of Nonce.t | Missing of Nonce_hash.t | Forgotten
    let info_encoding =  let open Data_encoding in  union    [      case        (Tag 0)        ~title:"Revealed"        (obj1 (req "nonce" Nonce.encoding))        (function Revealed nonce -> Some nonce | _ -> None)        (fun nonce -> Revealed nonce);      case        (Tag 1)        ~title:"Missing"        (obj1 (req "hash" Nonce_hash.encoding))        (function Missing nonce -> Some nonce | _ -> None)        (fun nonce -> Missing nonce);      case        (Tag 2)        ~title:"Forgotten"        empty        (function Forgotten -> Some () | _ -> None)        (fun () -> Forgotten);    ]
  • full Coq proof:

    Lemma encoding_is_valid :  Data_encoding.Valid_on.t    Valid.t Alpha_services.Nonce.info_encoding.  eapply Data_encoding.Valid_on.implies.  eapply Data_encoding.Valid_on.union.  (repeat econstructor; simpl; try intuition congruence);    try (intro x; destruct x; reflexivity);    try apply Data_encoding.Valid_on.obj1.  apply Data_encoding.Valid.empty.  apply Nonce_hash.encoding_is_valid.  apply Seed_repr.Nonce.encoding_is_valid.  intros x H_x.  destruct x; tauto.Qed.

The data-encoding library#

This library defines several combinators which we can combine to define the encoding of a type. From an encoding, we get serialization and deserialization functions to binary or JSON format.

Basic combinators#

Some examples of basic combinators are:

type 'a encoding
val bool : bool encodingval string : string encodingval option : 'a encoding -> 'a option encodingval list : ?max_length:int -> 'a encoding -> 'a list encoding

The encodings bool or string directly encode their corresponding types. The combinators option and list create encodings for parametrized types, starting from the encoding on the base type.

Conversion#

To encode a type for which no combinators directly work, we can use conv:

val conv :  ('a -> 'b) ->  ('b -> 'a) ->  ?schema:Json_schema.schema ->  'b encoding ->  'a encoding

This function expects two parameters which must be bijections between the target type 'a and the base type 'b. The user of the data-encoding library is responsible for checking the correctness of the bijection.

Algebraic data types#

For algebraic data types , we use several combinators: case and union for sum types (detailed below) and obj and merge_objs for products (which we do not use in any example of this blog post and thus we do not give any details about here).

type 't case
val case :  title:string ->  ?description:string ->  case_tag ->  'a encoding ->  ('t -> 'a option) ->  ('a -> 't) ->  't case
val union :  ?tag_size:[`Uint8 | `Uint16] -> 't case list -> 't encoding

With the union primitive, we combine several case constructs. Each case construct covers one constructor of the sum type 't. The two functions given as parameters to case are the projections and injection corresponding to the constructor. We need to make sure that we cover all the constructors exactly once and that the projections/injections operations are compatible.

We will now see how we verify the definition of encodings with Coq.

Specification#

We translate the protocol of Tezos using coq-of-ocaml. During this translation we generate a list of axioms for the .mli files of the environment. The environment is the set of primitives used by the protocol and includes the data-encoding library. Thus we do not have access to the definition of the data-encoding library but can use its interface from Data_encoding.v.

We manually specify the behavior of data-encoding and suppose that its code is correct. To do so, we define the predicate Valid_on.t which we will apply on each primitive:

Module Valid_on.  Definition t {a : Set} (domain : a β†’ Prop)    (encoding : Data_encoding.t a) : Prop :=    βˆ€ x, domain x β†’    match Data_encoding.Binary.to_bytes_opt None encoding x with    | Some bytes β‡’      Data_encoding.Binary.of_bytes_opt encoding bytes = Some x    | None β‡’ False    end.End Valid_on.

This statement says that for some type a, on a specific domain domain defined as a predicate, the encoding encoding is valid. For now, we only define that serializing and then deserializing to binary format is the identity. We should also state that the reverse operation is valid (deserializing and then serializing is the identity), and that serialization in JSON format is valid. For the typical case of encodings valid on the whole domain we define the predicate Valid.t:

Module Valid.  Definition t {a : Set} (encoding : Data_encoding.t a) : Prop :=    Valid_on.t (fun _ β‡’ True) encoding.End Valid.

Basic combinators#

For each of the data-encoding combinators, we give a validity axiom with corresponding conditions and domain:

Axiom bool : Valid.t Data_encoding.bool.
Axiom string : Valid.t Data_encoding.string.
Axiom option : βˆ€ {a : Set} {domain : a β†’ Prop}  {encoding : Data_encoding.t a},  Valid_on.t domain encoding β†’  let option_domain x :=    match x with    | None β‡’ True    | Some x β‡’ domain x    end in  Valid_on.t option_domain (Data_encoding.option encoding).
Axiom list : βˆ€ {a : Set} {domain_a}  {encoding_a : Data_encoding.t a},  Valid_on.t domain_a encoding_a β†’  Valid_on.t (List.Forall domain_a)    (Data_encoding.list None encoding_a).

The primitives boolΒ and string are valid on their whole domain. On lists, the primitive list yields an encoding that is valid on lists in which each element is in the domain domain_a. We define the list domain with:

List.Forall domain_a

knowing that the List.Forall operator has type:

forall {a : Set}, (a -> Prop) -> list a -> Prop

Conversion#

For the conversion operator, we rely on the correction of the bijection given as a parameter:

Axiom conv : βˆ€ {a b : Set} {domain_b : b β†’ Prop}  {a_to_b : a β†’ b} {b_to_a} {encoding_b : Data_encoding.t b},  Valid_on.t domain_b encoding_b β†’  Valid_on.t    (fun v_a β‡’ domain_b (a_to_b v_a) ∧ b_to_a (a_to_b v_a) = v_a)    (Data_encoding.conv a_to_b b_to_a None encoding_b).

Algebraic data types#

The specification of the union combinator for algebraic data types is a little more involved. We have to define the validity of a case as an inductive predicate in Coq due to universe constraints on impredicative sets. We let the interested reader look at the full definition.

Proof scheme#

We now verify the following type encoding from alpha_services.ml:

type info = Revealed of Nonce.t | Missing of Nonce_hash.t | Forgotten
let info_encoding =  let open Data_encoding in  union    [      case        (Tag 0)        ~title:"Revealed"        (obj1 (req "nonce" Nonce.encoding))        (function Revealed nonce -> Some nonce | _ -> None)        (fun nonce -> Revealed nonce);      case        (Tag 1)        ~title:"Missing"        (obj1 (req "hash" Nonce_hash.encoding))        (function Missing nonce -> Some nonce | _ -> None)        (fun nonce -> Missing nonce);      case        (Tag 2)        ~title:"Forgotten"        empty        (function Forgotten -> Some () | _ -> None)        (fun () -> Forgotten);    ]

The type info_encoding is an algebraic data type with three constructors Revealed, Missing, and Forgotten. Its encoding relies on the encodings of Nonce.t and Nonce_hash.t.

The Coq translation of this OCaml code is in Alpha_services.v. The encoding of Nonce.t is only valid on a specific domain Seed_repr.Nonce.Valid.t. Thus we define the domain for info_encoding as:

Module Valid.  Definition t (info : Alpha_services.Nonce.info) : Prop :=    match info with    | Alpha_services.Nonce.Revealed nonce β‡’      Seed_repr.Nonce.Valid.t nonce    | _ β‡’ True    end.End Valid.

We express the correctness lemma for info_encoding in Proofs/Alpha_services.v as:

Lemma encoding_is_valid :  Data_encoding.Valid_on.t    Valid.t Alpha_services.Nonce.info_encoding.

We verify that this encoding is valid on the domain Valid.t. Like most proofs on data encodings, we start with:

eapply Data_encoding.Valid_on.implies.

The lemma Data_encoding.Valid_on.implies is of type:

forall {a : Set} {domain domain' : a β†’ Prop}  {encoding : Data_encoding.t a},  Valid_on.t t domain encoding β†’  (βˆ€ x, domain' x β†’ domain x) β†’  Valid_on.t domain' encoding.

Using eapply on implies we introduce an existential Coq variable for the domain. We say that the domain is undefined for now. We will build it as a side-effect of using the axioms about the data-encoding primitives. At the end of the proof, we will verify that the domain we built includes the domain of the lemma statement using the tauto tactic.

Then, we use the axiom on union:

eapply Data_encoding.Valid_on.union.

and eliminate administrative goals, checking for example that we do not repeat two times the same tag:

(repeat econstructor; simpl; try intuition congruence);  try (intro x; destruct x; reflexivity);  try apply Data_encoding.Valid_on.obj1.

Finally, we use previously verified lemmas for the payload of each constructor:

apply Data_encoding.Valid.empty.apply Nonce_hash.encoding_is_valid.apply Seed_repr.Nonce.encoding_is_valid.

Conclusion#

We described how we verify the data-encodings in the Tezos protocol. Our goal is to verify as many encodings as possible in order to avoid mistakes in this part of the code. Ultimately, we would like to also verify the data-encoding library itself to make sure that it implements our specification.