Gloss by Example

Bytes, Buffers, and Clojure

Posted by Derek Troy-West on October 22, 2013 · 51 mins read

Image: unsplash-logoMiguel Carraça

Recently I’ve been building systems which attempt to process large volumes of data quickly and efficiently.

These systems are predominantly JVM/Netty based network services, so I’m comfortable working with bytes and buffers, but have found interpreting those bytes in a meaningful way to be a bit of a challenge. Enter Gloss, a Clojure DSL for describing, encoding, and decoding byte-structures. I like its composable nature, it performs well enough for my current needs, and I’ve really enjoyed learning more about Clojure.

Often times the data being consumed conforms to the header fields definitions of RFC-5322 (Internet Message) or RFC-2616 (HTTP) - meaning repeated, delimiter separated string values. Not so complicated until you include the requirement for header folding, then parsing becomes a little more difficult.

Gloss seems naturally inclined toward static or fixed-length byte formats, but does support parsing of delimited strings and with a little coaxing parses these header types fairly well.

My extremely naive micro-benchmarks indicate the Gloss approach to be about 10x slower than Netty 4.0’s highly tuned HTTP header parsing. I’m not deterred - the Gloss solution is more extensible and the output is a Clojure data-structure, which has its own advantages.

Here are some examples of the documented facets of Gloss (v0.2.2), a few undocumented, and one potential extension. All are available as a series of expectations annotated with marginalia here.

Content


The Basics


Gloss is a DSL for describing byte formats.

In Gloss a byte format is called a frame, frames are compiled into codec which allow you to encode/decode ByteBuffer(s) to/from a clojure data structure.

A frame is itself just a data structure that can contain certain gloss keywords or other codecs, nesting codec allows granular testing and translation.

frames

A frame can contain keywords representing a number of different primitive data types, the examples below use :byte, consider these inter-changeable.

;; Gloss primitive type keywords
[:byte :int16 :int32 :int64 :float32 :float64 :ubyte :uint16 :uint32 :uint64]


;; A very simple frame, a single byte
(def byte-frame :byte)
=> (var by-example-gloss.core/byte-frame)


;; Endian-ness can be declared by appending -le or -be
(def little-endian-int-frame :int32-le)
=> (var by-example-gloss.core/little-endian-int-frame)


;; Frames are just clojure data structures, this frame is a vector of two bytes
(def vector-frame [:byte :byte])
=> (var by-example-gloss.core/vector-frame)


;; This frame contains the same data, but in map form rather than a vector
(def map-frame {:first :byte :second :byte})
=> (var by-example-gloss.core/map-frame)


;; Frames can contain constants which are not encoded, are decoded
(def map-frame-with-constant {:first "constant-value" :second :byte})
=> (var by-example-gloss.core/map-frame-with-constant)

codec

Gloss’ encode and decode functions require codec, not frames. The following functions transform a frame into a codec:

  • gloss.core/compile-frame; or,
  • gloss.core/defcodec
;; A codec compiled from a simple byte frame
(def byte-codec
  (compile-frame :byte))
=> (var by-example-gloss.core/byte-codec)


;; A codec compiled using the defcodec macro
(defcodec vector-codec [:byte :byte])
=> (var by-example-gloss.core/vector-codec)


;; Gloss encodes map values in a consistent but arbitrary order
;; If you require values serialized in a particular order, use Gloss' ordered-map
(defcodec map-codec {:first :byte :second :byte})
=> (var by-example-gloss.core/map-codec)

utility functions

Gloss is designed to work on streams of data, so often times the core functions emit or consume sequences of ByteBuffers. Gloss provides a couple of utility methods which turn out to be very handy when testing encoding and decoding.

  • gloss.io/to-byte-buffer: Converts a value into a ByteBuffer
  • gloss.io/contiguous: Turns a sequence of ByteBuffers into a single ByteBuffer

encoding + decoding

  • gloss.io/encode: Encodes from a data-structure into a sequence of ByteBuffers
  • gloss.io/decode: Decodes from a ByteBuffer(s) into a data-structure

Encoding the single byte-codec into a ByteBuffer and decoding it back again is simple:

;; Encoding a single byte to a sequence holding a single buffer
(encode byte-codec 127)
=> (#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=1 cap=1]>)


;; Confirming the content of that buffer
(first (.array (contiguous *1)))
=> 127


;; Decoding from a ByteBuffer into a single byte 
(decode byte-codec (to-byte-buffer 127))
=> 127

Likewise, the vector-codec is similarly straight forward:

(encode vector-codec [126 127])
=> (#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=2 cap=2]>)


(vec (.array (contiguous *1)))
=> [126 127]


(decode vector-codec (to-byte-buffer [126 127]))
=> [126 127]

Encoding the map-codec is slightly more complicated.

Gloss encodes map values in a consistent but arbitrary order (technically values are serialized to the buffer in alphabetical order of their keys, but it would be poor form to count on that always being the case).

;; Encoding happens to serialize first, then second.
(encode map-bytes-codec {:first 126 :second 127})
=> (#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=2 cap=2]>)


(vec (.array (contiguous *1)))
=> [126 127]


;; Recompiling the codec with different key-names alters the serialization order
(defcodec map-codec {:zappa :byte :alpha :byte})
=> (var by-example-gloss.core/map-codec)


(encode map-codec {:zappa 126 :alpha 127})
=> (#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=2 cap=2]>)


(vec (.array (contiguous *1)))
=> [127 126]


;; In either case, decoding is unchanged
(decode map-codec (to-byte-buffer [127 126]))
=> {:zappa 126, :alpha 127}

Serialization order might not be a problem if you are only encoding/decoding through your own codec, but if you’re decoding an externally defined format into a map you’ll need to use Gloss’ ordered-map which encodes/decodes map values in the order they are defined.

gloss.core/ordered-map: when serialization order is important

;; Values are serialized in the order they are defined
(defcodec ordered-map-codec 
          (ordered-map :zappa :byte
                       :alpha :byte))
=> (var by-example-gloss.core/ordered-map-codec)


(encode ordered-map-codec {:zappa 126
                           :alpha 127})
=> (#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=2 cap=2]>)


(vec (.array (contiguous *1)))
=> [126 127]

partial decoding

Gloss is eager, it assumes that the codec you provide will consume all of the input. If that’s not the case you’ll get an error. Gloss can be instructed to ignore any remaining bytes by passing ‘false’ to the decode method:

(decode vector-codec (to-byte-buffer [126 127]))
=> [126 127]


(decode vector-codec (to-byte-buffer [126 127 125]))
Exception Bytes left over after decoding frame.  gloss.io/decode (io.clj:86)


(decode vector-codec (to-byte-buffer [126 127 125]) false)
=> [126 127]

nesting codec

A codec can be deined as a data structure composed of other codec, meaning more complex codec can be built from simpler parts.

(defcodec nested-codec
          [vector-codec {:foo "bar"} byte-frame map-codec])
=> (var by-example-gloss.core/nested-codec)


(encode nested-codec [[123 124]
                      {:foo "bar"}
                      125
                      {:first 126 :second 127}])
=> (#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=5 cap=5]>)


(vec (.array (contiguous *1)))
=> [123 124 125 126 127]


(decode nested-codec (to-byte-buffer '(123 124 125 126 127)))
=> [[123 124] {:foo "bar"} 125 {:second 127, :first 126}]

string frames

gloss.core/string: Supports any of the standard Java character encodings

unbound: will consume all available bytes in the character encoding specified.

;; The first argument to gloss.core/string specifies a character encoding.
(defcodec unbound-codec
          (string :utf-8))
=> (var by-example-gloss.core/unbound-codec)


(encode unbound-codec "some-string")
=> [#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=11 cap=11]>]


(decode unbound-codec *1)
=> "some-string"

fixed length: isn’t very interesting to me, but you get the idea.

;; String can be declared with a certain length
(defcodec fixed-length-codec
          (string :utf-8 :length 50))

delimited: any byte-sequence can be used as a delimiter, and multiple of.

(defcodec delimited-codec
          (string :utf-8 :delimiters ["x" "xx" \y]))
=> (var by-example-gloss.core/delimited-codec)

When decoding, Gloss searches head-first for a delimiter then decodes anything prior.

(decode delimited-codec (to-byte-buffer "derekx"))
=> "derek"


(decode delimited-codec (to-byte-buffer "derekxx"))
=> "derek"


(decode delimited-codec (to-byte-buffer "dereky"))
=> "derek"

We can instruct Gloss not to remove the delimiter from the decoded text.

;; :strip-delimiters? false
(defcodec dlm-inclusive-codec
  (string :utf-8 :delimiters ["x" "xx" \y] :strip-delimiters? false))
=> (var by-example-gloss.core/dlm-inclusive-codec)


(decode dlm-inclusive-codec (to-byte-buffer "derekx"))
=> "derekx"

When decoding with multiple delimiters:

  • the first delimiter is always matched; and,
  • the longest delimiter is always matched.
;; First delimiter found is matched
(decode dlm-inclusive-codec (to-byte-buffer "derekyx") false)
=> "dereky"


;; Where one delimiter is a subset of the other, longest matched
(decode dlm-inclusive-codec (to-byte-buffer "derekxx"))
=> "derekxx"

When encoding, by default Gloss will encode the first delimiter.

;; Interesting to note Gloss encodes two buffers, content and delimiter
(encode delimited-codec "derek"))
=> (#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=5 cap=5]> 
    #<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=1 cap=1]>)


(String. (.array (contiguous *1)))
=> "derekx"

The encoding delimiter is configurable based on the value being encoded.

;; A function returning a different delimiter dependent on the value
(defn choose-encoded-dlm [value]
  (condp = value
    "derek" ["x"]
    "kylie" ["xx"]
    ["y"]))
=> (var by-example-gloss.core/choose-encoded-dlm)


;; Provided  via the :value->delimiter argument.
(defcodec dlm-selective-codec
          (string :utf-8 :delimiters ["x" "xx" \y]
                  :value->delimiter choose-encoded-dlm))
=> (var by-example-gloss.core/dlm-selective-codec)


;; Now the delimiter depends on the value being encoded.
(String. (.array (contiguous (encode dlm-selective-codec "derek"))))
=> "derekx"


(String. (.array (contiguous (encode dlm-selective-codec "kylie"))))
=> "kyliexx"


(String. (.array (contiguous (encode dlm-selective-codec "kirsty"))))
=> "kirstyy"

repeated frames

gloss.core/repeated: Supports repeating frames or codec

(def rep-byte (repeated :byte))

=> (var by-example-gloss.core/rep-byte)

By default encoded data is prefixed with a 32 bit integer which declares the number of repetitions. Gloss also supports custom prefixes, or no prefix at all - where the repeated frame would expect to consume all input.

(encode rep-byte [127])
=> (#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=5 cap=5]>)


(vec (.array (contiguous *1)))
=> [0 0 0 1 127]

Repeated frames always encode from, or decode to vectors.

(decode rep-byte (to-byte-buffer '(0 0 0 2 127 126)))
=> [127 126]

Much like Strings, repetition can be terminated by delimiter. As it turns out, you can repeat a delimited string, terminating the repetition with a delimiter itself.

(def dlm-rep-string
  (repeated
    (string :utf-8 :delimiters ["\r"])
    :delimiters ["\n"]))
=> (var by-example-gloss.core/dlm-rep-string)


;; Interesting again, Gloss encodes separate buffers for content and delimiters
(encode dlm-rep-string ["first" "second"])
=> (#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=5 cap=5]> 
    #<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=1 cap=1]> 
    #<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=6 cap=6]> 
    #<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=1 cap=1]> 
    #<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=1 cap=1]>)


;; Validating the output of our encoded repeated string
(String. (.array (contiguous *1)))
=> "first\rsecond\r\n"


;; And decoding repeated delimited strings from a buffer to a vector
(decode dlm-rep-string (to-byte-buffer "first\rsecond\r\n"))
=> ["first" "second"]

Gloss scans first for the delimiter of the repeated section, then provides the matched bytes to the inner frame for consumption. All bytes must be consumed or an exception is thrown.

(decode dlm-rep-string (to-byte-buffer "first\nsecond\nleft-over-bytes\0"))
=> Exception Cannot evenly divide bytes into sequence of frames.  
     gloss.core.protocols/take-all/fn--1929 (protocols.clj:93)

Both repeated and string support leaving the delimiter in the matched bytes. This turns out to be useful, as we can leave the repeated codec termination delimiter within the content passed to the inner codec, allowing that codec to consume the delimiter itself.

;;  :strip-delimiters? false
(def dlm-rep-string-x
  (repeated
    (string :utf-8 :delimiters ["\n" "\n\n"])
    :delimiters ["\n\n"]
    :strip-delimiters? false))
=> (var by-example-gloss.core/dlm-rep-string-x)


;; In this case Gloss matches the entire text as the repeated section then;
;;
;; - 'first\n' as the first string
;; - 'second\n' as the second
;; - 'third\n\n' as the third (always matches the largest delimiter)
(decode dlm-rep-string-x (to-byte-buffer "first\nsecond\nthird\n\n"))
=> ["first" "second" "third"]

transforms

When compiling a frame, we can supply functions that:

  • transform the input before encoding; and/or,
  • transform the output after decoding.
;; Converts {:name "value"} to ["name" "value"]
(defn transform-input [data]
  (first (stringify-keys data)))
=> (var by-example-gloss.core/transform-input)


;; Converts ["name" "value"] to {:name "value"}
(defn transform-output [[k v]]
  {(keyword k) v})
=> (var by-example-gloss.core/transform-output)

This codec talks in vectors, but that is only an intermediary state. The input to encode is transformed from a map to a vector, and the output from decode is transformed from a vector to a map.

(def trans-codec
  (compile-frame
    [(string :utf-8 :delimiters [": "]) (string :utf-8 :delimiters ["\n"])]
    transform-input
    transform-output))
=> (var by-example-gloss.core/trans-codec)


(encode trans-codec {:name "value"})
=> (#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=4 cap=4]> 
    #<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=2 cap=2]> 
    #<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=5 cap=5]> 
    #<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=1 cap=1]>)


(String. (.array (contiguous *1)))
=> "name: value\n"


(decode trans-codec (to-byte-buffer "name: value\n"))
=> {:name "value"}

headers

Headers enable gloss to encode and decode in a conditional manner. An initial ‘header’ frame defines how the following frame will be encoded/decoded:

[todo: info on how to mix headers and complete RFC 5322 codec to parse headers in different ways depending on header-name. memoize, etc]

delimited-block

gloss.core/delimited-block defines a frame which is just a byte sequence terminated by delimiters, like string and repeated we can indicate that the delimiters should be stripped or not. It can be useful to split a ByteBuffer into parts based on a delimiter.

;; The boolean argument acts as :strip-delimiters?
(defcodec dlm-block
          (delimited-block [rn] true))
=> (var by-example-gloss.core/dlm-block)


;; Delimited block allows us to slice a portion of a ByteBuffer
(decode dlm-block (to-byte-buffer "one\r\ntwo") false)
=> (#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=3 cap=3]>) ;; <= "one"

identity-codec

gloss.core.codecs/identity-codec provides the ByteBuffer underlying the codec

(defcodec dlm-identity-block
          [(delimited-block [rn] true) identity-codec])
=> (var by-example-gloss.core/dlm-identity-block)

(decode dlm-identity-block (to-byte-buffer "one\r\nanother"))
=> [(#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=3 cap=3]>)   <= "one"
    (#<HeapByteBuffer java.nio.HeapByteBuffer[pos=0 lim=7 cap=7]>)]  <= "another"

A Basic RFC-5322 header codec


A very simple definition of RFC-5322 headers:

  • Repeated text of the format “name: value\r\n”
  • The entire section ending with “\r\n\r\n”
(def ^:const rn "\r\n")
(def ^:const rnrn "\r\n\r\n")

;; A buffer with this text value
(def basic-buf 
  (to-byte-buffer "name: value\r\nname2: value2\r\n\r\n"))
=> (var by-example-gloss.core/basic-buf)


;; will encode/decode from/to this map structure
(def basic-data {:name "value"
                 :name2 "value2"})
=> (var by-example-gloss.core/basic-data)

First, a codec which matches a single header:

(defcodec basic-header
  [(string :utf-8 :delimiters [": "]) (string :utf-8 :delimiters [rn rnrn])])
=> (var by-example-gloss.core/basic-header)


;; Decodes "name: value\r\n" to  ["name" "value"]
(decode basic-header (to-byte-buffer "name: value\r\n"))
=> ["name" "value"]


;; Decodes "name: value\r\n\r\n" to the same
(decode basic-header (to-byte-buffer "name: value\r\n\r\n"))
=> ["name" "value"]


;; Encodes the data back to the initial buffer.
(String. (.array (contiguous (encode basic-header ["name" "value"]))))
=> "name: value\r\n"

That header codec can be repeated. The repeated section leaving its delimiter in the matched bytes to be consumed by the internal, repeated headers.

(defcodec basic-headers
  (repeated basic-header :delimiters [rnrn]
                         :strip-delimiters? false))
=> (var by-example-gloss.core/basic-headers)

This means we can encode and decode a vector-of-vectors:

(decode basic-headers basic-buf)
=> [["name" "value"] ["name2" "value2"]]

transforms

We’re half-way to having our basic headers encode/decode into the form we want.

A post-decode transform works the output into a more practical form.

(defn output-to-map [data]
  (keywordize-keys (into {} data)))
=> (var by-example-gloss.core/output-to-map)


(output-to-map [["name" "value"] ["name2" "value2"]])
=> {:name "value", :name2 "value2"}

Similarly, a pre-encode transform breaks map-style input into an acceptable form.

(defn input-to-vector [data]
  (vec (stringify-keys data)))
=> (var by-example-gloss.core/input-to-vector)

(input-to-vector {:name "value"
                  :name2 "value2"})
=> [["name" "value"] ["name2" "value2"]]

the codec

A codec which encodes/decodes the most basic form of RFC-5322 headers:

(def basic-rfc5322-headers
  (compile-frame
    (repeated basic-header
              :delimiters [rnrn]
              :strip-delimiters? false)
    input-to-vector
    output-to-map))
=> (var by-example-gloss.core/basic-rfc5322-headers)

;; Decodes the sample data in a map correctly
(decode basic-rfc5322-headers basic-buf)
=> {:name "value", :name2 "value2"}

limitations

Though we can decode headers correctly, due to a limitation with Gloss’ repeated codec we can’t encode them.

When encoding Gloss emits the delimiter of each header, and then the delimiter of the repeated section. On decoding the repeated section delimiter is effectively consumed twice, so when encoding we end up with an extra delimiter.

(String. (.array (contiguous (encode basic-headers basic-data))))
=> "name: value\r\nname2: value2\r\n\r\n\r\n" ;; <= one too many /r/n

Gloss supports configurable encoding delimiters for string, but not repeated.

(def basic-headers-selective-dlm
  (compile-frame
    (repeated basic-header
              :delimiters [rnrn]
              :encoding-delimiter rn ; <- encode this delimiter (would be nice)
              input-to-vector
              output-to-map)))

A final limitation, empty set of headers fails.

(decode rudimentary-headers (to-byte-buffer rnrn))
=> Exception Cannot evenly divide bytes into sequence of frames.
     gloss.core.protocols/take-all/fn--1929 (protocols.clj:93)

A better RFC-5322 header codec


Due to the limitations above, we’re only concerned with decoding.

  • Same basic definition as the Basic Headers codec.
  • Combine repeated names into a single comma separated value.
  • Normalize names, trimmed and case insensitive.
  • Normalize values, trimmed.
;; This more complicated example
(def better-buf
  (to-byte-buffer (str "name: value\r\n"
                       "name2:value2\r\n"
                       "NAME2:value3 \r\n"
                       "name3 : VALUE5 \r\n"
                       "name2 :value4\r\n\r\n")))
=> (var by-example-gloss.core/better-buf)

;; Will decode into this map-structure
(def better-data {:name "value"
                  :name2 "value2,value3,value4"
                  :name "VALUE5"})
=> (var by-example-gloss.core/better-data)

The header codec is similar to the initial one, but without the expectation of a space after the colon.

(defcodec better-header
          [(string :utf-8 :delimiters [":"]) (string :utf-8 :delimiters [rn rnrn])])
=> (var by-example-gloss.core/better-header)

A post-decode transform method applies most of the rules described above.

(defn output-to-merged-map [data]
  (apply merge-with #(str %1 "," %2)
         (map (fn [[k v]] {(keyword (-> k trim lower-case)) (trim v)}) data)))
=> (var by-example-gloss.core/output-to-merged-map)


(output-to-merged-map [["name" "value"] ["NAME2" " value2"] ["name2" " value3 "]])
=> {:name2 "value2,value3", :name "value"}

the codec

A slightly better RFC-5322 header codec

(def better-headers
  (compile-frame
    (repeated better-header
              :delimiters [rnrn]
              :strip-delimiters? false)
    #(identity %)
    output-to-merged-map))
=> (var by-example-gloss.core/better-headers)


;; Provides better RFC-5322 header extraction
(decode better-headers (to-byte-buffer better-buf))
=> {:name3 "VALUE5", :name2 "value2,value3,value4", :name "value"}

A complete RFC-5322 header codec


In addition to previous requirements, values can be folded over several lines.

See the RFC-5322 spec for the folding definition.

;; This buffer includes folded headers
(def folded-buf
  (to-byte-buffer (str
                    "name: value\r\n"
                    "name2:value2\r\n"
                    " value3 \r\n"
                    "\tvalue3a \r\n"
                    "name3: value5 \r\n"
                    "name2:value4\r\n\r\n")))
=> (var by-example-gloss.core/folded-buf)


;; Will be unfolded before decoded, with "\r\n " or "\r\n\t"
;; interpreted as a space character.
(def unfolded-buf
  (to-byte-buffer (str
                    "name: value\r\n"
                    "name2:value2 value3  value3a \r\n"
                    "name3: value5 \r\n"
                    "name2:value4 \r\n\r\n")))
=> (var by-example-gloss.core/unfolded-buf)


;; The folded buffer should eventually be decoded into this data structure.
(def unfolded-data
  {:name "value"
   :name2 "value2 value3  value3a,value4"
   :name3 "value5"})
=> (var by-example-gloss.core/unfolded-data)

gloss extension

Gloss supports transforms pre/post decode on the data being supplied/generated.

To support decoding folded headers we extend Gloss to allow a function which transforms the buffer pre-decode, allowing us to unfold the buffer into a suitable form for the codec.

(defn- compile-frame- [f]
  (cond
    (map? f) (convert-map (zipmap (keys f) (map compile-frame- (vals f))))
    (sequential? f) (convert-sequence (map compile-frame- f))
    :else f))
=> (var by-example-gloss.core/compile-frame-)


;; Rather than explicitly applying the transform to the buffer pre-decode, an
;; extended compile-frame takes a pre-decode argument.
;;
;; Now a codec can also transform the incoming buffer.
(defn compile-frame-ext
  ([frame pre-encoder pre-decoder post-decoder]
   (let [codec (compile-frame frame)
         read-codec (compose-callback
                      codec
                      (fn [x b]
                        [true (post-decoder x) b]))]
     (reify
         Reader
       (read-bytes [_ b]
         (read-bytes read-codec (pre-decoder b)))
       Writer
       (sizeof [_]
         (sizeof codec))
       (write-bytes [_ buf v]
         (writ-bytes codec buf (pre-encoder v)))))))
=> (var by-example-gloss.core/compile-frame-ext)

Now it is possible to define a codec with a pre-decode method, manipulating the buffer and unfolding it before passing it to the codec.

(def ^:const rn-space "\r\n ")
(def ^:const rn-tab "\r\n\t")
(def sp-buf (to-byte-buffer " "))
(def rnrn-buf (to-byte-buffer rnrn))

;; This codec part-unfolds a ByteBuffer, removing all linear whitespace.
;;
;; - ("ab\r\nc\r\n d\r\n\t e\r\nf\r\n\r\n")
;; - becomes ("ab\r\nc" "d" " e\r\nf")
(defcodec part-unfold-codec
          (repeated
            (delimited-block [rn-space rn-tab rnrn] true)
            :delimiters [rnrn]
            :strip-delimiters? false))
=> (var by-example-gloss.core/part-unfold-codec)


;; This pre-decode transform method applies that codec, then interposes space
;; characters to give the expected unfolded form.
;;
;; - codec output ("ab\r\nc" "d" " e\r\nf")
;; - becomes ("ab\r\nc" " " "d" " " " e\r\nf" "\r\n\r\n")
(defn unfold [bufs]
  (let [buf-seq (decode part-unfold-codec bufs)]
    (if (> (count buf-seq) 1)
      (list (contiguous (interpose sp-buf (flatten (conj buf-seq rnrn-buf)))))
      bufs)))
=> (var by-example-gloss.core/unfold)


;; Unfolding the folded buffer gives the desired result
(String. (.array (contiguous (unfold (list folded-buf)))))
=> "name: value\r\n"
   "name2:value2 value3  value3a \r\n"
   "name3: value5 \r\n"
   "name2:value4 \r\n\r\n"

the codec

A complete RFC-5322 header codec, using the Gloss extension that allows a function to modify the ByteBuffer before decoding.

(def folding-headers
  (compile-frame-ext
    (repeated better-header
              :delimiters [rnrn]
              :strip-delimiters? false)
    #(identity %)
    unfold
    output-to-merged-map))
=> (var by-example-gloss.core/folding-headers)


(decode folding-headers folded-buf)
=> {:name3 "value5", :name2 "value2 value3  value3a,value4", :name "value"}

;; Unfortunately, empty sets of headers not supported
(decode folding-headers (to-byte-buffer "\r\n\r\n"))
=> Exception Cannot evenly divide bytes into sequence of frames.  
     gloss.core.protocols/take-all/fn--1929 (protocols.clj:93)