RACE Parsers

Most RACE applications require parsing text messages received from external systems such as SWIM. Such messages can come at a fairly high rate (e.g. >1000 msg/sec for a collection of SWIM topics) and messages can become quite large (~100kB for single SWIM messages). Since import of messages is usually just a prerequisite for RACE applications that process objects generated from these messages, support for efficient parsing is essential to leave enough CPU cycles and memory for the main actors of the application.

The RACE parsing infrastructure is centered around several assumptions:

normally, we only process a subset of message types we receive, and only a subset of the information that is contained in processed messages
since the rate of messages can be high, messages can be large and extracted information can be spread across these messages we should avoid creating a large amount of short living objects for the sole purpose of parsing (e.g. String objects for XML tags)
our message sources are generally trusted, i.e. XML schema validation is usually not reqiured

For instance, SWIM reports en route flight positions every 12 sec, the number of simultaneous flights can exceed 4000. A flight element contains more than 60 tags of which we only parse 12 in order to generate track objects. This means we can save more than 300 temporary String allocations/sec just for this message type if XML tags are not reported/matched as Strings. Moreover, most of the extracted values are floating point numbers, i.e. are not stored as objects even in case they are relevant.

RACE uses its own PullParser infrastructure to minimize temporary objects and reduce processing time, providing explicit control over object construction for parse results.

The distinguishing factor of RACE pull parsers is that lexical constructs such as tags or values are reported as mutable, pre-allocated gov.nasa.race.common.ByteSlice objects, which represent ranges into utf-8 byte arrays and support efficient comparison by means of inlining (the RACE build process is configured to inline contents of package gov.nasa.race.common.inlined). Only in case of relevance are slice values converted into Strings or parsed into primitive values such as Double.

To further improve performance of critical parse functions RACE also provides the gov.nasa.race.tool.StringMatchGenerator tool that generates efficient slice matchers which employ prefix factorization and loop unrolling in order to speed up matching against a set of lexical constants such as tag names. While this can significantly increase performance it usually does require manual completion to extract relevant values and reduces extensibility of the respective parser.

RACE parser support is avaliable for XML, JSON (Javascript object notation) and CSV (comma separated values) text formats. The respective classes reside in the gov.nasa.race.common package of module race-core. Concrete parsers have to extend these abstract classes.

The following sections show example snippets for each of these text formats.

XML

The following example fragment is from the gov.nasa.race.air.translator.MessageCollection parser that extracts SfdpsTrack objects from SWIm SFDPS MessageCollection messages:

class MessageCollectionParser(..) extends UTF8XmlPullParser2 .. {  // (1) class definition
                                                // (2) lexical constants
  val _flight_ = ConstAsciiSlice("flight")
  val _enRoute_ = ConstAsciiSlice("enRoute")
  val _positionTime_ = ConstAsciiSlice("positionTime")
  ..

  def parseMessageCollection: Unit = { ..
      while (parseNextTag) {                     // (3) pull loop
        if (isStartTag) {
          if (tag == _flight_) {                 // (4) manual matching
            parseFlight                          // (5) element parse functions
        ..

  def parseFlight: Unit = {
    // values to extract
    var cs: String = null                        // (6) local value cache
    var date = DateTime.UndefinedDateTime
    ..

    def flight (data: Array[Byte], off: Int, len: Int): Unit = {
      //--------------------------------------- (7) value extractors provided by user
      @inline def process_flightIdentification = {
        .. cs = attrValue.intern ..                // (8) direct slice internalizer

      @inline def process_position = {
        if (tagHasParent(_enRoute_)) {             // (9) element path matcher
          if (parseAttr(_positionTime_)) {         // (10) attribute matching
             date = DateTime.parseYMDT(attrValue)  // (11) attribute value extraction
        ..


      //--------------------------------------- (12)  match harness generated by StringMatchGenerator
      @inline def match_flightIdentification = {len == 20 && data(off) == 102 && data(off+1) == 108 && ..
      @inline def match_position = {len == 8 && ..

      if (isStartTag) {
        if (match_flightIdentification) {
          process_flightIdentification
        } else {
           .. if (match_position) {
                 process_position
              ..
    }

    parseElement(flight)                          // (13) element pull loop

    //--- object creation
    if (cs != null && date.isDefined &7 ..) {
      elements += new SfdpsTrack(cs,date,..)      // (14) parse result
    ..
  }

(1):	the concrete parser class is extending either `UTF8XmlPullParser2` or `StringXmlPullParser2`, depending on if the input data is a utf-8 byte array or a String object

(2):	parser implementation usually begins with definition of lexical constants for tag and attribute names

(3):	shows an explicit pull loop for XML elements (`while (parseNextTag)..`)

(4):	shows explicit tag matching. While this is efficient for single tests it is not optimal if we match a number of tags with common prefixes

(5):	since this is a pull parser we can provide methods that parse single elements in order to improve re-usability and readability of parsers

(6):	element parse functions that generate objects usually start with local variables to cache extracted element values

(7):	value extractor (`process_X`) methods are user provided

(8):	`ByteSlice` supports value extractors that avoid temporary allocations, such as direct String internalizing from slices

(9):	value extractors can disambiguate tags by using element path queries. The XmlPullParser2 keeps track of current element paths

(10):	attribute parsing is optional

(11):	attribute value extraction is also slice based, i.e. works like element value extraction

(12):	performance critical element tag matchers can be generated by the `StringMatchGenerator` tool (the match code does not require modification)

(13):	`parseElement(<match-func>)` is a convenience method that provides a pull loop over child elements

(14):	eventually parse functions create objects (such as `SfdpsTrack`) that are the result of parsing the XML message

While this parser structure is flexible and efficient for extracting selected values it should be noted that it does not support on-the-fly XML schema validation.

JSON

The same pull parser infrastructure is also supported for JSON. The following is a snippet from gov.nasa.race.air.translator.OpenSkyParser that parses a JSON object containing an array of aircraft state vectors:

class OpenSkyParser extends UTF8JsonPullParser .. {

  val _states_ = ConstAsciiSlice("states") ..  // lexical constants

  def parse (msg: Array[Byte], lim: Int): .. = {
    def parseState: Unit = {
      matchArrayStart
      val cs = readQuotedValue.intern
      skip(1)
      val timePos = readUnQuotedValue.toLong
      skip(1)
      ..
      skipToEndOfCurrentLevel
      matchArrayEnd

      if (cs.nonEmpty && ..)
        elements += new FlightPos(..,cs,..,timePos,..)
    }

    matchObjectStart
    ..
    readMemberArray(_states_){
      parseState                             // per-element parse function
    }
    matchObjectEnd

Due to the syntactic nature of JSON there is no generic parseElement but dedicated, type specific pull methods such as match{Object,Array}Start and parse{Un}QuotedValue. Values are also returned as ByteSlice objects, i.e. don't incur heap overhead. Array element and object member parsing is supported by read{Member}Array and read{Member}Object pull loop functions that are parameterized with element/member parse functions. The JSONPullParser also supports syntactic levels and member key paths (the latter one being of less importance for JSON).

CSV

Although comma separated value formats lack the syntactic structure of XML or JSON, RACE does provide a similar pull parser that makes use of slices in order to minimize temporary object allocation, especially in the context of extracting subsets of values. The following example is from gov.nasa.race.air.SbsUpdater:

class SbsUpdater .. extends UTF8CsvPullParser {
  val _MSG_ = ConstAsciiSlice("MSG") ..
  ..
  def parse: Unit = {
    while (skipToNextRecord) {
      if (readNextValue == _MSG_) {
          ..
          skip(2)
          val icao24 = readNextValue.toHexLong     // (1) mandatory value
          ..
          skip(3)
          val alt = if (parseNextNonEmptyValue)    // (2) optional value
                   Feet(value.toInt) else UndefinedLength
          ..

          if (.. alt.isDefined ..)
          new FlightPos(icao24,..,alt,..)
      }
      skipToEndOfRecord
    }

The parser API contains a readNextValue method to return mandatory field slices (1) and a parseNextValue for optional fields (2).

StringMatchGenerator tool

A performance critical task for XML parsing is to match tag and attribute names against a set of string constants. While this can be done manually by means of a sequence of tests:

val _a1_ = ConstAsciiSlice("a1")
val _a2_ = ConstAsciiSlice("a2")
..
if (tag == _a1_) ..
else if (tag == _a2_) ..
..

this ignores common prefixes within a closed set of strings. Factoring such common prefixes is a tedious task and hence RACE includes a gov.nasa.race.tool.StringMatchGenerator application that generates respective sources. The script/smg script can be used to run the tool like this:

> script/smg gen a1 a2

which will generate the sources to match a data range against "a1" and "a2":

def matchBytes(data: Array[Byte], off: Int, len: Int): Unit = {

  //--- (1) replace with code to process respective matches
  @inline def process_a1 = println("a1")
  @inline def process_a2 = println("a2")

  //--- (2) auto generated matcher code (no need to modify)
  @inline def match_a = { len>=1 && data(off)==97 }
  @inline def match_a1 = { len==2 && data(off+1)==49 }
  @inline def match_a2 = { len==2 && data(off+1)==50 }

  if (match_a) {
    if (match_a1) {
      process_a1
    } else if (match_a2) {
      process_a2
    }
  }
}

The resulting matcher code (match_.. and if (match..).. expression) is invariant and can be directly copied into clients, but semantic match actions (process_..) have to be modified. The process_.. methods usually extract and store values or call recursive parse functions.

Use of the StringMatchGenerator is not without downsides since it reduces readability of parser code. It should be mostly reserved for performance critical parse functions.

Parser input

All RACE parsers work on utf-8 byte arrays. Consequently, there is a UTF8{Xml,Json,Csv}PullParser class that gets directly initialized with a byte array input. This is the preferred base class for custom parsers since it avoids array allocation and string translation.

For String inputs there are buffered and unbuffered parser versions. The buffered parsers use a pre-allocated byte buffer and RACEs UTFx.UTF8Decoder (which is a value class) to convert string chars into utf-8 byte sequences. For efficiency reasons there are specialized ASCIIString versions of string based parsers.

Although conversion into utf-8 byte arrays is transparent, the relevant supporting classes of RACE (gov.nasa.race.common.UTFx and gov.nasa.race.common.StringDataBuffer) can be used separately, e.g. in cases that have several clients requiring the same input.

ByteSlice hierarchy

Although they can be used independently for many string processing tasks, gov.nasa.race.common.ByteSlice objects are a central element of RACE parsers. To that end RACE includes a hierarchy of related types:

ByteSlice types

ByteSlice is itself just a trait that provides methods to specify a range within a byte array. It does not define any fields and as such is invariant with respect to the range and source data reference. It should be noted though that it does not guarantee invariant source array contents (byte values). While we could use Scala's ArraySeq to also enforce invariant data this would require data copies (heap allocations), and the primary purpose of ByteSlice is to avoid such copies.

There are two dimensions of ByteSlice refinement - range mutability and semantic meaning of byte values. Mutability of data source, offset and length is reflected in Mut.. and Const.. variants of concrete ByteSlice classes and is generally self explaining.

While there could be other uses, the prevalent semantic meaning of ByteSlice byte values is to represent string data, i.e. text. The primary text encoding format is now UTF-8 which uses up to four byte values to represent unicode code points. This encoding is reflected in respective Utf8.. variants of ByteSlice classes. However, while UTF-8 is byte compatible with the ASCII character subset, processing general UTF-8 data requires encoding/decoding iteration over respective byte arrays and hence can incur significant runtime overhead (ASCII processing often can be done with simple indexed array element access and low level memory block operations). To that end, we provide specialized Ascii.. slice classes that avoid this overhead, especially given that our main target domain (airspace applications) primarily uses ASCII data.

The Java ecosystem reflects this fact by means of using the java.lang.CharSequence interface for most text processing tasks, such as regular expression matching. In order to be compatible with this ecosystem the perhaps most important ByteSlice refinement of RACE is the CharSeqByteSlice that provides the required charAt(index) conversion. Since this is our most basic text representation this trait also defines text-to-primitive conversions such as toInt and toDouble.

In addition, the intern method allows to efficiently create internalized String objects directly from slices, without resorting to temporary string allocations for the sole purpose of interned string lookup. This is especially important in RACEs primary application domain which features a high number of repetitive messages referring to the same entities (e.g. position update messages using call signs to identify flights). Implementations of this method make use of RACE's Internalizer and ASCII8Internalizer, the latter one being used for strings with up to 8 bytes (that can be directly represented as long values).

UTF-8 variants of CharSeqByteSlice use the Utf8Encoder/Utf8Decoder support in gov.nasa.race.common.UTFx for character conversion, which utilizes Scala value classes to store coder state and hence further reduces memory overhead.

ConstAsciiSlice is the primary representation for lexical constants such as tag names in RACE parsers.

StringDataBuffer is the main construct to generate ByteSlice objects from Java String objects (or other CharSequences) which encapsulates the dynamically allocated byte arrays to hold the respective text data. There are two main objectives for this type: (1) to enable re-use of the same buffer for successive string conversions (i.e. to avoid per-string object allocation such as String.getBytes), and (2) to provide specialized conversions for ASCII text, which is orders of magnitudes faster than iteration based unicode conversion.