Data Modeling This section describes details on schemas used in Hollow and need modeling child your data model maps to them. Schemas A Hollow data model is a set of schemas, which are usually defined by the POJOs used on the producer to populate the data. This section will use POJOs as examples, but there are other ways to define schemas — for example you could ingest a text file and use the schema parser.
A hollow dataset is comprised of one or more data types. The data model for a dataset is defined by the schemas describing those types. Object Schemas Each POJO class you define will result in an Object schema, which is a fixed set of strongly typed fields. The fields will be based on the member variables in the class. Each schema has a type name. The name of the type will default to the simple name of your POJO — in this case Movie. Each field also has a field type, which is in this case INT, REFERENCE, and REFERENCE, respectively.
REFERENCE: A reference to another specific type. The referenced type must be defined by the schema. Notice that since the reference type is defined by the schema, data models must be strongly typed. Each reference in your data model must point to a specific concrete implementation. References to interfaces, abstract classes, or java.
Primary Keys Object schemas may specify a primary key. When defined in the schema, primary keys are a part of your data model and drive useful functionality and default configuration in the hollow explorer, hollow history, and diff ui. They also provide a shortcut when creating a primary key index. Primary keys defined in the schema follow the same convention as primary keys defined for indexes. They consist of one or more field paths, which will auto-expand if they terminate in a REFERENCE field. In the above example, our fields are now of type INT, STRING, and REFERENCE.
While modeling data, we choose whether or not to inline a field for efficiency. Over the years, many such awards will be given, so we’ll have a lot of records which share that value. If we use an inlined STRING field, then the value “Best Supporting Actress” will be repeated for every such award record. However, if we reference a separate record type, all such awards will reference the same child record with that value.
Record deduplication happens automatically at the record granularity in Hollow. Try to model your data such that when there is a lot of repetition in records, the repetitive fields are encapsulated into their own types. In this case, if we reference a separate record type, we have to retain roughly the same number of unique character strings plus we need to retain references to those records. In this case, we end up saving space by using an inlined STRING field instead of a reference to a separate type. A REFERENCE field isn’t free, and therefore we shouldn’t necessarily try to encapsulate fields inside their own record types where we won’t benefit from deduplication. These fields should instead be inlined.
We refer to fields which are defined with native Hollow types as inlined fields, and fields which are defined as references to types with a single field as referenced fields. Namespaced Record Type Names In order to be very efficient, referenced types sometimes should be namespaced so that fields with like values may reference the same record type, but reference fields of the same primitive type elsewhere in the data model use different record types. Other referenced string fields in our data model, which are unrelated to award names, should use different types corresponding to the semantics of their values. Namespacing fields saves space because references to types with a lower cardinality use fewer bits than references to types with a higher cardinality. The reason for this can be gleaned from the In-Memory Data Layout topic underneath the Advanced Topics section. Namespacing fields is also useful if some consumers don’t need the contents of a specific referenced field.