Friday, July 17, 2015

Towards a Multiply-Linked List model for Storage and Retrieval of Trie Structures {S,P,O}

Ed. Note: Clearly, this web log format does not serve to present an ideal format for presentation of mathematical notations. The syntax of the following article may be difficult to read. It should be reformatted, subsequently, to something more immediately legible in its web-facing presentation.

Earlier this month, I'd begun developing a UML model for a graph-oriented application system, namely after a matter of small confusion with regards to differing board support information about the Cryptotronix CryptoCape circuit board. The design, as such, began to by-in-large diverge from the original concern as with regards to logical and physical modeling of electrical systems. It's become, it rather seems, a design for albeit a simplistic manner of graph-oriented application framework, extending of a FreeBSD operating system

As one of the core features in that design -- as yet, the design as such being represented in a by in large incomplete and unpublished document, but here might be a start of some of a literate documentation about it -- the design features a sort of a simple, abstract binary tree model for storage of trie formatted data, as of -- in a manner or description logics -- a data format {S,P,O} for resources S and O being, respectively, a subject and an object joined of a predicate P. This format, of course, is by in large influenced of the design of the Resource Description Framework (RDF).

The abstract binary tree model proposed in the same presently unpublished design, the same tree model, in its format, is essentially of a generic format like so: {{S,P}, {P,O}}

In a sense, perhaps it may seem like an arbitrary design concept, as here it would not be immediately correlated to any logical or mathematical models as with regards to data structures and computing. Of course, the author is at least vaguely familiar with a generic, diadic format of assembler instructions.  The original design, itself, may be much like unto a cafe placemat sketch but produced in UML format.

Of course, the design may evolve over time -- much to the author's own sense of uncertainty, as to whether or not to publish the initial design resource, presently. Essentially, it's a multi-layered design -- developing some of a service-oriented view, in parallel to a design for a simple graph information service, in a manner focusing on CORBA and some of a network security model. In each essential layer, however, the design is by in large conceptually naive, and not as yet completely developed to any single proof of concept. In some of a view of  a competitive software development community, it could seem like the design is far too much of a ripe manner of chum for a sharkey web, thus it's presently unpublished.

The storage of the {{S,P}, {P,O}} model would, itself, be one feature to address of the design -- here unnamed -- in any implementation.  The generic model does not, in itself, provide any detail as to how it  may be best implemented. Broadly, and albeit without any sense of a single usage case, an application could maintain exactly three tables, of each of A={S,P}, B={P,O} and M={A,B}.

Each of A,B, and M, hypothetically, could be implemented as a sequence value -- in Common Lisp, optionally an adjustable array -- each contained within some sort of an encapsulating object -- in an abstract sense representing each of A, B, and M as a table each implemented with a vector. In addition to each of the binary object values in each relation A,B, and M, the implementation may suffix a numeric index value onto each respective table -- thus, A' = {S,P, N_A}, B' = {P, O, N_B}, M' = {N_A, N_B, N_M}.

In something of a list-oriented approach, alternately, each of the tables may be given an index value N — fig. 1

A" = {S, P, N_B, N_A} // Including N_B in A"
B" = {P, O, N_M, N_B} // Including N_M in B". Redundant 'P'
M" = {N_A, N_B, N_M} // The "End" relation

Thus, in constructing any complete {S,P,O} relation for any S, any row of B" could be retrieved in beginning at a row in A", and any row in M" from a row in B", in such an implementation. This suggest a manner of a singularly linked list.

In refactoring the design into a doubly-linked list model, A" and M" may be retained — the refernce to P, made exactly once in A" — thusly, fig. 2

A" = {S, P, N_B, N_A}
B"' = {N_A, O, N_B}
M" = {N_A, N_B, N_M}

In that design, M" would be essentially redundant, though allowing perhaps for an {S,P,O} to be referenced as a unique object independent of A" and B"'.

In a more normally list-oriented design, towards a manner of a single-linked list — fig. 3

A"" = {S, N_B, N_A}
B"" = {P, N_M, N_B}
M"" = {O, N_M}

Towards a manner of a multiply linked list — fig. 4

A""' = {S, N_B, N_M, N_A}
B""' = {P, N_M, N_A, N_B}
M""' = {O, N_A, N_B, N_M}

Once the initial design is refactored onto a list-like data model -- as reminiscent of the (CAR,CDR) semantics of lists in Common Lisp -- then the design begins to assume a sense of applicable normalcy. For an application system, the {A""', B""', M""'} model might represent a manner of a most effective baseline data model design, it allowing for a search for any of {S,P}, {P,O}, {S,O}, and {S,P,O} sequences, essentially without any further data structure definitions in the table itself.

In developing a manner of a meta-semantic view of the last table design, in the previous: Essentially, it describes a model in which regards — fig. 5

Given: {M_1,.. M_N}
Define tables {M_1, M_N} ...
* as that each table M will be defined with N+1 columns
* such that in any row n of a a table M: an index value N will be defined in the row, such that N = n
*  furthermore, in any row n of a a table M: Each of {M_a, .. M_b} will be referenced, for a != n, b != n

This meta-semantic model allows for indexing and search across all elements of any single set type structure S = {M_1, ... M_Nx} 

So, it's probably as well that the original design document has not yet been published. To resolve the original design document of all its distinct naivete in design, the design document itself should need to be suffixed with so many considerations of the technical qualities of any single implementation, it may serve to render the original design document essentially obsolete.

Not as though to leave it to collect dust for all eternity, here's the first textual design document about an as-yet unpublushed, graphical model of a graph OS.

Continuing this article as from a description of generic data structures, towards a concept about application, it may be noted that the table structure described in fig. 4 may be initialized from any set of trie structures {S, P, O}. Some of the columns in the table would be populated not of data values represented in the original input structures, but rather as to provide for a small number of "Procedural shortcuts" for optimization of searches onto each table  of {S,P,O}.

Of course, once the table as illustrated in fig. 4 would be initialized of any single table of {S, P, O} if any single update would be made as effectively onto any row N of the original table, then any number of additional data stuctures — such as would be created during the table's initialization — would likewise need to be updated, specifically for any changes in reference across any single, original {S ,P, O} data structure.

Perhaps towards a more structurally modular design of the tabular model illustrated in fig. 4, the supplemental reference fields as added to fig. 3, in this design, may be essentially separated such as to create one additional table structure, removed of redundant refences  — thus, fig. 6

A_x = {S, N_B, N_A}
B_x = {P, N_M, N_B}
M_x = {O, N_M}
C_x = {N_M, N_A} // {O,S} in {S,P,O}

The supplemental table, C,  need not include row index values. That table would not provide original reference values, but would be maintaintained as for purpose of optimization of supplememtal searches across direct {S,O} references in the table of {S,P,O}. For searches singularly onto {P,O} in {S,P,O,} the table B_x may be queried. For searches singularly onto {S,P}, the table A_x provides a sufficient set of data fields.

Certainly, fig. 6 may serve to present an optimal table structure, at least across the set of table structures described in fig. 1 through fig. 4 and fig. 6.

In the methodology of storage/reference encoding illustrated as in fig. 6,  a table group structure is illustrated. This may be juxtaposed to a singular table structure, such as may be implemented onto an SQL relational database -- fig. 7

N_a = {S, P, O ,N_N}

In the table group methodology for storage/reference encoding, structural data field elements are stored independent of other elements. This table group methodology may be more closely analogous to data structures within a Common Lisp implementation, juxtaposed to data structures of a conventional SQL relational database implementation.

In a manner of further refining the table group illustrated in fig. 6, the tables A, B, and C may be removed of cross reference values, then applied only for storage and identity of data values for each element of {S, P, O}. Data references as of {S,P}, {P,O}, and {O,S} would all be stored in separate tables, independent of the individual data object tables -- fig. 8

A_y = {S, N_A}
B_y = {P, N_B}
M_y = {O, N_M}
C_y = {N_M, N_A}
D_y = {N_A, N_B}
E_y = {N_B, N_M}

The author has, to now, assumed it would be a known axiom that each of the N_fum row index values, in each table, would be an unsigned integer value unique in the respective table. Each N_fum value, correspondingly, may be interpreted as representing a manner of structure-relative reference address for each such data value.

If this methodology for table-oriented data storage would be extended logically as with regards to  CMUCL widetag and lowtag values, individual tables within the model may be referenced of, each, the respective widetag or lowtag interpretation of each data value's type. With the respective widetag and lowtag values being normatively documented, as for portable applications, the data encoding model, itself, would not be uniquely applicable of the respective Common Lisp implementation in which it would be applied.

The CMUCL widetag and lowtag values, as effectively forked in SBCL -- directly, as forked along with the broader CMUCL source tree, from the time of the origin of the fork -- the original CMUCL widetag and lowtag values are defined as a feature of the type system in CMUCL.

As an implementation of Common Lisp the Language, 2nd Edition (CLtL2) CMUCL -- as well as SBCL, and the small number of Common Lisp implementations in addition to SBCL and CMUCL --  essentially, every CLtL2 implementations has produced, in each distinct species of CLtL2 implementation, a manner of a semantic approach in implementing the system of data types developed and standardized in in CLtL2. Focusing on the CMUCL implementation, the semantics of the implementation might be observed as around the CTypes system developed in CMUCL, furthermore as with regards to definitions of data types and functionally mechanical forms, such as developed in the CMUCL compiler.

The CMUCL compiler's system of type definitions -- insofar as inherited by SBCL, if not furthermore adapted in the forked codebse -- the compiler's system of type definitions described, specifically with regards to SBCL, in an article by Paul Khuong: Hacking SSE Intrinsics (Part 1) (2009) with something of a reprise, in the article How to Define New Intrinsics in SBCL (2014).  The former of the two articles presents something of functional view of the compiler's types system, extending of an extension of the types system for an implementation of the Intel SSE instruction set, in SBCL.

With regards to documenting the implementation in SBCL, certainly, each functional form -- that is to say, each function and macro -- and each data value, as illustrated in either of the two articles, could be more comprehensively described with, of each, a single reference page. Each reference page could be referenced, furthermore, from a  broader documentation as to describe any manners of semantic correlations among the respective forms, in the implementation of the compiler specifically in SBCL. Although that would surely make for an interesting study, it would present a broad diversion from the topical focus of this article at DSP42.

In regards to the compiler implementation in CMUCL, and furthermore in SBCL, both implementations each define a set of data types, such that may be applied in a manner essentially internal to the compiler of each of the two implementations. Those data types may be observed for their definition and application, as around the widetag and lowtag values defined in each implementation. 

Defining a small subset of the complete set of widetag and lowtag values -- the latter, as would be defined in any single revision of each of CMUCL and SBCL -- essentially, a subset of widetag and lowtag values may be defined, such that the subset would denote the set of widetag and lowtag values defined for "User visible" data types -- i.e. a set of data types, such that might be reasonably expected as to be encountered of data values in portable user programs extending of the respective implementation. This article will denote that subset as it representing a set of types of user object.

That subset of widetag and lowtag values may then be applied such that -- for any single such type of user object -- firstly, the widetag value then secondly the lowtag value may each serve as an effective key value onto a single data storage table, such that the format of the respective data storage table may be of a format conducive to encoding and decoding of objects of such type, in a manner of a data serialization model.  Such data serialization model may be implemented as to develop an ad hoc encoding directly onto files, or as to develop a set of table structures and procedural forms in utilizing an existing database management system, essentially as for purposes of data serialization from within any single instance of the respective Common Lisp implementation.

In regards to serialization of structure objects, condition types, and standard classes furthermore, each respective type's class OID may serve as a fundamental key reference -- eliding, momentarily, concerns as with regards to  class names -- in a model then allowing for a manner of selection as to which slot values of any single class would be serialized into persistent storage, and which slot values may be initialized simply in interpretation of persistent data.

Of course, this does not describe all of the details of an object persistence model. Though it may serve to provide something of a guiding concept with regards to a deign, as such, onto either or both of CMUCL and SBCL -- not a furthermore portable design, immediately -- there may be some additional concerns that may be addressed, as with regards to:
persistent "linking" of each initialized objects to the object's persisted formfurthermore with regards to making references to objects' serialized forms without preventing garbage collection of objects that would be otherwise unreferencedand thirdly, deallocation of storage blocks on garbage collection of previously persisted objects. In albeit a naive sense, perhaps it might be more efficient to develop a memory mapped file model for the SBCL object system, such that -- in so many memory-mapped file/data areas -- an underlying operating system would immediately store any changes in the respective memory-mapped file/data areas, such that the updated data area might late be reinitialized of its filesystem mirror. This may serve to allow for something of a manner of failover, in a naive sense. With regards to optimization in the underlying file mirror, perhaps it might behoove such a software design, however, to make reference towards methods of filesystem journaling, such as with regards to the copy-on-write methodology  implemented in UFS journaling on FreeBSD.  In such regards, perhaps a broader goal may be defined as to develop a Common Lisp systems model extending immediately of FreeBSD.

In regards to the data persistence model denoted in the previous outline, certainly a consideration for the design of such a model is likewise reflected in the previous design discussion as with regards to the table group structures for the {S,P,O} data type.

Not as though to try to make such a project idea to seem too specifically grant-friendly, the prospective instance of a Common Lisp on FreeBSD systems model may be applied as for purposes of knowledge representation and knowledge sharing, effectively in any normative context to which an RDF graph may be developed -- ostensibly, even as to present a graph type model of operating system resources, such as mandatory access controls (MAC), on any single host or network, thus as to develop a management layer onto FreeBSD MAC data types -- whether or not, alternately, in as to develop any single, ostensibly peer-reviewed semantic model of the material sciences, or manufacturing, or commerce, for transparent application in communications and infrastructure support systems, at any single scale of support for material activities of a society or smaller community.

Not as though to paint with any too broad a brush, simply the very nature of knowledge modeling and knowledge sharing may be, likewise, of the very nature of the sciences. The material sciences being, furthermore, to the very essence of material technology, thence to the nature of material industry, thence to manners of material commerce, it may be possible to support development in all of industry and commerce, in supporting simply a development of science and technology in any single manner of normative social environment. 

Of course, in this broad view, a concept of intellectual property does not take up any clear or obvious place, neither any concepts of data assurance.

Ed. Note. The following section of the article was begun before the article diverged towards a more comprehensively CMUCL-like encoding for data values.

As with regards to a question of application, How many table groups like fig 6. or fig. 8 may need to be created, in any single application? and a further question, How may of each such table group may be most effectively stored and accessed for reference, in an application?  Considering some of the semantic qualities of RDF Schemas, as with regards to definitions of RDF classes and RDF properties, moreover considering the Web Ontology Language (OWL) as a specialization of RDF extending the essential class/property model, moreover with OWL adding a distinct Individual resource type to the abstract RDF data space, it may be feasible to implement a table group as illustrated in fig. 6, such that a single table group would be defined for every expressly defined OWL Class, likewise for each of OWL:Class and RDF:Class, as to provide an index of instances of the respective class. This naive "Table group per class" methodology may be further refined, if towards any singular application, as with regards to concerns of subsumption relations in RDF schema and other orthogonal considerations.

In something of a meta-semantic sense, considering the broad question of how many of a table group — as illustrated in fig. 6 or fig. 8 — may be "Well implemented," in an application, logically the question may be considered as in juxtapositoon to an orthogonal question: To what manner of a context may any such table group "need to be" singularly initialized and referenced, as a representation of any single table of {S,P,O}?

Offhand, a number of contexts may be defined — certainly, in no manner of a logically exhaustive list — in presenting at least a naive response to such a question:

Application as container of graph data sets
Graph data set as container of tuple or trie references, such as of all references of a format {S, P, O} within any single graph data set
• A Class A as a class reference context for tabular indexes of instances of Class A
• A Class B as a type reference context, for tabular indexes of instances of Class B and C, for C being a subclass of B

In this naive model, a class or type reference context may be defined for each RDF, RDFS, or OWL class both defined and being applied in any single graph data set, in any single application.

A class or type reference context need not be assigned as though exclusively to each single class. The symbolic name of a class — as in reference to cl:class-name, a function portably returning a symbol type value — as a symbol type of object, a class' name may be applied likewise as an associative reference, unique in any single table of graph-local and/or application-local data tables.

[RDF Schema] develops a type system for RDF Literal values, essentially in extending of XML Schema Datatypes.

In a model for interpretation of RDF data sets in Common Lisp, each XML Schema Datatype may be intepreted as to derive a corresponding data type in Common Lisp, together with a function for each of marshalling and unmarshalling an object of each respective type onto a character stream. Such an implementation, in a functuional regards, may resemble the CMUCL Ctypes model.

Depending on the characteristics of a Common Lisp encoding such that may be defined for any single RDF Literal data type, it may be advisable to consider revising the table group illustrated in fig. 6 or fig. 8, such as to allow for a specialized encoding in O onto {S, P, O}.

Of course, in a uniquely table stored model such as of an SQL database, then for each tabular reference as in fig. 6 or fig. 8, It may be advisable to encode a value representative of a type identity — such as of a class name, or an implelemtation-specific albeit non-portable value such as a classoid, in CMUCL or SBCL, such that may be applied as a symbolic reference to any type definition data provided of or in relation to any single class definition. A class identity, simply, may be applied for purpose of table selection in row index dereferenfcing, such as in any manner of a revision of the table set model denoted in each of fig. 6 and fig. 8.

This article essentially begins to describe a model for an implementation of a data persistence model, broadly, for objects of any arbitrary number of types, inasmuch as may be applied in {S,P,O} references in any arbitrary application. Towards developing an example onto a literary domain, this model may be furthermore developed as for purpose of maintaining a graph formatted registry of bibliographical ínformation structures, such that may be furthermore applied as in an independent bibliográphical management application within an academic content development system, or applied as an integrated feature of a graph-oriented information modeling platform — in the latter, as for purpose of creation and editing of bibliographical annotations onto graph object references, as to provide a framework for peer review of assertions in graph models.