Introducing SeQL: The Query Language for Set Theoretic Data Bases

Every database naturally needs a way to interact with it. While for an embedded database an API is sufficient, any serious database needs a Domain Specific Language that is perfectly suited for the operations of that database. As CantorDB utilizes a novel form of database, when I finished the initial API, I naturally set out to create SeQL.

Of course, there is the obvious issue that time spent upon the design and creation of an entire DSL

Those who are unfamiliar with CantorDB are directed to read Introducing CantorDB which discusses CantorDB and the Set Theoretic Data Base in more depth. As such I was left with the difficult question of how should I model my Query Language. A note on the name SeQL stands for Set Query Language. One could pronounce it SehQual or S-e-Q-L or Set-Q-L, all of them are proper. I was naturally slightly perturbed at those men in the 70s that decided upon Structured Query Language as opposed to Relational Query Language thus robbing me of SQL meaning Set Query Language. However, ever the punk rebel, I merely added the e to make it SeQL. This is because I am not creative in the least.

Options for Style

There are obviously many styles that a query language can take, afterall, there is more than one way to skin a cat. For example, GraphQL uses a pseudo json style saying which object that it wants and which properties from the object. Another option would be to use a pseudo code style imitating how one would interact with a json within Javascript itself. I could also use the near English style of SQL, or finally I could go esoteric.

How to Choose?

Obviously the best tool should be chosen for any job and just as a data base is a tool, so too is a Query Language a tool in order to interact with a data base. As such we must ask ourselves what the primary role of SeQL will be. The entire ideal of CantorDB is that of the Set Theoretic model, in that the data can be manipulated via set algebra. As such any query language which does not elegantly permit the use of Set Algebra may be dismissed out of hand. This naturally precludes json based Query Languages as they are best suited for CRUD operations. It is fairly natural to write a query asking for data in the schema of the data requested. However, if one is curious which elements are in two different categories, this does not lend itself towards json. This would lead to two conclusions. Either a totally esoteric language, a terse near English style Query Language, or using set notation in and of itself.

Set Notation

Using set notation does have its advantages as it is already well suited for performing set algebraic operations having been developed to express these operations for over a hundred years. There is however one fatal flaw which will lead to the dismissal of this notation to be the base of the Set Query Language out of hand. Set Notation Operations are neither present on the keyboard or in ascii. This would make the language prohibitively difficult requiring the memorization of keyboard shortcuts and unicodes to type the simplest query. This is an obstacle that no one, including myself, would overcome to even play with, let alone use seriously, CantorDB. I would have then wasted my time on a Query Language as the few users, if any, would purely utilize the API.

Esoteric Language

I am certain that by chaining together arbitrary symbols on the keyboard I could produce a query language which is exactly suited to Set Queries. I could even borrow sections of Set Notation, adapting them for use with the esolang.

As an example query < * (DogsVCats)

This would roughly correspond to < meaning GET, * meaning ELEMENTS and V meaning the UNION.

While this could have the greatest mathematical purity, it would however not be beautiful. Another strike again this Eso Set Query Language is if I ever intend SeQL to be more than a language which only I know, then it must be learnable. Having to memorize specific characters doing any number of operations ormodifier would become tedious fast and thus increase the barrier to entry.

Near English Query Language

There is something unusual about SQL. When one looks at the syntax A query has a structure and rhythm all its own that no other mathematical algorithmic language possession.

SELECT name FROM users WHERE age > 13 GROUP BY job ORDER BY dateofbirth

This query has both structure and rythm. SQL really could have utilized for example a more R based notation system, but instead it went with the terse, near English style that has become inexorably assosiated with it. It was not the first programming language to attempt to present itself in a near English style for both writability and readability. One can for example point to the verbose near-English syntax of Cobol. However, SQL , partially due to it being a DSL, managed to mostly dodge the verbosity. While complicated queries can spiral out of control with JOIN hell, for the most part a SQL query is both near English and terse, combining the near English syntax of COBOL along with the terseness of a computer language divorcing itself from the English Language. SQL has another benefit in that it is a terse example of how a person would naturally request data from the database.

Please select every person's name from our users that is older than thirteen.

Please group them by job and order them by their data of birth.

In this sentence one can see every element, and indeed most of the words, utilized within the SQL query. This is an additional point of beauty in that it serves as a bridge between the computer style language which is understood by rocks we stuff with electricity and mages with esoteric knowledge and the style of language which is a natural fit with how humans communicate daily.

I mentioned COBOL above earlier but I really should also mention Basic as another Near English Language. Indeed, this was the design standard during the fifties and sixties

This in and of itself is beautiful, and as every set algebra operation has a clear unique name, the obvious choice is that of a Near English Query Language.

Applying These Principles

This NEQL, a term I made up for this article, must also reflect the same principles that . That is, it must have structure, rhythm, and reflect both how humans think about data and at least partially how machines think about data. It should, like SQL, serve as a bridge between human thought and expression and between the machine. As such I came up with the following principles.

Every word must describe the operation as closely and simply as possible
Each word must map to either an operation or be used for separating sections in the parser
Symbols must be kept to an absolute minimum
Sections should not bleed into each other
The flow of the query should map onto human thought patterns and language
The flow of the query should likewise be roughly recognizable as a set operation

Through following these principle I would be able to follow the soul of SQL even as I did not feel myself tied to its syntax or structure. To solidify both in myself and in the eyes of any user that this, while it was a NEQL type language, is not SQL, my first concrete syntax decision was to use GET as the operator of gathering data instead of SELECT. However, to select a different term for an operator just because SQL happened to utilize it would be to define myself as NOQL and as relational databases are based upon set algebra, this would not only be reactionary, but would also serve to hamper the development of the language.

As such there were two other reasons that I selected GET for my syntax. First of all, because SeQL is primarily an operational language where algebra and filters is applied to sets before the result is outputted to the user. On the other hand SQL is much more of a declarative language. In a way, it has that in common with GraphDB, where one is declaring which data he desires, rather than saying which operations are to be applied to the data. While both are divorced from the underlying algorithms, when SeQL says UNION, the union operation is performed using an algorithm designed for that purpose. Meanwhile a select could perform a hash map operation, a binary tree search, scroll an index, or just brute force its way through the tables. The underlying operation is divorced from the data that has been requested. GET is a much more operational verb which has the connotation of physically grabbing an item or set of items. Meanwhile select is a more passive word, after all one can select merely by pointing. The difference is between pointing at the mustard one desires while a clerk fetches it and grabbing the mustard directly from the shelf. Both are divorced from the logistics of delivering, but one does include the surface level operation needed.

To compare

SQL: SELECT name FROM pets WHERE type = 'dog'

SeQL: GET ELEMENTS OF Dogs

The SQL expression describes the set that satisfies the predicate where as the SeQL expression describes the operation to be performed resulting in a set. The second reason that I selected GET is From this simple philosophy flows the rest of the primary operational syntax. Every word should describe the operation that is being performed. And thus the CRUD operations for example each describe an action. One adds a set to another or removes a set. Likewise properties are created and then added before they can be updated to their proper value. There is however one exception, that is, is. It is always necessary to determine the truth of something.

Afterall, we as humans will often ask is such and such true? Indeed one could say that the entirety of the Principia Mathematecia is just asking "Is that true?" and "Why" until the teacher hurls his chalk at the two of them informing them that if they want to ask such questions to the point of absurdity then they can answer its themselves. As such, there is the IS operator or predicate within the language. One is able to as various questions such as IS dog ELEMENT OF mammal or IS mammal SUPERSET OF dog and so on. Each of these rather than describing an operation to be performed instead asks a true false question of the data and thus represent the one, and essential, break from the operational philosophy.

OF

Couldn't GET ELEMENTS OF Dog be shorted to GET ELEMENTS Dog. To be fair, yes indeed. While OF does serve as a nice pivot for my parser as it separates the header describing the specific operation which is to be performed upon the set or resulting set from the algebra, in statements such as IS dog ELEMENT OF mammal, we see that dog being a set is to the left of the OF thus lessening its ability as a pivot. To be frank while in the operational queries where all sets are to the right of the OF and in the parsing it does serve as a convenient marker, in the predicate operations its mostly for flow. OF largely serves a psychological role within SeQL, that is, pushing the user towards thinking in the set database model. Sets belong to other sets and have sets within them and that is the primary and only form of data relationship. Every query and operation is based upon asking about membership and filtering the results of membership questions. Therefore while OF can have variable utility within the language, I decided upon making an exception to my earlier rules and leaving it as part of the language even in predicate queries in order to retain the terse NEQL style as well as to highlight the way one should be thinking when dealing with a Set Theoretic Data Base. The Algebra For being the very heart and soul of the Set Theoretic Database, the Set Algebra did not require that much thoughtful considering. Or perhaps it is because of this focus that it was non negotiable. Each set algebra operation is self descriptive and terse thus they neatly map to the above stated philosophy. As such, I ported them over without any change except for the lacking of Cartesian Product and Power Sets. The reasoning for these two decisions is in the CantorDB article linked above.

Dot Precedence Notation

Every designer is entitled to one highly opinionated decision that truly breaks what is considered normal for the domain. For me this was dot precedence notation. Dot Precedence Notation utilizes dots wrapping operators to give precedence with the number of dots determining precedence. Such as 4 .+. 2 x 4. The addition is first as it has one dot encompassing it giving it a precedence of one as opposed to the multiplication operator having a precedence of zero. The expression can be rendered in parenthetical notation as (4 + 2) x 4. As elaborated above, there is a beauty in a Near English Query. However, parentheses serve to mar its elegance. For parentheses extend above and below the line. As such, they break up the natural scanning of the line by the eye. In addition, they give meaning to statements that are far from them. For example the problem ((5x3)+2)-1 has the first ( give precedence and therefore meaning to the + operation as the second operation to be performed, however, it is distant from the modified operation and thus will result in the eyes returning to already scanned sections of the problem in order to continue parsing the expression.

Another issue is that numerous parenthesis provides unclear visual clutter. Lumping parenthesis for more than two or at most three levels of precedence quickly becomes unparsable to the eye without at minimum moving parsing into higher levels of thinking if not manually counting the numbers of parenthesis. Mathematics has partially solved this issue through the use of brackets but programming does not have such luxury as it would complicate the parser and utilize an ascii character for a redundant purpose. While one could say that writing more than one or two levels of precedence within a query is bad form, this is merely telling the user to not trigger the issue rather than solving it properly. However, both of these issues are solved by Dot Precedence Notation. This notation is of my own invention but is partially inspired by the Principia Mathematica. In specific, they utilized Peano's dot notation which replaces parenthesis with dots. The exact rules and a full critique of this system is beyond the scope of this document but suffice to say that its defeat and reduction to irrelevance within modern notation is not without merit.

As a quick example,

GET ELEMENTS OF : dogs UNION cats . INTERSECTION . "brown _fur" UNION "white _fur"

would in modern parenthetical notation roughly be

GET ELEMENTS OF (dogs UNION cats) INTERSECTION ("brown _fur" UNION "white _fur").

Thus as one can see it solves only the issue of beauty at the cost of severe ambiguity. Dot Precedence Notation on the other hand wraps the operator itself in dots with the number signifying precedence. For example (5+8) x (6 - (1 +4)) becomes 5 .+. 8 x 6 .-. 1 ..+.. 4. First one plus four is collapsed as the addition operation at the end has two dots which grants it the highest precedence in the expression. Next 5 + 8 and 6 - 5 are collapsed before ending with 13 x 1 as the multiplication operator has no dots surrounding it. This solves both of the issues of parenthetical notation highlighted above. The dots denoting precedence are next to the operation that they modify. A quick scan can immediately inform the viewer, once he has become accustomed to the notation, that the final addition operation has priority as humans can very easily compare length. This is almost like degrees of bolding as the dots naturally catch the eye while still enabling it to slide past without interruption providing both a variable spotlight upon operations as well as enabling the flow of the NEQL to continue without interruption. Thus, the lack of beauty and that parenthesis are distance from the operation that they modify is both solved. Dot notation also solves the last issue in that levels of precedence are easier to parse. On paper I would stack the dots which would provide for a much easier to parse level of precedence, however, even with just using horizontal length to provide for the amount of dots the difference is clear on a larger scale than parenthesis.

Consider

((((((A UNION B) INTERSECTION C) DIFFERENCE (D UNION E)) SYMDIFF F) INTERSECTION (G DIFFERENCE H)) UNION (I INTERSECTION (J UNION K)))

versus

A UNION B .INTERSECTION. C ..DIFFERENCE.. D UNION E ...SYMDIFF... F ....INTERSECTION.... G DIFFERENCE H .....UNION..... I INTERSECTION J UNION K.

While in the first example the highest precedence operation is not immediately apparent, indeed the LLM that I used to generate these problems got it wrong, it is abundantly clear that the final UNION is to be performed first with only a quick scan of the line. This is despite UNION has fully five dots and INTERSECTION four. Thus with these facts in hand, and with Dot Precedence Notation being simpler to parse, I decided to follow my own opinion and discharge parenthetical notation in favor of my own Dot Precedence Notation.

I also, after some thought, decided to dispense of the usual inherent precedence of Set Algebra in favor of explicit and left to right precedence in order to make the internal order of operations as transparent as possible. This is in the name of reducing ambiguity such that an expression will parse exactly how it is initially read by the user rather than requiring step by step thinking to ensure desired results.

But What About Set-Builder Notation?

When I first began to design CantorDB, as is discussed in depth above, I first desired for a sort of Set-Builder to sit alongside the Roster notation currently only used for set construction. However, I eventually had to jettison this idea. However, there is still need for this form of set as a tool. For example what if one desired to have the set of all children. One could indeed make such a set, but this would mean that membership to the set of children and the property of age are disconnected entirely which could lead to data incongruity where a twenty year old never had his membership changed and thus is still a child as far as the database in concerned. This is a fatal flaw in the closed universe assumption inherent within the Set Theoretic Database. Therefore it would be better to have the set of Children be people WHERE age < 18. I have still not despaired of using VIEWS for this Set-Builder, but in the current version one will have to just use the WHERE operator. The WHERE operator serves as a Set-Builder filter to produce sets on the fly using properties. This is admittedly a concession towards SQL as it operates upon properties rather than the pure sets, but as Set-Building is a part of set notation, I determined that this was enough justification beyond its inherent usefulness to be consistent with the philosophy of naive set theory.

There are currently two of these operations within SeQL, though CantorDB only has one. The first is WHERE which always has first precedence and operates upon the set immediately to its left before any algebra is performed. The second is FILTER which has the same precedence rules as all algebraic operations. I split the operator in the name of clarity. For if you wished to have WHERE operate first, you would have to count dots to ensure this. It could therefore lead to bad habits where one automatically gives WHERE ten dots to be doubly certain. However, the user still might wish to perform a filter upon the result of set algebra. These two conflicting needs could not both be satisfied, as such I split WHERE into WHERE and FILTER such that precedence is completely unambiguous.

It should be noted that FILTER collapses left to right meaning cats UNION dogs FILTER age > 4 will filter the UNION of cats and dogs rather than just dogs. For example GET ELEMENTS OF People WHERE age < 18 UNION Parents FILTER country = "US" would have WHERE operate first before filtering the resulting UNION, adult parents, to the country of the United States. (Though it should be noted that country really should be a set rather than a property.)

Views, when added, will be an extension of this concept except that they will be able to also use set algebra to define their dynamic contents. The above example could become the view American Adult Parents and thus be queryable like a normal set.

Aliases

From here on out, we will discuss features that are or will be purely within SeQL. This layering permits compromises to the set theoretic model to be made for the ease of the end user without harming the pure foundations of CantorDB. In set theory every set has just one name. However in the real world the same category often has different names depending on the context. For example dogs and canines are the same set. As such, SeQL supports ALIAS within the parser itself. This is purely for ease of use such that other names for the set can easily be added for the user's convenience. CantorDB will never see the alias as it will resolve to the actual name of the set before sending it to the backend. This makes Aliases practically a type of macro or symbol table.

Order By

Within set theory, order is ancillary and not significant for the value of an item in the set. Cat can come before or after dog in the set of mammals and it does not lend any property or value to either one. While for algorithmic purposes CantorDB is well ordered, there is no data contained within the order. Thus the ORDER operator serves purely as sugar for the user to better be able to read the list of sets printed by the shell. As ORDER is not an operator, predicate, or CRUD operation, I decided that it should be last. Thus in a normal GET operation, the header is positioned first, then the set algebra and finally ORDER. This is to make it as explicit as possible that ORDER is not part of the database, but rather a convenience for the user. ORDER can take any property and rank it by ASCENDING or DESCENDING. One of these is required before any property such that ORDER serves as a structural marker and ASCENDING or DESCENDING serves to define the type of sorting applied to the property directly after.

It should be noted that items without the named property will be sorted last always in either the order of one of the other ordering properties or in an indeterminate order. When two or more ASCENDING/DESCENDINGs are chained, then they are resolved left to right. If the first property results in equal order for elements the elements are then ordered by the next property until there are no more properties where upon the order will be arbitrary.

What Got Left Out

The most glaring omission for those who are familiar with SQL is that of GROUP BY. While I did consider GROUP BY I eventually decided to omit it as it would provide an extra layer of set upon the list of sets. It would compromise the foundation of the database and of SeQL without much benefit. While ORDER provides the slight illusion of order within the data mattering, GROUP BY would entirely shatter the precedence of sets. In addition, any data that would be at all useful for GROUP BY should be a set. For example if you wanted to GROUP BY gender, gender really should be the set of men and the set of women. As such, if I ever did add GROUP BY it would take sets as its grouping, but as sets are not exclusive, this could lead to issues. This plus other open questions means that I have decided to push GROUP BY into the far future if ever.

Conclusion

It is my hope that I have achieved that which I set out to do, namely, to create a beautiful NEQL with both rhythm and structure. A language which is approachable for beginners and powerful for advanced users. While one can never be certain until their language is used by others, it seems to me that I have accomplished this goal. At the very least, when I created something, it was beautiful.