[ALL] let's think about Query Language for 2.0

Fri May 23 12:52:54 CDT 2003

Hi Ruslan,

On Freitag, Mai 16, 2003, at 04:51  Uhr, Ruslan Zasukhin wrote:

> Hi Guys,
>
[...]
>> 1) Query is the BRAIN of database kernel.
>>
>> 2) Query allow us execute search and sort on Tables without SQL.
>> We can consider this as low level engine of query execution.
>> SQL is bult over this as higher level. It will use something from  
>> Query    and
>> add own algorithms.
>>
>> 3) Query must be able to support Relational model, Netowrk model,
>> Object-Relational model and Object model.
>>
>> 4) Query can be one of the next kinds:
>> 4.1 Single Column.
>> For example,     f1 = 5, f1 > 5, f2 LIKE 'asd', f2 REGEX '.*22'
>> RESULT is Set of records of Table
>>
>> 4.2 Single Table.
>> For example,     (f1 = 5 OR f1 = 15)
>> (f1 >=5 AND f1 <= 7)
>> f1 BETWEEN 5 AND 7
>> RESULT is Set of records of Table
>>
>>
>> 4.3 Multi Table ???
>> Query cannot be of multi-table kind, because this require usage of  
>> LINK
>> apart.
>> RESULT should be something complex as JOIN Table.
>>
>> 5) In the 4.2 point, query can be of any complexity. We must be able  
>> analyse
>> it,
>> build the optimal plan of execution, execute. If needed explain the  
>> plan.
>>
>> 6) We must be able describe Query in terms of c++ and other  
>> langauges, using
>> simply API.
>> We will not use for this STRINGs like in SQL, because this require  
>> parser.
>> We will use something like factories to step by step build tree of  
>> query.
>>
>> 7) The query is represented by Tree of ENodes.
>>
>> 8) When we get the Tree of ENodes we start Query.Execute().
>> During it we ask Planner to build Tree of Plan_Nodes.
>> Then start execute this plan.
>> If tables was not changed, this plan still is up to date and can be  
>> used for
>> repeated search if need. So plan is stored in the Query object.
>> If tables was changed, then we will need build plan again.
>>
>> 9) Query can work as on indexed so not indexed fields.
>> Planner take this into account of course.
>> Depending on combination of indexed and not indexed field, and on  
>> operations
>> itself,
>> we can get
>> a) column searches that return BitSets, then combine that bitsets.
>> b) scanning of table and filtering of records.
>> c) at first SET searches, then scanning for selection we have got  
>> before.
>

Here are my thoughts - although i did not address each of your ideas.
[sorry please - i apologise for that huge mail ;)]

If I understood you right, firstly, your primary concern is to  
eliminate the overhead imposed by a SQL parser when processing queries  
specified in a string. IMO, the overhead will be not so significant -  
compared to a complex query requiring to access a disc.
Secondly, your idea is to use the "native" language making a "micro  
kernel" API, exposing query primitives (and possibly others like  
locking, transaction, etc.)

One thing, not explicitly mentioned, if I understand you right, this  
API shall work in a client/server environment as well??

I find this very useful -  because of the layer below a Query language  
- say SQL - or others (OQL, etc.) which can be easily adopted on top  
the "query micro kernel".

What I have missed in all previous mails and suggestions is that there  
is little attention to the already complete solved problem of  
relational queries (in theory). It also has a mathematical basis which  
makes it easy to apply the rules and formalism to a language  
(relational calculus, SQL for instance) or a framework (through an  
API). So, please let me briefly repeat the lessons:

- The relational model is based on the mathematical concept of a  
Relation using set theory and predicate logic.

- A relation is a "table" having columns and rows - called tuples.

- A tuple is a set of attributes (columns)

Notice: in a relational DBMS, the data are presented to the user only  
through relations!

Relational Algebra

Relational algebra is a theoretical language having operations that  
uses relations as operands and returns relations as result without  
modifying the arguments. The relational algebra is especially  
convenient when it shall be implemented using a programming language.

Example:
Suppose, op is a binary operator and R1 and R2 are relations we can  
write:
R = op(R1, R2)
where R is a new relation yielding the result of the operation op.

Since both, operands and result are relations (so called a closure  
property), the operations can be arbitrarily nested:
R = op1(op2(R1, R2), R3)

There are five fundamental operations in relational algebra (but  
several others exists):
selection, projection, cartesian_product, union, set_difference.

Well, explaining each operator in detail is beyond the scope here, IMO  
this is well know anyway ;)

The selection operator is an unary operator (having one operand)  
returning a relation which tuples satisfy a certain condition - which  
is thus a subset of the original relation. We call this condition a  
predicate.
A predicate is a truth-valued function with arguments. The predicate is  
applied to a set S of values x in which is then called a proposition -  
a function where its arguments are substituted according the current  
value x and which then yields to either true or false. Notice the  
association of a predicate with a set!

Since predicates are boolean expressions, we can use them in arbitrary  
complex boolean expressions - thus getting a compound predicate:

p = p1 AND (p2 OR p3)

Example:
we could use the selection operator like this:

R = selection(R1, p)

where p - for instance - is a predicate p(x) and x is a tuple in R1  
(tuple oriented relational calculus).

Well, if we want to use the programming language (e.g. C++) for  
directly stating the relational algebra, we need to define - in that  
language - what the various artifacts Relation, Tuple, Predicate etc.  
are.

So - for example, what is a Relation expressed in a C++ class??

Ruslan, I will address this later here, but I want comment your  
suggestions first:

> 3) Query must be able to support Relational model, Netowrk model,
> Object-Relational model and Object model.

- Supporting Relational Model
of course!

- Supporting the Network Model

Don't know whether this is important in Valentina!?
Does Valentina implement the Network Data Model according CODASYL??  
(Group Item, Set Type, Record Type, etc.)??

-  Supporting the Object-Relational and Object Model

This extends the relational model by introducing Objects. Objects have  
attributes, methods, class members, class methods, inheritance,  
encapsulation, polymorphism, members which are itself objects, etc.
A lot of questions arise:
Where is the code for member functions located, how to ensure  
encapsulation, relaxing encapsulation? And several hundreds more!

Since objects have a much richer set of types for attributes, this  
might affect the query API. Objects also have methods which might be  
called:

For example:

Set<PersonRef>  s = select(Persons, Persons.DayOfBirth() <  
(time::today() -  2000))

where Persons is a set of PersonRef, and the expression  
"Persons.DayOfBirth()" evaluates to a predicate. A PersonRef has an  
operator Person* which returns a pointer  to an object of type Person.
(actually, making this all together work is not as simple as it might  
appear!!).

In the Object Model, we need Sets which are Containers containing  
objects or references (OIDs).
Additionally, we need methods for navigating through objects (point and  
arrow operators applied to objects).

Additionally, the Projection operator shall be applied to objects as  
well, so yielding to relations which columns are a subset of the  
attributes of the object. Notice, that the relation then does not  
contain objects - instead it contains the copies of a subset of the  
attributes of objects.

IMO, this all together is a challenging task. I guess, this requires  
also a complete object DB - which may have fundamental differences with  
respect to the implementation compared to a relational DB.

My suggestion:
A complete object-oriented API would be beyond a simple "query micro  
kernel". An object should be treated as a simple byte stream. Implement  
OO in a specific language, like OQL, ODMG.

However, with respect of querying a database - from the view point of a  
user (say a DB App developer) - there are pretty less differences!
Consider this:
A tuple could also be defined as a set of attributes | value pairs  
where attribute is of type string and used as the key. This set - a  
descriptor -  could be referenced by an immutable member of the tuple -  
thus defining its number, order and types of the attributes at runtime  
- but the values can be changed of course.
Then, a tuple becomes an object of class Tuple which "type" (a  
descriptor) will be defined at runtime and a relation is an object of a  
class Set<Tuple> containing only those tuples having the same  
"descriptor".
So, it looks like, the relational aspect is just a sub-set of the  
object-oriented aspect :)

There might be differences because of encapsulation, though.

"Kind of Queries"

Of course, all kinds of predicates and compare operators shall be  
implemented. This is relatively simple.

IMO, there should also be no restriction in the number of arguments  
(columns) which can be used in a predicate. Basically, internally there  
must be some AST (abstract syntax tree) which will be create either at  
runtime or compile time (!) when the predicate is defined. Then the AST  
will be used to perform the query. IMO, there is rather little  
correlation between the inherently complexity of this and the number of  
arguments - so there is little overhead in complexity when making it  
work allowing more than one column.

Single Table Query?

OK, joining tables may require more. But you (Ruslan) must it do  
anyway!!
This is Relational Algebra, and should be  implemented in a Relational  
DB ;)

So, why not multi table query??

(The subsequent paragraphs  6), etc. describe implementation details  
and thoughts)

I guess, the Query Facility will be implemented using the OO paradigm.  
;)

So, a quick but not so dirty OOA shows that we have these main classes:

The main classes of the Query API for Relational Algebra

DatabaseObject

A database object is a distributed object residing in the client side  
and in the server side and will be accessed through a handle or a  
reference. The communication between client and server boundary is done  
via  remote procedure calls (RPC).
A actual instance of an DO can reside in either the client or the  
server or in both requiring synchronisation.

Relation
A R. is an abstraction of a Table, having Columns and Rows.
A relation is an "DatabaseObject" which means, that the server has  
knowledge of the instances of it and possibly is the owner of that  
object. Thus, a relation is also a distributed object - residing in the  
client side as well as in the server side.

A relation is either in an opened or a closed state. The initial state  
is closed.
A R. may be implemented as a Cursor - that is, it manages internally a  
record buffer, holding
the column values of the current row and it references a "Table" which  
contains the
records.
If the relation is opened - that is, if there is data attached to the  
relation,  access of the column
values is allowed and will be achieved by means of the record buffer.
You should think of a "Table" as a private data structure only known by  
the Relation
and the Storage Manager - that its, it is complete opaque to client of  
a relation.

A relation can return a reference to the current tuple (row, record).
The relation can be iterated which affects the current tuple.

Member functions:
open()
close()
descriptor()	returns the tuple descriptor associated with this  
relation. Pre: open() == true
size()		returns number of tuples in the relation	
pos()		returns distance from first to current tuple
next()		
prev()
get()   		returns reference to current tuple. Pre: open() == true

Tuple
A Tuple is a set of attribute values (column values) and will be  
instantiated in the client side.
The elements of a Relation are tuples, however a tuple must not  
necessarily be an element of a relation.
A Tuple references a TupleDescriptor for internal purpose.

types: Tuple::Value

size()   		returns number of columns
at(pos)		returns reference to Tuple::Value at position pos
at(str) 		returns reference to Tuple with column name equal str
descriptor() 	returns the tuple descriptor associated with this tuple

TupleDescriptor
A TD is a list of column descriptors. A certain object of TD will be  
referenced by
relations and tupels (may be ref-counted).

ColumnDescriptor
  A CD is meta information for a column, like column name, default  
value, nullable,
its index in the tuple, and other properties specific for that column  
in that tuple.
A column descriptor also references an instance of a TypeDescriptor  
which is
meta data for describing the value_type of the column.
A CD will be owned by a TupleDescriptor.
A CD will be shared along all tuples belonging to a distinct relation  
and the relation itself.

TypeDescriptor
A type descriptor is meta data for types of the programming language.  
An instance of a TypeDescriptor
exists only once for a certain type (Singleton) in a program.
It stores information about type dependent functions: c-tor, d-tor,  
equal_func, assign_func, copy_func,
extractor_func, etc.

Predicate
A Predicate is a function object which return type is a boolean. (see  
also description above)
The predicate is realized as a tree structure, enabling it to compose  
complex predicates. Class Predicate
is a non-abstract but is not a concrete predicate - it serves as a base  
class for predicates and building the
tree structure.
There are several concrete predicates: equal, less_than, greater_than,  
etc. For class Predicate there are
global operator defined: && and ||.

Relational Algebra Operators

The Relational Algebra Operators will be implemented as global  
operators, taking arguments of
type Relation and auxiliary types for predicates and others. The  
operators are implemented in the server side.

A more C++ like example:

This shows some code examples which reside in the client side (as  
interfaces and libraries).

  - Note that this is just an idea! (However, several parts have already  
been implemented and proofed to work in a
previous framework)
- Note that this example is in C++ and uses templates, which may be not  
available in all languages,
but can be bypassed or it can be implemented in a slightly modified way.

// Suppose these classes have been defined in namespace db elsewhere:
class db::Relation;
class db::Tuple;
class db::TupleDescriptor;
calss db::ColumnDescriptor;
class db::Predicate;
class db::equal_pred;
class db::less_than_pred;

// Create a "tuple descriptor". It describes a tuple(mainly attributes  
and types). Can be re-used.
// A tuple descriptor is a list of column descriptors.
// In this case, a tuple descriptor will be created which has two  
column descriptors,
// one which is labeled "FirstName" and is of type string, the other is  
labeled
// "DOB" (means, "day of birth") and is of type date:
TupleDescriptor td;
td.push_back(ColumnDescriptor("FirstName", make_type_desc<string>()));
td.push_back(ColumnDescriptor("DOB", make_type_desc<date>()"));

// Note that a column descriptor may have several properties - which  
are not shown here.

// Create a relation with a specified tuple descriptor and attach it  
the base table "Persons".
// Note that this may throw an exception if the tuple descriptor is not  
applicable.
Relation persons(td); 		// c-tor with explicit row-descriptor
persons.open("Persons");  	// may throw "descriptor_does_not_match"

// Create predicates used in a search condition:
// Note, that they inherit all from an abstract base type!
Predicate p1 = equal_pred<string>("FirstName", "Ruslan");
Predicate p2 = less_than_pred<date>("DOB", date(2000, 01,01) );

// Perform a query, which is equivalent to
// "Select FirstName, DOB from Persons where FirstName = "Ruslan" and  
DOB < date '2000-01-01'":
// Note: when performing a relational operator, this operation will be  
actually executed on
// the server side. In order to accomplish this, the client side has to  
call some RPC mechanism
// which passes the handles of the relations as arguments to the server.
// Note also, that the instances of the records of a relation exist on  
the server side. They may
// be cached on the client side for performance reasons, though.

std::vector<std::string> cols;
cols.push_back("FirstName");
cols.push_back("DOB");
Relation r1 = projection(select(persons, p1 && p2), cols);
  // the line above performs two relational operators: projection and  
selection!

persons.close();  // Do not need the base table relation any longer

ASSERT(r1.open());
ASSERT(r1.TupleDescriptor().size() == 2);
ASSERT(r1.TupleDescriptor()[0].Type() == make_type_desc<string>.Type());
ASSERT(r1.TupleDescriptor()[1].Type() == make_type_desc<date>.Type());

for (int i = 0; i < r1.size(); ++i)
{	
	const Tuple& t = r1[i];  // this, internally causes a fetch, which is  
performed
                              // on the server side, returning the  
record which will be
                              // used to instantiate the current tuple  
managed on the
                              // client side within the relation.
	cout << t["FirstName"].get<string>() << "  " <<  
t["DOB"].get<date>.ToString() << endl;

	// note: t["name"] returns Tuple::Value, which represents a reference  
to a column value with opaque type.
	// Tuple::Value.get<T>() is a checked reinterpret cast - faster than  
dynamic_cast (one address 	// comparison), but fails if the type does  
not match, and returns a reference T pointing into the tuple buffer.
	// This reference can be used to modify the column value.
	// In the example above this ref is const, however (because of const  
Tuple& t).
	// Note that tuples have c-tors, compare semantics, assignment and  
other operators making it
	// very useful even outside the query. However, a tuple may not have  
the same value layout as
	// the record buffer of a result set, nor does it use the same buffer.  
When there is a tuple, a buffer for
	// the values must be allocated according of the type and order  
defined in the tuple descriptor.
	// These types  are C++ classes or ordinary types which values must be  
mapped from internal
	// record buffer representation in the storage manager to the C++  
types.
	// A relation may manage only one of such a tuple(and buffer)  
internally (thus implementing a cursor).
	// So, references to tuples become stale in this case when iterating  
the relation. If you need more than
	// one tuple at once, then you would need to make a temporary copy of  
at least one tuple.
}

fstream f("result.out");

f << r1;

r1.close();
f.close();

/*  
------------------------------------------------------------------------ 
-
  * Additional examples

// tuple descriptor will be deduced from the base table:
// When using this method, all columns of the base table appear in the  
relation.
	Relation r("MyBaseTable"); // creates row-descriptor dynamically

// You can access the tuple descriptor:

	const TupleDescriptor& td = r.TupleDescriptor();

// A TD is a list of column descriptors:
	int count = td.size();
	const ColumnDescriptor& cd = td[0];

// You can ask the type of the column:
	cout << cd.Type();

// Advanced options:
	if (cd.PropertyList().HasAttribute("SQL_TYPE"))
		cout << cd.PropertyList().GetValueOf("SQL_TYPE");

// When you need to fetch the value of the column, you need to know its  
type!
	Tuple t = r[0];
   	std::string name = t["Name"].get<std::string>(); // you need to  
know, that it's a std::string!
	double name t["Name"].get<double>(); // oops, throws exception!!

// But you are not lost, if you don't know the type:
	std::string dayOfBirth = t["DOB"].AsString();

// Does this work ??:
	cout << t["Name"] << endl;
	cout << t["DOB"] << endl;

// Yes it does!!

// Is this possible???

	t["Name"] = std::string("Hello");
	t["DOB"] = date::today();

// Yes, although EXACT match of argument types is required, as can be  
seen in code snip below:
// -- If it is implemented in such a way!

class Tuple
{
	//...

	class Value {
	public:	

		template <typename T> Value& operator=(const T& x)		{
			if (make_type_desc<T>.Type() != _cd.Type())
				throw err_type_mismatch();
			_cd.assign_func(_cd.buffer(), static_cast<const void*>(&x));

			return *this;
		}

		template <typename T> const T& get() const
		{
			if (make_type_desc<T>.Type() != _cd.Type())
				throw err_type_mismatch();
			if (_cd.tuple().is_NULL(_idx))
				throw err_value_is_NULL();

			return *reinterpret_cast<const T*>(_cd.buffer());
		}	

	private:
		Value(const Tuple& tuple, size_type idx) :  
_cd(tuple[idx].ColumnDescriptor()) {}
	private:
		const ColumnDescriptor&	_cd;

	friend class Tuple;
    	};
};

*/

Tasks:

1) Analysis and Design of  Relation, Tuple,  Predicate;

2) Find relational algebra operators.

3) Interface MUST NOT depend on the internal implementation of  
Valentina.

4) Possible artifacts:
      RPC mechanism, AST, tree, tree-iterator, Optimizer (AST  
transformation), Query-Planer, Cost-Model
note: an abstract cost model, decouples storage manager from query  
optimizer.

5) add Object aspect.

6) Refine Relation with respect to common container features: iterator  
capabilities, associations, sequence, etc.

7) Implement RPC

A challenging task!

If a relation is a Cursor , then it manages internally a row-buffer,  
and also references some external data-file - or say an external  
"record-container". This is kind of a container the storage manager  
takes care of and owns also.
A relation then also has a current row.

Relations may have some degree of freedom of how they behave and of how  
they can be implemented. For instance, they may or may not copy the  
data rows which have been queried.  If they do copy them, they need  
itself some kind of container to hold them.
In this case, for instance,  a standard container can be used -e.g. a  
std::vector, std::map, etc. customizable by the db app developer. The  
method of populating this custom container is suspect for further  
discussions, though.

If you would please take a look into the SQL3 standard (already  
frequently cited), you will find the different kind of Cursors and how  
they behave differently in several aspects, e.g. visibility of changes,  
updateable or non-updateable, life time with respect to transactions,  
etc.
Cursors, Result Sets and Relations probably are the same thing.

That's it for today.

Best Regards

Andreas Grosam