[ALL] let's think about Query Language for 2.0
Andreas Grosam
agrosam at computerworks.ch
Fri May 23 12:52:54 CDT 2003
Hi Ruslan,
On Freitag, Mai 16, 2003, at 04:51 Uhr, Ruslan Zasukhin wrote:
> Hi Guys,
>
[...]
>> 1) Query is the BRAIN of database kernel.
>>
>> 2) Query allow us execute search and sort on Tables without SQL.
>> We can consider this as low level engine of query execution.
>> SQL is bult over this as higher level. It will use something from
>> Query and
>> add own algorithms.
>>
>> 3) Query must be able to support Relational model, Netowrk model,
>> Object-Relational model and Object model.
>>
>> 4) Query can be one of the next kinds:
>> 4.1 Single Column.
>> For example, f1 = 5, f1 > 5, f2 LIKE 'asd', f2 REGEX '.*22'
>> RESULT is Set of records of Table
>>
>> 4.2 Single Table.
>> For example, (f1 = 5 OR f1 = 15)
>> (f1 >=5 AND f1 <= 7)
>> f1 BETWEEN 5 AND 7
>> RESULT is Set of records of Table
>>
>>
>> 4.3 Multi Table ???
>> Query cannot be of multi-table kind, because this require usage of
>> LINK
>> apart.
>> RESULT should be something complex as JOIN Table.
>>
>> 5) In the 4.2 point, query can be of any complexity. We must be able
>> analyse
>> it,
>> build the optimal plan of execution, execute. If needed explain the
>> plan.
>>
>> 6) We must be able describe Query in terms of c++ and other
>> langauges, using
>> simply API.
>> We will not use for this STRINGs like in SQL, because this require
>> parser.
>> We will use something like factories to step by step build tree of
>> query.
>>
>> 7) The query is represented by Tree of ENodes.
>>
>> 8) When we get the Tree of ENodes we start Query.Execute().
>> During it we ask Planner to build Tree of Plan_Nodes.
>> Then start execute this plan.
>> If tables was not changed, this plan still is up to date and can be
>> used for
>> repeated search if need. So plan is stored in the Query object.
>> If tables was changed, then we will need build plan again.
>>
>> 9) Query can work as on indexed so not indexed fields.
>> Planner take this into account of course.
>> Depending on combination of indexed and not indexed field, and on
>> operations
>> itself,
>> we can get
>> a) column searches that return BitSets, then combine that bitsets.
>> b) scanning of table and filtering of records.
>> c) at first SET searches, then scanning for selection we have got
>> before.
>
Here are my thoughts - although i did not address each of your ideas.
[sorry please - i apologise for that huge mail ;)]
If I understood you right, firstly, your primary concern is to
eliminate the overhead imposed by a SQL parser when processing queries
specified in a string. IMO, the overhead will be not so significant -
compared to a complex query requiring to access a disc.
Secondly, your idea is to use the "native" language making a "micro
kernel" API, exposing query primitives (and possibly others like
locking, transaction, etc.)
One thing, not explicitly mentioned, if I understand you right, this
API shall work in a client/server environment as well??
I find this very useful - because of the layer below a Query language
- say SQL - or others (OQL, etc.) which can be easily adopted on top
the "query micro kernel".
What I have missed in all previous mails and suggestions is that there
is little attention to the already complete solved problem of
relational queries (in theory). It also has a mathematical basis which
makes it easy to apply the rules and formalism to a language
(relational calculus, SQL for instance) or a framework (through an
API). So, please let me briefly repeat the lessons:
- The relational model is based on the mathematical concept of a
Relation using set theory and predicate logic.
- A relation is a "table" having columns and rows - called tuples.
- A tuple is a set of attributes (columns)
Notice: in a relational DBMS, the data are presented to the user only
through relations!
Relational Algebra
Relational algebra is a theoretical language having operations that
uses relations as operands and returns relations as result without
modifying the arguments. The relational algebra is especially
convenient when it shall be implemented using a programming language.
Example:
Suppose, op is a binary operator and R1 and R2 are relations we can
write:
R = op(R1, R2)
where R is a new relation yielding the result of the operation op.
Since both, operands and result are relations (so called a closure
property), the operations can be arbitrarily nested:
R = op1(op2(R1, R2), R3)
There are five fundamental operations in relational algebra (but
several others exists):
selection, projection, cartesian_product, union, set_difference.
Well, explaining each operator in detail is beyond the scope here, IMO
this is well know anyway ;)
The selection operator is an unary operator (having one operand)
returning a relation which tuples satisfy a certain condition - which
is thus a subset of the original relation. We call this condition a
predicate.
A predicate is a truth-valued function with arguments. The predicate is
applied to a set S of values x in which is then called a proposition -
a function where its arguments are substituted according the current
value x and which then yields to either true or false. Notice the
association of a predicate with a set!
Since predicates are boolean expressions, we can use them in arbitrary
complex boolean expressions - thus getting a compound predicate:
p = p1 AND (p2 OR p3)
Example:
we could use the selection operator like this:
R = selection(R1, p)
where p - for instance - is a predicate p(x) and x is a tuple in R1
(tuple oriented relational calculus).
Well, if we want to use the programming language (e.g. C++) for
directly stating the relational algebra, we need to define - in that
language - what the various artifacts Relation, Tuple, Predicate etc.
are.
So - for example, what is a Relation expressed in a C++ class??
Ruslan, I will address this later here, but I want comment your
suggestions first:
> 3) Query must be able to support Relational model, Netowrk model,
> Object-Relational model and Object model.
- Supporting Relational Model
of course!
- Supporting the Network Model
Don't know whether this is important in Valentina!?
Does Valentina implement the Network Data Model according CODASYL??
(Group Item, Set Type, Record Type, etc.)??
- Supporting the Object-Relational and Object Model
This extends the relational model by introducing Objects. Objects have
attributes, methods, class members, class methods, inheritance,
encapsulation, polymorphism, members which are itself objects, etc.
A lot of questions arise:
Where is the code for member functions located, how to ensure
encapsulation, relaxing encapsulation? And several hundreds more!
Since objects have a much richer set of types for attributes, this
might affect the query API. Objects also have methods which might be
called:
For example:
Set<PersonRef> s = select(Persons, Persons.DayOfBirth() <
(time::today() - 2000))
where Persons is a set of PersonRef, and the expression
"Persons.DayOfBirth()" evaluates to a predicate. A PersonRef has an
operator Person* which returns a pointer to an object of type Person.
(actually, making this all together work is not as simple as it might
appear!!).
In the Object Model, we need Sets which are Containers containing
objects or references (OIDs).
Additionally, we need methods for navigating through objects (point and
arrow operators applied to objects).
Additionally, the Projection operator shall be applied to objects as
well, so yielding to relations which columns are a subset of the
attributes of the object. Notice, that the relation then does not
contain objects - instead it contains the copies of a subset of the
attributes of objects.
IMO, this all together is a challenging task. I guess, this requires
also a complete object DB - which may have fundamental differences with
respect to the implementation compared to a relational DB.
My suggestion:
A complete object-oriented API would be beyond a simple "query micro
kernel". An object should be treated as a simple byte stream. Implement
OO in a specific language, like OQL, ODMG.
However, with respect of querying a database - from the view point of a
user (say a DB App developer) - there are pretty less differences!
Consider this:
A tuple could also be defined as a set of attributes | value pairs
where attribute is of type string and used as the key. This set - a
descriptor - could be referenced by an immutable member of the tuple -
thus defining its number, order and types of the attributes at runtime
- but the values can be changed of course.
Then, a tuple becomes an object of class Tuple which "type" (a
descriptor) will be defined at runtime and a relation is an object of a
class Set<Tuple> containing only those tuples having the same
"descriptor".
So, it looks like, the relational aspect is just a sub-set of the
object-oriented aspect :)
There might be differences because of encapsulation, though.
"Kind of Queries"
Of course, all kinds of predicates and compare operators shall be
implemented. This is relatively simple.
IMO, there should also be no restriction in the number of arguments
(columns) which can be used in a predicate. Basically, internally there
must be some AST (abstract syntax tree) which will be create either at
runtime or compile time (!) when the predicate is defined. Then the AST
will be used to perform the query. IMO, there is rather little
correlation between the inherently complexity of this and the number of
arguments - so there is little overhead in complexity when making it
work allowing more than one column.
Single Table Query?
OK, joining tables may require more. But you (Ruslan) must it do
anyway!!
This is Relational Algebra, and should be implemented in a Relational
DB ;)
So, why not multi table query??
(The subsequent paragraphs 6), etc. describe implementation details
and thoughts)
I guess, the Query Facility will be implemented using the OO paradigm.
;)
So, a quick but not so dirty OOA shows that we have these main classes:
The main classes of the Query API for Relational Algebra
DatabaseObject
A database object is a distributed object residing in the client side
and in the server side and will be accessed through a handle or a
reference. The communication between client and server boundary is done
via remote procedure calls (RPC).
A actual instance of an DO can reside in either the client or the
server or in both requiring synchronisation.
Relation
A R. is an abstraction of a Table, having Columns and Rows.
A relation is an "DatabaseObject" which means, that the server has
knowledge of the instances of it and possibly is the owner of that
object. Thus, a relation is also a distributed object - residing in the
client side as well as in the server side.
A relation is either in an opened or a closed state. The initial state
is closed.
A R. may be implemented as a Cursor - that is, it manages internally a
record buffer, holding
the column values of the current row and it references a "Table" which
contains the
records.
If the relation is opened - that is, if there is data attached to the
relation, access of the column
values is allowed and will be achieved by means of the record buffer.
You should think of a "Table" as a private data structure only known by
the Relation
and the Storage Manager - that its, it is complete opaque to client of
a relation.
A relation can return a reference to the current tuple (row, record).
The relation can be iterated which affects the current tuple.
Member functions:
open()
close()
descriptor() returns the tuple descriptor associated with this
relation. Pre: open() == true
size() returns number of tuples in the relation
pos() returns distance from first to current tuple
next()
prev()
get() returns reference to current tuple. Pre: open() == true
Tuple
A Tuple is a set of attribute values (column values) and will be
instantiated in the client side.
The elements of a Relation are tuples, however a tuple must not
necessarily be an element of a relation.
A Tuple references a TupleDescriptor for internal purpose.
types: Tuple::Value
size() returns number of columns
at(pos) returns reference to Tuple::Value at position pos
at(str) returns reference to Tuple with column name equal str
descriptor() returns the tuple descriptor associated with this tuple
TupleDescriptor
A TD is a list of column descriptors. A certain object of TD will be
referenced by
relations and tupels (may be ref-counted).
ColumnDescriptor
A CD is meta information for a column, like column name, default
value, nullable,
its index in the tuple, and other properties specific for that column
in that tuple.
A column descriptor also references an instance of a TypeDescriptor
which is
meta data for describing the value_type of the column.
A CD will be owned by a TupleDescriptor.
A CD will be shared along all tuples belonging to a distinct relation
and the relation itself.
TypeDescriptor
A type descriptor is meta data for types of the programming language.
An instance of a TypeDescriptor
exists only once for a certain type (Singleton) in a program.
It stores information about type dependent functions: c-tor, d-tor,
equal_func, assign_func, copy_func,
extractor_func, etc.
Predicate
A Predicate is a function object which return type is a boolean. (see
also description above)
The predicate is realized as a tree structure, enabling it to compose
complex predicates. Class Predicate
is a non-abstract but is not a concrete predicate - it serves as a base
class for predicates and building the
tree structure.
There are several concrete predicates: equal, less_than, greater_than,
etc. For class Predicate there are
global operator defined: && and ||.
Relational Algebra Operators
The Relational Algebra Operators will be implemented as global
operators, taking arguments of
type Relation and auxiliary types for predicates and others. The
operators are implemented in the server side.
A more C++ like example:
This shows some code examples which reside in the client side (as
interfaces and libraries).
- Note that this is just an idea! (However, several parts have already
been implemented and proofed to work in a
previous framework)
- Note that this example is in C++ and uses templates, which may be not
available in all languages,
but can be bypassed or it can be implemented in a slightly modified way.
// Suppose these classes have been defined in namespace db elsewhere:
class db::Relation;
class db::Tuple;
class db::TupleDescriptor;
calss db::ColumnDescriptor;
class db::Predicate;
class db::equal_pred;
class db::less_than_pred;
// Create a "tuple descriptor". It describes a tuple(mainly attributes
and types). Can be re-used.
// A tuple descriptor is a list of column descriptors.
// In this case, a tuple descriptor will be created which has two
column descriptors,
// one which is labeled "FirstName" and is of type string, the other is
labeled
// "DOB" (means, "day of birth") and is of type date:
TupleDescriptor td;
td.push_back(ColumnDescriptor("FirstName", make_type_desc<string>()));
td.push_back(ColumnDescriptor("DOB", make_type_desc<date>()"));
// Note that a column descriptor may have several properties - which
are not shown here.
// Create a relation with a specified tuple descriptor and attach it
the base table "Persons".
// Note that this may throw an exception if the tuple descriptor is not
applicable.
Relation persons(td); // c-tor with explicit row-descriptor
persons.open("Persons"); // may throw "descriptor_does_not_match"
// Create predicates used in a search condition:
// Note, that they inherit all from an abstract base type!
Predicate p1 = equal_pred<string>("FirstName", "Ruslan");
Predicate p2 = less_than_pred<date>("DOB", date(2000, 01,01) );
// Perform a query, which is equivalent to
// "Select FirstName, DOB from Persons where FirstName = "Ruslan" and
DOB < date '2000-01-01'":
// Note: when performing a relational operator, this operation will be
actually executed on
// the server side. In order to accomplish this, the client side has to
call some RPC mechanism
// which passes the handles of the relations as arguments to the server.
// Note also, that the instances of the records of a relation exist on
the server side. They may
// be cached on the client side for performance reasons, though.
std::vector<std::string> cols;
cols.push_back("FirstName");
cols.push_back("DOB");
Relation r1 = projection(select(persons, p1 && p2), cols);
// the line above performs two relational operators: projection and
selection!
persons.close(); // Do not need the base table relation any longer
ASSERT(r1.open());
ASSERT(r1.TupleDescriptor().size() == 2);
ASSERT(r1.TupleDescriptor()[0].Type() == make_type_desc<string>.Type());
ASSERT(r1.TupleDescriptor()[1].Type() == make_type_desc<date>.Type());
for (int i = 0; i < r1.size(); ++i)
{
const Tuple& t = r1[i]; // this, internally causes a fetch, which is
performed
// on the server side, returning the
record which will be
// used to instantiate the current tuple
managed on the
// client side within the relation.
cout << t["FirstName"].get<string>() << " " <<
t["DOB"].get<date>.ToString() << endl;
// note: t["name"] returns Tuple::Value, which represents a reference
to a column value with opaque type.
// Tuple::Value.get<T>() is a checked reinterpret cast - faster than
dynamic_cast (one address // comparison), but fails if the type does
not match, and returns a reference T pointing into the tuple buffer.
// This reference can be used to modify the column value.
// In the example above this ref is const, however (because of const
Tuple& t).
// Note that tuples have c-tors, compare semantics, assignment and
other operators making it
// very useful even outside the query. However, a tuple may not have
the same value layout as
// the record buffer of a result set, nor does it use the same buffer.
When there is a tuple, a buffer for
// the values must be allocated according of the type and order
defined in the tuple descriptor.
// These types are C++ classes or ordinary types which values must be
mapped from internal
// record buffer representation in the storage manager to the C++
types.
// A relation may manage only one of such a tuple(and buffer)
internally (thus implementing a cursor).
// So, references to tuples become stale in this case when iterating
the relation. If you need more than
// one tuple at once, then you would need to make a temporary copy of
at least one tuple.
}
fstream f("result.out");
f << r1;
r1.close();
f.close();
/*
------------------------------------------------------------------------
-
* Additional examples
// tuple descriptor will be deduced from the base table:
// When using this method, all columns of the base table appear in the
relation.
Relation r("MyBaseTable"); // creates row-descriptor dynamically
// You can access the tuple descriptor:
const TupleDescriptor& td = r.TupleDescriptor();
// A TD is a list of column descriptors:
int count = td.size();
const ColumnDescriptor& cd = td[0];
// You can ask the type of the column:
cout << cd.Type();
// Advanced options:
if (cd.PropertyList().HasAttribute("SQL_TYPE"))
cout << cd.PropertyList().GetValueOf("SQL_TYPE");
// When you need to fetch the value of the column, you need to know its
type!
Tuple t = r[0];
std::string name = t["Name"].get<std::string>(); // you need to
know, that it's a std::string!
double name t["Name"].get<double>(); // oops, throws exception!!
// But you are not lost, if you don't know the type:
std::string dayOfBirth = t["DOB"].AsString();
// Does this work ??:
cout << t["Name"] << endl;
cout << t["DOB"] << endl;
// Yes it does!!
// Is this possible???
t["Name"] = std::string("Hello");
t["DOB"] = date::today();
// Yes, although EXACT match of argument types is required, as can be
seen in code snip below:
// -- If it is implemented in such a way!
class Tuple
{
//...
class Value {
public:
template <typename T> Value& operator=(const T& x) {
if (make_type_desc<T>.Type() != _cd.Type())
throw err_type_mismatch();
_cd.assign_func(_cd.buffer(), static_cast<const void*>(&x));
return *this;
}
template <typename T> const T& get() const
{
if (make_type_desc<T>.Type() != _cd.Type())
throw err_type_mismatch();
if (_cd.tuple().is_NULL(_idx))
throw err_value_is_NULL();
return *reinterpret_cast<const T*>(_cd.buffer());
}
private:
Value(const Tuple& tuple, size_type idx) :
_cd(tuple[idx].ColumnDescriptor()) {}
private:
const ColumnDescriptor& _cd;
friend class Tuple;
};
};
*/
Tasks:
1) Analysis and Design of Relation, Tuple, Predicate;
2) Find relational algebra operators.
3) Interface MUST NOT depend on the internal implementation of
Valentina.
4) Possible artifacts:
RPC mechanism, AST, tree, tree-iterator, Optimizer (AST
transformation), Query-Planer, Cost-Model
note: an abstract cost model, decouples storage manager from query
optimizer.
5) add Object aspect.
6) Refine Relation with respect to common container features: iterator
capabilities, associations, sequence, etc.
7) Implement RPC
A challenging task!
If a relation is a Cursor , then it manages internally a row-buffer,
and also references some external data-file - or say an external
"record-container". This is kind of a container the storage manager
takes care of and owns also.
A relation then also has a current row.
Relations may have some degree of freedom of how they behave and of how
they can be implemented. For instance, they may or may not copy the
data rows which have been queried. If they do copy them, they need
itself some kind of container to hold them.
In this case, for instance, a standard container can be used -e.g. a
std::vector, std::map, etc. customizable by the db app developer. The
method of populating this custom container is suspect for further
discussions, though.
If you would please take a look into the SQL3 standard (already
frequently cited), you will find the different kind of Cursors and how
they behave differently in several aspects, e.g. visibility of changes,
updateable or non-updateable, life time with respect to transactions,
etc.
Cursors, Result Sets and Relations probably are the same thing.
That's it for today.
Best Regards
Andreas Grosam
More information about the Valentina
mailing list