Bloblets
Ed Kleban
Ed at Kleban.com
Sun Nov 20 10:23:03 CST 2005
On 11/20/05 9:52 AM, "Ruslan Zasukhin" <sunshine at public.kherson.ua> wrote:
> On 11/20/05 5:17 PM, "Ed Kleban" <Ed at Kleban.com> wrote:
>
>> I don't know how clear all that is. But maybe you can get a hint or two out
>> of it.
>
> It looks that this big BLOB has structure.
> You say that you will keep in some parallel tables offsets that keep info
> about structure.
>
> So I do not understand why you can't just split that BLOB...
> And then no need for this parallel table(s).
>
The short answer is simple: Because I don't want to. In my opinion this is
the wrong way to solve this problem.
Here's a longer answer:
Because I don't want to have to maintain two separate representations for
positional spans along with the associated processing overhead. I need to
retain the natural positional spans for comparison with substring spans that
are in files that have not yet been processed into the database. By doing
so, the same span can be used to either access a span within some OS file,
as well as a positional span within a big Blob.
This is mostly of value during the process of either parsing a file to
incorporate it's contents into the database, or comparing the contents of
information already in the database with some external file. Essentially,
by simply storing a copy of a file as a blob I can use the same position and
length parameters for a span whether the span is internal or external to the
database.
This also allows me to bootstrap my implementation by initially just copying
a file into the database as a quick and dirty way of representing it, and
eventually replacing this simplistic access method with something more
clever or efficient -- a technique I gather you recommend.
After a period of transition, these big blobs which are essentially copies
of existing files can be thrown away entirely, because all of the
information is stored in very small database pieces. In fact it is stored
at a tiny fraction of it's original size because all common substrings are
stored as references rather than duplicate text, and all the substrings that
will never be looked at again by the user unless they want to do a full
reconstruction for export are optionally compressed thanks to the ability
provided by Valentina to ZIP with individual Blob fields -- i.e. not all of
the condensed data is stored with in the same blob fields; it's more clever
than that.
However, during this time of transition:
1) Retrieving the substring for the blob string position spans from the very
compact versions would require an immense amount of processing, which is a
concern when you have 64063 spans to process for a mere 2.8 MB file as an\
very small example I am testing with.
2) I have many options available. I don't have to use the Valentina
database for this purpose. I already have a full parse tree at hand and
code written that will extract these substrings using a very efficient hash
pool that's faster than Valentina will ever be. However, I'm striving to
see how far I can push Valentina to use it for this and similar
applications. Ideally I'd like to move everything into a single data
manager such as Valentina and punt all the old pre-database code.
More information about the Valentina
mailing list