Bloblets

Sun Nov 20 09:17:05 CST 2005

On 11/20/05 8:40 AM, "Ruslan Zasukhin" <sunshine at public.kherson.ua> wrote:

> On 11/20/05 4:29 PM, "Ed Kleban" <Ed at Kleban.com> wrote:
> 
>> 
>> If you are asking am I sure that I need to be able to pull out 100-byte
>> substrings out 10 MB strings, then the answer is yes.  In terms of access
>> patterns however, it's not clear how much of a performance hit it will be.
> 
> Question is:
> 
>     why you cannot split that 10Mb into many small records ?

I can provide more detail off-list.  But the main reason is essentially
because I have numerous positional references into these large strings and
that this particular view or structuring is the natural paradigm I have to
work with given the data I have.

My choices are therefore:

1) Restructure my data, such as chopping it up into many small records, so
that it is more manageable in the structuring tools I have -- such as the
database and/or OS filesystem -- and then write an abstraction layer of code
that will allow me to work in terms of the natural positional span of
substrings, while translating this to the lower-level structural
representation.  In other words write a whole new segmentation scheme that
can deal with spans that cross segments, and other messes.  I don't want to
mess with the additional layer of complexity, and I don't relish the idea of
taking a performance hit for building and accessing through yet another
layer of segmentation -- especially one that I have to manage.

2) Use a tool that already implements segmentation logic so I don't have to,
such as a database or filesystem -- which, again, is what I see as a main
purpose of these tools.

The other reason that I need to preserve the ability to operate in this
natural positional framework, is that it allows me to do direct comparisons
of database content with that of external files that have not yet been
incorporated into my database already.  There are cases when these large 10
MB files can simply be discarded and are no longer needed because all of the
information of immediate value has been summarized into other structures --
including span hashes so I can do direct comparisons with external files
without having to have the old data around any more.  But there are also
cases and intermediate periods where the proper thing to do -- or in initial
stages of development the far easier thing to do -- is to merely keep a
fully assembled copy of these large text spans rather than taking the
performance hit of regenerating them.

I don't know how clear all that is.  But maybe you can get a hint or two out
of it.

Again, much of my "concern" comes not from the fact that I have to do more
effort to work around this, but the amusing frustration that the database
structure has ALREADY done all this work for me... but the access method to
take advantage of it was removed from V2 "Because nobody used it".  Guess I
should have been an user earlier, eh? ;-)

--Ed