Bloblets

Sun Nov 20 08:29:55 CST 2005

On 11/20/05 8:03 AM, "Ruslan Zasukhin" <sunshine at public.kherson.ua> wrote:

> On 11/20/05 3:47 PM, "Ed Kleban" <Ed at Kleban.com> wrote:
> 
>>> We did have such feature in 1.x but nobody have use it,
>>> So we have not open it for v2 for now.
>>> 
>> 
>> That is most unfortunate, since it means I will have to do so manually
>> outside of the database if I want to avoid unneccessary disk access -- which
>> is one of the key things I'm looking to the database to manage for me so
>> that I don't have to.  Specifically, I need to store a large number of
>> strings that will range in size from typically, say... 2 MB to 20 MB.   I
>> then need to perform completely random accesses to pull out substrings from
>> those strings, the position and length of which are stored in database
>> tables.  
> 
> Hmm. 3 times Hmm.
> 
> You sure this is best possible solution ?!
> I doubt.
> 

Great!  I welcome any suggestions you may have to offer. Any other options
come to mind?  

If you are asking am I sure that I need to be able to pull out 100-byte
substrings out 10 MB strings, then the answer is yes.  In terms of access
patterns however, it's not clear how much of a performance hit it will be.
If I'm only pulling out one or two such strings based on a couple of pieces
of text the user wants to view and is waiting on, who cares how long it
takes.  The user won't see any difference.   On the other hand if I'm going
to be pull out  dozens, hundreds, or thousands of such strings say, for
building some sort of cross-reference or analysis, then pretty rapidly the
entire file will get cached into the OS and perhaps the better thing to do
if I expect this is indeed to read the entire string into memory first and
just operate on it there.

But:

I'd don't really know in advance whether there will be shades of grey in
between where the tradeoff would make a great difference.  And since I'm
designing a tool that will allow the User rather than the Designer to
determine the access patterns based on their data viewing needs, I'm not
sure I'll ever really know.

I'd rather not have to guess or figure such issues out in advance.  I'd
rather rely on an efficient database to simply adapt, use its caching
algorithms, and do the job for me.  To my thinking this is one of the
fundamental reasons for using a database rather than managing the data
myself.

> 
>> Sometimes some of those 2-20 MB string will be in memory already,
>> but other times they will only be on disk and likely not in either
>> Valentina's or the OS's cache.
>> 
>> Since Valentina doesn't support substring access, I'll have to store every
>> such string in a separate file outside of the database, store the filenames
>> in V2, and use the OS to perform the random accesses -- or read in the whole
>> thing and suffer the performance consequences.  Not a difficult or
>> unmanageable task, but like I said... unfortunate.
>  
>>> Also always remember about disk, cache, ...
>>> 
>>> Here not always faster to read 100 bytes then then e.g. 4Kb
>>> 
>> 
>> I fully expect to have to do a disk access to pull in a cache page or two,
>> both into the OS cache as well as Valentina's, if I need a 100 byte
>> substring from a 10 MB source.  I don't expect to have to read 10 MB worth
>> of pages when I only need 100 bytes.
> 
> 10MB yes, this is many.

It is one of many since I may need 100 or 1000 byte substrings out of a
great many separate 10MB sources.

> 
> Just I think can be better solution to avoid such huge data of ATOMIC sets.
> 

There are, and I implement many of them.  This is not my sole source of
access to the data I need.  I have a massive amount of this data
pre-scanned, pre-sorted, and pre-linked for rapid access in anticipation of
things the user might want to do.  And to be more precise, even if I don't
have this information, when I do have a need to access it and pull it out of
some of the 10MB sources it will get scanned, sorted, linked,
crossreferenced, and cached by my software on the fly so that I don't have
to go get it again if I'm going to use again in a similar way real soon.

But all that said, this "get a small substring" mechanism is the access
method of last resort, and will occur -- a lot.  All my code can do after
that is work to prevent it from having to happen again too soon.

=======

I have a car of a certain model year.  When you go look up the list of most
common customer complaints, do you know what heads the list?  That it knows
the temperature inside and outside the car but won't tell you.  It has to
know to be able to administer the thermostat which you set by selecting the
desired temperature on a digital display.  The manufacturer heard that
complaint loud and clear, because in later model years they fixed the car to
display the temperature.

This is a similar kind of thing.  Your Kernel manual has very elegant
diagrams in it showing the segmentation scheme that you use to take a very
large file and segment it to make it manageable on disk.  But then I
discover that despite this fine organization, I can't take advantage of it
but rather have to find some other means of either duplicating it myself or
relying on some other mechanism, like the OS file system that can do it for
me.  It's not a show-stopper.  It's not even particularly hard or painful to
code.  It's just frustrating knowing that I have to when the provision of
simple single function call -- one in fact that was already written and used
to be there -- was removed due lack of popularity.