Nucular Overview
The following document provides an intuitive overview of the nucular system
and its components.
Design goals
Nucular is intended to help store and retrieve searchable information
in a manner somewhat similar to the way that "www.hotjobs.com" stores
and retrieves job descriptions, for example.
The highest level goal of the nucular system is to provide an
infrastructure for disk based fielded/full text searching that is
easy to set up and that opens, searches, and updates quickly.
It is also designed to perform well for very large data sets.
Nucular should be suitable for indexing hundreds of thousands of entries
or more.
The nucular system is designed to be very light weight requiring
no special kernel support (like shared memory locking) and no
persistent server process for either accesses or updates. This means
that an unpriviledged user should be able to set up and use a nucular
archive with no problems or help needed,
and that a "dormant" nucular archive will sit passively on the
disk and not consume other system resources.
Furthermore nucular is designed to support arbitrary concurrent updates
and accesses by multiple processes or threads. This means that 12 CGI programs
can be querying an archive at the same time as 2 system processes and 5 other
CGI programs are updating the archive, all without interfering with one-another.
Entries
A nucular archive exists to store and retrieve entry objects. An entry object
is identified by a string identity and may have any number of fields, and each
field may have several values. Here is an example entry:
Identity |
"NASDAQ:YHOO" |
Ticker |
"YHOO" |
FullName |
"Yahoo Corporation, Inc." |
Business |
"information services" "software" "publishing" |
Description |
"Yahoo! Inc. provides Internet services to users and businesses worldwide. It offers online properties and services to users; and various tools and marketing solutions to businesses. The company's search products include Yahoo! Search, Yahoo! Toolbar, and Yahoo! Search on Mobile, Yahoo! Local, Yahoo! Yellow Pages, and Yahoo! Maps that allow user to navigate the Internet and search for information from their computer or mobile device...." |
When a nucular archive stores an entry such as this one it builds a number of disk based data structures
which allow queries over the archive to find the entry later. For example this entry would be included
in the results of the following queries:
- What entries have "software" as a value for the "Business" field?
- What entries have "service" somewhere in the "Business" field?
- What entries mention "yellow" in some field?
- What entries have tickers between "U" and "Z" alphabetically?
- What entries mention both "mobile" and "marketing" somewhere in the "Description" field?
The values associated with entry fields may be simple strings or they may be special values
such as an indexed URL, or an internal link. For example you might refer
to the contents of an external file using a "file://" indexed URL instead of storing the
contents of the file in the index directly.
Archives and Sessions
A nucular archive is located by a directory in a filesystem. For example one of the test
programs for the package creates an archive at the location INSTALL/testdata/mondial
.
When populated, this directory contains a number of other directories and files that
implement the disk based storage for the archive.
All scripts or programs that interact with this archive identify it using the directory path
for the archive.
Technically no user program directly interacts with a nucular archive -- all interactions
are mediated by a session object. In a python program a session object is created like this:
from nucular import nucular
session = nucular.Nucular(directory)
All subsequent interactions with the archive are mediated by methods of a session object.
Queries
Nucular archives support retrieval of entries based on the values in the entries. The basic
tests used for retrieval are:
-
Full text word prefix: This test succeeds for entries that contain a
given word in any field value as a word prefix (case insensitive). For example a full text word prefix
retrieval for "cat" would match entries that have a "comment" field with value "I love my cats, mostly"
as well as some other entry which has a "title" field with the value "The Exhaustive Catalogue of Errors
and Misconceptions", and some other entry with "subtitle" field value "how to cook a cat".
-
Field word prefix: This test succeeds for entries that
contain a specified field which contains a given word as a word prefix.
For example a fielded word search in the "note" field for the word
"miss" would match entries with values for "note" are "not legal in Mississippi"
or "I'm missing both my front teeth" or "Here She Is: Miss America".
-
Field value prefix: This test succeeds when a field value starts with a given value.
For example a fielded search for field "phone" starting with "1-800" would match entries
with value for phone "1-800-big-Pigs".
-
Field match: This test succeeds when a field value exactly matches a given string.
A field match query for "title" of "king" would only match when the title was "king" and
not match, for example, "kingpin".
-
Field value range: This test succeeds when a field value for an entry lies between
two alphanumeric endpoints. For example a field range test looking for "date" between
"1978" and "1999" would match an entry with "date" value "1998-09-01-21:43:12".
Furthermore these tests may be conjoined for a query -- to search for "date between 1978 and 1999"
and "note contains miss" at the same time.
Facets or Suggestions from a Query
Nucular queries also support a suggestions feature which attempts to find to
provide data values likely to be related to a query. For example if the query
specifies that the "Business" symbol must match "software" and the Description field
must contain "community", the suggestions feature will attempt
to determine most likely values for the "Ticker" field. This feature is intended
to be useful for graphical interfaces which support drop down suggestions or
completions. The related data values for a field derived from a query are often
called query facets<.
Deferred and immediate updates
Updates to archives (inserts or deletes of entries) may be stored in "immediate" or "deferred"
mode. New updates to an archive are not automatically inserted into the optimized index
structures. Instead they "wait" in special areas for full integration when the archive
is aggregated later.
Deferred updates are kept in a special area for later integration with the
optimized index structures when the archive is aggregated (as described below).
Sessions that access the archive before the next aggregation completes will not see
the deferred updates -- but they also will not experience any performance
degradation regardless of the number of deferred updates that are awaiting aggregation.
Immediate
updates are placed in a different special area where they "lie on top" of the optimized
indices until the archive is aggregated. Sessions that access the archive before the next
aggregation will read all immediate updates that have not been aggregated. Consequently,
if there are too many immediate updates, the sessions may become slower and more
resource intensive.
Generally, it is a good idea to use immediate updates for information which must be visible
immediately (such as indexing a new user profile) and defer updates of
lower importance (such as a user profile edit) when possible.
Archive Aggregation
As mentioned, archives accumulate updates in special areas that are separate from
the main optimized indexing structures. There are two levels of aggregation which
move updates from the temporary areas into the optimized indexes:
-
Partial Aggregation:
Moves updates from the update area to a "transitional" index. This operation should be done
whenever it is important to encorporate deferred updates, or whenever there are too many
immediate and unaggregated updates.
-
Final Aggregation:
Moves updates from the "transitional" index to the "final" index. This operation should be
done whenever the "transitional" index gets large enough that partial aggregation becomes slow.
For large archives the final aggregation operation may take a while since it passes over
all data in the archive.
An aggregation operation may be done concurrently with updates and queries by other processes
or threads, but two aggregation operations should not run at the same time (I don't think
information will be lost, but one of the operations will be wasted).
End of Nucular Overview
return to index