Nucular Overview

nucular project page with download links

Nucular Overview

The following document provides an intuitive overview of the nucular system and its components.

Design goals

Nucular is intended to help store and retrieve searchable information in a manner somewhat similar to the way that "www.hotjobs.com" stores and retrieves job descriptions, for example.

The highest level goal of the nucular system is to provide an infrastructure for disk based fielded/full text searching that is easy to set up and that opens, searches, and updates quickly. It is also designed to perform well for very large data sets. Nucular should be suitable for indexing hundreds of thousands of entries or more.

The nucular system is designed to be very light weight requiring no special kernel support (like shared memory locking) and no persistent server process for either accesses or updates. This means that an unpriviledged user should be able to set up and use a nucular archive with no problems or help needed, and that a "dormant" nucular archive will sit passively on the disk and not consume other system resources.

Furthermore nucular is designed to support arbitrary concurrent updates and accesses by multiple processes or threads. This means that 12 CGI programs can be querying an archive at the same time as 2 system processes and 5 other CGI programs are updating the archive, all without interfering with one-another.

Entries

A nucular archive exists to store and retrieve entry objects. An entry object is identified by a string identity and may have any number of fields, and each field may have several values. Here is an example entry:

Identity	"NASDAQ:YHOO"
Ticker	"YHOO"
FullName	"Yahoo Corporation, Inc."
Business	"information services" "software" "publishing"
Description	"Yahoo! Inc. provides Internet services to users and businesses worldwide. It offers online properties and services to users; and various tools and marketing solutions to businesses. The company's search products include Yahoo! Search, Yahoo! Toolbar, and Yahoo! Search on Mobile, Yahoo! Local, Yahoo! Yellow Pages, and Yahoo! Maps that allow user to navigate the Internet and search for information from their computer or mobile device...."

When a nucular archive stores an entry such as this one it builds a number of disk based data structures which allow queries over the archive to find the entry later. For example this entry would be included in the results of the following queries:

What entries have "software" as a value for the "Business" field?
What entries have "service" somewhere in the "Business" field?
What entries mention "yellow" in some field?
What entries have tickers between "U" and "Z" alphabetically?
What entries mention both "mobile" and "marketing" somewhere in the "Description" field?

The values associated with entry fields may be simple strings or they may be special values such as an indexed URL, or an internal link. For example you might refer to the contents of an external file using a "file://" indexed URL instead of storing the contents of the file in the index directly.

Archives and Sessions

A nucular archive is located by a directory in a filesystem. For example one of the test programs for the package creates an archive at the location INSTALL/testdata/mondial. When populated, this directory contains a number of other directories and files that implement the disk based storage for the archive. All scripts or programs that interact with this archive identify it using the directory path for the archive.

Technically no user program directly interacts with a nucular archive -- all interactions are mediated by a session object. In a python program a session object is created like this:

from nucular import nucular
session = nucular.Nucular(directory)

All subsequent interactions with the archive are mediated by methods of a session object.

Queries

Nucular archives support retrieval of entries based on the values in the entries. The basic tests used for retrieval are:

Full text word prefix: This test succeeds for entries that contain a given word in any field value as a word prefix (case insensitive). For example a full text word prefix retrieval for "cat" would match entries that have a "comment" field with value "I love my cats, mostly" as well as some other entry which has a "title" field with the value "The Exhaustive Catalogue of Errors and Misconceptions", and some other entry with "subtitle" field value "how to cook a cat".
Field word prefix: This test succeeds for entries that contain a specified field which contains a given word as a word prefix. For example a fielded word search in the "note" field for the word "miss" would match entries with values for "note" are "not legal in Mississippi" or "I'm missing both my front teeth" or "Here She Is: Miss America".
Field value prefix: This test succeeds when a field value starts with a given value. For example a fielded search for field "phone" starting with "1-800" would match entries with value for phone "1-800-big-Pigs".
Field match: This test succeeds when a field value exactly matches a given string. A field match query for "title" of "king" would only match when the title was "king" and not match, for example, "kingpin".
Field value range: This test succeeds when a field value for an entry lies between two alphanumeric endpoints. For example a field range test looking for "date" between "1978" and "1999" would match an entry with "date" value "1998-09-01-21:43:12".

Furthermore these tests may be conjoined for a query -- to search for "date between 1978 and 1999" and "note contains miss" at the same time.

Facets or Suggestions from a Query

Nucular queries also support a suggestions feature which attempts to find to provide data values likely to be related to a query. For example if the query specifies that the "Business" symbol must match "software" and the Description field must contain "community", the suggestions feature will attempt to determine most likely values for the "Ticker" field. This feature is intended to be useful for graphical interfaces which support drop down suggestions or completions. The related data values for a field derived from a query are often called query facets<.

Deferred and immediate updates

Updates to archives (inserts or deletes of entries) may be stored in "immediate" or "deferred" mode. New updates to an archive are not automatically inserted into the optimized index structures. Instead they "wait" in special areas for full integration when the archive is aggregated later.

Deferred updates are kept in a special area for later integration with the optimized index structures when the archive is aggregated (as described below). Sessions that access the archive before the next aggregation completes will not see the deferred updates -- but they also will not experience any performance degradation regardless of the number of deferred updates that are awaiting aggregation.

Immediate updates are placed in a different special area where they "lie on top" of the optimized indices until the archive is aggregated. Sessions that access the archive before the next aggregation will read all immediate updates that have not been aggregated. Consequently, if there are too many immediate updates, the sessions may become slower and more resource intensive.

Generally, it is a good idea to use immediate updates for information which must be visible immediately (such as indexing a new user profile) and defer updates of lower importance (such as a user profile edit) when possible.

Archive Aggregation

As mentioned, archives accumulate updates in special areas that are separate from the main optimized indexing structures. There are two levels of aggregation which move updates from the temporary areas into the optimized indexes:

Partial Aggregation: Moves updates from the update area to a "transitional" index. This operation should be done whenever it is important to encorporate deferred updates, or whenever there are too many immediate and unaggregated updates.
Final Aggregation: Moves updates from the "transitional" index to the "final" index. This operation should be done whenever the "transitional" index gets large enough that partial aggregation becomes slow. For large archives the final aggregation operation may take a while since it passes over all data in the archive.

An aggregation operation may be done concurrently with updates and queries by other processes or threads, but two aggregation operations should not run at the same time (I don't think information will be lost, but one of the operations will be wasted).

End of Nucular Overview return to index