Python API Summary for creating and accessing a Nucular archive

nucular project page with download links

Python API Summary for creating and accessing a Nucular archive

The following discussion summarizes the Python Applications Programming Interfaces to the primary components used to interact with a Nucular archive. Most client programs will interact with the following sorts of objects:

A Nucular.Nucular session object provides the top level interface for an archive. It stores and retrieves entry.Entry objects and creates Nucular.NucularQuery objects. It also provides methods for storing updates, evaluating queries using boolean query strings, and aggregating the archive.
A entry.Entry object is a collection of field names with values. It provides the primary unit of storage for a Nucular archive (from a user's perspective).
A Nucular.NucularQuery object represents a specification for a set of entry.Entry's to extract from the archive. When successfully evaluated it generates a Nucular.NucularResult object.
A Nucular.NucularResult object encapsulates the set of entry.Entry's which match the specification provided by a Nucular.NucularQuery object. It provides methods for extracting the entry.Entry objects or their identities as well as other statistical information about the result set.
An entry.Entry may contain a specialValues.SpecialValue object, which can refer to the contents of an external URL, image, or point to an internal index entry identity.

At this point some advanced features of the system have been omitted from this documentation for simplicity.

Nucular Sessions

A Nucular archive is specified by the top level file system directory path containing the archive, such as /usr/arw/data/Mondial/. All access to such an archive is mediated by a Nucular.Nucular object. Create one using the following constructor:

from nucular import Nucular
session = Nucular.Nucular(directory)

If the session will make no changes to the archive the session will be slightly more optimized if openned readonly:

from nucular import Nucular
session = Nucular.Nucular(directory, readOnly=True)

Many sessions in many processes may be updating and querying an archive at the same time without interfering with eachother.

Initializing a new archive

If the archive associated with a session does not exist the call session.create() will attempt to create the session data structures on disk. The create will fail without making changes if the archive directory exists and is not empty. The calling program must make sure the archive is empty before calling create.

Storing session updates

A session which adds or deletes entries to an archive must be stored or else the adds and deletes will be discarded and will not appear in the archive. To store updates in "lazy mode" use

session.store(lazy=True)

This will prevent other sessions from seeing the updates until the session has been aggregated. The advantage of doing it this way is that the updates will not have any performance impact on the other sessions. The disadvantage is that the other sessions will not see the updates, of course.

To make the updates visible to all subsequent sessions immediately use

session.store(lazy=False)

If too many sessions store visible updates before an aggregation access time for the archive may get slower and the session objects will consume more memory.

Adding dictionaries to an archive

As a convenience the session object allows programs to store dictionaries directly (skipping the need to explicitly create entries) using

session.indexDictionary(identityString, dictionary)

Here the identityString should be a string uniquely identifying the new entry in the archive and the dictionary should map string attribute names to archivable values (values suitable for marshalling using the marshal module, like strings, ints, longs, or tuples or lists of marshallables).

The insert will not be permanent until the session is stored.

Adding entries to an archive

In addition to archiving dictionaries, a program may directly construct and store entry objects using

from nucular import entry
newEntry = entry.Entry(...)
...
session.index(newEntry)

The insert will not be permanent until the session is stored.

Evaluating boolean queries over an archive

Session objects provide the following methods for evaluating boolean query strings to get query results.

resultObject = session.result(queryString)
listOfDictionaries = session.dictionaries(queryString)

The syntax and usage of boolean query strings is described in greater detail in the Boolean query document. In addition to using boolean query strings, queries may also be constructed using query objects as described below.

Special Values

Some indices require internal cross references and references to external files or URLs. The specialValues.SpecialValues support these sorts of "pointer" values as an advanced usage feature. The special values may be used as values in dictionaries and entries inserted into a Nucular index. Use the following session methods to create special values.

session.ExpandedURL(URLstring) returns a "pointer" to the contents of a URL which will be indexed for searching in the indices and expanded into the full text upon retrieval. The full text itself will not be stored in the index directly.

session.UnExpandedURL(URLstring) returns a "pointer" to the contents of a URL which will be indexed for searching in the index but not expanded into full text upon retrieval. The full text will not be stored in the index directly.

session.UnIndexedURL(URLstring) returns a "pointer" to a URL with is not indexed for content and not stored in the indices.

session.ImageURL(URLstring) returns a "pointer" to an image object which is not indexed for content in the indices.

session.InternalLink(identityString) returns a "pointer" to an entry in the archive by its identity.

Please look to the test programs and the nBrowse.py HTTP browser code for examples of how these values may be used.

A warning about mixing Unicode strings with 8-bit strings

Although nucular allows indexing of entries containing both unicode strings and non-unicode (8-bit) strings it is possible to trigger exceptions if the archive contains both unicode strings and strings which cannot be automatically converted to unicode. In general it is best to keep all strings as unicode or all strings as 8-bit strings within a given archive. It is not a good idea to mix the two.

Indexing full text only (optimization)

By default an archive session will index all new entries for full text words searches, for fielded match searches, for fielded range searches, and for fielded word searches. The freeTextOnly method suppresses indexing for word searching within attributes (fielded word searching). This reduction of functionality cuts the build time for some archives approximately in half. Suppress fielded word search functionality by calling:

session.freeTextOnly()

Deleting from an archive

To delete an entry associated with identityString from an archive use


session.remove(identityString)

The delete will not be permanent until the session is stored.

Getting an entry by identity from an archive

To extract an entity from an archive using its identityString use

theEntry = session.describe(identityString)

Creating an unspecified query for a session

To get an "unspecified" query for a session use

query = session.Query()

Before evaluation the query must be specified using the query API described below.

Creating a specified query using XML

To create and specify a query in one step using the XML format described for the nucularQuery script, use

query = session.QueryFromXMLText(XMLText)

Such a query may be evaluated immediately with no further initialization. Please see the nucularQuery script documentation for discussion of the XML format required.

Aggregating recent updates to intermediate storage

The following operation combines any updates to an archive since the last aggregation operation into optimized data storage structures.

archive.aggregateRecent(verbose=False, fast=True)

This operation should be performed "once in a while" for archives which are frequently updated.

If the fast parameter is True most of the aggregation process will be done in memory with reduced disk accesses. This will be faster unless there are a great many updates to aggregate (many thousands or more). If fast is set False the aggregation operation will use disk storage and should work (possibly more slowly) for any number of updates.

As a side-effect, this operation will cause all deferred updates to become visible.

This operation may be performed with concurrent queries, updates, and deletes by other sessions in other processes.

Combining intermediate storage into permanent storage

The following operation will move all updates to the archive from intermediate optimized storage to permanent optimized storage.

session.moveTransientToBase(verbose=False)

This operation should be performed whenever the aggregateRecent operation starts to get too slow (which should be relatively infrequently). For large archives this operation may take significant time to complete.

This operation may be performed with concurrent queries, updates, and deletes by other sessions in other processes.

Unlinking retired files

After an aggregation operation there will be a number of files that will no longer be used by future session objects. To unlink these files to allow the filesystem to reclaim the space they consume use:

session.cleanUp()

Nucular Entries

entry.Entry objects encapsulate sets of attribute/value pairs stored and retrieved by a Nucular archive. Many applications may not need to interact directly with entry objects by using the interfaces which manipulate dictionaries. The following is a brief summary of the primary client accesses to entry objects.

from nucular import entry

# create a new entry object with identity String "1234"
myEntry = entry.Entry("1234")

# add an attribute/value association
myEntry[attribute] = value

# get the identity for an entry
myIdentity = myEntry.identity()

# associate several values to an attribute
myEntry.setValues(attribute, listOfValues)

# get the value for an attribute
theValue = myEntry[attribute]

# get the sequence of values for an attribute
listOfValues = myEntry.getValues(attribute)

# get a dictionary mapping attributes to single values
attributeToValue = myEntry.asDictionary()

# get a dictionary mapping attributes to lists of values
attributeToValueLists = myEntry.attrDict()

# get an XML string representing the entry
XMLString = myEntry.toXML()

Nucular Queries

Query objects collect descriptions for entries to use to find a set of entries of interest to the client application. Queries are created by session objects and are used to generate result sets. The following is a summary of the primary client operations for a query:

# Get a query from a session.
myQuery = session.Query()

# Specify that attribute1 must exactly match value1.
myQuery.matchAttribute(attribute1, value1)

# Specify that attribute2 must start with prefix2.
myQuery.prefixAttribute(attribute2, prefix2)

# Specify that attribute3 must contain wordPrefix3 as a prefix to some word.
# Matching is case insensitive.
# This operation will only work if the index was built
# with support for fielded word searches (freeTextOnly disabled).
myQuery.attributeWord(attribute3, wordPrefix3)

# Specify that wordPrefix4 must occurs as a prefix to some word in 
# some attribute. (matching is case insensitive)
myQuery.anyWord(wordPrefix4)

# Specify that attribute5 must be between AAALow and ZZZHigh alphabetically.
myQuery.attributeRange(attribute5, AAALow, ZZZHigh)

# Specify that proximateWords must occur in the order given
# somewhere in each entry separated by no more than nearLimit
# intervening words.
myQuery.proximateWords(proximateWords, nearLimit)

# Find all entries visible in the archive which match
# all the specified conditions as a list of dictionaries.
dictionaryList = myQuery.resultDictionaries()

# Find all entries visible in the archive which match
# all the specified conditions as a result object with a status indicator
# [advanced]. May return result=None on failure.
(result, status) = myQuery.evaluate()

The evaluate method is useful when the result set may be large and the client application is only interested in fully extracting a small part of the set.

Below are some motivating examples for the descriptive methods:

# I am only interested in phone number (732)454-7633.
myQuery.matchAttribute("phoneNumber", "(732)454-7633")

# I only want URL's that begin with http.
myQuery.prefixAttribute("URL", "http:")

# I only want entries where the topic contains "bigfoot".
myQuery.attributeWord("topic", "bigfoot")

# I only want entries that mention "watters" somewhere
myQuery.anyWord("watters")

# I only want entries where the date is between 1961-12-29 and 1999-01-17
myQuery.attributeRange("date", "1961-12-29", "1999-01-17")

# I only want entries containing "Charles Dickens" in that order
# with no more than 3 intervening words in any field.
myQuery.proximateWords("charles dickens", 3)

# Alternate signature for proximateWords:
myQuery.proximateWords(["charles", "dickens"], 3)

Nucular Results

The evaluate method of a query object generates a Nucular.NucularResult object which may be useful for queries that evaluate to 10000 entries, but the client application only wants to extract the first 20 of them. In this case using the result object to mediate the extraction may speed up the access to the 20 entries of interest considerably (since the other 9990 entries need not be materialized). The following is a brief summary of the possible interactions.

# get a result object and status string from a query
(result, status) = myQuery.evaluate()

# get the list of identity strings for the entities in the result set
identityList = result.identities()

# get the entity corresponding to a particular entity in the result set
theEntity = result.describe(theIdentity)

For example the following Python command line interaction queries the Gutenberg example archive for entries mentioning "bysshe" and extracts only the first entry completely (of 13 entries found).

>>> from nucular import Nucular 
>>> N = Nucular.Nucular("../testdata/gutenberg")
>>> Q = N.Query()
>>> Q.anyWord("bysshe")
>>> (R, status) = Q.evaluate()
>>> status
'complete'
>>> ids = R.identities()
>>> len(ids)
13
>>> firstId = ids[0]
>>> theEntity = R.describe(firstId)
>>> theEntity
Entry({'Subtitle': [u' An Essay'], 'Title': [u'Shelley'], 
'i': [u'1336'], 'Author': [u'Francis Thompson'], 
'Comments': [u'[Subtitle: An Essay] (PG Note: about Percy Bysshe Shelley)'], 
'link': [<fld n="?" special="UnIndexedURL">http://www.gutenberg.org/etext/1336</fld> <!-- UnIndexedURL -->]})
>>>

Nucular Suggestions

The suggestions method of the query object generates suggested completions for full text and for fielded values derived from entries that match the query. It returns a tuple containing a list and a dictionary:

(L, D) = Q.suggestions()

The list contains suggested completion strings for full text searches and the dictionary contains suggested completions broken down by attribute name (as a mapping from field name to list of suggestions). For example the following Python command line interaction requests suggestions for entries matching "bysshe" from the gutenberg example index:

>>> N = Nucular.Nucular("../testdata/gutenberg")
>>> Q = N.Query()
>>> Q.anyWord("bysshe")
>>> (L, D) = Q.suggestions()
>>> L   
[u'bysshe shelley']
>>> D.keys()
['Subtitle', 'Language', 'Author', 'i', 'Contains', 'Comments', 'Translator', 
'Tr.', 'link', 'Editor', 'Title', 'Commentator']
>>> for x in D["Author"]:
...     print x
... 
wtctlxxx
thomas
xxx
pbs
xenophon
ptbllxxx
hutchinson
thompson
dmntwxxx
shelley
bysshe
sotheran
>>>

Here the suggested free text completion for "bysshe" is "bysshe shelley" and the following lines display words occurring in the Author fields from related entries in the archive. The suggestions interface is intended to be useful for web interfaces which support "drop down completion/suggestions".

End of Python API Summary for creating and accessing a Nucular archive return to index