return to index |
nucular project page with download links |
nucularSite
script creates or resets a nucular archive.
nucularLoad
script adds or replaces entries to a
nucular archive, and can also delete entries.
nucularQuery
script extracts entries from an archive based on
characteristics specified in a query.
nucularDump
script writes out the contents of an
archive to XML files.
nucularArchive
script optimizes archive indices by aggregating
updates to the archive.
nScrape
script builds an index by scraping the contents of
text files in a directory tree.
nucularSite
The following discussion provides example command lines
for using the nucularSite
script together with explanations of the
uses.
% python nucularSite.py directory
Create the directory and initialize an empty nucular archive structure there.
% python nucularSite.py --reset directory
If the directory exists, erase it's content and replace it with an empty nucular archive structure.
Adding a "--silent" flag will suppress debug prints
nucularLoad
The following discussion provides example command lines
for using the nucularLoad
script together with explanations of the
uses.
% python nucularLoad.py --xml fileName.xml archivePath
Load the entries in fileName.xml (encoded as xml) into the nucular archive at archivePath. The updates will not be visible until the next time the archive is aggregated. By default if any duplicate identities are found between the new entries and the existing archive entries, the load will be aborted.
% python nucularLoad.py --stdin archivePath
Loads the entries from standard input (encoded as xml). The updates will not be visible until the next time the archive is aggregated.
By default the entries will be inserted in "deferred" mode, meaning they will not be visible until the archive is aggregated. To make the entries visible to queries immediately add the "--visible" flag.
% python nucularLoad.py --visible --xml fileName.xml archivePath
Warning: if too many updates are made visible immediately before aggregation accesses to the archive may become slower.
To automatically aggregate recent updates for the archive following the inserts add the --aggregate flag
% python nucularLoad.py --aggregate --xml fileName.xml archivePath
If the archive is large this may take some time to complete.
To suppress verbose progress messages to standard output add the "--silent" flag.
% python nucularLoad.py --silent --xml fileName.xml archivePath
To replace existing entries that match on id use --replace
% python nucularLoad.py --replace --xml fileName.xml archivePath
To delete existing entries that match on id from input (and not replace them) use --delete.
% python nucularLoad.py --delete --xml fileName.xml archivePath
Use the "--exampleXML" flag to get a sample of the xml format required for encoding entries printed to standard output.
% python nucularLoad.py --exampleXML
nucularLoad
The following provides an example of the XML format
used by the nucularLoad
script as an input format.
<!-- Note: The top level tag is the "entries" which contains "entry" elements. --> <entries> <!-- entry elements must have a unique id and any number of fld attribute definitions --> <entry id="CA"> <!-- fld names are given by the (required) "n" attribute --> <!-- fld values are the string content of the fld element --> <!-- an entry may have several fld elements with the same name --> <fld n="category">U.S. State</fld> <fld n="name">Republic of California</fld> <fld n="nickname">The Golden State</fld> <fld n="nickname">Disney World State</fld> </entry> <!-- the entries do not need to have the same attributes, except for the required "id" --> <entry id="T2T"> <fld n="Title">Tale of Two Cities</fld> <fld n="Author">Charles Dickens</fld> </entry> <entry id="DEC"> <fld n="Company">Digital Equipment</fld> <fld n="Business">Computer hardware, parts and services</fld> <fld n="Status">Now part of Compaq</fld> </entry> </entries>
nucularQuery
The following discussion provides example command lines
for using the nucularQuery
script together with explanations of the
uses.
% python nucularQuery.py --xml queryFileName.xml archivePath
Prints the result of evaluating queryFileName to standard output as an XML entries dump. By default XML comments are added providing additional verbose information.
% python nucularQuery.py --prefix name:value archivePath
Prints a dump of entries where field "name" has "value" as a prefix.
% python nucularQuery.py --contains name:word archivePath
Prints a dump of entries where field "name" contains a word with "word" as a prefix.
% python nucularQuery.py --contains word archivePath
Prints a dump of entries where any field contains has some word with "word" as a prefix,
% python nucularQuery.py --range name:aaalow..zzzhigh archivePath
Print dump of entries where the field "name" has a value between aaalow and zzzhigh (in string ordering).
% python nucularQuery.py --match name=value archivePath
Print dump of entries where the field "name" has the value "value"
% python nucularQuery.py --near 3:words..near..eachother archivePath
Print dump or entries in the archive where some field contains the words "words" and "near" and "eachother" in that order separated by no more than 3 intervening words.
The --prefix, --contains, --near, and --range flags may be provided multiple times and combined. Only one --xml flag is allowed.
% python nucularQuery.py --match name=value --suggestions archivePath
The --suggestions flag instructs the script to print a sequence of suggestions for values for the fields of the database selected from entries in the query result.
% python nucularQuery.py --match name=value --silent archivePath
The --silent flag will suppress verbose XML comments in the output.
% python nucularQuery.py --exampleXML
Prints an example for the XML query format to standard output
nucularQuery
The following provides an example of the XML format
used by the nucularQuery
script as an input format.
<!-- example XML for query specification --> <query> <!-- social security number must start with 787 --> <prefix n="socialSecurityNumber" p="787"/> <!-- age should be between 09 (inclusive) and 45 (exclusive) in string order --> <range n="age" low="09" high="45"/> <!-- interests contains word programming --> <contains n="interests" p="programming"/> <!-- interests also contains word horses --> <contains n="interests" p="horses"/> <!-- the word "java" occurs in some attribute value as a prefix --> <contains p="java"/> <!-- the gender must match "female" --> <match n="gender" v="female"/> <!-- the words "swing dancing" must occur in that order somewhere separated by no more than 3 intervening words --> <near words="swing dancing" limit="3"/> </query>
nucularDump
The following discussion provides example command lines
for using the nucularDump
script together with explanations of the
uses.
% python nucularDump.py --prefix fileNamePrefix archivePath
Dump the entries from the archive at archivePath to xml file fileNamePrefix0.xml. If there are more than 1000 entries the second 1000 will go in fileNamePrefix1.xml, the third will go in fileNamePrefix2.xml and so forth. The data files can be used to recreate the archive using the nucularLoad script.
Rationale for using many files: large data sets are broken into multiple files so they can be reloaded into an archive without difficulty in multiple stages.
% python nucularDump.py --chunkSize 10000 --prefix fileNamePrefix archivePath
The above specifies that up to 10000 entries should go in each generated file (rather than the default of 1000).
% python nucularDump.py --silent --prefix fileNamePrefix archivePath
the "--silent" option suppresses verbose progress reporting to standard output.
nucularAggregate
The following discussion provides example command lines
for using the nucularAggregate
script together with explanations of the
uses.
"nucularAggregate.py" performs maintenance operations which optimize an updated nucular archive structure. Aggregations are only needed after multiple update operations.
% python nucularAggregate.py archivePath
Aggregate the recent updates to the archive into the "Transient" tree. This will make "deferred" updates visible and possibly make archive accesses faster.
% python nucularAggregate.py --fast archivePath
Perform the aggregation using the "fast" protocol (which consumes more memory but runs faster than the default protocol when enough real memory is available). If there are too many updates memory thrashing may ensue (causing slowness) or memory may be exhausted (causing a MemoryError after a very long wait). If this happens don't use --fast next time.
% python nucularAggregate.py --full archivePath
Aggregate recent updates to the transient tree and then merge the transient tree with the base archive. This should be done less frequently than the partial aggregation and may take a long time for large data sets.
% python nucularAggregate.py --silent archivePath
Perform the aggregation without verbose progress reports to standard output.
The --full, --fast, and --silent flags may be combined.
nScrape
The following discussion provides example command lines
for using the nScrape
script together with explanations of the
uses.
% python nScrape.py --directory path archivePath
Creates a new search archive in archivePath and traverses the directory path, reading and indexing the contents of the text files found. A file is only indexed if it is clear from its extension that it is a text file.
% python nScrape.py --add --directory path archivePath
Add files below path to the existing search archive in archivePath.
% python nScrape.py --silent --directory path archivePath