Nucular Command Line Script Usage Summary

nucular project page with download links

Nucular Command Line Script Usage Summary

The following sections explain how to use the command line tools used to interact with a nucular archive:

The nucularSite script creates or resets a nucular archive.
The nucularLoad script adds or replaces entries to a nucular archive, and can also delete entries.
The nucularQuery script extracts entries from an archive based on characteristics specified in a query.
The nucularDump script writes out the contents of an archive to XML files.
The nucularArchive script optimizes archive indices by aggregating updates to the archive.
The nScrape script builds an index by scraping the contents of text files in a directory tree.

Script summary for `nucularSite`

The following discussion provides example command lines for using the nucularSite script together with explanations of the uses.

Create a new nucular archive

 %  python nucularSite.py directory

Create the directory and initialize an empty nucular archive structure there.

Erase and re-create a nucular archive

 %  python nucularSite.py --reset directory

If the directory exists, erase it's content and replace it with an empty nucular archive structure.

Adding a "--silent" flag will suppress debug prints

Script summary for `nucularLoad`

The following discussion provides example command lines for using the nucularLoad script together with explanations of the uses.

Load entries from an XML file to archive with deferred visibility

 %  python nucularLoad.py --xml fileName.xml archivePath

Load the entries in fileName.xml (encoded as xml) into the nucular archive at archivePath. The updates will not be visible until the next time the archive is aggregated. By default if any duplicate identities are found between the new entries and the existing archive entries, the load will be aborted.

Load entries from standard input XML to archive with deferred visibility

 %  python nucularLoad.py --stdin archivePath

Loads the entries from standard input (encoded as xml). The updates will not be visible until the next time the archive is aggregated.

Load entries with immediate visibility

By default the entries will be inserted in "deferred" mode, meaning they will not be visible until the archive is aggregated. To make the entries visible to queries immediately add the "--visible" flag.

 %  python nucularLoad.py --visible --xml fileName.xml archivePath

Warning: if too many updates are made visible immediately before aggregation accesses to the archive may become slower.

Load entries and aggregate the archive immediately

To automatically aggregate recent updates for the archive following the inserts add the --aggregate flag

 %  python nucularLoad.py --aggregate --xml fileName.xml archivePath

If the archive is large this may take some time to complete.

Load silently

To suppress verbose progress messages to standard output add the "--silent" flag.

 %  python nucularLoad.py --silent --xml fileName.xml archivePath

Load and replace duplicate identities

To replace existing entries that match on id use --replace

 %  python nucularLoad.py --replace --xml fileName.xml archivePath

Delete entries

To delete existing entries that match on id from input (and not replace them) use --delete.

 %  python nucularLoad.py --delete --xml fileName.xml archivePath

Give example XML

Use the "--exampleXML" flag to get a sample of the xml format required for encoding entries printed to standard output.

 %  python nucularLoad.py --exampleXML

Example XML Input for `nucularLoad`

The following provides an example of the XML format used by the nucularLoad script as an input format.

<!--
Note: The top level tag is the "entries" which contains "entry" elements.
-->
<entries>

    <!-- entry elements must have a unique id and any number of fld attribute definitions -->
    <entry id="CA">
        <!-- fld names are given by the (required) "n" attribute -->
        <!-- fld values are the string content of the fld element -->
        <!-- an entry may have several fld elements with the same name -->
        <fld n="category">U.S. State</fld>
        <fld n="name">Republic of California</fld>
        <fld n="nickname">The Golden State</fld>
        <fld n="nickname">Disney World State</fld>
    </entry>
    
    <!-- the entries do not need to have the same attributes, except for the required "id" -->
    <entry id="T2T">
        <fld n="Title">Tale of Two Cities</fld>
        <fld n="Author">Charles Dickens</fld>
    </entry>

    <entry id="DEC">
        <fld n="Company">Digital Equipment</fld>
        <fld n="Business">Computer hardware, parts and services</fld>
        <fld n="Status">Now part of Compaq</fld>
    </entry>
</entries>

Script summary for `nucularQuery`

The following discussion provides example command lines for using the nucularQuery script together with explanations of the uses.

Evaluate an XML Query using an archive

 %  python nucularQuery.py --xml queryFileName.xml archivePath

Prints the result of evaluating queryFileName to standard output as an XML entries dump. By default XML comments are added providing additional verbose information.

Find entries from an archive where a field starts with a given prefix

 %  python nucularQuery.py --prefix name:value archivePath

Prints a dump of entries where field "name" has "value" as a prefix.

Find entries from an archive where a field contains a given word prefix

 %  python nucularQuery.py --contains name:word archivePath

Prints a dump of entries where field "name" contains a word with "word" as a prefix.

Find entries from an archive where any field contains a given word prefix (free text search)

 %  python nucularQuery.py --contains word archivePath

Prints a dump of entries where any field contains has some word with "word" as a prefix,

Find entries from an archive where a field lies in a specified range

 %  python nucularQuery.py --range name:aaalow..zzzhigh archivePath

Print dump of entries where the field "name" has a value between aaalow and zzzhigh (in string ordering).

Find entries from an archive where has a specified value

 %  python nucularQuery.py --match name=value archivePath

Print dump of entries where the field "name" has the value "value"

Find proximate words in any field

 %  python nucularQuery.py --near 3:words..near..eachother archivePath

Print dump or entries in the archive where some field contains the words "words" and "near" and "eachother" in that order separated by no more than 3 intervening words.

The --prefix, --contains, --near, and --range flags may be provided multiple times and combined. Only one --xml flag is allowed.

Print suggested field values

 %  python nucularQuery.py --match name=value --suggestions archivePath

The --suggestions flag instructs the script to print a sequence of suggestions for values for the fields of the database selected from entries in the query result.

Suppress comments

 %  python nucularQuery.py --match name=value --silent archivePath

The --silent flag will suppress verbose XML comments in the output.

Print Example XML

 %  python nucularQuery.py --exampleXML

Prints an example for the XML query format to standard output

Example XML Input for `nucularQuery`

The following provides an example of the XML format used by the nucularQuery script as an input format.

<!-- example XML for query specification -->

<query>

    <!-- social security number must start with 787 -->
    <prefix n="socialSecurityNumber" p="787"/>

    <!-- age should be between 09 (inclusive) and 45 (exclusive) in string order -->
    <range n="age" low="09" high="45"/>

    <!-- interests contains word programming -->
    <contains n="interests" p="programming"/>

    <!-- interests also contains word horses -->
    <contains n="interests" p="horses"/>

    <!-- the word "java" occurs in some attribute value as a prefix -->
    <contains p="java"/>

    <!-- the gender must match "female" -->
    <match n="gender" v="female"/>

    <!-- the words "swing dancing" must occur in that order somewhere
         separated by no more than 3 intervening words -->
    <near words="swing dancing" limit="3"/>

</query>

Script summary for `nucularDump`

The following discussion provides example command lines for using the nucularDump script together with explanations of the uses.

Dump a nucular archive to files

 %  python nucularDump.py --prefix fileNamePrefix archivePath

Dump the entries from the archive at archivePath to xml file fileNamePrefix0.xml. If there are more than 1000 entries the second 1000 will go in fileNamePrefix1.xml, the third will go in fileNamePrefix2.xml and so forth. The data files can be used to recreate the archive using the nucularLoad script.

Rationale for using many files: large data sets are broken into multiple files so they can be reloaded into an archive without difficulty in multiple stages.

Dump a nucular archive using specified chunk size

 %  python nucularDump.py --chunkSize 10000 --prefix fileNamePrefix archivePath

The above specifies that up to 10000 entries should go in each generated file (rather than the default of 1000).

Dump silently

 %  python nucularDump.py --silent --prefix fileNamePrefix archivePath

the "--silent" option suppresses verbose progress reporting to standard output.

Script summary for `nucularAggregate`

The following discussion provides example command lines for using the nucularAggregate script together with explanations of the uses.

"nucularAggregate.py" performs maintenance operations which optimize an updated nucular archive structure. Aggregations are only needed after multiple update operations.

Aggregate recent operations to intermediate storage

 %  python nucularAggregate.py archivePath

Aggregate the recent updates to the archive into the "Transient" tree. This will make "deferred" updates visible and possibly make archive accesses faster.

Aggregate in fast mode

 %  python nucularAggregate.py --fast archivePath

Perform the aggregation using the "fast" protocol (which consumes more memory but runs faster than the default protocol when enough real memory is available). If there are too many updates memory thrashing may ensue (causing slowness) or memory may be exhausted (causing a MemoryError after a very long wait). If this happens don't use --fast next time.

Full aggregation

 %  python nucularAggregate.py --full archivePath

Aggregate recent updates to the transient tree and then merge the transient tree with the base archive. This should be done less frequently than the partial aggregation and may take a long time for large data sets.

Silent aggregation

 %  python nucularAggregate.py --silent archivePath

Perform the aggregation without verbose progress reports to standard output.

The --full, --fast, and --silent flags may be combined.

Script summary for `nScrape`

The following discussion provides example command lines for using the nScrape script together with explanations of the uses.

 %  python nScrape.py --directory path archivePath

Creates a new search archive in archivePath and traverses the directory path, reading and indexing the contents of the text files found. A file is only indexed if it is clear from its extension that it is a text file.

Index additional files in an existing index

 %  python nScrape.py --add --directory path archivePath

Add files below path to the existing search archive in archivePath.

Scrape silently

 %  python nScrape.py --silent --directory path archivePath

End of Nucular Command Line Script Usage Summary return to index