Command Line Script Guide

nucular project page with download links

Command Line Script Guide

The following discussion talks through a sequence of command line examples which illustrate ways to interact with a nucular archive using the command line python scripts provided with the distribution. Please see the command line summaries for more detailed explanations of the scripts used here.

In the examples below the working directory is the scripts directory of the distribution and the PYTHONPATH environment variable is set so that from nucular.nucular import Nucular works.

Creating an archive

The command

% python nucularSite.py --reset ../testdata/ScriptExample

creates an empty archive at ../testdata/ScriptExample. Since --reset is specified any existing data in the directory will be deleted.

Loading some data with deferred visibility

The command

% python nucularLoad.py --xml ../data/docExamples0.xml ../testdata/ScriptExample

Loads some data from ../data/docExamples0.xml into the new archive. This is the content of ../data/docExamples0.xml

<entries>
<entry id="123FROG">
   <fld n="descr">little green slimy things</fld>
   <fld n="food">tastes delicious, like chicken</fld>
   <fld n="name">frog</fld>
</entry>
<entry id="456BUNNY">
   <fld n="descr">cute and cuddly</fld>
   <fld n="food">just delicious with garlic</fld>
   <fld n="name">bunny rabbit</fld>
</entry>
<entry id="789KITTEN">
   <fld n="descr">cute and cuddly</fld>
   <fld n="name">kitten</fld>
   <fld n="note">not edible</fld>
</entry>
<entry id="Joe Blow">
   <fld n="c">great at a grill</fld>
   <fld n="g">male</fld>
   <fld n="p">333-2222</fld>
</entry>
<entry id="Joe Smithers">
   <fld n="c">can't cook</fld>
   <fld n="g">male</fld>
   <fld n="p">111-3333</fld>
</entry>
<entry id="Lola Waller">
   <fld n="g">female</fld>
   <fld n="n">thinks snails are delicious</fld>
   <fld n="p">333-2222</fld>
</entry>
<entry id="Sally Smithers">
   <fld n="c">uses too much salt</fld>
   <fld n="g">female</fld>
   <fld n="p">111-3333</fld>
</entry>
<entry id="Sandy Waller">
   <fld n="c">delicious pizza</fld>
   <fld n="g">female</fld>
   <fld n="p">333-2222</fld>
</entry>
</entries>

Since we didn't specify --visible in the load command the data is not visible when we pose the following query:

% python nucularQuery.py --contains delicious  ../testdata/ScriptExample

The output of the query command gives

<!-- archive= ../testdata/ScriptExample
<query threaded="False">
   <contains p="delicious"/>
</query>
-->
<!-- result status= complete -->

<entries>
</entries>

<!--  0 entries in result set -->

The above output includes some verbose XML comments because the query command didn't specify --silent but there are no entries shown inside the top level entries tag.

Aggregating the data

Aggregating the archive will optimize the indexing structures and also make all deferred updates visible. We aggregate the archive using the command

% python nucularAggregate.py --silent ../testdata/ScriptExample/

After aggregation we pose the query again

% python nucularQuery.py --contains delicious  ../testdata/ScriptExample

And the query evaluation generates the output

<!-- archive= ../testdata/ScriptExample
<query threaded="False">
   <contains p="delicious"/>
</query>
-->
<!-- result status= complete -->

<entries>
   <entry id="123FROG">
      <fld n="descr">little green slimy things</fld>
      <fld n="food">tastes delicious, like chicken</fld>
      <fld n="name">frog</fld>
   </entry>
   <entry id="456BUNNY">
      <fld n="descr">cute and cuddly</fld>
      <fld n="food">just delicious with garlic</fld>
      <fld n="name">bunny rabbit</fld>
   </entry>
   <entry id="Lola Waller">
      <fld n="g">female</fld>
      <fld n="n">thinks snails are delicious</fld>
      <fld n="p">333-2222</fld>
   </entry>
   <entry id="Sandy Waller">
      <fld n="c">delicious pizza</fld>
      <fld n="g">female</fld>
      <fld n="p">333-2222</fld>
   </entry>
</entries>

<!--  4 entries in result set -->

Queries with multiple conditions

Queries may place many conditions on entries in the result. For example the following query looks for entries that both contain the word "cuddly" in a prefix of some word in some attribute and the word "delicious" as a prefix of some word in some attribute.

% python nucularQuery.py --contains delicious --contains CUDDLY ../testdata/ScriptExample

The query generates the following XML as output

<!-- archive= ../testdata/ScriptExample
<query threaded="False">
   <contains p="cuddly"/>
   <contains p="delicious"/>
</query>
-->
<!-- result status= complete -->

<entries>
   <entry id="456BUNNY">
      <fld n="descr">cute and cuddly</fld>
      <fld n="food">just delicious with garlic</fld>
      <fld n="name">bunny rabbit</fld>
   </entry>
</entries>

<!--  1 entries in result set -->

XML Query specification

Queries may also be specified using XML files. For example the file ../doc/deliciousCuddlyQuery.xml with the content

<query>
   <contains p="cuddly"/>
   <contains p="delicious"/>
</query>

represents the same query as the one above (looking for "cuddly" and "delicious") and the command using this XML specification

% python nucularQuery.py --xml ../doc/deliciousCuddlyQuery.xml ../testdata/ScriptExample

generates the same output.

Dumping the archive as XML

To save the archive entire contents as XML execute the command

python nucularDump.py --prefix ../testdata/Dump ../testdata/ScriptExample/

In this case because the archive is small the only file created by the command is ../testdata/Dump0.xml. For larger archives the script might create additional files: ../testdata/Dump1.xml, ../testdata/Dump3.xml, ../testdata/Dump3.xml, and so forth.

Adding more entries

The following command loads an additional 100 entries derived from a portion of the Gutenberg project's book list.

% python nucularLoad.py --silent --visible --xml ../data/gutenberg1.xml ../testdata/ScriptExample

In this case since we specified --visible the data becomes visible immediately to subsequent queries.

Looking for Smith

If we evaluate the query looking for the prefix "smith" anywhere

% python nucularQuery.py --contains smith ../testdata/ScriptExample

We see entries from both the initial data set and the additional data in the output:

<!-- archive= ../testdata/ScriptExample
<query threaded="False">
   <contains p="smith"/>
</query>
-->
<!-- result status= complete -->

<entries>
   <entry id="10166">
      <fld n="Author">Thomas F. A. Smith</fld>
      <fld n="Comments">[Subtitle: The War as Germans see it]</fld>
      <fld n="Subtitle"> The War as Germans see it</fld>
      <fld n="Title">What Germany Thinks</fld>
   </entry>
   <entry id="Joe Smithers">
      <fld n="c">can't cook</fld>
      <fld n="g">male</fld>
      <fld n="p">111-3333</fld>
   </entry>
   <entry id="Sally Smithers">
      <fld n="c">uses too much salt</fld>
      <fld n="g">female</fld>
      <fld n="p">111-3333</fld>
   </entry>
</entries>

<!--  3 entries in result set -->

Deleting entries

To remove the new entries use the command

% python nucularLoad.py --silent --visible --xml ../data/gutenberg1.xml --delete ../testdata/ScriptExample

Which deletes all the identities from entries in the file ../data/gutenberg1.xml from the archive, and makes the deletes visible immediately. If we run the "smith" query again:

% python nucularQuery.py --contains smith ../testdata/ScriptExample > ../doc/smith1.xml

The entry from the Gutenberg data set is gone:

<!-- archive= ../testdata/ScriptExample
<query threaded="False">
   <contains p="smith"/>
</query>
-->
<!-- result status= complete -->

<entries>
   <entry id="Joe Smithers">
      <fld n="c">can't cook</fld>
      <fld n="g">male</fld>
      <fld n="p">111-3333</fld>
   </entry>
   <entry id="Sally Smithers">
      <fld n="c">uses too much salt</fld>
      <fld n="g">female</fld>
      <fld n="p">111-3333</fld>
   </entry>
</entries>

<!--  2 entries in result set -->

Fully aggregating the archive

At some point after a series of updates the archive should be fully aggregated as is done for the example archive using the following command line:

% python nucularAggregate.py --silent --full ../testdata/ScriptExample/

Other sorts of queries

As explained in the command line summary document there are other ways to specify a query in addition to the ones shown above. For example the following query finds entries where the "g" attribute has the value "female" and the "p" attribute starts with "333":

% python nucularQuery.py --match g=female --prefix p:333 ../testdata/ScriptExample

generating the output

<!-- archive= ../testdata/ScriptExample
<query threaded="False">
   <match n="g" v="female"/>
   <prefix n="p" p="333"/>
</query>
-->
<!-- result status= complete -->

<entries>
   <entry id="Lola Waller">
      <fld n="g">female</fld>
      <fld n="n">thinks snails are delicious</fld>
      <fld n="p">333-2222</fld>
   </entry>
   <entry id="Sandy Waller">
      <fld n="c">delicious pizza</fld>
      <fld n="g">female</fld>
      <fld n="p">333-2222</fld>
   </entry>
</entries>

<!--  2 entries in result set -->

Scraping a directory tree

The scripts directory also provides a utility for building a searchable index from the text files within a directory tree. nScrape.py traverses a directory structure identifying text files, reading the text files found and adding the file information to a nucular index for later searching.

For an example run of nScrape.py we first create a new archive to house the scraped text indices:

% python nucularSite.py --reset ../testdata/ScrapeExample

Then we scrape the contents of the ../test directory into the archive:

% python nScrape.py --add --directory ../test ../testdata/ScrapeExample

After the scrape we may query the archive, for example to identify files containing words with the prefix "garban":

% python nucularQuery.py --contains garban  ../testdata/ScrapeExample

This query produces the XML output:

<entries>
   <entry id="7">
      <fld n="A_path">../test/scrapeTargetFile.txt</fld>
      <fld n="B_type">text/plain</fld>
      <fld n="C">
This is the only file in the distribution
which mentions garbanzo beans.
</fld>
   </entry>
</entries>
<!--  1 entries in result set -->

End of Command Line Script Guide return to index