Breaking lines on your own with OmniMark

By Jacques Légaré, Senior Software Developer and Mario Blažević, Senior Software Developer
1. Motivation

OmniMark has had line-breaking functionality built-in ever since the XTRAN days. This functionality can be used to provide rudimentary text formatting capabilities. The language-level support for line-breaking is described quite thoroughly in the language documentation.

OmniMark’s language-level line-breaking support is very simple to use, and aptly supports the use-case where all the output of a program needs to be similarly formatted. Where the performance is less stellar, however, is when line-breaking needs to be activated and deactivated on a fine-grained level. The reason for this is simple: when line-breaking is disabled (say, using the h modifier), OmniMark cannot predict when it might be reactivated. As a result, it still needs to compute possible line-breaking points, just in case. As efficient as OmniMark might be, this can cause a significant reduction in performance, sometimes by as much as 15%.

As of version 8.1.0, the OmniMark compiler can detect some cases when line-breaking functionality is not being used, and thereby optimize the resulting compiled program to by pass line-breaking computations. However, the compiler cannot make this determination in general: this is an undecidable problem. For instance, consider the following somewhat contrived example:

replacement-break ” “ “%n”process do sgml-parse document scan #main-input output “%c” doneelement #implied output “%c” element “b” local stream s set s with (break-width 72) to “%c” output s

Note that line-breaking is only activated in the element rule for b, and so line-breaking will only be activated if the input file contains an element b. The OmniMark compiler cannot be expected to predict what the input files might contain when the program is executed!

Another issue with OmniMark’s built-in line-breaking is that it does not play well with referents. Specifically, consider the following program:

replacement-break ” “ “%n”process local stream sopen s as buffer with break-width 32 to 32 using output as s do xml-parse scan #main-input output “%c” done close s element #implied output “%c”  || “.” ||* 64

This program puts a hard limit of 32 characters on the maximum length of lines output to s. When this program is executed, a run-time error is triggered in the body of the element rule, where we attempt to output 64 periods. On the other hand, consider the following similar program:

replacement-break ” “ “%n”process local stream sopen s as buffer with (referents-allowed & break-width 32 to 32) using output as s do xml-parse scan #main-input output “%c” done close s set referent “a” to “.” ||* 64 output s element #implied output “%c”  || referent “a”

This program accomplishes virtually the same task, but instead uses a referent to output the string of periods. In this case, no run-time error is triggered: the line-breaking constraints have been silently violated.

Because of these issues, it is better to use OmniMark’s built-in line-breaking only when necessary, whereas in other cases to implement line-breaking using other language constructs.

The remainder of this article discusses how to simulate line-breaking on PCDATA using string sink functions.

2. string sink functions

string sink function is a function that can be used as the destination for strings. In a very real sense, a string sink function is the complement of a string source function, which is used as the source of strings. While a string source function outputs its strings to #current-output, a string sink function reads its strings from #current-input.

string sink function is defined much like any other function in OmniMark, the only difference being that the return type is string sink: for example,

define string sink function dev-null as void #current-input

This is a poor man’s #suppress, soaking up anything written to it.

string sink function can have any of the properties normally used to define functions in OmniMark: e.g., it can be  overloadeddynamicetc …. The argument list of a string sink function is unrestricted. However, in the body of a string sink function, #current-output is unattached.The form of OmniMark’s pattern matching and markup parsing capabilities makes string sink functions particularly convenient for writing filters, taking their #current-input, processing it in some fashion, and writing the result out to some destination. However, since #current-output is unattached inside the function, we need to pass the destination as an argument. For this, we use a value string sink argument. For example, a string sink function that indents its input by a given amount might be written

define string sink function
 indent (value integer     i,
 value string sink s)
 as
 using output as s
 do
 output ” “ ||* i       repeat scan #current-input
 match “%n”
 output “%n” || ” “ ||* i      match any-text* => t
 output t
 again
 done

The function indent could then be used like any other string sink:

; …
 using output as indent (5, #current-output)
 do sgml-parse document scan #current-input
 output “%c”
 done

(The ability to pass #current-output as a value string sink argument is new in OmniMark 8.1.0.)

You can find out more about string sink functions in the language documentation.

3.  Line-breaking in OmniMark

We can use a pair of string sink functions to simulate to some extent OmniMark’s built-in line-breaking functionality. The benefit of this approach is that it impacts the program’s performance only where it is used.

3.1. Simulating insertion-break

To simulate the effect of insertion-break on PCDATA we need to scan the input and grab as many characters as we can up to a specified width. If we encounter a newline in the process, we stop scanning. Otherwise, we output the characters we found, and append a line-breaking sequence provided by the user.

define string sink function
 insertion-break       value string      insertion
 width value integer     target-width
 into value string sink destination
 as

We can start by sanitizing our arguments:

assert insertion matches any ** “%n”
 message “The insertion string %”” || insertion
 || “%” does not contain a newline character %”%%n%”.”

This assertion is not strictly necessary. However, OmniMark insists that the line-breaking sequence contain a line-end character, and so we do the same.

We can grab a sufficient number of characters from #current-input by using OmniMark’s counted occurrence pattern operator:

using output as destination
 repeat scan #current-input
 match any-text{1 to target-width} => l (lookahead “%n” => n)?
 output l

The use of lookahead at the end of the pattern allows us to verify if a %n is upcoming: we should only output the line-breaking sequence if the characters we grabbed are not followed by a %n.

output insertion
 unless n is specified   match “%n”
 output “%n”
 again

We can then use this to break the text output from the markup parser: for example,

process
 using output as insertion-break “%n” width 20 into #current-output
 do sgml-parse document scan #main-input
 output “%c”
 done

3.2. Simulating replacement-break

Simulating insertion-break on PCDATA is straightforward, because it can insert a line-breaking sequence whenever it sees fit. On the other hand, replacement-break is slightly more complex, since it must scan its input for a breakable point. For clarity, the characters between two breakable points will be referred to aswords; if the breakable points are defined by the space character, they are effectively words.

define string sink function
 replacement-break       value string      replacement
 width value integer     target-width
 to value integer     max-width    optional
 at value string      original     optional initial { ” “}
 into value string sink destination
 as

The argument original is used to specify the character that delimits words; the argument is optional, as a space character seems like a reasonable default. target-width specifies the desired width of the line. max-width, if specified, gives the absolute maximum acceptable line width; if a line cannot be broken within this margin, an error is thrown. Finally, the argument replacement gives the line-breaking sequence.

As before, we start by ensuring our arguments have reasonable values:

assert length of original = 1
 message “Expecting a single character string,”
 || ” but received %”” || original || “%”.”   assert replacementmatches any ** “%n”
 message “The replacement string %”” || replacement
 || “%” does not contain a newline character %”%%n%”.”

The second assertion is repeated from above, for the same reasons as earlier: OmniMark insists that the replacement string contain a newline, and so we will do the same. The first assertion insists that breakable points be defined by a single character; again, this is a carry-over from OmniMark’s implementation.

For replacement-break, the pattern is very different from that of insertion-break: in that case, we could consume everything with a single pattern, using a counted occurrence. This does not suffice with replacement-break: rather, we have to consume words until we reach target-width.

using output as destination
 do
 local stream line initial { “” }

The stream line will be used to accumulate text from one iteration to another.

repeat scan #current-input
 match ((original => replaceable)? any-text
 ** lookahead (original | “%n” | value-end)) => t

The pattern in the match clause picks up individual words. If the line length is still below target-width, we can simply append the word to the current line and continue with the next iteration:

do when length of line + length of t < target-width
 set line with append to t

If this is not the case, we can output the text we have accumulated thus far, so long as it does not surpassmax-width

else when max-width isnt specified
 | length of line < max-width
 output line
 output replacement
 when replaceable is specified            set line to t droporiginal?

If all else fails, we could not find an acceptable breakable point in the line: OmniMark throws an error in this case, so we will do the same.

else
 not-reached message “Exceeded maximum line width”
 || ” of %d(max-width) characters.%n”
 || “The line is %”” || line || “%”.%n”         done

Our string sink function needs a few more lines to be complete. For one, our previous pattern does not consume any %n that it might encounter. In this case, we should flush the accumulated text, and append a%n:

match “%n”
 output line || “%n”
 set line to “”
 again

Lastly, when the repeat scan loop finishes, there may be some text left over in line, which needs to be emitted:

output line
 done

Just as was the case previously in Section 3.1, “Simulating insertion-break”, we can use our function to break text output from the markup parser: for example,

process
 using output as replacement-break “%n” width 10 to 15 into #main-output
 do sgml-parse document scan #main-input
 output “%c”
 done
4.  Going further

We demonstrated in Section 1, “Motivation” that referents and line-breaking did not play well together: in fact, a referent could be used to silently violate the constraints stated by a break-width declaration. In the case of our string sink simulations, referents are a non-issue: a referent cannot be written to an internal string sink function, which effectively closes the loophole.

OmniMark’s built-in line-breaking functionality can be manipulated using the special sequences %[ and %]: by embedding one of these in a string that is output to a stream, we can activate or deactivate (respectively) line-breaking. The easiest way of achieving this effect with our string sink functions would be to add a read-only switch argument called, say, enabledviz

define string sink function
 insertion-break         value     string      insertion
 width value     integer     target-width
 enabled read-only switch      enabled      optional
 into value     string sink destination
 as

and similarly for replacement-break. We could then use the value of this shelf item to dictate whether the functions should actively break their input lines, or pass them through unmodified.

Breaking lines using string sink functions in this fashion is really only the beginning. For instance, we could envision a few simple modifications to replacement-break that would allow it to fill paragraphs instead of breaking lines: it would attempt to fill out a block of text so that all the lines are of similar lengths.

The code for this article is available for download.


Pattern matching in OmniMark

Writing better patterns addresses two goals: making the patterns do more for you, and getting them to run fast. The former is always more important than the latter — there’s no point having a program running fast if it isn’t doing what you want it to — but they are not incompatible, and very often more effective patterns will run faster than less effective ones, because they are more to the point.

There are three main principles discussed in this paper:

  • Fail Fast — write patterns that don’t waste time,
  • Succeed Slow — write patterns that do their job efficiently, and
  • Divide and Conquer — build patterns to cover all the cases.
Fail Fast, Succeed Slow

All OmniMark programmers, at one time or another, ask themselves if their “find” rules could or should be running faster. Most of the time it doesn’t matter. If you’re not waiting around for your OmniMark program to run, there isn’t a problem. But if your thumbs are getting tired of twiddling, it might be time to take a look at your find rules.

Perhaps the most important thing that should be kept in mind about using find rules is that they spend most of their time failing to produce results. At most, only one find rule will end up capturing a piece of text, so all the rules ahead of it are going to fail. This leads to the first principle of writing find rules: “fail fast” — or, spend as little time as possible on find rules that are failing.

Most find rules already fail quite fast; the first character in the find rule isn’t the one being looked at, most of the time, and OmniMark takes advantage of this in the way it chooses the find rules to look at. One thing that you should avoid, if you can, is a find rule that starts with any, and which usually fails later in the pattern — i.e. an any match that isn’t a “catch-all” at the bottom of the program.

The second principle of writing find rules is “succeed slow” — once you are in a find rule, or in a repeat scan match, pick up as much with it as you can. It is often the case in word processor formats, for example, that commands come in bunches. So pick up bunches, not just single commands. This cuts down the number of find rules that are performed.

With these two principles in mind, let’s visit an old friend:

find any

This little rule ends up sitting at the bottom of many an OmniMark program. It can sit by itself as above, in which case it means “please throw away anything you haven’t yet recognized.” Or it can do something simple, as in the following, where it copies the otherwise unrecognized character to the current output:

find any => one-character
   output one-character

In both cases, this little rule usually provides an excellent opportunity to speed up the OmniMark program. First of all, on the “fail fast, succeed slow” principles, it makes sense to pick up as long a run of characters as possible that will not be recognized by any other rule.

What these characters are depends very much on other find rules at work. For example, if other rules recognized text starting with any of the seven-bit ASCII graphic characters, then a rule such as:

find [\ " " to "~"]+

will “fail fast,” if the character being looked at is a seven-bit ASCII graphic, as well as “succeed slow,” because it will spend the time it takes to find all the following characters. And the nice thing about such a rule is that it can be put anywhere in the program — it’s not getting in the way of other find rules.

Just be careful to pick up any leftover characters that are sometimes, but not always picked up by other find rules. Leaving the good old “find any” rule in the program, after the “gobbler,” does this, or it can be combined as follows:

find [\ " " to "~"]+ | any

Note that the any doesn’t have a “+” sign. If it did, it would consume the rest of your input, without giving the other find rules a chance to examine it.

Finally, remember that OmniMark copies characters that are unrecognized by any find rule to the current output, and does it in a very efficient way. So if you simply want all otherwise unrecognized characters copied, don’t do the copying yourself — not doing anything is the best way to fail.

Alternatively, if you want to discard all characters not otherwise captured by find rules, you should make your default output be #suppress, and explicitly output to #main-output or wherever the output is going. Doing so makes unmatched characters go to #suppress — efficiently discarding them.

Divide and Conquer

A common need in pattern matching is to pick up everything up to, but not including, a known closing delimiter. For example, a match part that picks up the text in a quoted literal can do so as follows, with a quote character ending the picked-up text:

match "%"" [\ "%""]* => text "%""

For other than single-character closing delimiters, a simple “any except” character set doesn’t do the job. The following shows why:

match "--" [\ "-"]* => text "--"

The problem is that a single “-“, not part of a “–“, will terminate the matching done by the “any except” but won’t be matched by the final “–“. That’s where the lookahead used in the upto macro comes into play — it makes sure that the whole of the terminating delimiter is present.

Here’s a common solution to this problem, using the handy lookahead:

match "--" ((lookahead not "--") any)* "--"

What’s going on here is that the pattern repeatedly “looks ahead” to make sure that the terminating condition has not yet been met, and if it hasn’t, consumes another character. When the termination condition is found, the “*” loop exits and finds the following match, the final “–“.

The lookahead formulation does a good job of picking up text. But for those wanting to “fail fast, succeed slow”, it’s really unsatisfactory, because it examines every character in the text twice — once in the lookahead to ensure it’s not part of the closing delimiter, and a second time in the any gobbler. A better approach is one that “divides and conquers” — examining only once any character that isn’t a “-” using the “any except” form, and only doing a lookahead when a “-” is encountered:

match "--" ([\ "-"]+ | "-" lookahead not "-")* => text "--"

The divide and conquer approach to writing patterns comes in handy even when lookahead isn’t required. For example, the following match part picks up a C-like string, gobbling everything but quotes and “\” quickly, and handling “\” separately (so that a quote can be put in the text using “\””):

match "%"" ([\ "%"\"]+ | "\" any)* => text "%""

When the patterns start to become large and more complex, divide and conquer is the real winner. Here’s how to write a divide and conquer pattern in general:

  1. For each character that is only sometimes matched, based on the character or sequence of characters that may precede it (such as characters preceded by “\” in C-like strings), construct an alternative for all possibilities starting with the first character of the preceding characters. For example:
    "\" any ; for escaped characters in C-like strings
    "%" any ; for escaped characters in OmniMark strings
  2. For each character that is only sometimes matched, based on the character or sequence of characters that may follow it (such as the “-” possibly followed by another “-” in XML-like comments), construct an alternative that matches that character with a “lookahead” or “lookahead not” that excludes the non-matching cases. For example:
    "<" lookahead not [letter | "/!?"] ; for illegal uses of "<" in XML
  3. Pick out all the characters that must always be matched, but are not matched by one of the previously constructed alternatives, and match them using a character set matcher with a “+” on it. For example:
    [" " to "~" \ "%"\"]+ ; for what's allowed "as is" in C strings
    [\ "%"%%"]+ ; for what's allowed "as is" in OmniMark strings
  4. Take the partial patterns from the first three steps, connect them with “|” (or), putting the most likely alternatives first (for speed only). For example:
    ([" " to "~" \ "%"\"]+ | "\" any) ; for C string text
  5. Append an “*” (repeat zero or more times) or a “+” (repeat one or more times) to the connected partial patterns, depending on whether or not the text as a whole can consist of zero characters. The result will look something like:
    ([" " to "~" \ "%"\"]+ | "\" any)*
  6. Recursively apply divide and conquer to any of the constructed alternatives that itself matches delimited text.

As an example of divide and conquer, here’s a find rule that matches an XML start tag:

find "<" (letter [letter | digit | ".-_"]*) => element-name
   ([\ "%"'<>/"]+ |
      "%"" [\ "%""]* "%"" |
      "'" [\ "'"]* "'")* => attributes
   ("/" => empty-element)? ">"?

The core of the pattern (that matches the attributes) is a three-way alternative that picks up everything except a quote or apostrophe, and then tries each of the two types of quoted text. (Quoted text generally needs to be specially recognized because it can contain things that would be recognized as delimiters outside of quotes.)

** and ++

If you’ve used the “**” and “++” pattern operators introduced in OmniMark version 6, you’ll maybe be wondering where they fit into all of this. What “**” and “++” do is take the principles discussed above and apply them in a number of common and useful cases.

“**” and “++” take a (preceding) character set and (following) pattern, and match everything within that character set up to and including the pattern. (They differ only in that “++” fails if it does not encounter at least one character prior to the pattern.) For example, a convenient way of writing the XML comment matcher shown earlier in this paper is:

match "--" any** "--"

This is certainly shorter than the previous formulations. More importantly, it is easier to read and it runs efficiently.

OmniMark uses the divide and conquer principle on the pattern following “**” or “++” and builds a loop that only stops when it needs to look further, in the same way that the divide and conquer rewrite did. It saves the programmer the trouble, and is able to do things that can be hard for the programmer.

Although they’re very useful, and should get a lot of use, “**” and “++” don’t deal with all pattern matching problems. It’s good for the programmer to understand the principles described in this paper when they have to be applied explicitly.


Migrating legacy OmniMark programs

OmniMark has undergone a tremendous evolution from its earliest days as a very simple rule-based SGML scripting language, to its current state as a general-purpose programming language with modern software-engineering features.

During the course of this evolution, great pains have been taken to maintain backwards compatibility. Still, some changes to the core language have been necessary. The requirements of a general-purpose programming language, suitable for engineering complex high-performance systems, are very different from those of a simple narrowly-targeted scripting language.

Even so, many older programs do not need to be modified to work with current versions of OmniMark. The only programs that do need modification are those that are written with a version 2 (v2) coding style.

For most programs that do require modifications, they can be updated automatically in only a few seconds, using the migration script provided with this article. A few programs will require additional hand-editing, generally taking just a few minutes more.

This article will walk you through the migration process, and help you troubleshoot those few programs that require further modification.

This process can be run with any version of OmniMark from version 6 to the current version, and the results will be compatible with all of these releases.

You can download the conversion scripts mentioned here in zip format for Windows. These scripts are provided in source code form to make it easier for you to customize them for your particular code base.

This article assumes the use of the OmniMark Studio for Eclipse. In versions 8 and above, this will be the OmniMark Studio for Eclipse. In version 7, you might have either the Studio for Eclipse or the standalone Studio. In version 6 you will be using the standalone version of Studio. The program provided will work in all of these versions. The procedures for running the program vary slightly. The steps for using Studio for Eclipse are listed first, and the steps for standalone Studio follow. If you have a lot of files to convert, you may wish to compile the migration programs so that they can be run from a batch script. See the OmniMark Studio documentation for more information on compiling programs, and the OmniMark Engine documentation for information about running compiled programs.

If you are working with OmniMark version 6 in a Unix environment, you will have to compile the migration programs and use an OmniMark Engine to execute them.

Running the Migration Program

This step will take a few seconds for each file you need to migrate.

Procedure for OmniMark Studio for Eclipse

Run the to-six.xom program to upgrade the syntax.

  1. Unzip the program files into a suitable directory.
  2. In OmniMark Studio for Eclipse, create a project for running the migration.
    1. From the file menu, choose New -> project
    2. Expand the OmniMark option and choose OmniMark Project.
    3. Enter a name for your project.
    4. Uncheck “Use default” and navigate to the directory where you have placed your files.
  3. Open to-six.xom
  4. Create a launch configuration for the program to-six.xom by pulling down the “Run” menu and clicking on “Run as OmniMark Program”. Don’t worry about the error messages.
  5. In the Parameters tab, under Arguments, enter
    1. “include=” to specify the directory containing any include files used by the program you are upgrading, for example include=C:\MyPrograms\OmniMark\xin
    2. the path to the program you want to migrate
    3. the path to the output destination. Don’t overwrite your input file. Either give the new file a different name or place it in a different directory
      -of newfile.xom
    4. the path to a log file to capture any error messages
      -log logfile.xom
    5. If you are using OmniMark version 8 or newer, you may want to consider adding the command-warning-ignore deprecated here, to avoid seeing numerous warnings about obsolete syntax. The obsolete syntax has been retained in this program so that it continues to work in versions 6 and 7.
  6. Click Run at the bottom of the launch configuration screen.
  7. Examine the result.
    1. You will find that all of the lines have been moved over by a few spaces. Whenever the program changes a line, it inserts the original line in front, as a comment beginning with “;pre-6”. Spaces are added to the front of the rest of the lines to keep everything lined up. (These extra spaces and comment lines will be removed in a later step.)
    2. You should examine the changes to make sure that the new lines are correct. You may examine the changes by searching for the text “;pre-6”.
    3. Open the log file to examine the error messages. If you see any errors, you will have to correct the code by hand. There should be very few (if any) errors. You may see warnings. Warnings about deprecated syntax can be ignored. For other warnings, you should examine the referenced lines in the output file and make sure that the code is correct. If not, you should correct the output file by hand.

Procedure for standalone Studio (OmniMark 6)

1.     Run the to-six.xop project to upgrade the syntax.

a.   In OmniMark Studio, open the project to-six.xop.

b.   Select the input file, the output file, and the log file by editing the project options. If you are migrating a program file and it includes files from a different directory, then you also specify the include path at this step.

i.   Specify the directories to be searched for include files if any. Click on the “Arguments” tab. Type “include=” followed by the name of a directory that contains include files. (Do not leave spaces around the “=” sign.) Do this once for each directory that contains include files. For example, to search the folder “C:\MyPrograms\OmniMark\xin” enter:

include=C:\MyPrograms\OmniMark\xin

ii.   Click on the “Arguments” tab, and browse to the include or program file that you want to migrate, and add it to the argument list.

iii.   Click on the “Output” tab, and type in the name of the output file. This will be a new version of your original include or program file, compatible with OmniMark 6 and above. Do not overwrite your input file. Either give your output file a new name or place it in a different folder.

iv.   Also in the “Output” tab, type in the name of a log file to capture any error messages.

c.   Save the project.

d.   Pull down the “Run” menu, and select “Execute Project”.

e.   When you see “Hit <ENTER> to continue.”, press the enter key to return to the Studio.

2.     Examine the result to make sure it is correct.

a.   Open the output file. You will find that all of the lines have been moved over by a few spaces. Whenever the program changes a line, it inserts the original line in front, as a comment beginning with “;pre-6”. Spaces are added to the front of the rest of the lines to keep everything lined up. (These extra spaces and comment lines will be removed in a later step.)

b.   You should examine the changes to make sure that the new lines are correct. You may examine the changes by searching for the text “;pre-6”.

c.   Open the log file to examine the error messages.

If you see any warnings, you should examine the referenced lines in the output file and make sure that the code is correct. If not, you should correct the output file by hand.

Once you have finished converting your program files and your include files, you should try running the programs with the newer version of OmniMark.

Procedure for OmniMark Studio for Eclipse

  1. In OmniMark Studio for Eclipse, bring your updated program into Eclipse as before.
  2. Select the program in the navigator window and choose Run… from the Run menu to open a launch configuration. Specify the options you want, and click “run”

Procedure for standalone Studio (OmniMark 6)

Create a project file for each command-line:

  1. In OmniMark Studio, open your program file.
  2. Pull down the File menu and select “Create Project File” (or click on the “Create new project” button on the toolbar).
  3. Pull down the Edit menu and select “Project Options”. Use this dialog box to fill in the information that you had specified on the command-line.
  4. Pull down the Run menu and select “Start Debugging” to run the project (or press the “Start debugging” button on the debug toolbar). If your program takes too long to run in debug mode, you can use Run menu and select “Execute Project”. But first, make sure you save your project, your program, and include files. (To maximize speed, “Execute Project” reads files from disk, not from the Studio buffers.)
  5. See the OmniMark Studio documentation for more information on OmniMark projects.

At this point you should have very few syntax errors if any. Correct any of the remaining errors by hand, and then run your programs again. Make sure they produce the same results as your old programs under your old version of OmniMark. If you have any problems, or just want to understand this process better, see Appendix C: What the Migration Process Does.

Part 4: Cleaning Up

At this point, you have successfully upgraded your programs to work with the newest versions of OmniMark. Now you probably want to get rid of the comment lines added during this process. This step will take a few seconds for each file you are upgrading.

Procedure for OmniMark Studio for Eclipse:

  1. Open clean.xom in OmniMark Studio for Eclipse.
  2. In the Launch Configuration, specify the output file from to-six.xom as the input, and save the output to a new file.
  3. Run the program.

Procedure for standalone Studio (OmniMark 6)

  1. Open the project clean-six.xop.
  2. Edit the project options:
    1. The input file should be the should be the output file from to-six.xop.
    2. Save the output to a new file.
  3. Save the project file.
  4. Pull down the “Run” menu, and select “Execute Project”.
  5. When you see “Hit <ENTER> to continue.”, press the enter key to return to the Studio.

You can compare this output file to your original (pre-OmniMark 6) file with any line-by-line comparison utility. You will see that the only changes are the ones necessary to upgrade the syntax.

Appendix A: Quoted Variable Names

Prior to 5.3, OmniMark allowed you to quote your variable names if you preceded them with a herald (or a type-specific keyword like active or increment).

Without heralding, quoted variable names are indistinguishable from quoted text strings. For this reason, this feature was dropped in version 5.3.

OmniMark 7 re-introduces quoted names to support prefixing of symbolic operators. OmniMark from version 7 on uses a different syntax for quoted names so that they cannot be confused with text strings. In these versions, a quoted name must be wrapped in either #”…” or #’…’.

Programs which use quoted variable names will be automatically migrated to the OmniMark 7 syntax. These programs will be compatible with all newer releases, but they will not be compatible with OmniMark 6 without hand modification.

Appendix B: Hiding Keywords

Another potential problem is that a global variable declaration can “hide” a keyword. If your program did not have any variable declarations, then a variable reference always had to be preceded by a herald or a keyword that acted like a herald. So OmniMark was always able to tell the difference between your variables and language keywords.

From OmniMark V3 to OmniMark 5.2, you could use or omit the herald, as you wished, as long as you declared all of your variables. From OmniMark 5.3 on, variables were always referenced without a type herald.

When the variable name is the same as a keyword, OmniMark sometimes can’t tell which you mean. If this occurs, you may get an error message like:

     omnimark --
     OmniMark Error xxxx on line 1179 in file my-prog-1.xom:
     Syntax Error.
     ...  The keyword 'SDATA' wasn't recognized because there is a
     variable, function, or opaque type with the same name.

You can fix these types of errors quickly by doing a search and replace. Make sure you change only the variable references, though, and not the keywords too.

Appendix C: What the Upgrade Process Does

This section briefly describes some of the transformations that the upgrade program (to-six) does.

Global Variable Declarations and Translation Types

The first step of the migration process determines whether global variables must be declared, and if so, generates them.

It does this by reading the program file, and all of the files that it includes. If there are no global variable declarations already, but variable references are detected, then a list of global variables will be generated and inserted at the beginning of the program file.

Global variable declarations are not generated for include files, because they would duplicate the ones generated for the main programs.

At this time, the keyword “down-translate” will also be placed at the top of the program if it is needed.

Pattern Assignments and Comparisons

One thing the program does is correct the use of the equals symbol (=).

Before OmniMark V3, the “=” symbol was only used for pattern assignment. V3 introduced a new symbol for pattern assignment (=>) and used “=” for comparisons. However, the use of “=” for pattern assignment was still supported for backwards compatibility.

Needless to say, you shouldn’t use the same symbol to mean different things. Since version 5.3, OmniMark issues warning messages wherever the “=” symbol is used for pattern assignment, with a view towards eventually removing this use from the language.

equalize.xin contains a function that looks for solitary “=” symbols in your program and converts them either to “is equal” (the old form of the equality comparison) or to “=>” (the new form of the pattern assignment operator). Either way, ambiguity is eliminated at this stage. The “is equal” construct will be changed back to “=” in a later phase.

Heralds and Mods

Removing type heralds is the final and most extensive part of the process. This is done by a function in deherald.xin.

In addition to removing heralds, this step also replaces some deprecated constructs with their modern equivalents. This includes:

  • Changing “set counter” and “reset” to “set
  • Changing “set buffer” and “set stream” to “set
  • Removing “counter“, “stream“, and “switch” everywhere except in variable declarations
  • Converting the “and” form of variable declarations to a sequence of declarations. Omnimark allows syntax like:
    local switch x and counter y

    This is converted to:

    local switch x local counter y
  • Converting the verbose forms of comparisons (“is/isnt equal“, “is/isnt greater-than“, “is/isnt less-than“) to the symbolic forms
  • In some contexts, converting the shelf names “sgml” and “output” to “#markup-parser” and “#main-output
  • Removing the heralds “pattern” and “another

Quoted Variable Names Again

You may find that this step results in messages like:

WARNING: Quoted variable name (stream "my-var") -
  replacing with v7 syntax.

That means that "my-var" may be a quoted variable name here. The variable will be changed to use the OmniMark 7 syntax for quoting names (#”my-var”).

You may wish to examine the modified lines of code, and make sure that it really is a variable reference. If you are migrating to OmniMark 6, you will have to remove the “#” character and the quotes.

When you are migrating to OmniMark 6, make sure that the unquoted name is legal. It must begin with a letter or a character whose numeric value is between 128 and 255, and the subsequent characters must be either one of those, a digit, or a period (.), hyphen (-), or underscore (_). Any other characters must either be replaced or removed.

You will have to be careful with quoted variable names inside of macros. The sequence “%@(…)” in a quoted variable name means that a macro argument is being spliced into the name at that point.

Using macro arguments to build variable names was one way of simulating structures in early OmniMark programs. Now, a better way is simply to use keys to simulate field referencing.

In any event, you cannot use macros this way in OmniMark 6. The best way to correct this is to pass in the complete list of variable names that the macro operates on, instead of just passing in a piece of a variable name.

Duplicate Variable Names

Finally, with the removal of heralds, there is one other problem area that needs to be dealt with.

In most languages, when you define a variable in a local scope with the same name as a variable in the outer scope, the inner variable hides the outer one. In OmniMark, prior to 5.3, you could still reference the outer variable by heralding it, provided it had a different type than the inner one.

Usually, the only time a name is reused in a program is when one of the variables has a very short lifespan, only being used to capture a value and transfer it to the final destination variable, and the programmer uses the same name because it’s easy:

     find digit+ => id ":" any-text+ => value "%n"
       local counter id
       set id to pattern id
       ...

This can be easily corrected by changing the name of the pattern variable.

The file finddup.xin contains a function that can detect some of these variable name reuses. It also attempts to warn about variables declared with the same name as another variable visible in the same scope.

These checks are heuristic, and can be fooled by macros, or by declaring the variables in one file, and using them in another. However, these checks should find many of the common cases.


How to prepare your content for conversion to DITA

Presented by Helen St. Denis, Conversion Services Manager | Stilo

So, the decision is made to implement DITA and the content audit is done, now you need to get your content into DITA. So, what do you really need to do to your content before you start the conversion process? Maybe not as much as you may think!

In this webinar we discussed …

  • What is the most useful thing to do pre-conversion?
  • What kind of things influence what else you might want to do pre-conversion?
  • What are the common trouble areas in the different source formats?
  • What is best left for post-conversion?

View recording (registration required)

 

Meet the presenter

Helen St. Denis

Helen originally joined Stilo as a technical editor in the documentation team, and now works closely with Stilo Migrate customers, helping to analyse their legacy content and configure appropriate mapping rules. She also provides Migrate customer training and support.
Over a period of several years, Helen has helped Migrate customers to convert tens of thousands of pages of content to DITA and custom XML.

Helen holds a BA in English from St. Francis Xavier University in Antigonish, Nova Scotia and has pursued graduate studies at Queen’s University in Kingston Ontario.


DITA and minimalism

Minimalism from a technical writing and training perspective was first investigated and proposed in the late 1970s by John Carroll and colleagues at the IBM Watson Research Center. It has since evolved and been extended by a variety of stakeholders.
The link between DITA and minimalism (the IBM connection notwithstanding) is not exactly carved in stone but the two complement each other like macaroni and cheese. The macaroni (DITA) provides the infrastructure and model that you need to support the cheese sauce (the minimalist content).

JoAnn Hackos’ Four Principles of Minimalism are helpful and useful. They are:

  • Principle One: Focus on an action-oriented approach
  • Principle Two: Ensure you understand the users’ world
  • Principle Three: Recognize the importance of troubleshooting information
  • Principle Four: Ensure that users can find the information they need

However, I would change things up a bit and stress some different points. Minimalism, when applied to technical writing, should result in content that is:

  • Based on knowledge of the users
  • Usable
  • Minimal
  • Appropriate
  • Findable
Based on Knowledge of the Users

Understanding your users is the underlying requisite for applying all other facets of minimalism. Without knowing how they are using the product, how they are accessing the content, and what their daily goals are with your product (and a thousand other factors), you aren’t going to be able to correctly apply the other facets of minimalism.

Usable

Write task-oriented topics that are focused on business goals rather than product functionality.

This means you need to understand your users well enough to understand those goals, including what backgrounds they have, what other tools they have at their disposal, their educational background, and a host of other information. User personas can be a powerful tool here.

The action-oriented approach is important but more specifically, you should be writing procedural information (tasks). One absolutely vital piece of any task is a detailed, goal-based context for the task. Done well, this context is an essential component of the learning process. The context the “why” that helps users situate themselves—so it must be written for and about their goals when using your product. They use that context to take their understanding of your product to the next level. The task’s focus should always be on the user. This focus is often neglected, although usually through either ignorance or time constraints, but the steps of a particular task are almost immaterial when compared to the context.

Minimal

Powerful, usable content is clear and simple. In this sense, minimalism means removing words that add no value; words or phrases that are long, ambiguous, convoluted; and content that is simply not required.

Short simple sentences are easier to read and provide a basis for better translation. Topics that have only essential information are more easily parsed and will be read with more attention. If you limit yourself to essential words, then every word will be valuable to end users.

This facet of minimalism can often be done as part of an editing pass, either by you, a peer reviewer, or an editor—or better yet, all three. Remember that fewer, clearer words is more work, not less.

Appropriate

The careful selection and placement of every word you write should always be on your mind, from the planning stages through to the editing stages of your content. For content to be appropriate, it has to be the right information in the right place, support error recovery, and be formatted correctly.

  • Provide the right information in the right place to support users in their goals.

A pre-requisite to a task that is placed between steps four and five is a good example of content being in an inappropriate location. Always move a pre-requisite up before the context (in its valid DITA location) for consistency and because no one wants to get part-way through a task only to realize that should have done something important before even beginning. Similarly, if you need to prevent a common error from occurring between one step and another, then put it right there instead of in a separate location or topic. Best practices, mentioned in the right place, can save your users the hassle of having to troubleshoot later on.

  • Write detailed error recovery using DITA 1.3’s new troubleshooting topic and elements.

Many users will turn to the documentation only when they have run into a problem and need to look for a solution, so you are most definitely trying to write exactly what they are looking for. Troubleshooting, if it is concise and can’t be separated out from the context of its related task, should be kept inline using any of the three new troubleshooting-related elements (task troubleshooting, step troubleshooting, or the troubleshooting note type). When there’s too much information or the troubleshooting topic really should stand on its own (and be searched for as a discrete object) it should be written using the new troubleshooting topic. For any troubleshooting information (whether it’s a topic or inline), your goal is to provide error recovery by identifying a condition, a cause, and a remedy. The new troubleshooting topic also allows you to provide multiple cause and remedy pairs for a single condition to cover more complex cases.

  • Provide supporting information in the right format.

Although much of your minimalistic content will be in task topics (and thus formatted as ordered lists), your supporting information should use unordered lists, graphics, and tables depending on the type of information being conveyed. When users will need to scan and look up specific details quickly, you’ll use a table. Graphics will be most helpful when trying to convey an underlying or specific structure or flow. Unordered lists are important for listing parallel items that need to be noted

Findable

All of this targeted, minimalist, formatted, supportive content is going to be completely wasted if it’s not also all easily findable.

There are two levels to findability. At the detailed level, it means using short, targeted topics that are the appropriate type (task, concept, reference, troubleshooting, and glossary).

At the macroscopic level, findability means robust search mechanisms including faceted search, role-based content, and filtering. Whether your content strategy calls for building something custom, leveraging something in house, or buying a tool (that can take DITA source content and publish responsive Web-based content with these findability features built right in) is up to your requirements, budget, and timelines.

Summary

One of the most valuable changes you can make to your DITA content is applying minimalism to that content, both before and after your DITA conversion. Take the opportunity when learning DITA to learn how to apply minimalism to your content. It will not only improve your content and make your users far happier (leading to customer satisfaction improvements that will positively affect the entire company), it will also allow you to further define and refine your entire content strategy.

Further reading

JoAnn Hackos: Minimalism Updated 2012

OASIS DITA Adoption Feature Article on Troubleshooting in DITA 1.3: DITA 1.3 Feature Article: Using DITA 1.3 Troubleshooting


Overcoming writer resistance to DITA adoption

I could easily have titled this article “Why your writers are suddenly freaking out,” because that is exactly what happens on teams that adopt DITA—there is always some degree of writer resistance. Even the best writers experience a moment of doubt when contemplating such a complete overhaul of their writing tools and standards. The trick, as someone implementing DITA, is to manage those fears wisely, even preempting them when possible, so that writer resistance becomes writer acceptance.

Successful DITA adoption relies on three pillars of strength: a solid content strategy, a detailed and realistic project plan, and a useful change management plan. That last pillar, the change management plan, is most often forgotten or left out and that’s where you’ll plan your strategy for overcoming inevitable writer resistance.

Why writers resist

Although many writers understand and embrace the changes that DITA introduces into their day-to-day writing, when they’re actually faced with adopting DITA, it can be a scary prospect. There are some key reasons for resistance.

Lack understanding of DITA best practices

Even if authors understand the basics of DITA, it’s not always clear how to do things the best way. This murky beginning can daunt even the most seasoned writers. In fact, the more senior the writer, the more likely they are to be bothered by this gap in their skillset.

Job loss

Many authors fear that their team will be downsized because DITA promises to be a more efficient way of creating and managing content. Although companies rarely downsize after adopting DITA, it is still a fear that needs to be managed because it can drive a very stubborn resistance; people feel like they are fighting for their livelihoods.

New tools

You are asking writers who have been using FrameMaker, Word, InDesign, and, in some cases, text editors to adopt a completely new writing tool, one that doesn’t necessarily reflect the look and feel of the end product they are creating—which is very confusing at first.  Piled on top of that, you are also asking them to learn a few hundred elements, attributes, and probably a CCMS as well. There’s no doubt about it, there is a lot to learn all at once.

New way of writing

The move from chapter-based or narrative writing to topic-based writing is a huge change for most writers. This is actually the change that writers struggle with most in the first 6 months.

Many authors have been doing things the same way for so many years that they’re uncertain if they’ll be able to adapt to a new way of writing, one that focusses more on modular writing and user goals. Often, this change in the way they write exposes a gap in their understanding of the end users and how they’re using the product. Although this gap is usually not their fault, it certainly hinders creating quality DITA content and reflects poorly on the writer.

Forms of resistance

There are generally two kinds of resistance you’ll run into: active and passive. You might see some authors using both forms while others will choose one over the other. The key to a smooth writer transition is to anticipate both forms and deal with them separately.

There’s no shame in being either one of these types of resisters. DITA is a big change and it’s up to management to ensure that the writers have all the knowledge (whether it’s tools, training, or communication) to ensure that change is as easy as possible.

Active resistance

Active resistance includes being vocal about doubting the adoption process, personnel, tools, or management handling of the adoption. Often, these writers will derail meetings or training with side issues—and you’ll suddenly find yourself mired in meetings that never accomplish what you need them to. Everyone’s work starts suffering and tempers run hot.

They might also take their disgruntlement to fellow team members and try to gang up on a manager, raising doubts and concerns that although real to them, are usually unfounded. If you get a group of writers entering your office one day to “talk to you about this DITA thing,” then you know you’ve neglected to address the concerns of your active resisters.

Passive resistance

Passive resistance includes not using the tools, performing tag abuse (using the wrong tags to get a specific result), not chunking content into topics, and generally not implementing the content strategy that is laid out for them. In some cases, you’ll find that an author has simply never made the transition from the old tools—they are sneakily still using them.

Passive resisters aren’t trying to rock the boat, but they definitely need help to make the change successfully.

Solutions

A change management strategy is your plan for deciding how you’ll introduce the changes that DITA requires. Part of your change management plan should be specifically targeted towards identifying likely resistance from your team and how you’ll address each issue.

  • Active resisters: These are actually the easiest resisters to deal with because they are so easily identified. The best way to work with active resisters is to get them involved with the planning and content strategy. Active resistance also usually means that your writer lacks training or doesn’t understand that training will eventually cover all their concerns. Getting your active resister to create or review the training plan (what kind of information do we need at which point in the adoption) often circumvents their fears.
  • Passive resisters: Although passive resistance is harder to identify, you can plan for ways to either prevent it or catch it immediately. Feedback combined with adequate and timely training will solve this problem.

Don’t forget to proactively address your writers’ reasons for resistance in your plan. When you manage change, use the following tools to help address their underlying fears and keep your authors moving forward.

Communications plan

This plan should include how and when you’ll communicate the goals and progress of your DITA adoption. Use this plan to specifically address the issue of downsizing or job loss. Most companies don’t even think about downsizing because they are about to invest time and money into upgrading the skills of those same resources. However, authors still need to hear that their jobs are as secure as they have ever been from management.

The message you want to convey (at key points and on a regular basis) is that you are training them to be more valuable tech writers with two major goals: easier content lifecycles (happier writers and reviewers with more time to write usable, focused content) and more usable content (happier end users).

Don’t forget that DITA adoption also means new roles will be available, like content strategist, tools maintenance, and publishing expert, to name just a few. DITA adoption opens doors to all sorts of opportunities. And even if the company is downsizing, your writers will leave after being trained in DITA, making their options for a prospective job that much better.

This is also the time to let them know that their yearly performance reviews will include DITA adoption goals. For example, you can include each document converted to DITA or each new documents written in DITA as point on their performance review. You can even gamify this process and have writers compete for the most DITA-related points, which will convert to real dollars at the end of the year. If you give writers a target to shoot for with their DITA adoption, you’ll be pleased with the results.

Training plan

The right training for the right people at the right time in the adoption can prevent problems for both types of resisters. Authors need several kinds of training to get up to speed quickly but smoothly. Very early on, start with DITA fundamentals and then closely follow by topic-based writing, minimalism, tools training, process training, and DITA best practices.

Your content strategist and your publishing specialist will need training in DITA content strategy, metadata, and XSLT/DITA Open Toolkit (or equivalent, depending on your choices) very early on.

It’s a good idea to plan for ongoing training for the first four years in addition to formal training during the initial adoption phase. This ongoing training can be more informal and internal where authors and content strategists can share tips, tricks, and best practices among team members. You should also plan for key people to attend DITA-related conferences so they can get exposure to a wider world of what is possible with DITA.

Some companies have hired a consultant to bring an internal resource up to speed for the role of content strategist as a sort of ongoing, as-needed training from an expert.

Whatever your decisions for promoting, shifting, or even hiring your resources, an effective training plan will preempt a lot of resistance.

Feedback conduits

Authors need a way to have someone check their tagging and writing for the first 4-6 months to give them essential feedback on how to do things better or which pitfalls to avoid. This is where you’ll catch most of your passive resisters, but to do so, it’s essential that someone track who is and who is not getting their content checked.

Also provide a way for authors to ask questions from someone who knows the right answers. If you can do this in a way where they don’t need to risk looking or feeling dumb to their manager or peers, then it will be more of a success. A wiki page is a good idea or an anonymous Q&A drop box works well too, with the answers coming from someone with DITA experience (usually a content strategist, either full time or consulting).

Summary

Overall, writer resistance is a normal part of a DITA adoption project and can (and should) be planned for just like every other aspect of the project. Identify the root causes for your team and plan for how and when you’ll address those concerns for both active and passive resisters.


Director / PDMR Shareholding

13 June 2016

Stilo International plc (“Stilo” or the “Company”) has today been notified that Les Burnham, the Company’s CEO and Executive Director, on 10th June 2016 sold 2,600,000 ordinary shares of 1p each in Stilo at a price of 5.25 pence per share.

Following this sale, Les Burnham has a beneficial interest in 5,000,000 ordinary shares, representing approximately 4.45% of the issued share capital of the Company.

ENQUIRIES

Stilo International plc
Les Burnham, Chief Executive
T +44 1793 441 444

SPARK Advisory Partners Limited (Nominated Adviser)
Neil Baldwin T +44 203 368 3554
Mark Brady  T +44 203 368 3551

SI Capital (Broker)
Andy Thacker
Nick Emerson
T +44 1483 413500


DITA reuse and conversion together

When you are considering converting content from Word or unstructured FrameMaker (or other unstructured formats) into DITA, one of the things you want to consider before you start converting is your reuse strategy.
Why reuse and conversion together?

Your reuse strategy can be partially implemented as part of the conversion process, which means that you can automate some of the work. The highly automated nature of conversion is the perfect opportunity to sneak in some reuse automation at the same time. If you know what your reuse goals are, you can save a lot of manual effort by using the conversion process to automatically and programmatically add some reuse mechanisms to your content.

At its core, a DITA conversion is the process of mapping your content and formatting to DITA elements and attributes. If you opt to ignore certain formats or objects, your conversion process essentially “flattens” your content and you lose a great opportunity to automate.

For example, if you neglect to map variables to a specific element in DITA, then those variables are converted as plain text and you’ve lost your opportunity to apply a DITA element to them programmatically (and quickly).

In short, with a little planning and setup, combining reuse and conversion can save you lots of time.

Reuse strategy

A reuse strategy defines what kind of content you’ll reuse and how you’ll reuse it, including the DITA mechanisms you’ll need for each kind of reuse.

Generally speaking, you should be looking at reusing two types of content:

  1. Content that stays the same: Wherever it is, it needs to be standardized.
  2. Content that changes: Content has variations because of the
    • context of the person reading it (or someone who is re-branding and publishing it)
    • product version, suite or product combinations, or
    • changing nature of the product over time (like product or component names that may evolve).

Your reuse strategy is going to be your target. Without this strategy you’re really not going to be able to definitively know what it is you need to do during conversion or how to best do it. Planning and testing the reuse strategy before conversion is the key to being able to automate its application as part of conversion.

Reuse strategy for conversion

Although you should always develop a content reuse strategy when moving to DITA, when combining reuse and conversion, you need to add an extra layer to your reuse strategy that includes two major areas:

1.     Identify your existing content reuse (text insets, conditions, and variables) and decide how you’ll leverage it during conversion.

a)      What will each one map to?

b)      What is your desired end result?

2.     Plan for new content reuse that can be applied during conversion.

a)      What is your desired end result? What reuse mechanisms do you want to use?

b)      What and how can you automate?

c)       What do you need to change to enable automation during conversion? (You may need to apply formatting, for example, to automate something you really need.)

Content reuse in DITA

Everyone’s requirements are unique, but in general you should consider some common reuse strategies in DITA to get you started.

DITA allows you to use a variety of methods to reuse content and you’ll want to consider them all. When you’re getting started with reuse, you usually consider three main mechanisms:

  • Conref: Conref’ing is a mechanism that is equivalent to a text inset in FrameMaker, where a chunk of content (less than a topic) is pulled in from another location. DITA does this using a conref. (A push mechanism is also available, but less frequently used.)
  • Profiling: Profiling is equivalent to FrameMaker conditions, where content can be shown or hidden based on attribute values on elements.
  • Topic reuse: Topic-level reuse is simply pulling your topic into a map wherever it’s needed. Once in DITA, content is modular enough to be in short, reusable chunks. You don’t necessarily need to plan for this reuse during conversion, but it may allow you to NOT convert some content.
Warehouse topics for conrefs

These are topics that hold fragments of content and are never meant to be published as topics. You might create a warehouse topic for each of the following:

  • GUI objects, fields, buttons, icons
  • Frequently used steps, with step results and info
  • All your notes and warnings
  • Pre-requisites that are commonly mentioned, like having administrative privileges

You then use these warehouse topics as the source for conref mechanisms. Just like with text insets, warehouse topics let you write content once, and use wherever you need it. That means, when it changes, you update it in one spot. You translate it once too.

Once you know the steps that will go into a warehouse topic, for example, you can apply a distinct FrameMaker or Word format/style to the steps and then script conversion that will

  1. Pull the step into a warehouse topic (if not already there).
  2. Replace the step with a conref from the warehouse topic.

The result is that a good chunk of your reuse is automated during conversion.

DITA keys

Keys are a powerful mechanism in DITA. Although not strictly reuse, they can make reuse faster, simpler, and more dynamic. If you’re planning consistent, ongoing, and growing reuse, then consider keys as well.

Keys are used for indirect referencing of any kind. You can use keys for any piece that may need to be centrally updated or swapped out. For example, keys are often used for

  • Variables: To define terms or product names that can change based on context or over time.
  • URLs: To centrally manage and update them or customize them based on deliverables.
  • Conrefs become conkeyrefs: To pull in a different set of conrefs and quickly customize a document.
  • Related links: To customize them based on deliverable when topics are reused in multiple maps.
  • Including/excluding topics or maps: To create deliverables that have specific content without having to create many different maps.

Note: DITA keys will change for DITA 1.3; they will include a scoping mechanism that will simplify and extend linking. This article is based on DITA 1.2.

DITA keys ensure maximum reuse with minimum long-term efforts for updates.

An example of applying reuse during conversion: variables

If I know that I’ll be using keys to manage content that changes frequently (every release or when there is re-branding), I can ensure that, for example, the variables I’m using in FrameMaker for my product and version are part of my conversion. Instead of converting them as text, I can convert variables as <keyword> elements.

Converting variables as plain text is what we call flattening variables—once flattened, there is nothing that distinguishes them from the rest of the text. If you’re expecting conversion to leverage the DITA key mechanism but you are flattening variables, you will be left with adding keys manually after conversion.

Instead, as part of your conversion, you can leverage your variables by wrapping an appropriate element around them and even setting a keyref value on the element.

For example, my conversion plan for variables might specifically map variables to elements and keyrefs.

Variables become keys
Variables named… Are converted as element… With keyref value… For an end result of…
AppleBanana <keyword> product <keyword   keyref=”product”/>
ComponentB <keyword> componentB <keyword keyref=”componentB”/>

However, when you’re building this plan, it’s essential that you know that keys are defined in a map and defined only once. So if you need both Apple and Banana products to have separate names in a deliverable, then you need to create a unique keyref value for each one. When they have the same keyref value, then they resolve to the same name in the output. In my example above, <keyword keyref=”product”/> will resolve to the same product name in a particular DITA map, but can change in other DITA maps. I can no longer have both Apple and Banana in the same DITA map.

The key here is to plan, test, and then test again.

Strategies for combining conversion and reuse

It’s sometimes quite difficult to determine what can and should be done as part of the conversion to build the end result you need. There are two possible solutions to this:

  1. Manually build your end result in DITA and test it out until you’re sure it’s how you want to work. You can do this on a small set of content if you have limited funds or time, but the larger and more realistic the data set, the more accurate it will be for your overall needs. Once you have something that actually works the way you want it with the reuse set up the way you need it, you have a very concrete goal to work towards and can figure out what can be built and automated as part of conversion.
  2. Convert a small set of content, automating the reuse parts that you know you’ll need for sure and use that as an iterative process to keep building upon it until you have your desired result.

Either way, slow and steady is the way to go. Diving into conversion without considering reuse can lead to some frustrated hours or days doing something that could have taken seconds.

Best practices to prepare content for reuse

At this point, some of you may be saying that this is just too difficult and too time consuming to figure out. You want to convert now! Well, that’s ok too, but you should consider doing some basic pre-conversion work that will let you at least search and replace (or find) items that you want to reuse.

For example, what you can do while converting or before converting that will save time afterwards is:

  1. Re-write content and chunk content: A precursor to any good conversion is making sure your content adheres nicely to the topic-based writing paradigm and that you have clear distinctions between task, reference, and concept. All other work is based on this essential step. This is also where you would remove text indicators to location of other content, like the words “before, after, following, preceding, next, first”. In DITA, content can occur in any order, so you should remove any references to location.
  2. Include placeholders for future reuse: If you know you’ll be replacing the step “Select Log in from [graphic] and enter your administrative credentials.” with a conref, then go ahead and replace the content with “Conref login admin.” in your unstructured source. You’re simplifying the structure so your conversion will be faster and easier and afterwards, you can quickly insert a conref or conkeyref right where you need one.
  3. Standardize phrasing: The hardest things to do is find content that is almost the same but not quite matching. Although laborious, this pre-conversion cleanup process can set you up for easy reuse down the road. A tool like Acrolinx can help.
  4. Use a FrameMaker condition to identify likely or potential reusable content that you want to revisit after conversion. Then convert the condition as <draft-comment> elements. This is a good way to leave a note to yourself that is easily findable after conversion.
  5. Convert boilerplate content once, even if it has variations: If you have many versions of legal pages, copyright statements, standard notices, and any other content that is generally standardized, don’t bother converting those with each book. Convert them once, then modify the XML until it meets your requirements for all your books.
Resources

 


What’s in a DITA file?

When you first start tackling a DITA conversion, it’s difficult to get a handle on just what comprises a single file. Is it one topic? Is it multiple topics? How long should a file be? How short can it be? We’ll tackle these questions in this article.
Perfect File

The perfect DITA file is one that contains one topic, where that topic is as long or as short as it needs to be.

The length of a topic depends entirely on the subject matter and issues of usability. Generally, the litmus test is asking yourself “Can a user navigate to this one topic and have all the information they need for it to be usable and stand on its own?” If it’s too short, you’re forcing them to navigate away to find more information. If it’s too long, it becomes too onerous to follow.

Your file should contain just one topic, be it task, concept, or reference, regardless of the length of your topic.

Shortest File

The shortest file allowed in DITA is a topic that contains only a title element and nothing more.

SDKSFeb1

This is absolutely allowed but it is something that you would only do for a specific reason, such as adding another level of headings.

Longest File

The longest file you should have is one that supports your requirements. trusted tablets For most technical publications content, you should only have one topic per file.

Note: An exception to this rule is if you’re authoring training content using the DITA Learning and Training Specialization; there are very good reasons for having many topics in a single file in that case, but those would all be learning and training topics, not the core concept, task, and reference that DITA is built upon. Note that this is true as of DITA 1.2 but may change in the not-so-distant future.

Chunking

DITA architecture allows you to nest many topics into one file. However, doing so introduces major limitations on reuse. If you nest topics into one file, you will be sacrificing the flexibility that DITA introduces. It’s like choosing to hop on one leg instead of running on both.

Breaking each topic out into its own file is what we call “chunking.” One purpose of chunking is to allow the authors to have incredible flexibility when it comes to reusing this content.

Consider this nested task within a task, where two tasks are in the same file.

SDKSFeb2

If I want to include the information about studying with a master somewhere else, such as a guide for becoming a senator, I’ll be out of luck because I haven’t separated my two tasks into separate topics—when they’re in the same file, where one goes, the other must also follow.

Once a topic is in its own file, an author can pull that topic into any deliverable that needs it. It’s not uncommon for one file to be part of up to ten or more deliverables. If you have multiple topics in that one file, then they all must be reused along with the one you want, without the possibility of even changing the order of them.

There are many other good reasons to chunk. For example, if your content needs to be reorganized, you can quickly drag and drop topics that are each in their own files. All navigation and linking is automatically updated based on your new organization.

How It All Comes Together

Chunking content into individual topics is the first major hurdle that authors face when adopting DITA because it’s so far away from our training and understanding of writing in chapters, books, and documents. It’s not clear how all those tiny little topics come together. And be warned, you will have hundreds of topics that make up just one deliverable.

Enter the DITA map, the great glue that holds it all together.

You can think of a DITA map (which has a .ditamap file extension) as nothing but an organizational mechanism or even as a Table of Contents. A DITA map itself has very little content. It usually contains just a title. What it does have, though, are topicrefs, which are references to all those files you’ve authored.

Here’s a visual representation of a map, where I’ve told it to “pull in” my two topics (with a hierarchy, one nested below the other). When I create my map, I add the topics that are relevant to this deliverable. The map simply references them using a file path with the href attribute.

SDKSFeb3

That same map in code view looks like this:

SDKSFeb4

The only thing that’s actually typed into this map is the content in the element. The other two objects, the topicrefs, are pointing to the files that are my two tasks using the href value, in this case simply the names of the files: become_jedi.dita and study_jedi.dita.

When you publish a DITA map with all your topics referenced, you get a single deliverable with all your content. For PDFs, you’ll have content that looks exactly like your PDFs created from FrameMaker, Word, InDesign or whatever else you have used. For HTML output, each file becomes its own page, with automated navigation to all the other pages. All outputs are entirely customizable.

SDKSFeb5

Summary

Although the length of your topics will vary depending on the subject matter, your files should contain just one topic. Use your DITA map to bring your topics together, giving you the flexibility that DITA promises, including topic-level reuse and the ability to quickly reorganize your content.


DITA and graphics: what you need to know

A DITA project is an ideal time to audit, enhance, and start managing your media assets. Like any other piece of content, your media are a valuable resource that the company can leverage.

Although much of a transition to DITA concentrates on improving the quality of your content, there are also some distinct benefits to your media as well. By media, I mean:

  • Logos (for branding/marketing)
  • Screenshots
  • Illustrations
  • Diagrams
  • Image maps (a flow graphic that shows how a process or a set of tasks connect with clickable hot spots)
  • Inline graphics for buttons, tips, notes

When you’re moving to DITA, you should be thinking about two things when it comes to media:

  • Minimizing and single sourcing
  • Introducing and maintaining best practices
Graphics you no longer need

Probably the biggest mistake when moving to DITA is to lug your extra, non-essential media around with you, just in case.

Any graphic that is being used solely for the purposes of design can be managed centrally instead of being placed in everyone’s individual folders. These include (but are not limited to):

  • Icons for tips/notes
  • Logos for the title page, header, footer
  • Horizontal rules for separation of content areas

All these graphics get applied on publish so each individual author no longer has to worry about them. Your authors no longer even see these graphics—and also no longer need to manage them. Your publishing expert still needs to manage these graphics efficiently, but at least now there’s only one graphic to manage instead of dozens or hundreds.

For example, when the branding guidelines are updated, the publishing expert simply updates the logo used in the stylesheets–replacing one graphic in one place instead of a graphic in every single title page and footer throughout your library of content.

Prune and archive

The design-related graphics are easy to throw away, but we all have extra graphics lying around. It’s not unusual for a single graphic to have 5 or more other related graphics that are hanging around just in case. For example, you may have files that are older versions, variations, and different size and quality options.

We always loathe to “lose” a graphic, but DITA migration is the perfect time to archive the older versions and variations—but keep the quality options because they’ll come in handy.

A graphic is only useful if it conveys something that words cannot. If you can explain what the graphic shows, then the graphic is usually redundant and not useful. If you can’t explain it, then the graphic is needed. Prune your content of the graphics that don’t add any value.

Formats

DITA lets you specify multiple types of formats for a single graphic so that you are always publishing the right graphic for the right format. You can easily publish different formats (such as color for ePub and grayscale for print) using DITA attributes.

All graphics are either vector or raster. The format you use will depend on the type of graphic you need and the outputs you’re publishing to. For more information about raster versus vector, see this article.

Vector graphics (specifically SVG) are usually the right choice for most technical illustrations and diagrams as long as they don’t require complex coloring (like drop shadowing and shading). They are clear, clean graphics that look professional and don’t have that “fuzzy” look on publish.

A huge added bonus is that, because they are made up of layers, you can export the text (usually in the form of callouts or labels) and translate just that text if you’re providing localized content. This saves having to edit or even re-create a graphic when localizing. Another benefit is that in HTML outputs, they also allow users to zoom without pixilation, as needed.

It’s just icing on the cake that vector graphics are smaller, more compact files with lossless data compression.

SVG is an open standard, advocated by W3C, the Web consortium that is bringing you HTML5. It might not be the only graphic format in the future, but it will definitely be a forerunner. You can create SVG files from most graphics editors, including but not limited to: Adobe Illustrator (.ai and .eps files), Microsoft Visio, InkScape, and Google.

Size

The best size for your graphics depends on the output type. PDFs and HTML have different widths and resolutions. This really does get tricky, but if you’re using the DITA OpenToolkit to publish, it’s possible to set default maximum widths (maintaining the correct aspect ratio) so that you’ll at least never overwhelm your audience with a massive graphic. Use this default maximum in combination with authors using DITA attributes to set preferred width or height (but not both).

Interactive graphics

Some illustrations are best done in 3D. The ability to manipulate, rotate, zoom and otherwise play with graphics is not just really cool, it can also let users access the information they need without overwhelming them with 20 different views/zooms of a particular object or object set.

You can also play around with something like Prezi (or better yet, Impress.js), which lets you display and connect information graphically.

Manage graphics

If your authors don’t know that a graphic exists or they can’t find it, then that graphic is a wasted resource. It’s not uncommon for an author to forget or be unable to find the graphics that they themselves created months or years before.

Every time an author re-creates a graphic that they could have re-used, they are wasting on average 5 hours. Assuming your authors are worth approximately $45/hour and that they re-create a graphic they should have re-used about 4 times per year, then that means the company is wasting $900/year/author. If you have 10 authors, that’s $9000/year that could be saved with some simple, basic management of graphics with little or no cost or effort. If you have complex graphics, double or triple that savings.

Just like topic reuse, graphics reuse is a no-brainer source of savings for your company.

They key to graphics management is metadata. File naming, even with strict enforcement, is one of those things that degrades over time. Mistakes creep in. If you’re relying primarily on file naming, then expect to lose or orphan graphics. (An orphaned graphic is one that is not referenced by any topics.)

Instead, use descriptive tags applied to each graphic so authors can search, filter, and find the graphics they’re looking for. Then make sure they search for existing graphics before creating new ones from scratch. This same metadata can also be used to let media be searchable to end users when you publish. If you have videos, you can take this one step further and provide a time-delineated list of subjects covered in the video so users can skip right to the spot they want.

Descriptive tags should also be intelligently managed so you don’t have people using slightly different tags and so that you can modify the tags when it’s needed. These tags are called a taxonomy/classification scheme and can lead to their own chaos if left unmanaged and uncontrolled. Either keep them extremely simple (fewer than 10 tags, no hierarchy), select a CCMS that allows you to manage them, or call in an expert to help you out.

Manage source files

Don’t forget to store and manage your source files for graphics in a similar way to your graphic output files themselves.

A quality CCMS makes it easy to store your source graphics with your output graphics (or vice versa), so you can easily find, for example, both your Visio source file and your eight associated .PNG files.

If your CCMS doesn’t include this functionality (or if you’re using file folders instead), the key is to use metadata that matches how you’re managing your output graphics so that your filters and searches will automatically include the source files for graphics as well.

Summary

Graphics are the least-emphasized aspect of a DITA conversion project, but it’s worth the effort to establish which graphics you need to keep, how to manage them, and how to make them findable for both authors and end users. Your graphics are valuable assets that can and should be leveraged.