How to Convert SGML to XML

How to Convert SGML to XML

  1. Take our self-study “Introduction to OmniMark” training course (materials and labs available upon request at https://www.stilo.com/omnimark-training/) which will teach you to:
    • work effectively with the core capabilities of OmniMark
    • use OmniMark’s powerful pattern-matching capabilities for text processing and converting content into XML/SGML
    • use OmniMark’s markup parser to validate XML/SGML markup
    • use OmniMark to process and enrich XML/SGML content prior to publication
  2. Write code to create an SGML to XML data conversion program
  3. Debug your program
  4. Run your program using OmniMark
  5. Retrieve the output results from your SGML to XML conversion program

The OmniMark streaming model can make an application more efficient, reusable, and scalable to large volumes of data.

OmniMark supports incremental input processing through its rule-based execution model. As events are recognized in the input, rules are executed to process them.

Develop high-performance content processing applications

Free Trial

How To Build Fast-Flow Content Conversion Pipelines With OmniMark

OmniMark Fast-Flow Content Conversion Pipeline Components:

  1. Parsers. OmniMark includes parsers for many common data formats, including XML, SGML and RTF. You can also write custom parsers using OmniMark script.
  2. OmniMark Script. For creating your own filters, parsers, validators and business rules, OmniMark provides a high-level scripting language designed for creating content conversion components that operate in a streaming pipeline environment.
  3. Multiple Inputs. An OmniMark conversion pipeline can integrate content from multiple sources into a single content stream. Supported scenarios include combining similar content from different suppliers or enriching content with data drawn from corporate databases on the internet.
  4. Filters. OmniMark includes a number of pre-built filters for common content conversion operations. You can use these filters in your pipelines or use them as templates for developing your own filters.
  5. Multiple Outputs. An OmniMark pipeline can be split to send output to two or more different destinations. Support scenarios include the output of the same content to multiple formats, and splitting a single content stream into two different streams with different content in each.
  6. Database Interface. OmniMark can pull data from or send data to most popular databases.
  7. File System Access. OmniMark provides complete access to local networked file systems.

Enabling Organizations To Meet Critical Demand

Large organizations today need to process increasing volumes of content, including corporate data, office documents, plain text and markup (XML, SGML, HTML), for delivery to enterprise information portals or supply chain partners.

When you need to acquire content from multiple sources, and convert, transform, validate and integrate it into your mission critical business systems, the processing of that content can rapidly become a major business issue.

Processing bottlenecks can develop within enterprise information architectures that are designed to provide real-time delivery of content to hundreds, or possibly thousands of online users. When you need to modify your system to handle new content types or integrate new business rules, and processing volumes increase substantially, bottlenecks can increase and system performances can rapidly deteriorate.

Building high-performance content conversion solutions requires specialist content engineering skills, supported by specialist processing tools.

Event-Based Parsing

With conventional tools, there is little more you can do to optimize the development process and/or the overall conversion time. With OmniMark, however, there is another option. OmniMark allows you to create conversion pipelines which can be broken down into smaller steps without the need to serialize and parse the data between each conversion step.

Like some other tools, OmniMark uses an event-based parsing approach. Unlike other tools, however, OmniMark allows you to combine multiple parsing sources in a common parse event stream, and to generate parse events at each stage in the pipeline. Because each filter in the pipeline can catch incoming parse events and insert new parse events into the parse event stream, there is no need to serialize data between filters, which means the pipeline runs faster and uses fewer resources.

Solving the Time Crunch

Because there is no need to serialize and parse between each step, you can break the process down much more finely, which keeps each filter as simple as possible and allows you to build a library of reusable filters. This helps you to maintain and update your conversion pipeline with minimal effort and disruption.

Because OmniMark is a full-feature content processing platform, there is no need to use different programming languages for different parts of the process. All the capabilities you need for the content processing are present in OmniMark. Taken together, these features provide the solution to the content conversion time crunch: rapid development execution add up to rapid completion of the content conversion.

The OmniMark Solution

OmniMark allows developers to build efficient content conversion pipelines that support the rapid insertion of multiple content filter elements without loss of processing speed. Organizations can easily create purpose-built conversion pipelines that enable them to convert structured, semi-structured and unstructured content, even content that is unique to their business in either it’s format or meaning.

The modular nature of the OmniMark pipeline architecture means that content conversion specialists can develop plug-and-play conversion modules that can be swapped into the pipeline architecture as needed, with confidence, and without impacting the flow of the working pipeline.

OmniMark offers outstanding speed, scalability and stability, regardless of the format or semantics of the content being processed or the business rules that are applied during processing. It is the superior performance for high volumes, time sensitive content conversion environments, that sets Omnimark apart.


How To Solve The Content Processing Challenge With OmniMark

OmniMark Solves The Content Processing Challenge By:

  1. Scalable Streaming Architecture. Faster than alternative approaches for processing content and requiring far less memory and system resources to complete processing tasks.
  2. Rules-Based Model. Proven to be the most effective model for handling the complex nature of content processing applications, and allows rapid development of functional applications.
  3. Powerful Pattern Matching. Optimized for efficiency and tightly integrated with markup processing.
  4. Context Management. Enables a hierarchical approach to processing content, making applications efficient, scalable and maintainable.
  5. Full SGML/XML. Support and Content Validation OmniMark has built-in SGML/XML markup parsers, with XML processing an integral part of the platform such that it is handled transparently for the developer.

The Content Processing Challenge

All large organizations today need to process ever increasing volumes of content in all its forms, including data, plain text, hypertext  and markup (XML, SGML, HTML), for delivery to enterprise information portals or supply chain partners. When content needs to be acquired from multiple diverse sources, converted, validated, integrated and transformed, the processing of that content rapidly becomes very complex, and a major issue for many organizations. Processing bottlenecks can readily occur within enterprise information architectures that need to ensure the real-time delivery of content to hundreds, or possibly thousands of online users. When systems need to be modified or maintained, and processing volumes increase substantially, the situation can rapidly deteriorate. Building high-performance content processing solutions requires specialist content engineering skills, supported by specialist content processing tools, seamlessly embedded within enterprise information architectures.

The OmniMark Solution

OmniMark has been built from the ground up to provide content engineers with a high-performance content processing platform able to support the most demanding content processing applications. It has evolved from being a domain specific, text processing language used for processing marked-up files and unstructured text, to become a single, integrated content processing platform featuring a wide range of connectivity and integration capabilities. OmniMark is able to process content from any source and deliver precisely-tailored information, on demand, to everyone who needs it. Increasingly this means streaming content into other applications, where specialized tasks are performed in specialized environments.

Open Standards

The ability to seamlessly combine pattern-based text and data processing with structured markup parsing allows developers to create powerful hybrid applications. When it comes to XML, OmniMark supports well-formed and schema-based parsing, being equipped with both a built-in XML parser and an interface to external parsers. SGML is fully supported including the latest amendments to the standard that were made to accommodate the adjustments needed by XML. OmniMark supports the W3C XML Schema via an External Parser Interface (EPI). The EPI also allows other types of XML Schema to be directly supported in OmniMark’s Markup Processing domain, including non-XML or customer-specified XML protocols. An XSLT processor is integrated for performing specific markup processing tasks or when working with small XML instances and applying multiple views to the same content.

Connectivity and Integration

Broad connectivity and communication options allow OmniMark to interact with other applications via application program interfaces (APIs) and user interfaces. Most major networking protocols are supported, including TCP/IP, HTTP, HTTPS, FTP and mail (POP3 and SMTP). Data sources and sinks may be accessed transparently via URLs, whether they are on a local machine, corporate network, or public internet.

In addition, OmniMark supports sophisticated high-level database access via ODBC and XQuery. It has extensive support for the native Oracle Call Interface 11g. It also includes directory connectivity via LDAP. OmniMark functionality can be extended through its SDK to support other emerging protocols and specific APIs.