|An Eclipse application for building custom language analysis into IBM LanguageWare resources and their associated UIMA annotators.
Update: July 20, 2012:
Studio 3.0 is out and it is officially bundled with ICA 3.0. If you are a Studio 3.0 user, please use ICA forum instead of LRW forum. 18.104.22.168 LRW is a fixpack that resolves issues in various areas including the Parsing Rules editor, PEAR file export and Japanese/Chinese language support.
22.214.171.124 LRW is still available for download on the Downloads link for IBM OmniFind Enterprise Edition V9.1 Fix Pack users.
What is IBM LanguageWare?
IBM LanguageWare is a technology which provides a full range of text analysis functions. It is used extensively throughout the IBM product suite and is successfully deployed in solutions which focus on mining facts from large repositories of text. LanguageWare is the ideal solution for extracting the value locked up in unstructured text information and exposing it to business applications. With the emerging importance of Business Intelligence and the explosion in text-based information, the need to exploit this “hidden” information has never been so great. LanguageWare technology not only provides the functionality to address this need, it also makes it easier than ever to create, manage and deploy analysis engines and their resources.
It comprises Java libraries with a large set of features and the linguistic resources that supplement them. It also comprises an easy-to-use Eclipse-based development environment for building custom text analysis applications. In a few clicks, it is possible to create and deploy UIMA (Unstructured Information Management Architecture) annotators that perform everything from simple dictionary lookups to more sophisticated syntactic and semantic analysis of texts using dictionaries, rules and ontologies.
The LanguageWare libraries provide the following non-exhaustive list of features: dictionary look-up and fuzzy look-up, lexical analysis, language identification, spelling correction, hyphenation, normalization, part-of-speech disambiguation, syntactic parsing, semantic analysis, facts/entities extraction and relationship extraction. For more details see the documentation.
The LanguageWare Resource Workbench provides a complete development environment for the building and customization of dictionaries, rules, ontologies and associated UIMA annotators. This environment removes the need for specialist knowledge of the underlying technologies of natural language processing or UIMA. In doing so, it allows the user to focus on the concepts and relationships of interest, and to develop analyzers which extract them from text without having to write any code. The resulting application code is wrapped as UIMA annotators, which can be seamlessly plugged into any application that is UIMA-compliant. Further information about UIMA is available on the UIMA Apache site
LanguageWare is used in such various products as Lotus Notes and Domino, Information Integrator OmniFind Edition (IBM’s search technology), and more.
The LanguageWare Resource Workbench technology runs on Microsoft Windows and Linux. The core LanguageWare libraries support a much broader list of platforms. For more details on platform support please see the product documentation.
How does it work?
The LanguageWare Resource Workbench allows users to easily:
- Develop rules to spot facts, entities and relationships using a simple drag and drop paradigm
- Build language and domain resources into a LanguageWare dictionary or ontology
- Import and export dictionary data to/from a database
- Browse the dictionaries to assess their content and quality
- Test rules and dictionaries in real-time on documents
- Create UIMA annotators for annotating text with the contents of dictionaries and rules
- Annotate text and browse the contents of each annotation.
The Workbench contains the following tools:
- A dictionary viewer/editor
- An XML-based dictionary builder
- A Database-based dictionary builder (IBM DB2 and Apache Derby support are provided)
- A dictionary comparison tool
- A rule viewer/editor/builder
- A UIMA annotator generator, which allows text documents to be annotated and the results displayed.
- A UIMA CAS (common annotation structure) comparator, which allows you to compare the results of two different analyses through comparing the CASes generated by each run.
The LanguageWare Resource Workbench documentation is available online and is also installed using the Microsoft Windows or Linux installers or using the respective .zip files.
What type of application is LanguageWare suitable for?
LanguageWare technology can be used in any application that makes use of text analytics. Good examples are:
- Business Intelligence
- Information Search and Retrieval
- The Semantic Web (in particular LanguageWare supports semantic analysis of documents based on ontologies)
- Analysis of Social Networks
- Semantic tagging applications
- Semantic search applications
- Any application wishing to extract useful data from unstructured text
For Web-based semantic query of the LanguageWare text analytics, you might be interested in checking out IBM Data Discovery and Query Builder. When used together, these two technologies can provide a full range of data access services including UI presentation, security and auditing of users, structured and unstructured data access through semantic concepts and deep text analytics of unstructured data elements.
- Learn more about IBM Content Analytics.
About the technology author(s)
LanguageWare is a worldwide organization comprising a highly qualified team of specialists with a diverse combination of backgrounds: linguists, computer scientists, mathematicians, cognitive scientists, physicists, and computational linguists. This team is responsible for developing innovative Natural Language Processing technology for IBM Software Group.
LanguageWare, along with LanguageWare Resource Workbench, is a collaborative project combining skills, technologies, and ideas gathered from various IBM product teams and IBM Research division.
Operating systems: Microsoft Windows XP, Microsoft Windows Vista, or SUSE Linux Enterprise Desktop 11
Hardware: Intel 32-bit platforms (tested)
- Java 5 SR5 or above (not compatible with Java 5 SR4 or below)
- Apache UIMA SDK 2.3 (required by LanguageWare Annotators in order to run outside the Workbench)
Notes: Other platforms and JDK implementations may work but have not been significantly tested.
Installing LanguageWare Resource Workbench and LanguageWare Demonstrator
On each platform, there are two methods of installation.
On Microsoft Windows:
- Download the lrw.win32.install.exe or Demonstrator.exe and launch the installation.
- Download the lrw.linux.install.bin or Demonstrator.bin and launch the installation (e.g., by running sh ./lrw.linux.install.bin).
LanguageWare Training and Enablement
The LanguageWare Training Material is a set of presentations that walk you through the use of LanguageWare Resource Workbench. It uses a step by step approach with examples and practice exercises to build up your knowledge of the application and understand data modeling. Please follow the logical succession of the material, and make sure you finish the sample exercise.
Please use this link to access the presentation decks.
- Intel is the trademarks or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.
- Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
- Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
- Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
- 1. What is the LanguageWare Resource Workbench? Why should I use it?
- 2. What’s new in this version of LanguageWare Resource Workbench?
- 3. Where should I start with LanguageWare?
- 4. What documentation is available to help me use LanguageWare?
- 5. What are the known limitations with this release of the LanguageWare Resource Workbench?
- 6. What version of UIMA do I need to use the LanguageWare annotators?
- 7. What is a Domain Extraction Model (or “model,” “annotator,” “analyzer”)? How do I build a good Domain Extraction Model?
- 8. How do I change the default editor for new file types in the LanguageWare Resource Workbench?
- 9. How do I integrate the UIMA Analyzers that I develop in the LanguageWare Resource Workbench?
- 10. How is my data stored?
- 11. What licensing conditions apply for LanguageWare on alphaWorks, for academic purposes, or for commercial use?
- 12. Is Language Identification identifying the wrong language?
- 13. Why is the LanguageWare Resource Workbench shipped as an Eclipse-based application?
- 14. What languages are supported by the LanguageWare Resource Workbench?
- 15. Does LanguageWare support GB18030?
- 16. Are you experiencing problems with the LanguageWare Resource Workbench UI on Linux platforms?
- 17. Any other questions not covered here?
- 18. How do I upgrade my version of the LRW?
1. What is the LanguageWare Resource Workbench? Why should I use it?The LanguageWare Resource Workbench is a comprehensive Eclipse-based environment for developing UIMA Analyzers. It allows you to build Domain Extraction Models (lexical resources and parsing rules, see FAQ 8) which describe the entities and relationships you wish to extract, creating custom annotators tailored to your specific needs. These annotators can be easily exported as PEAR files (see FAQ 10) and installed into any UIMA pipeline or deployed onto an ICA server. If you satisfy the following criteria, then you will want to use LanguageWare Resource Workbench:
- You need a robust open standards (UIMA) text analyzer that can be easily customized to your specific domain and analysis challenges.
- You need a technology that will enable you to exploit your existing structured data repositories in the analysis of the unstructured sources.
- * You need a technology that allows you to build custom domain models that become your intellectual property and differentiation in the marketplace.
- need a technology that is multi-lingual, multi-platform, multi-domain, and high performance.
All new features are outlined in the LanguageWare Resource Workbench Release Notes, ReleaseNotes.htm, which is located in the Workbench installation directory.
The best way to get started with LanguageWare is to install the LanguageWare Resource Workbench. Check the training videos provided above or the training material (to be posted soon); it will introduce you to the Workbench and show you how it works.
Context-sensitive help is provided as part of the LanguageWare Resource Workbench. There is an online help system shipped with the LanguageWare Resource Workbench (under Help / Help Contents). Check the training videos provided above or the training material (to be posted soon). More detailed information about the underlying APIs will be provided for fully-licensed users of the technology.
Any problems or limitations are outlined in the LanguageWare Resource Workbench Release Notes, ReleaseNote.htm, which is located in the LanguageWare Resource Workbench installation directory and is part of the LanguageWare Resource Workbench Help System.
LanguageWare Resource Workbench ships with, and has been tested against, Apache UIMA, Version 2.3. They should work with newer versions of Apache UIMA; however, they have not been extensively tested for compatibility. Therefore, we would recommend Apache UIMA v2.3. The LanguageWare annotators are not compatible with versions of UIMA prior to 2.1. These were released by IBM and have namespace conflict with Apache UIMA.
A “model” is the set of resources you build to describe what you want to extract from the data. The models are a combination of:
- The morphological resources, which describe the basic language characteristics
- The lexical resources, which describe the entities/concepts that you want to recognize
- The POS tagger resource
- The parsing rules, which describe how concepts combine to generate new entities and relationships.
The process of building data models is an iterative process within the LanguageWare Resource Workbench.
Go to Window / Preferences / General / Editors / File Associations. If the content type is already listed, just add a new editor and pick the LanguageWare Text Editor. You can set this to be the default, or alternatively leave it as an option that you can choose, on right click, whenever you open a file of that type. You will need to restart the LanguageWare Resource Workbench before this comes into effect. Note: Eclipse remembers the last viewer you used for a file type so if you opened a document with a different editor beforehand you may need to right-click on the file and explicitly choose the LanguageWare Resource Workbench Text Editor the first time on restart.
Once you have completed building your Domain Extraction Models (dictionaries and rules), the LanguageWare Resource Workbench provides an “Export as UIMA Pear” function under File / Export. This will generate a PEAR file that contains all the code and resources required to run your pipeline in any UIMA-enabled application, that is, in a UIMA pipeline.
The LanguageWare Resource Workbench is designed to primarily help you to build your domain extraction models and this includes databases in which you can store your data. The LanguageWare Resource Workbench ships with an embedded database (Derby, open source); however, it can also connect to an enterprise database, such as DB2.
11. What licensing conditions apply for LanguageWare, for academic purposes, or for commercial use?LanguageWare is licensed through the IBM Content Analytics License at http://www-01.ibm.com/software/data/cognos/products/cognos-content-analytics/.
Sometimes the default amount of text (1024 characters) used by Language Identification is not enough to disambiguate the correct language. This happens specially when languages are quite close or when the text analysed may include text in more than one language. In this case, it may help to increase the MaxCharsToExamine parameter. To do this, select from the LWR menu: Window > Preferences > LanguageWare > UIMA Annotation Display. Enable the checkbox for “Show edit advanced configuration option on pipeline stages.” Select “Apply” and “OK.” Next time you open a UIMA Pipeline Configuration file, you will notice an Advanced Configuration link at the Document Language stage. Click on it to expand and display its contents, notice the MaxCharsToExamine parameter can be edited. Change the default number displayed to a bigger threshold. Save your changes and try again to see if the Language Identification has improved.
We built the LanguageWare Resource Workbench on Eclipse because it provides a collaborative framework through which we can share components with other product teams across IBM, with our partners, and with our customers. This version of the LanguageWare Resource Workbench is a complete, stand-alone application. However, users can still get the benefits of the Eclipse IDE by installing Eclipse features into the Workbench. Popular features include the Eclipse CVS feature for managing shared projects and the Eclipse XML feature for full XML editing support. See the Eclipse online help for more information about finding and installing new features. It is important to understand that while the LanguageWare Resource Workbench is Eclipse-based, the Annotators that are exported from the LanguageWare Resource Workbench (under File / Export) can be installed into any UIMA pipeline and can be deployed in a variety of ways. The LanguageWare Resource Workbench team, as part of the commercial LanguageWare Resource Workbench license, provides integration source code to simplify the overall deployment and integration effort. This includes UIMA serializers, CAS consumers, and APIs for integrating into through C/JNI, Eclipse, Web Services (REST), and others.
For the following languages, a lexical dictionary without part of speech disambiguation can be made available upon request: Afrikaans, Catalan, Greek, Norwegian (Bokmal), Norwegian (Nynorsk), Russian and Swedish. These dictionaries are provided “AS-IS” (i.e. they have not been maintained and will not be supported. While feedback on them is much appreciated, requests for changes, fixes or queries will only be addressed if adequately planned and sufficiently funded).
Back to top
LanguageWare annotators support UTF-16, and this qualifies as GB18030 support. This does mean that you need to translate the text from GB18030 to UTF-16 at the document ingestion stage. Java will do this automatically for you (in the collection reader stage) as long as the correct encoding is specified when reading files.
Please note that text in GB18030 extension B may contain characters outside the Unicode Basic Multilingual Plane. Currently the default LanguageWare break rules would incorrectly split such characters into two tokens. If support for these rare characters is required, the attached break rules file can be used to ensure the proper handling of 4-byte characters. (Note that the file zh-surrogates.dic for Chinese is wrapped in zh-surrogates.zip.)
There is a known issue on Ubuntu with Eclipse and the version of the GTK+ toolkit that prevents toolbars being drawn properly or buttons working properly with mouse clicks. The fix is explained here:
To provide a work around you need to create an environment variable “GDK_NATIVE_WINDOWS=1” before loading up the LanguageWare Resource Workbench. Another issue was reported for Ubuntu 9.10 (Karmic) with LanguageWare Resource Workbench showing an empty dialog window when starting. The issue is explained here:
To provide a work around you need to add “-Dorg.eclipse.swt.browser.XULRunnerPath=/dev/null” line to lrw.ini file in the LanguageWare Resource Workbench installation folder.
Please use the “Forum” to post your questions and we will get back to you.
On Windows and Linux, each version of the LRW is a separate application. You should not install a new version over a previous one, instead make sure to either uninstall your previous version, or ensure that all versions are installed in separate locations. If in doubt, the default use of the LRW installers will ensure this behaviour.
The projects you created with older versions of the LRW will never be removed by the uninstall process. You can point your new version of the LRW at the same data workspace during startup.
Updated by Amine_Akrout|Sep 14 2011|Tags: