Introducing Alfresco Certified Technologies
With hundreds of add-ons listed in the directory, we want to make it easier to find ones that can be maintained in production. We started a certification program to identify third-party technologies that we have reviewed for quality and know can be supported by our certified partners. Some of the criteria are described in a lightning talk given by Peter Monks at the last Alfresco Summit. One of the first add-ons to be certified is the integration between Alfresco and Ephesoft.
Alfresco helps manage an organization's unstructured content. But in order to leverage most of Alfresco's capabilities, content must be digital and key attributes must be attached as metadata. Document capture consists of imaging and indexing content. Document imaging is the process of scanning paper documents so that they can be managed digitally. Document indexing is the process of identifying the data in those images that should be extracted as metadata. Ephesoft is a tool for making this process as efficient as possible. It can work with any sort of printed document including application forms, invoices, contracts and patient records. It automatically extracts requested data and makes it easy for a human operator to deal with the cases which require manual intervention.
In a similar style to how Alfresco changed document management, Ephesoft's intention is to shake-up the document capture segment of enterprise software. Since their launch in 2010 they have produced a capable system, built a solid business, and recruited experienced partners. Many Alfresco partners are also Ephesoft partners and the two solutions work together for many business automation tasks.
Ephesoft is an enterprise application built with flexibility and scale in mind; this leads to a certain level of complexity. The documentation, error messages, and in-application guidance is at times not sufficiently helpful, so learning the platform will take some investment. I expect it will be worth the effort.
Editions and Architecture
As a Java application running on Tomcat, Ephesoft will feel familiar to an Alfresco administrator or developer. The Community Edition relies on familiar technologies such as MySQL, Lucene and OpenOffice / LibreOffice.
Like Alfresco, Ephesoft has an Enterprise, Cloud, and Community Edition. Ephesoft’s Enterprise Edition can be deployed on-premise or in the cloud and is fully supported. The Cloud Edition is a multi-tenant service with free and paid tiers. It has similar features to Enterprise Edition, including many of the hooks for customization.
Ephesoft's model for managing Community Edition is much different than that used by Alfresco. Though Community Edition does have enough features to be useful, the list of features that are missing make it hard to use it as a valid evaluation of Ephesoft. The Ephesoft team has not previously followed an open development model but instead pushes the source for community edition to the public source repository at Google Code a few times a year. The Community Edition download is released after the Enterprise Edition. The current release is from March 2012, but a new release is expected in the first half of 2014.
Ephesoft is currently best supported on Windows. They previously had a Linux release, but it has not been available for a few years. The Ephesoft team's next release will have full support on Windows and Linux.
Optical character recognition (OCR) is one of the key capabilities that Ephesoft uses to automate metadata extraction from documents. Ephesoft integrates third-party OCR libraries to provide this feature. The Community Edition relies on Tesseract, which is probably the most advanced open source OCR engine and is adequate for situations involving clean scans and easy-to-read fonts. On Windows, the Enterprise Edition uses the advanced OpenText Recostar engine which has high accuracy and transcribes some types of handwriting. The upcoming Linux release will use the state-of-the-art OmniPage engine from Nuance. All three of these engines has support for recognizing many languages.
When the OCR engine fails to find a value, or is not confident in what it found, it flags the field for a human operator to review. The operator interface is very efficient and loaded with keyboard shortcuts. It doesn't take long to get fast at indexing within the application.
The quickest way to get started is to sign up for a free cloud account at MyCloud.Ephesoft.com. The freemium account is limited to only one batch class (a batch class is how Ephesoft organizes document types for related scan jobs). This default batch class has some restrictions in configuration, for instance the name cannot be changed. It can only process 10 batches per day, with 10 pages per batch. Files must be less than 2MB. Reporting and workflow are disabled. These limitations are lifted with paid accounts.
The Ephesoft Cloud is a relatively young product and I had some problems. However the Ephesoft team was quick to respond to my reports. They are also adding requested features such as the ability to view log files for processes in the cloud.
Installing the on-premise Community Edition starts by making sure you have a JDK on your machine. Then
download the latest release, which is currently the Windows version from 2012. The installation requires you to agree to the AGPL, and also a bunch of Microsoft license terms for the Visual Studio components it installs. The installation bundle includes the installer for MySQL, which it will start for you. OpenOffice is optionally needed for document transformations (though you can instead do them in Alfresco). After the installers have completed you should check that MySQL is installed as a Windows Service and running. You can then start the Ephesoft Server using the shortcut in the Start Menu. That launches a cmd window that spews out Tomcat log messages. After you see “INFO: Server startup in ___ ms”, you can open your browser and go to the Ephesoft pages.
Testing Enterprise Edition requires filling out the sales form, receiving a call back, downloading the product, installing, generating a license details.properties file, requesting a license file, and then installing the license file. I had problems with my license validation that was solved by setting the “Users” permissions to “Full” for the registry key HKLM\Software\JavaSoft\Prefs\com\ephesoft. The Enterprise Edition installer is 1GB download and won't install unless there are 3GB of disk space available. This is because it packages everything you need to run the application, including a JRE, MySQL, MSSQL Express, and LibreOffice. Though the installer setup Ephesoft as a Windows service, I had to configure it to use the Administrator user with a blank password before it would run. The install scripts for the Windows services and setting environment variables are stored in Ephesoft\JavaAppServer, and can be useful for correcting the environment.
The Windows VM I was using to test Ephesoft only had 4 GB of RAM. The default installation of Ephesoft claimed more RAM than I had available, so the server process silently exited. The only clue I could find was in the log file at Ephesoft\JavaAppServer\logs\jakarta_service. It reported that prunsrv.c generated the error “Failed creating java ServiceStart returned”. The system ran after I edited Ephesoft\JavaAppServer\bin\startup.bat to lower the JAVA_OPTS memory requirements to -Xmx1024m and -XX:MaxPermSize=512m.
There are two main Ephesoft URLs, both of which have shortcuts in the Start Menu: the Administrator Module (<base_url>/dcma/BatchClassManagement.htm) and the Operator Module (<base_url>/dcma/BatchList.html). Both are accessible from the Ephesoft Home (<base_url>/dcma/Home.html). The Administrator Module is where you configure the document types in Batch Classes, and the Operator Module is where you interact with your captured documents. The default username is "ephesoft" and the default password is "demo". I recommend you change them.
I had some trouble getting the web scanner Java applet to run as it kept giving me initialization errors. I got it working by verifying that my JRE had installed the Java plugin into my browser, that my browser security settings allowed it to run, and that my scanner was presenting a TWAIN interface. I got it working on Windows with Firefox and JRE7, but I couldn't get it to see my HPLIP scanner on Linux. The web scanner is useful but not required as it is easy to upload documents or pick them up from a hot folder used by the scanner.
Integrating with Alfresco
Ephesoft has a robust set of APIs, which allows you to use it as a remote service for indexing content that is in Alfresco, as described by Pat Myers from Zia Consulting in this session from Alfresco Summit 2013. But the certified integration uses Ephesoft to do the initial imaging and indexing of the content, and then uses Ephesoft's CMIS Export to push the content into Alfresco. The export supports mapping Ephesoft index fields to CMIS metadata properties in Alfresco. It also allows you to define for each batch class the aspects you want applied to the content after it is exported to Alfresco. This can be used in Alfresco to apply custom metadata and trigger complex automated behaviors by coupling an aspect with Alfresco's folder rules. Ephesoft can also use CMIS to import content from Alfresco that is in a specified folder and has a property with a specified value.
These are instructions for integrating Ephesoft Enterprise Edition with an on-premise Alfresco repository.
When configuring CMIS use these values:
- Alfresco CMIS URL: https://<alfresco_hostname>:8443/alfresco/service/cmis
- It didn't work when using the Public API endpoint version 1.0 or 1.1 (alfresco/api/-default-/public/cmis/versions/1.1/atom), it just reported “cmis_repository_not_found” and “Unknown repository”.
- I didn't have to do anything special to get it to accept the self signed certificate I was using for my Alfresco repository.
- Username: The username of the service account in Alfresco that will interact with Ephesoft content.
- Password: The password for the service account in Alfresco.
- Repository ID: Easiest way to get this is to browse to https://<alfresco_hostname>:8443/alfresco/service/cmis/index.html, and look at the repository information.
- Folder Name: The folder in Alfresco. Best found using the Repository view in Share, and remove the /Repository/ from the path. For example, a “Uploads” folder in the site “Customer Invoicing” (with slug “customer-invoicing”) would look in the Repository view as /Repository/Sites/customer-invoicing/documentLibrary/Uploads. In Ephesoft that would be Sites/customer-invoicing/documentLibrary/Uploads.
To configure the CMIS Import:
- Enable the module by editing Ephesoft\Application\applicationContext.xml and uncommenting <import resource=”classpath:/META-INF/applicationContext-dcma-cmis-import.xml” />
- It can be helpful to make the import run more frequently than every 15 minutes, by editing the cron definition in Ephesoft\Application\WEB-INF\classes\META-INF\dcma-cmis-import\cmis-import.properties
- Restart the Ephesoft server.
- Log in to the Administrator Module.
- Click on the “Batch Class Management” tab.
- Select the batch class you want to have import from Alfresco, and click Edit in the top right corner.
- Select the “CMIS Import” tab.
- Select “Add” to define a new import location.
- Use these settings:
- Server URL: Alfresco CMIS URL
- Username: Your service account user name.
- Password: Your service account password.
- Repository ID: As explained above.
- File Extension: A comma separated list of file extensions that should be imported.
- Folder: The location in Alfresco from where you want to import content. See example above.
- Property: A cmis property that should be used to filter the content for import, e.g. Author is “cm:author”.
- Value: The value the cmis property should have to be eligible for import.
- New Value: The value that will be assigned to the cmis property after the content has been imported to Ephesoft.
Once it is set up, Ephesoft will check that folder every 15 minutes and all content that matches these specifications will be imported into a batch.
To enable the CMIS Export Script:
- From the “Batch Class Management” tab in the Administrator Module, edit the batch class you want to have export to Alfresco.
- Press Configure, and verify that “Export” is a selected module. Leave the Configure screen.
- On the “Modules” tab, select “Export”, and click Edit in the right corner.
- Press Configure, and verify that “CMIS_EXPORT” is a selected plugin.
- The selected plugins is an ordered list, so CMIS_EXPORT needs to appear somewhere after CREATEMULTIPAGE_FILES and before CLEANUP.
- Any time you change plugins, you need to press the “Validate” button and the “Deploy Workflow” button (select to deploy as a “Normal Workflow”).
- Any jobs in flight will need to be restarted to get the new workflow settings.
- Leave the Configure screen.
- Select CMIS_EXPORT in the plugin list, and press Edit.
- Use these settings:
- Cmis Root Folder Name: The destination folder in Alfresco. See the example above.
- Cmis Upload File Extension: Either pdf or tif.
- Cmis Server URL: The Alfresco CMIS URL.
- Cmis Server User Name: Your service account user name.
- Cmis Server User Password: Your service account user password.
- Cmis Server Repository Id: As explained above.
- CMIS Server Switch ON/OFF: ON, to enable the module.
- Aspect Switch: Depends on how you configure the aspect mapping.
- CMIS Export File Name: Allows you to specify how you want files to be named within Alfresco.
- CMIS Export Client Key, CMIS Export Secret Key, CMIS Export Refresh Token, CMIS Export Redirect URL, and CMIS Export Network are all used to connect to Alfresco in the cloud using OAuth. They can be ignored when connecting to an on-premise Alfresco instance.
- Configure the Ephesoft document level field mappings to Alfresco metadata properties by editing DLF-Attribute-mapping.properties in the Ephesoft\SharedFolders\<batch class id>\cmis-plugin-mapping folder. As you can see in the screenshot, Alfresco types are prefixed with a “D:”.
- Configure the mapping of aspects per document type, and fields to aspect properties, by editing aspects-mapping.properties in the Ephesoft\SharedFolders\<batch class id>\cmis-plugin-mapping folder. As you can see in the screenshot, Alfresco aspects are prefixed with a “P:”.
When testing the export, it can help to:
- Start out with blank mapping files and test that the connection settings allow successful uploads. Then work on getting the mappings correct.
- Disable the CLEANUP Export plugin, so that the batches stay in the queue. Then you can just restart the export step instead of waiting for everything to process again.
- Test with small batches.
Once this is set up, if a batch has no errors, or after an operator has completed verifying a batch, it will be uploaded to that folder in Alfresco.
Community Edition is configured in the same way as Enterprise Edition, but the current release is pretty old and doesn't have support for setting aspects or importing from CMIS. This same configuration is available for Ephesoft Cloud in the Administrator Module under the Folder Management tab.
The Ephesoft Wiki page has more details relating to configuring CMIS and the CMIS Export Plugin. If you have trouble, you can turn up global logging by editing the value of <logger name=“com.ephesoft”> in Ephesoft\Application\log4j.xml, and restart Ephesoft.
CMIS Export (but not import) supports OAuth in Ephesoft Enterprise and the next update of the Cloud Edition. This allows it to upload to Alfresco in the cloud using the cloud CMIS endpoint (api.alfresco.com/yourcompany.com/public/cmis/versions/1.1/atom). You can get more information in the API reference and you can get OAuth developer keys by registering for the Cloud API on the developer page. Unfortunately since Alfresco in the cloud does not yet support custom metadata, it can't take advantage of Ephesoft metadata extraction capability.
Besides the resources linked here, there are some very helpful videos on the Ephesoft University YouTube channel, such as this video guide. The wiki is also very helpful. There are community forums, but they are not very active and new accounts require administrator approval. There is also this Ephesoft book which is about as old as the current Community Edition release. I haven't read it, but I know most of the authors and they do excellent work.
If your Alfresco repository has a lot of forms, or worse, if your organization has a lot of paper documents that you want to get into Alfresco, then Ephesoft will likely be a valuable tool. It integrates well, is very configurable, and is familiar to an Alfresco developer. I recommend you ask for evaluation assistance from Ephesoft or a partner, as they can make the process much easier.