The next frontier: Giving form to unstructured data

When the former chair of Enron, Kenneth Lay, sent e-mails to his staff assuring them that the company's future was sound, he probably never thought about how these pieces of unstructured data could one day be used to indict him on 11 counts of fraud.

"There is an awful lot of intelligence potential locked up inside that unstructured data," wrote Nick Patience, managing analyst and cofounder of tech analyst The 451 Group in a November 2004 report.

The accepted figure, according to a May 2004 Forrester Research paper, is that 80 percent of all enterprise information exists as unstructured data. Like Lay, many do not realize the power of unstructured data or how to capture it. But thanks to Enron and other corporate scandals, federal regulations have become more strict regarding unstructured data, and software companies have stepped up to the plate, making it a lot easier to secure and integrate unstructured data into accounting systems.

"It's the next big area of growth for the business intelligence space," said Jay Henderson, director of product marketing for ClearForest, a provider of text-driven business intelligence software. "We've grown by more than 50 percent year over year - it's growing very dramatically."

The result is a plethora of services and systems filling up this hole in business intelligence. From data cleansing and sorting, to integration and storage capabilities, there are applications for every step in structuralizing data. Some companies are starting to merge all capabilities into one product suite, but just as one BI layer comes into being, another is at the forefront, ready to take over.

Unstructured is a misnomer

In fact, the term unstructured data is, perhaps, too complex.

Typically, structured data is an alphanumerical value - a numerical value, name, address or zip code - easily entered into a structured database. Unstructured data - information found in spreadsheets, e-mails, presentations, Word documents - normally takes up large amounts of storage space and is not easily converted into simple and easy-to-analyze structured data. Mostly, these unstructured files are stored as binary large objects or character large object files, rather than as XML.

"A lot of CPAs don't picture it in their heads that this is even possible," said Carlton Collins, CPA and chief executive of consultancy Accounting Software Advisor LLC. "They're still doing it the old, expensive way. They really need to come up with the times."

The ability to change this unstructured data into alphanumerical code, however, is not only possible, but the many software products and systems available are growing more sophisticated.

Scheduled to be available in early fall, Microsoft's Small Business Accounting will include linking capability to unstructured files. A Microsoft partner, Computer Information Enterprises, is extending the integration of its Imagelink application, a product linking scanned images to figures within a financial software system, to include the popular Best ERP system MAS 90 by the end of April. And IBM expects to close on its acquisition of Ascential Software, a provider of business data integration software, by the end of the second quarter of 2005, creating a one-stop shop for all data.

Cleansing, sorting, integrating and storing

To achieve a heightened operating level, tech departments replicate domestic chores: cleaning, sorting and storing. Much like dishes, content in unstructured formats needs to be gathered and cleaned before it can be stored in the appropriate place and used with structured data. Today, much of the data content management field is automated and segmented among specialized software companies.

Text analysis software like Inxight's SmartDiscovery gathers and cleanses important data from insignificant parts of speech - adverbs, participles or adjectives. Products like SmartDiscovery read text found in e-mails, documents and other sources and decipher what to pull out - people, companies, currency amounts, dates etc.

"We can tell from grammatical structure what looks like a company's name or a personal name," said Catherine van Zuylen, senior product manager for Inxight Software Inc. "There is no way a human can read through all the materials in a business that they need to get all this data - our system reads it for you."

"Typically the information looks different," said Jeff Jones, director of strategy for IBM Information Management Software. "It's saved in a way convenient for sales and not [for] analytical application."

ClearForest, with other large content management software companies like Hummingbird, transforms the jumbled, unstructured data into structured columns and rows by placing it into analysis databases.

Cleaned and sorted, data then needs to be linked or integrated with structured data. With enterprise application integration and tools to extract, transform and load (or ETL), most tech departments or consultants can bridge the gap from here.

However, Microsoft is making the linking process that much easier in their Small Business Accounting system for small or midsized businesses. In SBA, an unlimited number of links can be attached to any figure within the accounting system. The linking system here looks like the attachment system found in Outlook for e-mails. Some products, like Imagelink, automatically index scanned images like checks, contracts and receipts, under key items like their name or vendor ID number for later linkage into the ERP system.

The last step towards automated content management is storage - the last pocket of software specialty for this BI division. Federal regulations require financial data, structured and unstructured, to be unchanged and stored for several years. Net Appliance is one storage software producer that offers some products using "Worm" technology, in addition to their NetApp servers. By using this technology, applications like their Compliance Journal, a log of who performed what task and when, become unalterable.

"People using optical disks to save data were feeling the pain," said Ajay Singh, product manager at Net Appliance. "It's not easy to manage and is very flat."

But what companies are looking for is one united approach, not all these different systems, said IBM's Jones: one system, regardless of how structured, unstructured or semi-structured the information is, that combines all this data into one recorded information set. Companies want to "just ask one question and literally get all recorded information regarding that question," he stated.

One system is what IBM hopes to achieve with its $1.1 billion purchase of software vendor Ascential Software. The buy-out will add ETL capabilities and data transformation, migration and cleansing to IBM's WebSphere Information Integrator - creating one high-speed, integrated suite.

Laurie M. Orlov and Laura Ramos, vice presidents at Forrester Research, wrote in their May 2004 "Organic Information Abstraction" report that by 2007, a new "information abstraction layer will emerge to connect separate environments of data, content and text." The two go on to say that this new layer will "replace today's BI and content analytics."

For reprint and licensing requests for this article, click here.
Technology
MORE FROM ACCOUNTING TODAY