Document Management

Document Management

Post by Christopher B. Brow » Mon, 28 Jul 1997 04:00:00



An application area that is missing under Linux seems to be that of
"document management."  (MH people, hold on...  Relevance does
appear...)

Xerox sells a system called Documentum; IXOS sells this sort of
product; Lotus Notes can be used for this; Informix would like for you
to use their database as the data repository (See
<http://www.hex.net/~cbbrowne/textdbms.html> for some of my thoughts
and for links to most of these guys...)

The general idea is that these systems allow you to "throw documents"
into a repository and have a database describing them.  This makes it
easier to:
a) Ensure that important documents are backed up,
b) Search the database,
c) Combine documents of various formats together,
d) Generally keep the documents organized.

Often related to this are "imaging" systems; these systems are *very*
popular in enterprises with sizable legal departments.  Land title
systems would be a *great* example of this.

They'll apply the system by getting:
i) A "page scanner", and
ii) A server with *lots* of disk.

Peons scan each document, dumping a bunch of "scan" images into the
database.  If the software is quite sophisticated, it does OCR so as
to, as well as having images, have the full text of the document, and
try to fill in some index information based on that.  If OCR isn't
available, the "peon" types in some indexing information including
things like:
- File number (there's typically some sort of unique ID)
- Document date
- Who's involved
- Subject/author info...
- Department/application-specific "stuff"
- Categorization info

In a "free" approach, one sensible approach could be to for each new
document create two files:

a) ~/TDB/header/doc012345678
 - This would contain "information about the document" in a format
   resembling mail/news headers.  There might be subfolders in here;
   categorization could be done on the basis of which folder the
   header is in...

b) ~/TDB/body/doc012345678.0, ~/TDB/body/doc012345678.1, ...
 - A file for each component of the document.
 - It may be presumptuous to assume that the names used for the
 documents are required to match those used for the header.

For instance, if I scanned my two-page-long bank statement, there
would be *two* images, and thus *two* files.

A TCL/Tk interface could be used to add new documents, set up
searches, and launch an appropriate application to load/display/print
a "found" document.

This would be useful for simplifying personal filing systems.

- Important documents (insurance policies, leases, tax returns) could
be scanned and turned into .GIFs or .JPEGs and kept online, whilst the
original is hidden in a filing cabinet. Some simplified referencing
system could be used to indicate how to find the original is for those
unusual occasions when the original is truly needed.

- "Electronic" documents could either be moved into the Doc Mgmt
system or simply referenced in place.

- A typical extension is that certain categories of documents get
"aged" and gradually moved out onto archival media or even purged.

- A no-brainer is to use the Magicfilter print filter system as a
"rendering" utility; a typical thing to do is to throw documents at
Magicfilter, and grab back Postscript output that can be
displayed/printed using Ghostscript.

- Glimpse could be used for full text search of documents...  Hmmm...
Doesn't that sound like a program MH users have heard of?!?

This feels like something where the MH family of interfaces for
managing mail messages could be *highly* useful for managing the
"header" information.  We just have to relax the assumptions that:
a) "Messages" are static and shouldn't be edited;
b) Messages have senders and recipients.

EXMH feels to me like it would be the "perfect" interface to manage
this.  Interestingly, MH messages *are* permitted to link to
files/"objects" outside the message, which means that *that*
functionality doesn't require much work to implement.

The document might sit off in the ~/TDS/data/ hierarchy, perhaps with
an exceedingly cryptic name.

The "header" files would look almost like MH messages in ~/Mail/...

--

PGP Fingerprint: 10 5A 20 3C 39 5A D3 12  D9 54 26 22 FF 1F E9 16
URL: <http://www.hex.net/~cbbrowne/>
Q: What does the CE in Windows CE stand for?  A: Caveat Emptor...

 
 
 

Document Management

Post by Chris Mikkels » Fri, 01 Aug 1997 04:00:00


Another thing to check out might be VirtualPaper from DEC. It's
a free product of their research center.  It might not be exactly
what you describe, but could be a place to start/alternative

 
 
 

Document Management

Post by Evan Care » Sat, 02 Aug 1997 04:00:00



Quote:> An application area that is missing under Linux seems to be that of
> "document management."  (MH people, hold on...  Relevance does
> appear...)

Chris,

How true! Seems to me that the heart of the matter is the lack of good
tools to work with a fully functional database from a reasonable front
end. My approach has been to spend some time getting postgres up to speed
so that I can access it through JDBC. Right now I'm useing RPC's to do the
same thing, but its not quite as convienient as I would like it to be.
With a working JDBC I could put up screens in no time with Java and then
worry about building the inteligence into the database with postgres's C
inclusion functionality. If I were really streaching it, I could even tie
that into the free CORBA servers I've seen for Linux so that it could be
an enterprise solution demo (postgres is a little slow for anything larger
than a small office).

 
 
 

Document Management

Post by Christopher B. Brow » Sun, 03 Aug 1997 04:00:00



posted:


>> An application area that is missing under Linux seems to be that of
>> "document management."  (MH people, hold on...  Relevance does
>> appear...)
>How true! Seems to me that the heart of the matter is the lack of good
>tools to work with a fully functional database from a reasonable front
>end. My approach has been to spend some time getting postgres up to speed
>so that I can access it through JDBC. Right now I'm useing RPC's to do the
>same thing, but its not quite as convienient as I would like it to be.
>With a working JDBC I could put up screens in no time with Java and then
>worry about building the inteligence into the database with postgres's C
>inclusion functionality. If I were really streaching it, I could even tie
>that into the free CORBA servers I've seen for Linux so that it could be
>an enterprise solution demo (postgres is a little slow for anything larger
>than a small office).

I've been sketching out some thoughts; they ramble at this point
somewhat.  I'd say that's OK at this point; functionality shouldn't
be locked down too early...

# $Id: imaging.txt,v 1.1 1997/08/02 04:10:08 cbbrowne Exp cbbrowne $

Fields that will be needed:
--------------------------------------------------------------------
Within Document
--------------------------------------------------------------------
At the "master document" level
Document ID:
Date Created:
Date Modified:
Date Accessed:
Archival Action: (delete|archive)
Archival Basis: (After Creation, After Modification, After Access)
Category: (User defined set of categories...)
Subcategory: (User defined set of categories...)
--------------------------------------------------------------------
Optional fields:
Brief Description:
Extended Description:
Extended Description:
Extended Description:
...
Related Document: Relationship, Document ID
Related Document: Relationship, Document ID
Related Document: Relationship, Document ID
...
--------------------------------------------------------------------
Then, at the "component" level:
- File Name/ID in "repository"
- Public file name (e.g - what do we call it when it's checked out
   ---> This should be based on the *last* filename and path that
        was used...
- File "type" (preferably based on magic number)
- Versioning Policy: (Text-->RCS, Duplicate (.v1, .v2, .... , .vn)

Need to have several associations:
a) Use "magic number" library info to determine file types...
---> Need to have a "registry" of file "launchers"
b) Printing
---> This *ought to* amount to dropping files into "magicfilter"
---> Certain objects should be marked as "unprintable," and require
     that the "view/browse/edit" application be invoked...
c) Document archive DB
---> Should get represented as colon-separated fields; looks like
     mail...
---> Periodically cross-check to ensure that the underlying files
     haven't been messed with...
     If they've been updated, warn the user, request DB update...

d) Archiving to slower media, /dev/null
---> Need configurable policies

e) File management
---> option of:
     i) Keep the file "in place," let the user manage it where it is;
     (to be discouraged)
     ii) Move the file into "the repository" with check in and check
     out...
f) Documents with multiple components
   e.g. - A fax that contained 8 .GIF files
        - Email message with 4 attachments
g) Nested Documents
        - Allow documents to reference documents...
h) Work Flow???
  ---> Build a 'to-do' list with lists of documents
  ---> Use vSchedule data format to describe when events should take
       place to documents <http://www.versit.com/pdi/>

Thought:
At check-in time, see if there has ever been a file with the same
file name.  If there has, propose that the current file be checked in
as a new version.  Note that this indicates that there could be multiple
documents with the file name...
--

PGP Fingerprint: 10 5A 20 3C 39 5A D3 12  D9 54 26 22 FF 1F E9 16
URL: <http://www.hex.net/~cbbrowne/>
Q: What does the CE in Windows CE stand for?  A: Caveat Emptor...