Document in several languages

Document in several languages

Post by Jonah Olsso » Wed, 05 Feb 2003 23:07:21



Hello guys,

I'm designing a database that will contain documents in several languages
and each document can have several translations. For displaying these
documents I have a web application (in ASP.NET).

Now, as standard, this application displays documents in English, but from
each document page I want to have a listbox for selecting other translations
of the same document. But how can each document know which other
translations it has? If I create another table that contains
Document-Language relations, I only know which language ONE document has.
The following is the Documents-table and Languages-table:

CREATE TABLE [Documents] (
 [document_id] [int] IDENTITY (1, 1) NOT NULL ,
 [header] [nvarchar] (100) NOT NULL ,
 [body] [ntext] NOT NULL ,
 [created_date] [datetime] NOT NULL ,
 [updated_date] [datetime] NULL ,
 [author_email] [varchar] (255) NULL ,
 CONSTRAINT [PK_Documents] PRIMARY KEY  CLUSTERED
 (
  [document_id]
 )  ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

CREATE TABLE [Languages] (
 [language_id] [varchar] (5) NOT NULL ,
 [language_name] [nvarchar] (20) NOT NULL ,
 CONSTRAINT [PK_Languages] PRIMARY KEY  CLUSTERED
 (
  [language_id]
 )  ON [PRIMARY]
) ON [PRIMARY]
GO

Thanks for any kind of help.

Regards,
Jonah

 
 
 

Document in several languages

Post by Anith Se » Thu, 06 Feb 2003 00:52:50


Quote:>> If I create another table that contains Document-Language

relations, I only know which language ONE document has. <<

You can have a many-to-many relation in your link table. Make sure
you have (DocumentIdentifier, LanguageIdentifier) as your key. This
allows your to identify all languages a document can be translated
to and all documents which are translated a specific language.

--
- Anith
(Please respond only to newsgroups)

 
 
 

Document in several languages

Post by Jonah Olsso » Thu, 06 Feb 2003 01:19:24


Hi Anith and thanks for your reply.

How would you suggest my link table to look like? Do I need a second column
in my Documents table containing some kind of document "group" id since
document_id is unique?

Thanks!
Regards,
Jonah



Quote:> >> If I create another table that contains Document-Language
> relations, I only know which language ONE document has. <<

> You can have a many-to-many relation in your link table. Make sure
> you have (DocumentIdentifier, LanguageIdentifier) as your key. This
> allows your to identify all languages a document can be translated
> to and all documents which are translated a specific language.

 
 
 

Document in several languages

Post by Anith Se » Thu, 06 Feb 2003 01:40:31


Generally speaking, design questions are difficult to answer due to
the fact that, through Newsgroup posts, a specific business model and
the data requirements cannot be fully understood by others.

Anyway, here is a guess:

CREATE TABLE DocumentLanguages (
     document_id INT NOT NULL
          REFERENCES Documents (document_id)
              ON DELETE CASCADE
              ON UPDATE CASCADE,
     language_id VARCHAR(5) NOT NULL
          REFERENCES Languages (language_id)
              ON DELETE CASCADE
              ON UPDATE CASCADE,
     ...,
     PRIMARY KEY (document_id, language_id)
)
GO

--
- Anith
(Please respond only to newsgroups)

 
 
 

Document in several languages

Post by Jonah Olsso » Thu, 06 Feb 2003 02:52:29


Thanks for your time and help! Replies in post..

Quote:> Generally speaking, design questions are difficult to answer due to
> the fact that, through Newsgroup posts, a specific business model and
> the data requirements cannot be fully understood by others.

Offcourse, but in this case I posted DDL for both tables so that people
don't have to guess what I'm talking about, I also explained what problem I
had and what I wanted to acomplish.

Quote:> CREATE TABLE DocumentLanguages (
>      document_id INT NOT NULL
>           REFERENCES Documents (document_id)
>               ON DELETE CASCADE
>               ON UPDATE CASCADE,
>      language_id VARCHAR(5) NOT NULL
>           REFERENCES Languages (language_id)
>               ON DELETE CASCADE
>               ON UPDATE CASCADE,
>      ...,
>      PRIMARY KEY (document_id, language_id)
> )
> GO

This is pretty much what I tested with first that only worked with one
document - one language. Hmm.. maybe I didn't explain in a proper way =/
This is what I thought of:

Document 1: English
Document 1: Swedish
Document 1: French

Document 2: English
Document 2: Swedish
Document 2: German
Document 2: French

Document 3: English
Document 3: Swedish

When I view each document on the web page, I want to get a list of which
other languages this particular document is available in.

Thanks!
Regards,
Jonah

 
 
 

Document in several languages

Post by Anith Se » Thu, 06 Feb 2003 03:02:38


Quote:>> When I view each document on the web page, I want to get a list

of which other languages this particular document is available in. <<

You can formulate something alone the lines of:

SELECT Languages.*                -- use column names
  FROM Documents
 INNER JOIN DocumentLanguages
    ON Documents.Document_id =  DocumentLanguages.Document_id
 INNER JOIN Languages
    ON DocumentLanguages.language_id = Languages.language_id

                    --- assuming this a natural key

To simplify this further if you have the Document_id :

SELECT Languages.*                -- use column names
  FROM DocumentLanguages
 INNER JOIN Languages
    ON DocumentLanguages.language_id = Languages.language_id

--
- Anith
(Please respond only to newsgroups)

 
 
 

Document in several languages

Post by Anith Se » Thu, 06 Feb 2003 03:07:19


A mistake in my second query! This should do.

SELECT Languages.*
  FROM DocumentLanguages
 INNER JOIN Languages
    ON DocumentLanguages.language_id = Languages.language_id

--
- Anith
(Please respond only to newsgroups)

 
 
 

Document in several languages

Post by Jonah Olsso » Thu, 06 Feb 2003 06:12:59


Hi Anith,

I'm sorry, but that's not quite what I meant. I probably have a design
problem..
Let me just try to explain it once more (I hope I'll be more specific this
time :-)
Here's the deal:

My customer has a bunch of press releases. Each of them can be translated
into several languages. When inserting these into the database (using an
administration web), each press release will be inserted as a new "document"
along with a language setting that tells the system (and database) which
language the press release was written in.

The customer also want to insert one or more translations of the press
release into the database. Here's where I think I have a design problem. If
the customer inserts a Swedish translation of a press release first, how
should the next translation be connected and related to the Swedish copy?
How do they know that they belong together? This is vital since a list of
available translations will be listed together with the press release in the
application.

Should I have two indexes within the Documents table? One unique id for the
current translation and one for the set of translations?

For example (notice the ID for the set of translations; PressID):

ID    PressID     Title
Language
--------------------------------------------------------------------
10    10             New London office!
English
11    10             Nytt kontor i London!
Swedish
12    10             Nueva oficina en Londres!                      Spanish
...
26    18             I don't know any more headlines..          English
27    18             Jag vet inga fler rubriker..
Swedish
28    18            No s ms ttulos
Spanish
...

This way I might be able to track which translations a press release has.
But how should the PressID be generated? Is this really a way to do it?

If you know what I'm trying to explain, what would you suggest?
Thanks for your help!! Very appreciated!

Best regards,
Jonah Olsson

 
 
 

Document in several languages

Post by Anith Se » Thu, 06 Feb 2003 09:26:09


Quote:>> But how should the PressID be generated? Is this really a way

to do it? <<

This is a situation, where surrogate keys hamper the database design
and the proper way of identifying relations.

Since you have a database generated key (Identity) you cannot validate
that value against the real business model. To put this in another way,
if you do not have an IDENTITY column in your Document table, how would
you identify a specific press release in the real world? There may be
another attribute in the real business model which you may be missing
like a Press Release Number or Release Volume Number or something like that.

If you are relying on a database-generated primary key, then you have
no way of relating that key to other real world entities.

If I understood you right, this is more like a business model issue.
Your client should give you sufficient info. regarding, a specific
press release and how different translations of a single press release
are related. Databases cannot do it for the real world business.

--
- Anith
(Please respond only to newsgroups)

 
 
 

Document in several languages

Post by Joe Celk » Thu, 06 Feb 2003 11:21:45


Two thoughts:

1) There is an ISO code for languages that you should be using instead
of what you have.

2) The original language is an attribute of the original press release
that cannot chnage, so it should be in a column in that table.  So it
sounds like you have a compound key (press_release_id, language_code).  

3) My dislike of IDENTITY for a key is well-documented in several
thousand rants on newsgroups <g>.  How do you actually encode an
identifier for them now?  

--CELKO--
 ===========================
 Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

 
 
 

Document in several languages

Post by Steve Kas » Thu, 06 Feb 2003 12:12:22


Jonah,

  One of the entities in your real-world model is "press release", and
it looks like
you have no table to hold the set of press releases.  Very simplified,
your model
could probably be implemented with something like this:

Entities:
  Press release
    attributes:
      key (you need something here)
      creation date
      creation language (consider one version the official one)
      author (etc., all of these are non-language-dependent attributes
of a release)
  Language
    attributes:
      name  (this could be the key, or you can use something shorter)

Relationship:
  Press releases <=many to many=> Languages

The relationship will be implemented as a table like what you have now,
and while it contains one row for each document, I would not think of it
as a documents table, but instead think of the press release as the
document, and the content in any language as an attribute of the
relationship
between the document and that language.

  PressRelease_Language
    attributes:
      PRkey references PressRelease(key)
      language references Language(name or key)
      translation date (null for the official one)
      translator  (null for the official one)
      content -- the language-specific text of the release
      primary key (PRkey, language)

I can't think of anything you've mentioned that can't be represented
here.  If you need the pull-down list of languages, it's

select language from PressRelease_Language
where PRkey = 11029

You will have to implement logic that makes sure when a
new press release is created, it creates new rows in both
the PressRelease table (which includes creating a new unique
key for the release) and in the PressRelease_Language table
for the original language.

Just as in the real world the customer knows whether a newly
produced document is an original press release (it came from the
marketing department) or a translation (it came from the translations
department), you will need to treat these separately:

create procedure addPressRelease (




) as





There are other approaches.  You could create a "originals" view off the
two tables
containing what's in PressRelease along with the content, and insert the
original into that view, adding an instead of trigger to make both
insertions
from the data in the inserted table, for example, which would be a bit
simpler
if you have an identity column on the PressRelease table instead of a
guid for
the key, and you should validate the language value at some point, etc.

Steve Kass
Drew University


>Hi Anith,

>I'm sorry, but that's not quite what I meant. I probably have a design
>problem..
>Let me just try to explain it once more (I hope I'll be more specific this
>time :-)
>Here's the deal:

>My customer has a bunch of press releases. Each of them can be translated
>into several languages. When inserting these into the database (using an
>administration web), each press release will be inserted as a new "document"
>along with a language setting that tells the system (and database) which
>language the press release was written in.

>The customer also want to insert one or more translations of the press
>release into the database. Here's where I think I have a design problem. If
>the customer inserts a Swedish translation of a press release first, how
>should the next translation be connected and related to the Swedish copy?
>How do they know that they belong together? This is vital since a list of
>available translations will be listed together with the press release in the
>application.

>Should I have two indexes within the Documents table? One unique id for the
>current translation and one for the set of translations?

>For example (notice the ID for the set of translations; PressID):

>ID    PressID     Title
>Language
>--------------------------------------------------------------------
>10    10             New London office!
>English
>11    10             Nytt kontor i London!
>Swedish
>12    10             Nueva oficina en Londres!                      Spanish
>....
>26    18             I don't know any more headlines..          English
>27    18             Jag vet inga fler rubriker..
>Swedish
>28    18            No s ms ttulos
>Spanish
>....

>This way I might be able to track which translations a press release has.
>But how should the PressID be generated? Is this really a way to do it?

>If you know what I'm trying to explain, what would you suggest?
>Thanks for your help!! Very appreciated!

>Best regards,
>Jonah Olsson

 
 
 

Document in several languages

Post by Steve Kas » Thu, 06 Feb 2003 12:24:00


Care to suggest what natural key is best for press_release_id?

This looks to me like a nice opportunity for an artificial key.  Press
releases aren't books, so there's no ISBN to use; they aren't cans
of beans, so no UPC, they aren't US residents, so no SSN, they don't
have DNA, so no gene sequence...   Press releases don't have to
have titles...

Conceptually I like a guid better than an (increasing) integer, because
it's less tempting to try to order by, but something like
reverse(identity(int,5555555,999)) might be a good way to save space
and still distribute things around.

SK


>Two thoughts:

>1) There is an ISO code for languages that you should be using instead
>of what you have.

>2) The original language is an attribute of the original press release
>that cannot chnage, so it should be in a column in that table.  So it
>sounds like you have a compound key (press_release_id, language_code).  

>3) My dislike of IDENTITY for a key is well-documented in several
>thousand rants on newsgroups <g>.  How do you actually encode an
>identifier for them now?  

>--CELKO--
> ===========================
> Please post DDL, so that people do not have to guess what the keys,
>constraints, Declarative Referential Integrity, datatypes, etc. in your
>schema are.

>*** Sent via Developersdex http://www.developersdex.com ***
>Don't just participate in USENET...get rewarded for it!

 
 
 

Document in several languages

Post by lindawi » Thu, 06 Feb 2003 16:14:20


Jonah Olsson,

Quote:> My customer has a bunch of press releases. Each of them can be
> translated into several languages. When inserting these into the
> database (using an administration web), each press release will be
> inserted as a new "document" along with a language setting that
> tells the system (and database) which language the press release
> was written in.

> This way I might be able to track which translations a press
> release has. But how should the PressID be generated? Is this
> really a way to do it?

Steve's example data model is exactly how multilingual text is
handled in the real world. You need to store the document info
in two tables, one for the general attributes about the press
release that applies to all language versions, and a second table
to store the language-specific information, including the body
of the document.

One more thing you may want to consider. Can you really make a
clear, unambiguous assignment for the original language of a
document? Sometime it is not so clear-cut in the real world.
You may have several different language versions of a document
that were all written concurrently but independently by several
people without seeing an "original". They just had a bulleted
list of facts they needed to include in the release. Or they
were all written by one person in no particular order. When the
finished documents are compared, they all convey the same
information, which is important, but stylistically there is a
great deal of variation. Which one is the "original" then?
We might want to characterize them as just a set of documents
that belong together and leave it at that.

For example, you have an important press release announcing
that an international team has just set a new world record in
non-stop conga line dancing. The announcement will be released
simultaneously in several languages. However, you wish to
customize the text for each language market, emphasizing the
contribution of the "native son or daughter" to provide some
local interest. These documents cannot really be considered
true translatons because the content is slightly different
for each one.

Only the customer can decide if a document is a translation
or simply a "related" document. You should check with your
customer, to see if they anticipate such a situation.

An internallly generated primary key value as a surrogate key
is fine in your case. Since a press release is just free-form
text and can be anything the customer wants it to be, you probably
would need some descriptive text, a date, and perhaps some names
of people or places to uniquely identify each press release.
A bit too bulky for a primary key. Put a unique constraint on
the columns that identify a press release and use a surrogate
key that can be referenced in other tables.

A few more examples of essentially the same data model. These
deal with multilingual product descriptions, but it is all the
same problem.

http://google.com/groups?threadm=uiRG30bNBHA.2048%40tkmsftngp03
http://google.com/groups?threadm=ejba%24bp6BHA.1792%40tkmsftngp02
http://google.com/groups?threadm=Oy%24br%24pLBHA.1268%40tkmsftngp04

Linda

 
 
 

Document in several languages

Post by Jonah Olsso » Thu, 06 Feb 2003 20:10:20


Hi Joe and thanks for your response.

1) Why is there a problem with the language table? The fields are made to
fit the Globalization functions in .NET Framework, so as a key field I would
use standard language codes (like 'en-US' and 'sv-SE' etc.).

2) Yes. Each press release has one language_code related to the language
table, and also a press_release_id. But that's not the problem as I can see
it. I need to know how to group the translations so that the original press
release knows which other translations there is available. And how do I
define the difference between an original press release and a translation in
the documents table?

3) :-)

/Jonah



Quote:> Two thoughts:

> 1) There is an ISO code for languages that you should be using instead
> of what you have.

> 2) The original language is an attribute of the original press release
> that cannot chnage, so it should be in a column in that table.  So it
> sounds like you have a compound key (press_release_id, language_code).

> 3) My dislike of IDENTITY for a key is well-documented in several
> thousand rants on newsgroups <g>.  How do you actually encode an
> identifier for them now?

 
 
 

Document in several languages

Post by Jonah Olsso » Thu, 06 Feb 2003 20:13:59


I agree. Although there is a very good example in using increasing integers
without using IDENTITY at:
http://www.sqlmag.com/articles/index.cfm?articleid=16354

/Jonah



Quote:> Care to suggest what natural key is best for press_release_id?

> This looks to me like a nice opportunity for an artificial key.  Press
> releases aren't books, so there's no ISBN to use; they aren't cans
> of beans, so no UPC, they aren't US residents, so no SSN, they don't
> have DNA, so no gene sequence...   Press releases don't have to
> have titles...

> Conceptually I like a guid better than an (increasing) integer, because
> it's less tempting to try to order by, but something like
> reverse(identity(int,5555555,999)) might be a good way to save space
> and still distribute things around.

 
 
 

1. Access to several XML-islands from a HTML-document

Hi there!

I am a XML "greenhorn" so maybe my question is very simple. I want to
make a database query to XML documents which are inserted into a HTML
document as islands (is this the correct english term?). The
HTML document looks like this:

<html>
<head>
<title>Datenbankabfrage</title>

  <script language="JavaScript" src="abfrage_vi.js">
  </script>

</head>
<body bgcolor=#ffff99 onload="ausgabe()">
  <XML ID="insel1" src="insel1.xml"></XML>
  <XML ID="insel2" src="insel2.xml"></XML>
  <XML ID="insel3" src="insel3.xml"></XML>
  <XML ID="insel4" src="insel4.xml"></XML>
</body>
</html>

The database query is in the javascript "abfrage_vi.js". To search in
each XML document I use the following code:

XMLInseln = document.all.tags("XML");
AnzahlInseln = XMLInseln.length;

  document.write("<BODY bgcolor=#ffff99><CENTER><DIV
STYLE='font-size:20pt; color:midnightblue;
font-weight:bold;'>Suchergebnisse</DIV>");
  document.write("<TABLE border=1 cellpadding=5><TR
bgcolor=#88AADD><TH>Name</TH>
    <TH>Vorname</TH><TH>Stra?e</TH><TH>Ort</TH><TH>Telefon</TH></TR>");

        for (f=0; f<AnzahlInseln; f++)
        {
          WurzelElement = XMLInseln[f].documentElement;
          AlleKnoten = WurzelElement.childNodes;
          AnzahlKnoten = WurzelElement.childNodes.length;
          ......
          ......

What happens is, that I get the header "Suchergebnisse" and, after
refreshing the page view, the headline of the table. If I try to output
a certain element of XMLInseln
(document.write(XMLInseln[0])) I get an "undefined" on the screen, but
the output of "AnzahlInseln" is the correct number.

Has anyone a clue, where the problem is? Thank you very much.
Claudia

--
--------------------------------------------------------
Claudia B?nsch
Institut fr Biochemie
Universit?t zu K?ln        Tel. 0221-470 6435
Zlpicher Str. 47          Fax. 0221-470 5092

--------------------------------------------------------

2. SQL Memory usage

3. VFP: Application in several languages

4. Trouble installing VB app + DLL + DEnv on client PC

5. Several tables or Several databases?!

6. Access Violation SQL2000 SP1

7. Need help, Transfer record with several infos in several records

8. IDS 7.31.UC5

9. Using one or several tables for several cust.

10. Several DBs vs. One DB with several schemas

11. Generating word document files from embedded documents in Access or SQL Server DB.

12. Filemaker record to word document and word document list

13. converting xml documents to html documents