Feedback on XML for Analysis draft specification

Feedback on XML for Analysis draft specification

Post by steve.tol.. » Sat, 10 Feb 2001 07:19:25



Here is my feedback on the draft 0.6 version of the XML for Analysis
specification posted below http://www.microsoft.com/data/
This is the first public specification I have seen on an important
area: how to present a multidimensional cube as XML.  Thanks for
posting this.  I am replying in the Usenet newsgroup

and set this to be the group to Follup-To.  But I am
also cross posting to a few other newsgroups that are relevant and
which are much more widely read.

1.  Use of ordinals considered harmful

Unfortunately this proposal seems too tightly coupled to the Microsoft
implementation.  Instead of being based on the human understandable
ids and names for dimensions, members, etc. the keys are "ordinals",
the small integers starting at 0, taken directly out of the
implementation in Microsoft SQL Server.  It seems better to use
e.g. MemberUniqueName instead.

I think that this may cause the following serious problem: changes to
the cube may invalidate a result stored as XML.  I think this will
prevent reading a saved xml file back into the new structure.  Or
worse it will be read back in again, but with a different meaning!  
I am not sure exactly what kinds of changes will cause this problem:
almost certainly adding members not at the end, but possibly also
adding levels, hierarchies, or dimensions.

Using ordinals has another drawback: for the human reader they add
a lot of clutter.

2. Not exporting all dimensional metadata considered harmful

Another limitation, potentially very serious, is that this proposal
seems totally oriented towards presenting the result on a query (which
may be a cube or a relational table).  There is no way to export the
actual "schema" for the cube itself, all its dimensional metadata.
Specifically, if there is calculated member the xml will only show the
result, not the formula.  If this limitation is intentional, please
change your minds, as that is not what is needed in many case.

3. Sharing dimensional metadata

It is likely that many result cubes will want to "share" the same
dimensional metadata.  Is there any way to export this into a separate
XML instance document that can be referred to by the instance
documents containing the results?

4. Validation

Suppose I want to validate that cell is the result is identified by
a valid "tuple".  In theory the XML Schema key and keyref facility
could be used for this.  Can you provide a schema that supports this?
Conversely doing this validation might have poor performance.
Can there also be a schema that does not define these conmstraints.

Another constraint we might want to enforce is that a tuple
(identifying one cell) can only appear at most once.  
Please show how this be enforced using Schema.

5. Some misc. questions and requests.

Why does a name with a single underscore in it e.g. foo_bar get
changed to doubled underscores, e.g. foo__bar?  

Why is the entire data enclosed in a CDATA section?  This seems to
make it much more difficult to process.  (Or am I misunderstanding
something.)  What is done if a data item contains the string "]]>"?

Please provide one coherent example where we see the original cube,
the query, and the result in XML (both as a cube and tabular).

There seems to be a dependency on having a web server -- is there any
way to access the database directly, analagous to ODBC, or is a web
server truly needed?  

The discussion of BEGIN_RANGE and END_RANGE on p. 30 is unclear,
i.e. the examples include a range (-1,0) meaning from undefined to the
first element.  Is this just a somewhat strange syntax meaning the
same as (0,0)?  In the example it says the range(2,1) is invalid (as
the endpoints are out of order).  What will the query do -- raise an
error or ignore?

Please feel free to contact me for clarification or additional details.

Hopefully helpfully yours,
Steve
--

Fidelity Investments   82 Devonshire St. R24D    Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.

 
 
 

Feedback on XML for Analysis draft specification

Post by Akshai Mirchandan » Sun, 11 Feb 2001 16:32:21


Hi Steve,

Thanks for your feedback. Some answers are below.

1. Use of ordinals harmful

The reason for using ordinals is to be able to easily figure out the
references from a cell to its axis tuples. The Member Unique Name is still
one of the properties of the members belonging to a tuple.
e.g. A tuple on an axis can be a cross-join of [Time] and [Customers]. In
this case, each tuple on the axis is really made up of two members, for
example [Time].[1999] and [Customer].[All Customers].[USA].

This makes it impossible to use a single Member Unique Name to uniquely
identify a tuple. A query returning filtered or "non-empty" crossjoins would
be made up of many missing combinations of the join and this makes it
difficult to return the axes like:

Axis:
Time: { [1998], [1999]
Customers: { [USA], [CA] }

What if the combination of [1998] and [CA] was filtered out as part of the
query? Using ordinals to identify valid combinations of members on the axis
is a solution. The human (un)readability of ordinals is a good point --
however, the reason for having the tuple ordinals there is to make machine
readability more efficient.

There are clearly several alternatives to this. For example, I could do
this:
<Result>
    <Time value="1998">
        <Customer value="USA">
            <Value>100</Value>
            <FormattedValue>100.00</FormattedValue>
        </Customer>
        <Customer value="CA">
            <Value>100</Value>
            <FormattedValue>100.00</FormattedValue>
        </Customer> </Time>
</Result>

Unfortunately, this very simplistic example does not address many problems.
The most important problem is method has effectively lost the navigation
capabilities that axes give to the query. For example, representing repeated
members on an axis or empty cells without exploding the size becomes
impossible.

2. Not exporting all dimensional metadata considered harmful

I agree that this design is focused on returning results of queries.
However, I disagree that there is no way to access the metadata. The
DISCOVER method in the spec is designed specifically for querying the schema
of the database -- we implement the OLEDB for OLAP and DM schema rowsets.
The spec does allow us the flexibility to gradually add more advanced
schemas which will return more complex metadata.

3. Sharing dimensional metadata

Accessing dimensional metadata is possible by querying the schema rowsets --
the Execute method still needs to return complete results. It cannot assume
that the client has all the dimension meta-data -- it would be a really
useful optimization but risky because the client may cache meta-data that
has since expired on the server.

4. Validation

Lets assume that the server always generates results that are valid for the
schema. Then the value of using the schema is for the following:

- Obtain data types of member properties and cell properties
- Use ID/IDREF to be able to quickly look up the axis tuples.

The first point is handled by generating inline the instance types for the
member properties and cell properties. Properties are very often variant
type -- they could be integers or floats or doubles. Therefore, it makes
sense to inline their types.
The second point is under consideration and we may add it to the schema --
although not for the beta.

5. Some misc. questions and requests.

The underscore encoding has changed in the meantime -- the encoding we
follow now is the SQL Server 2000 encoding. It looks like _xHHHH_ where HHHH
is the unicode value of the invalid-XML character. Underscores are only
encoded if there is an ambiguity (e.g. if the next character is 'x' followed
by 4 unicode digits followed by _)

The CDATA section was for interoperability problems we faced with the SOAP
clients out there at the time of the preview. We have since removed the
CDATA tags.

A server supporting SOAP is necessary -- whatever the transport protocol. I
could use HTTP/SMTP or any other protocol supported by SOAP. If the database
intrinsically supports SOAP then a web server is not needed.

I believe the section on BEGIN_RANGE and END_RANGE has been re-worded with a
table to explain the behavior. If the range is invalid, the query will raise
an error.

Hope this helps.
Akshai

<steve.tol...@fmr.com <mailto:steve.tol...@fmr.com>> wrote in message

<news:ug0hoddhu.fsf@fmr.com>...> Here is my feedback on the draft 0.6
version of the XML for Analysis
> specification posted below <http://www.microsoft.com/data/>
> This is the first public specification I have seen on an important
> area: how to present a multidimensional cube as XML. Thanks for
> posting this. I am replying in the Usenet newsgroup
> <news:microsoft.public.data.xmlanalysis> as requested,
> and set this to be the group to Follup-To. But I am
> also cross posting to a few other newsgroups that are relevant and
> which are much more widely read.

> 1. Use of ordinals considered harmful

> Unfortunately this proposal seems too tightly coupled to the Microsoft
> implementation. Instead of being based on the human understandable
> ids and names for dimensions, members, etc. the keys are "ordinals",
> the small integers starting at 0, taken directly out of the
> implementation in Microsoft SQL Server. It seems better to use
> e.g. MemberUniqueName instead.

> I think that this may cause the following serious problem: changes to
> the cube may invalidate a result stored as XML. I think this will
> prevent reading a saved xml file back into the new structure. Or
> worse it will be read back in again, but with a different meaning!
> I am not sure exactly what kinds of changes will cause this problem:
> almost certainly adding members not at the end, but possibly also
> adding levels, hierarchies, or dimensions.

> Using ordinals has another drawback: for the human reader they add
> a lot of clutter.

> 2. Not exporting all dimensional metadata considered harmful

> Another limitation, potentially very serious, is that this proposal
> seems totally oriented towards presenting the result on a query (which
> may be a cube or a relational table). There is no way to export the
> actual "schema" for the cube itself, all its dimensional metadata.
> Specifically, if there is calculated member the xml will only show the
> result, not the formula. If this limitation is intentional, please
> change your minds, as that is not what is needed in many case.

> 3. Sharing dimensional metadata

> It is likely that many result cubes will want to "share" the same
> dimensional metadata. Is there any way to export this into a separate
> XML instance document that can be referred to by the instance
> documents containing the results?

> 4. Validation

> Suppose I want to validate that cell is the result is identified by
> a valid "tuple". In theory the XML Schema key and keyref facility
> could be used for this. Can you provide a schema that supports this?
> Conversely doing this validation might have poor performance.
> Can there also be a schema that does not define these conmstraints.

> Another constraint we might want to enforce is that a tuple
> (identifying one cell) can only appear at most once.
> Please show how this be enforced using Schema.

> 5. Some misc. questions and requests.

> Why does a name with a single underscore in it e.g. foo_bar get
> changed to doubled underscores, e.g. foo__bar?

> Why is the entire data enclosed in a CDATA section? This seems to
> make it much more difficult to process. (Or am I misunderstanding
> something.) What is done if a data item contains the string "]]>"?

> Please provide one coherent example where we see the original cube,
> the query, and the result in XML (both as a cube and tabular).

> There seems to be a dependency on having a web server -- is there any
> way to access the database directly, analagous to ODBC, or is a web
> server truly needed?

> The discussion of BEGIN_RANGE and END_RANGE on p. 30 is unclear,
> i.e. the examples include a range (-1,0) meaning from undefined to the
> first element. Is this just a somewhat strange syntax meaning the
> same as (0,0)? In the example it says the range(2,1) is invalid (as
> the endpoints are out of order). What will the query do -- raise an
> error or ignore?

> Please feel free to contact me for clarification or additional details.

> Hopefully helpfully yours,
> Steve
> --
> Steven Tolkin steve.tol...@fmr.com <mailto:steve.tol...@fmr.com>
617-563-0516
> Fidelity Investments 82 Devonshire St. R24D Boston MA 02109
> There is nothing so practical as a good theory. Comments are by me,
> not Fidelity Investments, its subsidiaries or affiliates.