Can this be done with SQL??

Can this be done with SQL??

Post by Mark Davi » Wed, 21 Aug 2002 04:50:19



I have a 30 million record table containing (in sequence) all of the
words from a set of texts.  Looks like the following:

ID     word
1     Once
2     upon
3     a
4     time
5     there
6     was
. . . .

I want to be able to find sequences of words, i.e. all of the three
word strings having the pattern:

ate the [xxx]

or all of the four word strings with the pattern:

threw the [xxx] away

The problem, of course, is that the words are in different (though
adjacent and sequential) records.  I've been playing around with
subqueries, i.e.:

select word
from [table]
where ID in
(select ID from [table] where word = 'threw')
and
ID+1 in
(select ID from [table] where word = 'the')
and
ID+3 in
(select ID from [table] where word = 'away')

But this is obviously fundamentally flawed.  

P.S. One option is to SELECT all of the records with WORD1, WORD2, and
WORD 3 INTO three separate temp tables, and then do something like:

select table1.word, table2.word, table3.word
from table1, table2, table3
where table1.word = 'threw' and table2.word = 'the' and
table3.word = 'away'
and table1.ID = table2.ID-1 and table1.ID = table3.ID-3

But this is going to be too slow for my needs, especially since the
table will eventually have 100 million records for a 100 million word
corpus of texts.

Any possible solutions?? Thanks in advance for your help.

 
 
 

Can this be done with SQL??

Post by Andrew J. Kell » Wed, 21 Aug 2002 05:22:05


Is there a specific reason why you have each word in a separate row (other
than trying to find the sequence)?  Have you looked into using Full Text
Searching capabilities yet?  If not I highly recommend doing so before going
any further.

--
Andrew J. Kelly   SQL MVP
Targitinteractive, Inc.


Quote:> I have a 30 million record table containing (in sequence) all of the
> words from a set of texts.  Looks like the following:

> ID     word
> 1     Once
> 2     upon
> 3     a
> 4     time
> 5     there
> 6     was
> . . . .

> I want to be able to find sequences of words, i.e. all of the three
> word strings having the pattern:

> ate the [xxx]

> or all of the four word strings with the pattern:

> threw the [xxx] away

> The problem, of course, is that the words are in different (though
> adjacent and sequential) records.  I've been playing around with
> subqueries, i.e.:

> select word
> from [table]
> where ID in
> (select ID from [table] where word = 'threw')
> and
> ID+1 in
> (select ID from [table] where word = 'the')
> and
> ID+3 in
> (select ID from [table] where word = 'away')

> But this is obviously fundamentally flawed.

> P.S. One option is to SELECT all of the records with WORD1, WORD2, and
> WORD 3 INTO three separate temp tables, and then do something like:

> select table1.word, table2.word, table3.word
> from table1, table2, table3
> where table1.word = 'threw' and table2.word = 'the' and
> table3.word = 'away'
> and table1.ID = table2.ID-1 and table1.ID = table3.ID-3

> But this is going to be too slow for my needs, especially since the
> table will eventually have 100 million records for a 100 million word
> corpus of texts.

> Any possible solutions?? Thanks in advance for your help.


 
 
 

Can this be done with SQL??

Post by Bob Castlema » Wed, 21 Aug 2002 05:33:57


Uhhh ... I'm really new to SQL Prrgramming, but since T-SQL supports loops
and branching, couldn't you write a recursive stored procedure that takes a
delimited string and a cursor as a parameters and returns the index of the
starting words of successful matches

In psuedo-code:

Declaration:

    my_proc(string:words_to_match;cursor:my_cursor)

        if length(words_match) = 0
            return index of current my_cursor position

       if   word  at my_cursor position = first word in delimited list
                then
                    increment cursor
                    return my_proc(words_to_match - first word,my_cursor)
            else
                return -1

used something like

    set my_cursor to first record
    while my_cursor not at EOF
        index_of_match = my_proc("string,to,match",my_cursor)
        if index_of_match > 0
            then insert into match table
       increment cursor

This is only a very rough approximation of how the recursion would work. You
would need parse the string and translate everything into T-SQL, and debug
the heck out of it since recursion loves to overflow the stack if you don't
provide a correct exit.

Also, I am new to T-SQL and maybe none of this is supported.

Bob


> I have a 30 million record table containing (in sequence) all of the
> words from a set of texts.  Looks like the following:

> ID     word
> 1     Once
> 2     upon
> 3     a
> 4     time
> 5     there
> 6     was
> . . . .

> I want to be able to find sequences of words, i.e. all of the three
> word strings having the pattern:

> ate the [xxx]

> or all of the four word strings with the pattern:

> threw the [xxx] away

> The problem, of course, is that the words are in different (though
> adjacent and sequential) records.  I've been playing around with
> subqueries, i.e.:

> select word
> from [table]
> where ID in
> (select ID from [table] where word = 'threw')
> and
> ID+1 in
> (select ID from [table] where word = 'the')
> and
> ID+3 in
> (select ID from [table] where word = 'away')

> But this is obviously fundamentally flawed.

> P.S. One option is to SELECT all of the records with WORD1, WORD2, and
> WORD 3 INTO three separate temp tables, and then do something like:

> select table1.word, table2.word, table3.word
> from table1, table2, table3
> where table1.word = 'threw' and table2.word = 'the' and
> table3.word = 'away'
> and table1.ID = table2.ID-1 and table1.ID = table3.ID-3

> But this is going to be too slow for my needs, especially since the
> table will eventually have 100 million records for a 100 million word
> corpus of texts.

> Any possible solutions?? Thanks in advance for your help.



- Show quoted text -

Quote:> I have a 30 million record table containing (in sequence) all of the
> words from a set of texts.  Looks like the following:

> ID     word
> 1     Once
> 2     upon
> 3     a
> 4     time
> 5     there
> 6     was
> . . . .

> I want to be able to find sequences of words, i.e. all of the three
> word strings having the pattern:

> ate the [xxx]

> or all of the four word strings with the pattern:

> threw the [xxx] away

> The problem, of course, is that the words are in different (though
> adjacent and sequential) records.  I've been playing around with
> subqueries, i.e.:

> select word
> from [table]
> where ID in
> (select ID from [table] where word = 'threw')
> and
> ID+1 in
> (select ID from [table] where word = 'the')
> and
> ID+3 in
> (select ID from [table] where word = 'away')

> But this is obviously fundamentally flawed.

> P.S. One option is to SELECT all of the records with WORD1, WORD2, and
> WORD 3 INTO three separate temp tables, and then do something like:

> select table1.word, table2.word, table3.word
> from table1, table2, table3
> where table1.word = 'threw' and table2.word = 'the' and
> table3.word = 'away'
> and table1.ID = table2.ID-1 and table1.ID = table3.ID-3

> But this is going to be too slow for my needs, especially since the
> table will eventually have 100 million records for a 100 million word
> corpus of texts.

> Any possible solutions?? Thanks in advance for your help.

 
 
 

Can this be done with SQL??

Post by Mark Davi » Thu, 22 Aug 2002 00:52:32



Quote:> Is there a specific reason why you have each word in a separate row (other
> than trying to find the sequence)?  Have you looked into using Full Text
> Searching capabilities yet?  If not I highly recommend doing so before going
> any further.

I use Full-Text Indexing extensively as part of the website.  But it's
very, very limited and won't allow me to do searches on the corpus
like:

all three-word strings containing:
any pronoun + any form of any synonym of a given verb + any infinitive
(e.g. lo quiero hacer, me deseaba hablar, etc in Spanish -- the
language I'm working with)

I already have an approach that works pretty well
(www.corpusdelespanol.org), but I'm looking at something like the
possible solution posed in my initial query as well.

Anyway, thanks in advance for any help that anyone might have.

Mark Davies

 
 
 

Can this be done with SQL??

Post by Delbert Glas » Thu, 22 Aug 2002 04:49:35


Here is an example of doing the four word phrase with any third word.
Be sure to set things up so a plan gets generated each time.

-------------------------------------------------
go
create table demoTable (
  TheID integer identity(1,1),
  TheWord varchar(20)
)
create clustered index IXC_demoTable on demoTable(TheWord)

insert into demoTable (TheWord) values ('Once')
insert into demoTable (TheWord) values ('upon')
insert into demoTable (TheWord) values ('a')
insert into demoTable (TheWord) values ('time')
insert into demoTable (TheWord) values ('there')
insert into demoTable (TheWord) values ('was')
insert into demoTable (TheWord) values ('Once')
insert into demoTable (TheWord) values ('upon')
insert into demoTable (TheWord) values ('a')
insert into demoTable (TheWord) values ('night')
insert into demoTable (TheWord) values ('there')
insert into demoTable (TheWord) values ('was')

select * from demoTable order by TheID

select w3.TheID, w1.TheWord,w2.TheWord,w3.TheWord, w4.TheWord from
--select  w1.TheWord,w2.TheWord,w4.TheWord from
(((select TheWord, TheID+2 as TheID from demoTable where TheWord='upon') w1
inner join
(select TheWord, TheID+1 as TheID from demoTable where TheWord='a') w2
on w2.TheID = w1.TheID)
inner join
(select TheWord, TheID-1 as TheID from demoTable where TheWord = 'there') w4
on w4.TheID=w1.TheID)
inner join
demoTable w3
on w3.TheID=w1.TheID
order by w3.TheWord,w3.TheID

drop table demoTable
go
-------------------------------------------------

Bye,
Delbert Glass

 
 
 

Can this be done with SQL??

Post by Steve Kas » Thu, 22 Aug 2002 05:28:28


Using the same DDL, this should be quite a bit more efficient.  It works
for 1- to 16-word phrases - you can change that as you like.

set nocount on
go
create table demoTable (
  TheID integer identity(1,1),
  TheWord varchar(20)
)
create clustered index IXC_demoTable on demoTable(TheWord)

insert into demoTable (TheWord) values ('Once')
insert into demoTable (TheWord) values ('upon')
insert into demoTable (TheWord) values ('a')
insert into demoTable (TheWord) values ('time')
insert into demoTable (TheWord) values ('there')
insert into demoTable (TheWord) values ('was')
insert into demoTable (TheWord) values ('Once')
insert into demoTable (TheWord) values ('upon')
insert into demoTable (TheWord) values ('a')
insert into demoTable (TheWord) values ('night')
insert into demoTable (TheWord) values ('there')
insert into demoTable (TheWord) values ('was')
insert into demoTable (TheWord) values ('a')

go
create procedure Phrases (



) as
select
  rtrim(
    max(case when n = 0 then TheWord else '' end) + space(1)
  + max(case when n = 1 then TheWord else '' end) + space(1)
  + max(case when n = 2 then TheWord else '' end) + space(1)
  + max(case when n = 3 then TheWord else '' end) + space(1)
  + max(case when n = 4 then TheWord else '' end) + space(1)
  + max(case when n = 5 then TheWord else '' end) + space(1)
  + max(case when n = 6 then TheWord else '' end) + space(1)
  + max(case when n = 7 then TheWord else '' end) + space(1)
  + max(case when n = 8 then TheWord else '' end) + space(1)
  + max(case when n = 9 then TheWord else '' end) + space(1)
  + max(case when n = 10 then TheWord else '' end) + space(1)
  + max(case when n = 11 then TheWord else '' end) + space(1)
  + max(case when n = 12 then TheWord else '' end) + space(1)
  + max(case when n = 13 then TheWord else '' end) + space(1)
  + max(case when n = 14 then TheWord else '' end) + space(1)
  + max(case when n = 15 then TheWord else '' end) + space(1)
  + max(case when n = 16 then TheWord else '' end)
  ) as Phrase
from (
  select a.TheID, N.n, demoTable.TheWord
  from demoTable join (
    select TheID from demoTable

  ) a

  join (
    select top 100 percent orderid - 10248 as n
    from northwind..orders

    order by orderid
  ) N

) X
group by TheID

order by TheID
go

exec Phrases 4, 3, 'a'     -- four-word phrases, third word is 'a'
exec Phrases 2, 3, 'was'   -- two-word phrases, following word is 'was'
exec Phrases 3, 1, 'there' -- three-word phrases, first word is 'there'
go

drop proc Phrases
drop table demoTable

Steve Kass
Drew University


> Here is an example of doing the four word phrase with any third word.
> Be sure to set things up so a plan gets generated each time.

> -------------------------------------------------
> go
> create table demoTable (
>   TheID integer identity(1,1),
>   TheWord varchar(20)
> )
> create clustered index IXC_demoTable on demoTable(TheWord)

> insert into demoTable (TheWord) values ('Once')
> insert into demoTable (TheWord) values ('upon')
> insert into demoTable (TheWord) values ('a')
> insert into demoTable (TheWord) values ('time')
> insert into demoTable (TheWord) values ('there')
> insert into demoTable (TheWord) values ('was')
> insert into demoTable (TheWord) values ('Once')
> insert into demoTable (TheWord) values ('upon')
> insert into demoTable (TheWord) values ('a')
> insert into demoTable (TheWord) values ('night')
> insert into demoTable (TheWord) values ('there')
> insert into demoTable (TheWord) values ('was')

> select * from demoTable order by TheID

> select w3.TheID, w1.TheWord,w2.TheWord,w3.TheWord, w4.TheWord from
> --select  w1.TheWord,w2.TheWord,w4.TheWord from
> (((select TheWord, TheID+2 as TheID from demoTable where TheWord='upon') w1
> inner join
> (select TheWord, TheID+1 as TheID from demoTable where TheWord='a') w2
> on w2.TheID = w1.TheID)
> inner join
> (select TheWord, TheID-1 as TheID from demoTable where TheWord = 'there') w4
> on w4.TheID=w1.TheID)
> inner join
> demoTable w3
> on w3.TheID=w1.TheID
> order by w3.TheWord,w3.TheID

> drop table demoTable
> go
> -------------------------------------------------

> Bye,
> Delbert Glass

 
 
 

1. SQL What am I doing wrong ?

I have two tables:

[ORG]

ID ID2  Name  
--  ---    --------
01 AA  James
02 AA  Kevin

[TMP]

ID ID2  Name  
--  ---    --------
01 AA  James
02 BB  June

ID1 and ID2 are unique and a key.

What I like to do is to update the ORG table with the new records in
the TMP table.

What I have tried without success is:

INSERT INTO ORG
SELECT * FROM [TMP] WHERE (ID,ID2) NOT IN
SELECT (ID,ID2) FROM [ORG]

I cannot even get the select statments to work.What am I doing wrong?
Any Help would be appreciated.
Morten Brun

http://www.novo.dk

2. Oracle PC/SQL help in WinHelp format

3. SQL problem, what am I doing wong?

4. Oracle data warehousing: mixed reactions

5. ADO and SQL: what am I doing wrong ?

6. US-OR-Java J2EE Architect

7. What am I doing wrong (RTRIM, PL/SQL)

8. Need Data Warehouse Information (Actual Implementation Case)

9. What in the world am I doing wrong?

10. Am I doing something wrong...??

11. Am I doing this the best way?

12. what am i doing wrong?

13. I am done