Xref and multiple news servers

Xref and multiple news servers

Post by J. Howard Benso » Tue, 22 Apr 1997 04:00:00



When reading news from multiple servers, it would be nice to
avoid seeing the same article twice.

In the next release of Simplenews, I am going to add a feature where
if you have read an article on one server it will be marked as read
on other servers as well (if it is present on the other servers).
Obviously, the Xref: line offers no help for more than one server.
The solution I am looking at now is to maintain a file containing
the message-ids of articles that have been read. When an article
header is pulled in from the server, if the message-id is in this
file, that article will be marked as read in the .newsrc file for that
server and will not be displayed. This message-id file could be
trimmed by size or age of entry to prevent out of control growth.

Any ideas??

---

http://www2.polarnet.com/~hbenson    -    home of simplenews
PGPprint =  1F CC EC 3F 1B 17 01 F2  5B 3E 57 6C 42 13 EB 5A

 
 
 

Xref and multiple news servers

Post by lvir.. » Sat, 26 Apr 1997 04:00:00



:The solution I am looking at now is to maintain a file containing
:the message-ids of articles that have been read. When an article
:header is pulled in from the server, if the message-id is in this
:file, that article will be marked as read in the .newsrc file for that
:server and will not be displayed. This message-id file could be
:trimmed by size or age of entry to prevent out of control growth.

The problem here is that this method doesn't scale well.  If a user only
reads a few articles, things will be fine.  What if they read a _lot_ of
messages?  Keeping only a few message ids won't help them at all.  The
best bet might be to a) keep the queues of message ids per newsgroup,
b) default to keeping no message ids, but allow a user to specify particular
newsgroups they wish to track in this fashion, c) use some sort of compression/
hashing mechanism to reduce the amount of bits needed to keep track of the
message ids.

--

<URL:http://www.teraform.com/%7Elvirden/> <*> O- "We are all Kosh."
Unless explicitly stated to the contrary, nothing in this posting should
be construed as representing my employer's opinions.

 
 
 

Xref and multiple news servers

Post by lvir.. » Sat, 26 Apr 1997 04:00:00



:The solution I am looking at now is to maintain a file containing
:the message-ids of articles that have been read. When an article
:header is pulled in from the server, if the message-id is in this
:file, that article will be marked as read in the .newsrc file for that
:server and will not be displayed. This message-id file could be
:trimmed by size or age of entry to prevent out of control growth.

The problem here is that this method doesn't scale well.  If a user only
reads a few articles, things will be fine.  What if they read a _lot_ of
messages?  Keeping only a few message ids won't help them at all.  The
best bet might be to a) keep the queues of message ids per newsgroup,
b) default to keeping no message ids, but allow a user to specify particular
newsgroups they wish to track in this fashion, c) use some sort of compression/
hashing mechanism to reduce the amount of bits needed to keep track of the
message ids.

--

<URL:http://www.teraform.com/%7Elvirden/> <*> O- "We are all Kosh."
Unless explicitly stated to the contrary, nothing in this posting should
be construed as representing my employer's opinions.

 
 
 

Xref and multiple news servers

Post by Vidar Hokst » Sun, 27 Apr 1997 04:00:00





>:The solution I am looking at now is to maintain a file containing
>:the message-ids of articles that have been read. When an article
>:header is pulled in from the server, if the message-id is in this
>:file, that article will be marked as read in the .newsrc file for that
>:server and will not be displayed. This message-id file could be
>:trimmed by size or age of entry to prevent out of control growth.

> The problem here is that this method doesn't scale well.  If a user only
> reads a few articles, things will be fine.  What if they read a _lot_ of
> messages?  Keeping only a few message ids won't help them at all.

The problem isn't as unmanageble as it may seem. We run a newsserver, and
have approx. 12.000 groups flowing through our system. With a history file
of about 120MB, we hardly get any duplicates. Now with approx. 10kb pr.
group to keep a reasonable backlog, it shouldn't be that big a problem.
Also, the history file format isn't exactly that space efficient. For a
personal newsclient where the needed throughput is a lot lower, you could
save a lot of space by reducing the number of bytes stored pr. article.

Quote:>The best bet might be to a) keep the queues of message ids per newsgroup,

With the potential of increasing the amount of data, due to crossposting.

Quote:> b) default to keeping no message ids, but allow a user to specify particular
> newsgroups they wish to track in this fashion, c) use some sort of compression/
> hashing mechanism to reduce the amount of bits needed to keep track of the
> message ids.

Compressing the message ID fields could of course be worth looking at. Also,
if you don't mind the a duplicate getting through now and then, you could
make assumptions about the ID's. For instance you could try to reduce the
hostname to a few bytes of data. (reducing the number of bits stored for
each byte, shortening top level domains to one byte, or removing it
alltogether etc.

But I don't think it's worth spending much time on it. I've got 65
newsgroups in my newsrc file. With an average of 10kb pr. group, I'll use
650kb to avoid dupes. 650kb of diskspace isn't exactly a huge amount anymore.
But I second the suggestion about letting the user specify which groups to
keep message id data for.

--


 
 
 

Xref and multiple news servers

Post by Vidar Hokst » Sun, 27 Apr 1997 04:00:00





>:The solution I am looking at now is to maintain a file containing
>:the message-ids of articles that have been read. When an article
>:header is pulled in from the server, if the message-id is in this
>:file, that article will be marked as read in the .newsrc file for that
>:server and will not be displayed. This message-id file could be
>:trimmed by size or age of entry to prevent out of control growth.

> The problem here is that this method doesn't scale well.  If a user only
> reads a few articles, things will be fine.  What if they read a _lot_ of
> messages?  Keeping only a few message ids won't help them at all.

The problem isn't as unmanageble as it may seem. We run a newsserver, and
have approx. 12.000 groups flowing through our system. With a history file
of about 120MB, we hardly get any duplicates. Now with approx. 10kb pr.
group to keep a reasonable backlog, it shouldn't be that big a problem.
Also, the history file format isn't exactly that space efficient. For a
personal newsclient where the needed throughput is a lot lower, you could
save a lot of space by reducing the number of bytes stored pr. article.

Quote:>The best bet might be to a) keep the queues of message ids per newsgroup,

With the potential of increasing the amount of data, due to crossposting.

Quote:> b) default to keeping no message ids, but allow a user to specify particular
> newsgroups they wish to track in this fashion, c) use some sort of compression/
> hashing mechanism to reduce the amount of bits needed to keep track of the
> message ids.

Compressing the message ID fields could of course be worth looking at. Also,
if you don't mind the a duplicate getting through now and then, you could
make assumptions about the ID's. For instance you could try to reduce the
hostname to a few bytes of data. (reducing the number of bits stored for
each byte, shortening top level domains to one byte, or removing it
alltogether etc.

But I don't think it's worth spending much time on it. I've got 65
newsgroups in my newsrc file. With an average of 10kb pr. group, I'll use
650kb to avoid dupes. 650kb of diskspace isn't exactly a huge amount anymore.
But I second the suggestion about letting the user specify which groups to
keep message id data for.

--


 
 
 

Xref and multiple news servers

Post by J. Howard Benso » Sun, 27 Apr 1997 04:00:00




>:The solution I am looking at now is to maintain a file containing
>:the message-ids of articles that have been read. When an article
>:header is pulled in from the server, if the message-id is in this
>:file, that article will be marked as read in the .newsrc file for that
>:server and will not be displayed. This message-id file could be
>:trimmed by size or age of entry to prevent out of control growth.

>The problem here is that this method doesn't scale well.  If a user only
>reads a few articles, things will be fine.  What if they read a _lot_ of
>messages?  Keeping only a few message ids won't help them at all.  The
>best bet might be to a) keep the queues of message ids per newsgroup,
>b) default to keeping no message ids, but allow a user to specify particular
>newsgroups they wish to track in this fashion, c) use some sort of
>compression/
>hashing mechanism to reduce the amount of bits needed to keep track of the
>message ids.

It sounds like saving message-ids in a file will substitute for
the Xref line when multiple servers are involved..

Let's look at the math for the scaling issue. Assume an average of
40 characters for each message-id with the brackets stripped off and
a two digit age stamp added. Entries are removed after 28 days.
1,000,000 bytes are allowed for the file.

   (1000000 / 40) / 28 = 892.86

space for almost 900 articles per day.

News reading habits vary widely between users. Do my assumptions
seem valid?  Is 28 days long enough to hold the entries?

---

http://www2.polarnet.com/~hbenson    -    home of simplenews
PGPprint =  1F CC EC 3F 1B 17 01 F2  5B 3E 57 6C 42 13 EB 5A