Editor's note:  These minutes have not been edited.


0. Agenda review/changes

The proposed agenda was accepted without changes. 

1. Why two parallell CIP drafts ?

Patrick explained that he and Roland shared the view that Chris 
Weider's draft didn't reflect the consensus of the group reached at the 
LA meeting and also had too much whois++ stuff in there. Therefore a 
second draft was produced by by Jeff Allen and Patrik Faltstrom. The 
intended outcome of this is that these two drafts will be merged into 
one. 

2. Charter of the find group

There where some discussion about which papers were going to be 
produced. The consensus was that there should be one document 
specifying the CIP, another one specifying how to use centroids as one 
special case of indexes within the CIP and further for each client - 
server protocol that is goint to use the CIP one paper describing the 
mapping between the data representations and one describing the access 
method. 

3. LDAP/CIP work at Umea University

Roland Hedberg presented the work he has been doing to enable a X.500 
DSA to work as an index server and he also presented a WWW-
interface that can use this index server.
The WWW-interface can be reached at
http://macavity.umdc.umu.se/~roland/query2.en.html and the 
indexserver it accesses contains all the information presently accessable 
in the Swedish branch of the X.500 DIT (~50.000 entries). For the time 
being the index only contains names of people. Roland will produce a 
draft describing the objectclass and attributes needed to ackomplish 
this .

4. The new CIP draft

Jeff Allen presented the gist of the new draft. The discussion following 
the presentation led up to a couple of unresolved items: 

The use of MIME - should/can INDEX-CHANGED be structured as a 
MIME message Aggregation ala CIDR - facilitate query routing. 
Incremental updates - per application domain or general. Security - 
both regarding exporting indexes and data protection. Centroid scaling 
issues - certain datasets only contain unique items which means that 
the resulting index is no smaller than the original dataset. Frontends to 
indexservers might only speak one access protocol - clients speaking 
another access protocol can not pass this server, while climbing the tree 
upwards or downwards, which means that parts of the mesh might be 
unaccessable to the client.

5. Workshop of Distributed Indexing and Searching 

Erik Selberg presented some ideas on using query routing within the 
Web indexing sphere which came out of the workshop . It was felt that 
introducing query routing and distributed index servers is a necessary 
step in the development of the Web indexes since the current centric 
approach doesn't scale. More info on the workshop can be found at 
http://www.w3.org/pub/WWW/Search/9605-Indexing-Workshop/ 

It was agreed that followup work undertaken by the query routing 
contingent from the Distributed Indexing/Searching Workshop would 
be folded into the FIND working group.

6. The CIP and CCSO

Martin Hamilton presented his work on integrating CCSO nameservers 
with the CIP. His conclusion was that it was viable but that there 
remained some items that have to be resolved. There is no standard 
URL format for a CIP referral to a CCSO nameserver. For the time being 
Martin proposed that one could use the gopher one 
(gopher://ccso.server.domain.name:105/2). 

Another question is whether the CCSO should the CCSO attribute 
names and types be normalized to a common schema. 

7. Scaling of the CIP

Patrik presented some graphs showing the relationship between the 
size of a centroid and the size of the actual datasets both when looking 
a people informations from the phonebook and large document 
collections. Phonebook information revealed the not very astonishing 
fact that phonenumbers are unique which means that the centroid 
increased almost linearly with the growth of the dataset. Removing 
phonenumbers from the centroid gave a much slower growth and it also 
appeared to be asymptotic. When indexing words out of documents the 
curve didn't seem to level off when the dataset grew ( max dataset size 
~12.000.000 tokens). When applying a stop list weeding out very 
frequent words and very unusual words the curve became asymptotoic, 
reaching 60.000 and levelling off to be leveling of at that value.