User:Brent Gulanowski/Categorization

Proposal for a Knowledge Classification System for the Wikipedia

Introduction

I propose a self-organizing system, using contributors as the agents of course, which would work similarly to the Colon or UDC but with some variations to take advantage of the computerized storage and retrieval. The idea is to find a relationship between pages that is more fundamental than simple links, which are fragile and somewhat incidental, and very chaotic. Their nature as a network is exceptionally powerful, but too disorganized to be used as a basis for classification.

Motivations

The inadequacy of links for organizing articles is the motivation for categorization of some kind. So the first instinct is to emulate a library. But if pages are categorized according to a proper tree-like structure, as implicit in LC or DDS, the problem is how to choose which branch to place them on when they could qualify for multiple branches. Well, why not put them on multiple branches? Further, many classification systems are top-down. I propose a more organic approach, combining top-down and bottom-up.

The primary reason for proposing this system is to ensure that a user of the wikipedia can see, at a glance, all pages which relate to a particular page that they are reading or a topic that they are interested in. Users should be spared the necessity of following links or performing fruitless searches in the cases where their unfamiliarity with a subject means they do not even know the terms for which they are searching.

Requirements

Ideally you would like a classification system which would allow the insertion of an article on any subject at any level of detail. Both of these qualities are entirely relative -- especially detail level. Although users and contributors will bring their own sense of subjects and level of detail, it might be nice to try to keep the wikipedia articles organized as much as possible only in relation to one another, and less to inferred external repositories. In which case, a classification algorithm would be advised to relate pages, as much as possible, explicitly to other pages. As the system grows, re-organization can be expected to naturally shape the graph according to user and contributor expectations anyway.

Starting Point

Of course, there is already a framework, because there is already a set of classification-related pages, starting with the main page. The main page is the root of the classification tree. By definition it encompasses all pages and all categories, and so it will change if any second-level categories are created or destroyed, but there will always and forever only be one main category page. Ideally the main page will be the pre-eminent example of a special page: a pure category page, which has no information about anything except the categories beneath it, and any other useful meta-data that should be placed there. In the case of the main page, meta-data includes links to the whole set of wiki meta pages: FAQs, welcome, community, etc.

The algorithm

The repository begins with a single category: the main category. In a theoretical model, this category starts with one page. The main category has a pure category page called "all" and this page refers only to itself. The "category: all" page is simultaneously a member of the set called "category", in which all subsequent category pages will also be placed. All category pages have a list of links to content pages which are top-level members of that category proper, as well as a link to the category pages immediately above it, and to all proper sub-categories below it.
- It is a debatable question whether one should inter-mingle category and content information on the same page. A strict system would separate them, but there is no problem with combining them.
As pages are written, they will always be added to one or more categories, either directly, by selecting an existing category, or indirectly, by explicit association with an existing page. If the associated page belongs to multiple categories, the new page can either be a member of one, some, or all of those categories. Categories can always be changed later, but it might make sense to make the initially selected category permanently recorded. It might also be useful to define a default or primary category which is also more sticky than others, to help avoid pages getting lost. Some facility for such categorization will be necessary: menus, text field entry, or the like. (Some user interface expert would determine that.)
- Pages which are members of multiple categories will implicitly define intersection sets of those categories, which one might term cross categories or something similar. It would be beneficial to define these in a manner similar to that of proper sub-categories (with a unique ID), as described below, but not strictly necessary. More thought is needed.
As categories grow in size (that is, as the number of members, in the set which describes the category, grows), it will be sensible to introduce proper sub-categories for sets of related pages of some specialization. Each new sub-category will be given a permanent unique identifier, used by the system, and a name, given by the contributor who defines the category, subject to editorial change later, as with title of any page. The sub-category will be automatically provided a category page and the parent category page will be updated to reflect the new sub-category: member page links will be removed and replaced with a link to the sub-category. However, links on other category pages will be unchanged. Re-categorization of a page in a proper sub-category might be restricted to pages which had initially had the parent category as their primary category, if such is implemented.
- If a page is added to the wiki without being assigned a category (unlikely, but it should be an option), then it will get appended to a default category, to avoid having the main page filled up with "orphan" pages. It seems likely that active users will quickly sort these orphans into suitable homes.