#LyX 1.6.3 created this file. For more info see http://www.lyx.org/
\lyxformat 345
\begin_document
\begin_header
\textclass scrreprt
\begin_preamble
\usepackage{classicthesis-lyx}
\end_preamble
\options paper=a4,fontsize=10pt,BCOR=5mm,captions=tableheading,cleardoublepage=empty,clear=right,numbers=noenddot,abstract=false,footinclude=true,headinclude=true,fleqn=true,titlepage=true,twoside=true
\use_default_options false
\master proposal.lyx
\begin_modules
biblatex
enumitem
logicalmkup
\end_modules
\language american
\inputencoding auto
\font_roman default
\font_sans default
\font_typewriter default
\font_default_family default
\font_sc false
\font_osf false
\font_sf_scale 100
\font_tt_scale 100

\graphics default
\paperfontsize 10
\spacing single
\use_hyperref false
\papersize a4paper
\use_geometry false
\use_amsmath 1
\use_esint 0
\cite_engine natbib_numerical
\use_bibtopic false
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\papercolumns 1
\papersides 2
\paperpagestyle default
\tracking_changes false
\output_changes false
\author "" 
\author "" 
\end_header

\begin_body

\begin_layout Standard
\begin_inset ERT
status open

\begin_layout Plain Layout

% ----------------------( PART 1 / CHAPTER 1    )----------------------
\end_layout

\begin_layout Plain Layout

% Define the default pagination style for thesis content: numeric
\end_layout

\begin_layout Plain Layout

% pagination under the KOMA-Script "scrheadings" style.
\end_layout

\begin_layout Plain Layout


\backslash
pagestyle{scrheadings}
\end_layout

\begin_layout Plain Layout


\backslash
pagenumbering{arabic}
\end_layout

\begin_layout Plain Layout

\end_layout

\begin_layout Plain Layout

% ......................{ Part 1                }......................
\end_layout

\begin_layout Plain Layout


\backslash
myPart{Policy 
\backslash
& Pipelines}
\end_layout

\begin_layout Plain Layout

\end_layout

\begin_layout Plain Layout

% ......................{ Chapter 1             }......................
\end_layout

\begin_layout Plain Layout


\backslash
myChapter{Hypothesis}
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "ch:hypothesis"

\end_inset


\end_layout

\begin_layout Standard
\begin_inset VSpace medskip
\end_inset


\end_layout

\begin_layout Standard

\family sans
\begin_inset Box Boxed
position "c"
hor_pos "c"
has_inner_box 1
inner_pos "b"
use_parbox 1
width "100col%"
special "none"
height "1in"
height_special "totalheight"
status open

\begin_layout Plain Layout
\paragraph_spacing single
\noindent
\align center

\family typewriter
Is machine clustering the Wikipedia corpus practicable?
\end_layout

\end_inset


\end_layout

\begin_layout Standard
\begin_inset VSpace bigskip
\end_inset


\end_layout

\begin_layout Standard
\noindent
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
term{
\end_layout

\end_inset

Machine clustering
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

 is algorithmic categorization of articles.
 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
graffitoquote{
\end_layout

\end_inset

Inventors have a machine, and they are perfecting it.
 They get one part right, and then another goes wrong; and they get that
 right, and then another goes wrong, and so on.
 When they are quite sure they have reached perfection, forth issues the
 machine out of the shed -- and in five minutes is smashed up, together
 with a limb or so of the inventors, just because they had been quite sure
 too soon.
 Then the whole business starts again.
\begin_inset ERT
status open

\begin_layout Plain Layout

}{
\end_layout

\end_inset

Arnold Bennett
\begin_inset ERT
status open

\begin_layout Plain Layout

}{
\end_layout

\end_inset

on machines
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
term{
\end_layout

\end_inset

Human clustering
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

, its antonym, is non-algorithmic, human-driven categorization of the same.
\end_layout

\begin_layout Standard
Both machine and human clustering are known to be possible: that is, implementab
le in computationally finite time.
 Prior research (presented in 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{ch:pipelines}
\end_layout

\end_inset

) authenticates this possibility for machine clustering; likewise, the prevailin
g success of such category-related WikiProjects as 
\begin_inset CommandInset href
LatexCommand href
name "WikiProject Categories"
target "http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Categories"

\end_inset

, 
\begin_inset CommandInset href
LatexCommand href
name "WikiProject Stub Sorting"
target "http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Categories"

\end_inset

, and 
\begin_inset CommandInset href
LatexCommand href
name "WikiProject Tree of Life"
target "http://commons.wikimedia.org/wiki/Commons:WikiProject_Tree_of_Life"

\end_inset

 (presented in 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
EW
\end_layout

\end_inset

 itself) authenticates this possibility for human clustering.
\end_layout

\begin_layout Standard
Only the latter, however, is known to be practicable: that is, implementable
 in computationally feasible time.
 Assuming sufficiently enthusiastic volunteerism, human clustering effectively
 requires no computation.
 Insofar as it consumes no computational resources, human clustering qualifies
 as 
\begin_inset Quotes eld
\end_inset

computation free
\begin_inset Quotes erd
\end_inset

 and therefore practicable.
\end_layout

\begin_layout Standard
Whether machine clustering is practicable or not remains a matter of open
 debate.
 Due to mammoth scales in input dataset sizes,
\begin_inset Foot
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
EW
\end_layout

\end_inset

 consists of 
\begin_inset Formula $\sim3x10^{6}$
\end_inset

 official articles and 
\begin_inset Formula $\sim6.7x10^{6}$
\end_inset

 total articles as of July 2009 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
autocite{wpstats:entables}
\end_layout

\end_inset

, where total articles implies official articles plus disambiguity and redirect
 articles.
 It is not known how many interlinks (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
ie
\end_layout

\end_inset

, internal links between official articles) 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
EW
\end_layout

\end_inset

 consists of, unfortunately, since software failure as of October 2006.
 However, if average article structure for this Wikipedia resembles that
 of other Wikipedias for which this statistic is available (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
eg, 
\backslash
IW 
\backslash
autocite{wpstats:ittables}
\end_layout

\end_inset

), we estimate it consisting of an average 
\begin_inset Formula $24.96$
\end_inset

 interlinks per article and total 
\begin_inset Formula $\sim75x10^{6}$
\end_inset

 interlinks for all articles.
 Thus, when applying graph theory algorithms to an 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
EW
\end_layout

\end_inset

 corpus, we have an input dataset described by number of vertices 
\begin_inset Formula $n\simeq3x10^{6}$
\end_inset

 and number of edges 
\begin_inset Formula $m\simeq75x10^{6}$
\end_inset

 .
 To put it shortly, this is large.
\end_layout

\end_inset

 complete machine clustering of the Wikipedia corpus in synchronous response
 to edits on that corpus (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
ie
\end_layout

\end_inset

, in real-time) may never be practicable.
 Approximate machine clustering, however, may.
 By incompletely (but hopefully adequately) categorizing articles under
 categories of some reduced accuracy and/or granularity, approximate machine
 clustering could be sufficiently efficient as to warrant practical implementati
on.
\end_layout

\begin_layout Standard
This thesis proposal addresses this matter:
\end_layout

\begin_layout Enumerate
In 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{ch:policies}
\end_layout

\end_inset

, discussing the current state of human clustering on Wikipedia and, therein,
 the desirability for machine clustering.
\end_layout

\begin_layout Enumerate
In 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{ch:pipelines}
\end_layout

\end_inset

, introducing algorithmic pipelines for practical implementation of machine
 clustering.
\end_layout

\begin_layout Standard
\begin_inset ERT
status open

\begin_layout Plain Layout

% ......................{ Next Chapter          }......................
\end_layout

\begin_layout Plain Layout


\backslash
myChapter{Policy}
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "ch:policies"

\end_inset


\end_layout

\begin_layout Paragraph
\noindent

\shape smallcaps
Note:
\end_layout

\begin_layout Standard
As the reader may already intuit, this thesis proposal is both curiously
 long and curiously unfocused.
 I, the author, attempted to rectify this on numerous occasions -- usually,
 over luminously Black Adder-blackened licorous tea.
 As this chapter attests, I failed.
 Despite best intentions and nimble 'all-nighter' tea sessions, it rather
 remains a mess.
 Please do feel free to skim straight to the 
\begin_inset Quotes eld
\end_inset

meat and potatoes
\begin_inset Quotes erd
\end_inset

 of this proposal: 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{ch:pipelines}
\end_layout

\end_inset

, which more fully examines the tangible matter of algorithmic solutions
 to the machine clustering problem.
\end_layout

\begin_layout Standard

\lyxline

\end_layout

\begin_layout Standard
\begin_inset VSpace bigskip
\end_inset


\end_layout

\begin_layout Standard
\noindent
Wikipedia's present policies on article categorization 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
graffitoquote{
\end_layout

\end_inset

To know the name on the door is to know nothing.
\begin_inset ERT
status open

\begin_layout Plain Layout

}{
\end_layout

\end_inset

Fredy Perlman
\begin_inset ERT
status open

\begin_layout Plain Layout

}{
\end_layout

\end_inset

on abstraction
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

 reduce to the following invariant:
\end_layout

\begin_layout Quote
Each Wikipedia article is necessarily categorized under zero or more Wikipedia
 categories.
\end_layout

\begin_layout Standard
Of its own accord, this naive and very obvious invariant is not particularly
 informative.
\end_layout

\begin_layout Standard
This lack of informativeness begets several problems, policy solutions,
 and, surprisingly, algorithmic insight.
 This chapter documents these problems' severity in 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{sec:category-problems}
\end_layout

\end_inset

, the existing solution of human clustering in 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{sec:soft-solutions}
\end_layout

\end_inset

, the alternative solution of machine clustering in 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{sec:hard-solutions}
\end_layout

\end_inset

, and algorithmic insight into the nature of verifiable facts on article
 categorizations in 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{sec:verifiable-facts}
\end_layout

\end_inset

.
\end_layout

\begin_layout Section
Categorical Deficiencies
\begin_inset CommandInset label
LatexCommand label
name "sec:category-problems"

\end_inset


\end_layout

\begin_layout Standard
This weak invariant on article categorization coupled with specific lack
 of stronger invariants hobbles information retrieval: namely, as we shall
 see, it prevents such retrieval from fully leveraging the verifiable facts
 implied by article categorizations.
\end_layout

\begin_layout Subsection
Non-categorization
\end_layout

\begin_layout Standard
This invariant in no way requires that articles be categorized.
 By allowing uncategorized articles remain uncategorized (vis-á-vis editor
 negligence) and pre-categorized articles become uncategorized (vis-á-vis
 implicit deletion of existing categories or explicit removal of articles
 from those categories), it actually impedes category-based decision making.
 Since an article may or may not be categorized as editors see fit, decision
 making over heterogenous articles (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
ie
\end_layout

\end_inset

, over articles of dubious or various authorship) cannot subsist solely
 on article categorizations but must, necessarily, consider article data
 (headers, ledes, lists, links, and other body text) and/or article metadata
 (hatnotes, infoboxes, navboxes, and other margin templates).
\end_layout

\begin_layout Standard
This is mildly unfortunate.
 As 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{sec:verifiable-facts}
\end_layout

\end_inset

 details, article categorizations actually signify verifiable facts on the
 articles they categorize.
 Furthermore, these categorizations are directly parsable from those articles
 via HTML scrape, XML lookup, or SQL request in 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
Oone
\end_layout

\end_inset

 to not much more than 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
Oone
\end_layout

\end_inset

 time.
 It follows that decision making on only article categorizations could engender
 faster, less resource consumptive algorithms than decision making on the
 same verifiable facts implied by other, less accessible article content.
\end_layout

\begin_layout Subsection
Categorical Inconsistency
\end_layout

\begin_layout Standard
Moreover, this invariant in no way requires that article categorizations
 be factually consistent.
 This is probably the product of initial conditions: first, the insufficiency
 of Wikipedia technology; second, the inexpressivity of Wikipedia categories.
\end_layout

\begin_layout Standard
Wikipedia sits astride the 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
MW
\end_layout

\end_inset

 server platform.
 This platform currently affords no mechanism for affixing bibliographic
 citations onto article categorizations.
 Article categorizations, both as visually demarcated to end users and programat
ically recorded to backend databases, simply 
\begin_inset Flex CharStyle:Emph
status collapsed

\begin_layout Plain Layout
are
\end_layout

\end_inset

.
 They stand adrift in flat seas of HTML, without context or textual metadata;
 what Wikipedians can say from whence they come, to thence they go? The
 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
MW
\end_layout

\end_inset

-managed histories for an article only betray by whom and when some categorizati
on was added to or deleted from that article --- but never 
\begin_inset Flex CharStyle:Emph
status collapsed

\begin_layout Plain Layout
why
\end_layout

\end_inset

.
 This stands in marked contrast to article content.
 Both convey verifiable facts, but only article content contextualizes those
 facts with bibliographic citation and, thereby, the means to consistently
 validate that content.
 
\end_layout

\begin_layout Standard
However, even were 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
MW
\end_layout

\end_inset

 to propitously support affixation of citations onto categorizations, this
 would in no way rectify their absence from existing stores of articles.
 English Wikipedia comprised some 
\begin_inset Formula $\sim3x10^{6}$
\end_inset

 articles and 
\begin_inset Formula $\sim496x10^{3}$
\end_inset

 categories as of July 2009 and some 
\begin_inset Formula $\sim634x10^{3}$
\end_inset

 articles and 
\begin_inset Formula $\sim50x10^{3}$
\end_inset

 categories as of July 2005 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
autocite{wpstats:entables}
\end_layout

\end_inset

.
 It follows that the number of Wikipedia categories increases at somewhat
 twice the rate at which the number of Wikipedia articles increases, but
 that the latter's inertial weight more than eclipses the former's.
 Historically, Wikipedia provided only a few categories relative to the
 many articles it published.
 Since many articles were categorizable under only a few categories, the
 consistency of these categorizations with verifiable facts implied by article
 content was, in practice, redundantly obvious.
 As a consequence of incremental growth in average article length and total
 available categories across all Wikipedias,
\begin_inset Foot
status open

\begin_layout Plain Layout
Wikipedia statistics show total available categories linearly growing at
 
\begin_inset Formula $\sim10^{4}$
\end_inset

 categories per month on English Wikipedia 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
autocite{wpstats:entables}
\end_layout

\end_inset

.
 Unfortunately, due to inadequacies in the software mining these statistics
 and enormous reservoir of data such mining entails, Wikipedia statistics
 no longer show the proportion of articles of length greater than 0.5Kb and
 2.0Kb to the total available articles; extrapolating from past trends, however,
 it seems probable that average article length is also growing.
\end_layout

\end_inset

 this is no longer the case, obviously.
\end_layout

\begin_layout Standard
The factual inconsistencies inherent in article categorization, as implied
 by the above historical conditions, again impedes category-based decision
 making; and, again, is mildly unfortunate.
\end_layout

\begin_layout Section
Police Power
\begin_inset CommandInset label
LatexCommand label
name "sec:soft-solutions"

\end_inset


\end_layout

\begin_layout Standard
In response, Wikimedia Foundation, Inc.
 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
graffitoquote{
\end_layout

\end_inset

No tyranny is so irksome as petty tyranny: the officious demands of policemen,
 government clerks, and electromechanical gadgets.
\begin_inset ERT
status open

\begin_layout Plain Layout

}{
\end_layout

\end_inset

Edward Abbey
\begin_inset ERT
status open

\begin_layout Plain Layout

}{
\end_layout

\end_inset

on officialdom
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

 and foundling Wikipedia community invented two voluntary policies.
\end_layout

\begin_layout Paragraph
Wikipedia:Category (
\begin_inset Flex CharStyle:Emph
status collapsed

\begin_layout Plain Layout
WP:CAT
\end_layout

\end_inset

):
\end_layout

\begin_layout Standard
To gaurantee epistemological completeness in and, thereby, algorithmic computabi
lity on the Wikipedia corpus, Wikipedia policy 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
autocite{wp:cat}
\end_layout

\end_inset

 encourages each article be categorized under one, rather than zero, or
 more categories.
\end_layout

\begin_layout Paragraph
Wikipedia:Overcategorization (
\begin_inset Flex CharStyle:Emph
status collapsed

\begin_layout Plain Layout
WP:OCAT
\end_layout

\end_inset

):
\end_layout

\begin_layout Standard
To guarantee factual consistency between article categorization and content,
 Wikipedia policy 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
autocite{wp:ocat}
\end_layout

\end_inset

 encourages each article categorization be associated with one verifiable
 fact or intersection of verifiable facts in article content.
 
\begin_inset Foot
status open

\begin_layout Plain Layout
In this respect, the 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
term{
\end_layout

\end_inset

Wikipedia:Overcategorization
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

 policy redoubles as a 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
term{
\end_layout

\end_inset

Wikipedia:Undercategorization
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

 policy as well.
\end_layout

\end_inset

 However as discussed under 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{sec:verifiable-facts}
\end_layout

\end_inset

, the reverse is not generally true.
\end_layout

\begin_layout Section
Soft Solutions, Hard Solutions
\begin_inset CommandInset label
LatexCommand label
name "sec:hard-solutions"

\end_inset


\end_layout

\begin_layout Standard
These policies constitute 
\begin_inset Quotes eld
\end_inset

soft
\begin_inset Quotes erd
\end_inset

 --- not 
\begin_inset Quotes eld
\end_inset

hard
\begin_inset Quotes erd
\end_inset

 --- solutions.
 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
graffitoquote{
\end_layout

\end_inset

Soft power is the velvet glove, but behind it there is always the iron fist.
\begin_inset ERT
status open

\begin_layout Plain Layout

}{
\end_layout

\end_inset

Robert Cooper
\begin_inset ERT
status open

\begin_layout Plain Layout

}{
\end_layout

\end_inset

on diplomacy
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

 Being voluntary, they envision community-oriented behavior instead of provision
ing algorithm-oriented invariants.
 Machine-driven algorithms cannot, therefore, depend on their obeyance.
\end_layout

\begin_layout Standard
This is non-ideal.
 Ideally, one or several algorithms deterministically guaranteeing such
 obeyance could be concocted as conditionally activatable 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
MW
\end_layout

\end_inset

 extensions or 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
MWT
\end_layout

\end_inset

 bots; when conditionally active, later algorithms then could perform under
 the hard guarantees afforded by the former.
\end_layout

\begin_layout Standard
This thesis proposal discusses algorithms applicable for this purpose.
 It proposes heuristics for classifying, identifying, and selecting such
 algorithms, metrics for measuring the comparative merits of select algorithms,
 and constraints on practical design, implementation, debugging, documentation,
 and publication of several meritorious algorithms --- as imposed by:
\end_layout

\begin_layout Itemize
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
itemlead{
\end_layout

\end_inset

Existing policy
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

.
 For safety, Wikipedia policy 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
autocite{wp:bot}
\end_layout

\end_inset

 polices bot activity against egregious edits (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
eg
\end_layout

\end_inset

, deemed 
\begin_inset Quotes eld
\end_inset

harmful
\begin_inset Quotes erd
\end_inset

 or 
\begin_inset Quotes eld
\end_inset

useless
\begin_inset Quotes erd
\end_inset

), over-excessive edits (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
eg
\end_layout

\end_inset

, exceeding one edit per ten seconds), and edits for which either no community
 consensus exists or community censensus does exist but is antagonistically
 opposed.
 If implemented as exogenous bots on active Wikipedia corpora, machine clusterin
g could (concievably) make excessively frequent edits on excessively many
 categories and categorizations, and thus be construed (or misconstrued)
 as contravening these policies.
 Alternately, if implemented as endogenous extensions in the 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
MW
\end_layout

\end_inset

 server(s) serving those corpora, machine clustering could constrain article
 categorizations in such a way as not covered by existing policy; then,
 there arises the non-academic, practical issue of crafting and grafting
 appropriate policies over existing policy for those corpora.
 (This does not seem particularly fun.) Either way, existing policy should
 be addressed.
\end_layout

\begin_layout Itemize
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
itemlead{
\end_layout

\end_inset

Existing 
\begin_inset Flex CharStyle:Emph
status collapsed

\begin_layout Plain Layout
realpolitik
\end_layout

\end_inset


\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

.
 For security, it's in the sensible nature of most human communities to
 homeostatically inhibit large-scale change.
 Wikipedia communities are little different.
 Machine clustering could be considered large-scale change, and therefore
 oppose community consensus on the nature, import, and intention of article
 clustering in Wikipedia.
 (This is probably a good thing.) Given its implicit, tacit ephemerality
 and inexplicability, existing politics probably need not be addressed ---
 but should be kept in mind.
\end_layout

\begin_layout Itemize
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
itemlead{
\end_layout

\end_inset

Existing technology
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

.
 For efficiency, 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
MW
\end_layout

\end_inset

 necessarily limits and delimits what, which, and how many algorithms, operation
s, and other requests may be performed on behalf of endogenous extensions
 and in response to exogenous bot and user requests.
 Machine clustering could, if insufficiently efficient, prove to perform
 so poorly as to be technically infeasible for real-world use.
 (This is probably not the case here.
 Technically speaking, some limited modicum of functionality mimicing genuine
 machine clustering should be sufficiently efficient --- if non-ideal.) Existing
 technology must thus be addressed.
\end_layout

\begin_layout Section
Verifiable Facts --- or Are They?
\begin_inset CommandInset label
LatexCommand label
name "sec:verifiable-facts"

\end_inset


\end_layout

\begin_layout Standard
Per Wikipedia policy 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
autocite{wp:v}
\end_layout

\end_inset

, 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
graffitoquote{
\end_layout

\end_inset

The one certain fact about these issues and a million like them is that
 there are no verifiable facts which might enable us to make up our minds
 about them.
\begin_inset ERT
status open

\begin_layout Plain Layout

}{
\end_layout

\end_inset

Ray Billington
\begin_inset ERT
status open

\begin_layout Plain Layout

}{
\end_layout

\end_inset

on death & love
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

 verifiable facts are statements attributable to reliable sources, where
 'reliable' and 'sources' are (occasionally subject to human discretion
 but) usually bound by these maxims:
\end_layout

\begin_layout Itemize
'Reliable' means 
\begin_inset Quotes eld
\end_inset

\SpecialChar \ldots{}
peer-reviewed journals and books published in university presses; university-lev
el textbooks; magazines, journals, and books published by respected publishing
 houses; and mainstream newspapers
\begin_inset Quotes erd
\end_inset

 as well as electronic media.
\end_layout

\begin_layout Itemize
'Sources' means third-party publishers acknowledged for factual accuracy.
\end_layout

\begin_layout Standard
Verifiable facts in Wikipedia articles therefore constitute a veritable
 datamine of falsifiable inquiry and knowledge on those articles.
 However, the usual long Wikipedia article consisting of 

\begin_inset Formula $32\mathrm{Kb}$
\end_inset


\lang american
 worth of textual content
\begin_inset Foot
status open

\begin_layout Plain Layout
Wikipedia articles are customarily capped at 

\begin_inset Formula $32\mathrm{Kb}$
\end_inset

 worth of textual content due to the technologic incapability of handicapped
 browsers (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
eg
\end_layout

\end_inset

, those executing on portable devices) and largely obsolete, but prevalant
 and therefore relevant, browsers (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
eg
\end_layout

\end_inset

, Internet Explorer, Netscape Navigator).
\end_layout

\end_inset

 also consists of unusually many verifiable facts; this is particularly
 true for articles rated 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
noun{
\end_layout

\end_inset

GA
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

 (good article) and 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
noun{
\end_layout

\end_inset

FA
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

 (featured article).
 Winnowing this excess of verifiable facts for an article to the smallest
 subset of verifiable facts uniquely characterizing that article could assist
 in information retrieval.
\end_layout

\begin_layout Standard
Article categorization could admit one such means for winnowing verifiable
 facts.
 First, we note that the reverse of the heuristic implied by the 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
term{
\end_layout

\end_inset

Wikipedia:Overcategorization policy
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

 is not necessarily true: each verifiable fact or intersection of verifiable
 facts in article content is not necessarily associated with an article
 categorization.
 Wikipedia articles are theoretically unbounded in length.
 Thus, the set of verifiable facts an article cites increases in size as
 that article increases in length.
 Infinitely long articles cite infinitely many verifiable facts.
 Thus, there can be no one-to-one correspondence between the set of verifiable
 facts for one article and set of verifiable facts implied by all categorization
s for that article.
 Rather, the latter is always a subset of the former.
 As a latent property of Wikipedia's insistence on community consensus,
 policy adherence, and iterative refinement, the latter (subset of verifiable
 facts implied by all categorizations for each article) usually converges
 to the smallest subset of facts uniquely characterizing that article.
 This unique characterization can be likened to article metadata and, like
 all such data, parsed, stored, indexed, and retrieved on a per-article
 basis.
\end_layout

\begin_layout Standard
As each categorization implies one or more verifiable facts, the set of
 all verifiable facts for an article is technically extractable from either
 explicit reference within article content or implicit insinuation within
 article categorizations.
 All things are not equal here, however.
\end_layout

\begin_layout Standard
Verifiable facts expressed in the latter (categorizations) but not the former
 (content) constitute article errors, and could serve to inform concerned
 algorithms on inconsistent state in the Wikipedia corpus.
 Interestingly, verifiable facts expressed in the former (content) but not
 the latter (categorizations) constitute not errors but relatively 'uninterestin
g' facts; they were not sufficiently interesting to warrant inclusion as
 article categorizations and cannot contribute to the unique characterization
 of that article.
 Finally, verifiable facts expressed in both (content and categorizations)
 do constitute 'interesting' facts; they did warrant inclusion inclusion
 as article categorizations and, as that contributes to that article's unique
 characterization, could serve to inform concerned algorithms of the applicabili
ty of some verifiable fact to its host Wikipedia article.
 (That is, verifiable facts also expressed in an article's categorizations
 could be considered more applicably relevant to that article than verifiable
 facts only expressed in that article's content.)
\end_layout

\begin_layout Standard
This suggests that, where not already the case, Wikipedia-specific information
 retrieval may be improved by incorporating article categorizations into
 that retrieval.
\end_layout

\begin_layout Standard
\begin_inset ERT
status open

\begin_layout Plain Layout

% ......................{ Next Chapter          }......................
\end_layout

\begin_layout Plain Layout


\backslash
myChapter{Pipelines}
\end_layout

\end_inset


\begin_inset CommandInset label
LatexCommand label
name "ch:pipelines"

\end_inset


\end_layout

\begin_layout Standard
Machine clustering is not customarily performed by stand-alone algorithms.
 Rather, its customary implementation involves several algorithms in series,
 which we refer to as 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
term{
\end_layout

\end_inset

algorithmic pipelines
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

.
\end_layout

\begin_layout Standard
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
textcite{lizorkin:comstructure}
\end_layout

\end_inset

 outline one such pipeline for solving the complete machine clustering problem.
 Briefly, this is:
\end_layout

\begin_layout Enumerate
Application of the Clauset-Newman-Moore optimization 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
autocite{clauset:comstructure}
\end_layout

\end_inset

 of the Girvan-Newman algorithm 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
autocite{girvan:comstructure}
\end_layout

\end_inset

 for discovering graph communities in the article interlink graph, which
 is the graph implied by the set of all articles and article interlinks
 for the Wikipedia corpus.
\end_layout

\begin_layout Enumerate
Application of PageRank on each such community for identifying the central
 article in that community according to eigenvector centrality, and thereby
 associating that community with the qualitative topic implied by that central
 article.
\end_layout

\begin_layout Standard
Assuming an input snapshot of English Wikipedia circa August 2008 consisting
 of 
\begin_inset Formula $\sim1.1x10^{6}$
\end_inset

 articles and 
\begin_inset Formula $\sim4.6x10^{6}$
\end_inset

 article interlinks 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
autocite{lizorkin:comstructure}
\end_layout

\end_inset

, this algorithmic pipeline completely machine clusters that input in a
 little over 
\begin_inset Formula $4$
\end_inset

 days.
 This is 
\begin_inset Formula $345.6x10^{3}$
\end_inset

 seconds or 
\begin_inset Formula $2.53$
\end_inset

 articles per second, and may or may not qualify as 
\begin_inset Quotes eld
\end_inset

practicable.
\begin_inset Quotes erd
\end_inset


\end_layout

\begin_layout Standard
To the extent that a Wikipedia corpus does not change, this could be considered
 practicable enough; to the extent, however, that it does, this could be
 considerably slower than real-world application of machine clustering requires.
 To better quantify its slowness and thereby convey a convenient sense of
 this pipeline's merits versus other such pipelines, conventional analysis
 of its running time in big 
\begin_inset Formula $O$
\end_inset

 notation could be instructive.
 Alas -- 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
citeauthor{lizorkin:comstructure}
\end_layout

\end_inset

 deigned not to suffix their smattering description of this pipeline with
 such analysis!
\end_layout

\begin_layout Standard
This chapter somewhat amends that.
 In 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{sec:lizorken-complete}
\end_layout

\end_inset

 and 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{sec:lizorken-incomplete}
\end_layout

\end_inset

, it imprecisely analyzes the relationship of input dataset size to output
 running time for this pipeline.
 Then, in 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{sec:shunicycle}
\end_layout

\end_inset

, it succinctly proposes an alternative pipeline having very different character
istics.
\begin_inset Foot
status open

\begin_layout Plain Layout
Just 
\emph on
what
\emph default
 characteristics is probably the proper domain of the actual thesis!
\end_layout

\end_inset


\end_layout

\begin_layout Section
An Incomplete Complete Analysis
\end_layout

\begin_layout Standard
\begin_inset CommandInset label
LatexCommand label
name "sec:lizorken-complete"

\end_inset

Given some arbitrary graph, the Girvan-Newman algorithm 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
autocite{girvan:comstructure}
\end_layout

\end_inset

 produces a dendrogram hierarchically describing all communities in that
 graph; that is, it algorithmically discovers all communities in all nested
 hierarchies in that graph.
 This is 
\begin_inset Formula $O(mn^{2})$
\end_inset

 on arbitrary graphs having 
\begin_inset Formula $m$
\end_inset

 edges and 
\begin_inset Formula $n$
\end_inset

 vertices, and 
\begin_inset Formula $O(n^{3})$
\end_inset

 on sparse graphs for which 
\begin_inset Formula $m\simeq n$
\end_inset

.
 As Wikipedia is a sparse graph, this is 
\begin_inset Formula $O(n^{3})$
\end_inset

 on the Wikipedia corpus having 
\begin_inset Formula $n$
\end_inset

 articles.
\end_layout

\begin_layout Standard
Given some arbitrary graph, the Clauset-Newman-Moore optimization 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
autocite{clauset:comstructure}
\end_layout

\end_inset

 on this algorithm produces the same dendrogram in 
\begin_inset Formula $O(md\mathrm{log}n)$
\end_inset

 on arbitrary graphs having 
\begin_inset Formula $m$
\end_inset

 edges, 
\begin_inset Formula $n$
\end_inset

 vertices, and a dendrogram of depth 
\begin_inset Formula $d$
\end_inset

; 
\begin_inset Formula $O(nd\mathrm{log}n)$
\end_inset

 on sparse graphs for which 
\begin_inset Formula $m\simeq n$
\end_inset

; and 
\begin_inset Formula $O(n\mathrm{log}^{2}n)$
\end_inset

 on sparse graphs for which 
\begin_inset Formula $m\simeq n$
\end_inset

 and that dendrogram has a constant branching factor for which 
\begin_inset Formula $d=logn$
\end_inset

.
\begin_inset Foot
status open

\begin_layout Plain Layout
However, see 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
prettyref{app:clauset-errata}
\end_layout

\end_inset

 for exposition of possible errata in 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
textcite{clauset:comstructure}
\end_layout

\end_inset

.
\end_layout

\end_inset

 Wikipedia is a sparse graph; however, it is unclear whether its dendrogram
 obeys a constant branching factor or not.
 In fact, 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
citeauthor{lizorkin:comstructure}
\end_layout

\end_inset

 visually demonstrate that its dendrogram probably follows a power-law distribut
ion.
 Thus, this is 
\begin_inset Formula $O(nd\mathrm{log}n)$
\end_inset

 on the Wikipedia corpus having 
\begin_inset Formula $n$
\end_inset

 articles and a dendrogram of indeterminate depth 
\begin_inset Formula $d$
\end_inset

.
\end_layout

\begin_layout Standard
Given the dendrogram produced by the Clauset-Newman-Moore algorithm, it
 is also unclear how much additional computational complexity an application
 of PageRank on that dendrogram's communities contributes to this running
 time.
 It is probably not negligible, however.
 Let 
\begin_inset Formula $t_{pr}$
\end_inset

 be the indeterminate time PageRank contributes.
 Then the incomplete complete running time 
\begin_inset Formula $t$
\end_inset

 for the Lizorkin-Medelyan-Grineva pipeline is given by
\end_layout

\begin_layout Standard
\begin_inset Formula \[
t=f(n)=O(nd\mathrm{log}n+t_{pr})\]

\end_inset


\end_layout

\begin_layout Section
A Completely Incomplete Analysis
\end_layout

\begin_layout Standard
\begin_inset CommandInset label
LatexCommand label
name "sec:lizorken-incomplete"

\end_inset

Alternately, assume this running time 
\begin_inset Formula $t$
\end_inset

 as a function of input dataset size crudely follows a polynomial relationship
 given by 
\begin_inset Formula $t=f(nm)=k(nm)^{a}$
\end_inset

, where 
\begin_inset Formula $k>0$
\end_inset

 and 
\begin_inset Formula $a>1$
\end_inset

 are constants in 
\begin_inset Formula $\mathbb{R}$
\end_inset

, 
\begin_inset Formula $t$
\end_inset

 is output running time,
\begin_inset Foot
status collapsed

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
citeauthor{lizorkin:comstructure}
\end_layout

\end_inset

 make no explicit mention of having parallelized this pipeline's operation
 across a supercomputing array.
 Thus, this crude analysis assumes pipeline operation for one machine.
 If this was not the case, it mildly but not wholly invalidates the big
 
\begin_inset Formula $O$
\end_inset

 notation running time that this analysis obtains; this is correctable by
 exponentiating this running time by the number of machines 
\begin_inset Formula $j>1$
\end_inset

 participating in that array.
 Thus we have 
\begin_inset Formula $f(nm)=O(n^{5j})$
\end_inset

 rather than merely 
\begin_inset Formula $f(nm)=O(n^{5})$
\end_inset

.
 (In any case, this crude analysis is just that: crude, and not to be taken
 too authentically.)
\end_layout

\end_inset

 and 
\begin_inset Formula $n$
\end_inset

 and 
\begin_inset Formula $m$
\end_inset

 are number of input articles and article interlinks, respectively.
 Then 
\begin_inset Formula $t=f(nm)\simeq(nm)^{a}=O\left(n^{a}m{}^{a}\right)$
\end_inset

 and, from above, we have 
\begin_inset Formula $t=345.6x10^{3}s$
\end_inset

, 
\begin_inset Formula $n=1.1x10^{6}$
\end_inset

, and 
\begin_inset Formula $m=4.6x10^{6}$
\end_inset

.
\end_layout

\begin_layout Standard
Solving for 
\begin_inset Formula $a$
\end_inset

 requires we assume some fundamental unit of time for 
\begin_inset Formula $t=f(nm)$
\end_inset

.
 Noting that modern machines operating at 
\begin_inset Formula $\sim3GHz$
\end_inset

 or 
\begin_inset Formula $\sim3x10^{9}\mathrm{cycles}/s=3\,\mathrm{cycles}/ns\simeq1\,\mathrm{cycle}/ns$
\end_inset

 perform one fundamental operation per 
\begin_inset Formula $ns$
\end_inset

, we take the 
\begin_inset Formula $ns$
\end_inset

 to be our fundamental unit of time.
 Thus, we have 
\begin_inset Formula $t=345.6x10^{12}ns$
\end_inset

.
 This suggests 
\begin_inset Formula $345.6x10^{12}ns\simeq(1.1x10^{6}*4.6x10^{6})^{a}$
\end_inset

 and, solving for 
\begin_inset Formula $a$
\end_inset

, that 
\begin_inset Formula $a=2.17$
\end_inset

.
 Finally, this suggests 
\begin_inset Formula $f(nm)=O(n^{2.17}m^{2.17})$
\end_inset

 for the average case as signified by this sample Wikipedia corpus.
\end_layout

\begin_layout Standard
This can be further simplified, fortunately.
 As articles for all Wikipedia corpora are sparsely connected, we have 
\begin_inset Formula $n\simeq m$
\end_inset

.
 Then, the complete incomplete running time 
\begin_inset Formula $t$
\end_inset

 for the Lizorkin-Medelyan-Grineva pipeline is given by 
\begin_inset Formula $t=f(n)=O(n^{2.17}n^{2.17})=O(n^{4.7})$
\end_inset

 or, in synopsis,
\end_layout

\begin_layout Standard

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none
\begin_inset Formula \[
t=f(n)=O(n^{5})\]

\end_inset


\end_layout

\begin_layout Standard
\noindent
This seems particularly slow.
 Perhaps we can do better.
 (Perhaps we can't.)
\end_layout

\begin_layout Standard
\begin_inset Note Comment
status collapsed

\begin_layout Plain Layout
FIXME: Obsolete.
\end_layout

\begin_layout Plain Layout
operate in 
\begin_inset Formula $O(mlogn)$
\end_inset

 time where, here, 
\begin_inset Formula $n$
\end_inset

 is the number of Wikipedia articles and 
\begin_inset Formula $m$
\end_inset

 the number of directed interlinks (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
ie
\end_layout

\end_inset

, internal links) from one such article to another.
 Machine clustering does not, therefore, qualify as 
\begin_inset Quotes eld
\end_inset

computation free.
\begin_inset Quotes erd
\end_inset

 If sufficiently efficient (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
eg
\end_layout

\end_inset

, in less than 
\begin_inset Formula $O(n^{3})$
\end_inset

 polynomial time), however, it may still qualify as practicable.
\end_layout

\begin_layout Plain Layout
Naturally, --- which this thesis proposal intends to address.
\end_layout

\end_inset


\end_layout

\begin_layout Section
An Alternative -- The Weakly Connected Component Pipeline
\end_layout

\begin_layout Standard
\begin_inset CommandInset label
LatexCommand label
name "sec:shunicycle"

\end_inset


\begin_inset VSpace medskip
\end_inset


\end_layout

\begin_layout Paragraph

\family typewriter

\lyxline

\family default
\shape smallcaps
\lang american
Note:
\end_layout

\begin_layout Standard
Woefully, this section is unfinished.
 Due to the inadequacies of time as well as page and sanity constraints,
 I hope this meager summary and summary conclusion sufficies.
\lyxline

\family sans

\begin_inset VSpace bigskip
\end_inset


\end_layout

\begin_layout Standard
\noindent
This section intended to discuss 
the 
\lang american

\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
textcite{tarjan:depthfirst}
\end_layout

\end_inset

 algorithm for graph discovery of strongly connected components in 
\begin_inset Formula $O(n+m)$
\end_inset

 time, where 
\begin_inset Formula $n$
\end_inset

 and 
\begin_inset Formula $m$
\end_inset

 are the number of vertices and edges, respectively, in that graph.
\end_layout

\begin_layout Standard
It then intended to discuss a novel algorithm for graph discovery of so-called
 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
term{
\end_layout

\end_inset

shunicycles
\begin_inset ERT
status open

\begin_layout Plain Layout

}
\end_layout

\end_inset

 (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
ie
\end_layout

\end_inset

, shortest unique cycles) of strongly connected components I recently invented
 atop the Tarjan algorithm.
 The former is considerably more complex than the latter; consequently,
 I found a formal analysis of average and worst case running times for the
 shunicycle algorithm to be arduous.
\end_layout

\begin_layout Standard
In blatant fact, I gave up.
 But only after having obtained a worst case running time 
\begin_inset Formula $t$
\end_inset

 for this algorithm in unsimplified form, resembling
\end_layout

\begin_layout Standard
\begin_inset Formula \begin{eqnarray*}
t & = & \sum_{i=1}^{\lg n-2}(2i-1)\left\{ \sum_{j=1}^{\lg n-2}(\lg n-1-j)\left[\left(\sum_{k=\lg n-i}^{\lg n-2}2^{\lg n+2-j-k}\right)+i2^{\lg n-1-j}\right]\right\} +\\
 &  & 2\sum_{i=0}^{\lg n-1}\sum_{j=0}^{2^{i-1}-1}\left(2ij+1+\sum_{k=1}^{i=1}\left\lceil \frac{j}{2^{i-k-1}}\right\rceil \right)\end{eqnarray*}

\end_inset


\end_layout

\begin_layout Standard
\noindent
Momentarily lacking access to a symbolic mathematics solver (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
eg
\end_layout

\end_inset

, Axiom, Maple), I elected to stop here.
 This was probably an applaudable choice.
\end_layout

\begin_layout Standard
I had intended a series of other sections.
 Happily, I have them written; unhappily, I have them written to paper.
 These are:
\end_layout

\begin_layout Enumerate
A section documenting the applicability of this work to Dr.
 Trotman's INEX 2010 Entity Ranking Track.
 It was, is, and shall remain my hope that this work interfaces well with
 on-going effort in that track.
\end_layout

\begin_layout Enumerate
A section documenting the ad-hoc analytic efforts of 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
textcite{dolan:sixdegrees}
\end_layout

\end_inset

, whose seminal 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
citetitle{dolan:sixdegrees}
\end_layout

\end_inset

 served as modest inspiration for my own.
\end_layout

\begin_layout Enumerate
An appendix statistically questioning whether the number of average interlinks
 per article for 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
EW
\end_layout

\end_inset

 and 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
IW
\end_layout

\end_inset

 is a polynomial function of the number of articles in each such Wikipedia
 corpus or not.
 And, indeed, this does appear to be the case: the number of average interlinks
 per article for 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
IW
\end_layout

\end_inset

 appears to be a quadratic function of the number of articles.
 As of July 2009, 
the number of average interlinks per article for 
\lang american

\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
IW
\end_layout

\end_inset


 is currently 
\begin_inset Formula $24.87$
\end_inset

 and increasing by 
\begin_inset Formula $1.39$
\end_inset

 per annum (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
ie
\end_layout

\end_inset

, this number has a positive velocity of 
\begin_inset Formula $\nicefrac{1.39}{\mathrm{year}}$
\end_inset

).
 However, the amount of increase is decreasing by 
\begin_inset Formula $-0.14$
\end_inset

 per annum (
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
ie
\end_layout

\end_inset

, this number has a negative acceleration of 
\begin_inset Formula $\nicefrac{-0.14}{\mathrm{year}}$
\end_inset

).
 Thus the quadratic nature of this number.
\end_layout

\begin_layout Enumerate

An appendix statistically estimating the number of article interlinks for
 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
EW
\end_layout

\end_inset

, for which no formal statistics have been published as of October 2006
 due to software failure at the 
\emph on
Wikipedia:Statistics
\emph default
 project.
\end_layout

\begin_layout Standard

And so forth.
 I should conclude on a happy note: merely writing this wild, writhing proposal
 has been a matter of much joy, triage, and adagios of learning.
 Learning
\lang american
 LyX
 was hard and 
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
LaTeX
\backslash
 
\end_layout

\end_inset

 harder -- but the result, I trust, worth it.
\begin_inset VSpace vfill
\end_inset


\end_layout

\begin_layout Standard
\noindent
\align center

\family typewriter

\lyxline
Humbly yours,
\begin_inset VSpace medskip
\end_inset


\end_layout

\begin_layout Standard
\noindent
\align center

\family typewriter
Brian W.
 Curry
\end_layout

\end_body
\end_document