Robot Soul: Top 20 Hits

Top 20 Hits

Zipf's law states that, while a few words are used very often, many or most words are used rarely. For those AI developers that categorize client inputs using pattern matching, this translates into the fact that the pattern of the */default/miscellaneous category invariably is the one that is most frequently matched. This seems to hold true even for systems that service many clients and provide large data sets.

The Pandorabots bot hosting service, for instance, has responded to around 300.000.000 client inputs so far, and Dr. Richard Wallace (who ought to know) recently reported to the Robitron group that a Pandorabot's probability of matching with the default AIML category ranges between 2 and 5 percent, the wildcard pattern thus leading the Zipf curve. The actual percentage seems to depend on the botmaster's competence (and investment in dev time), but no bot has yet pushed it from the head of the curve.

In an earlier, but related discussion on the Alicebot newslist, Alexander E. Richter, founder of the Parsimony bot hosting service (currently hosting more than 1.600 active bots), remarked that, when measuring with a bot that featured 2.000.000 AIML categories, it turned out that 5 % of those categories matched with 95 % of all inputs.

Here are my current Top 20 AIML patterns, complete with their estimated matching probabilities (after normalization and typo correction):



3.500 % *

1.400 % YES

0.700 % NO

0.350 % WHY

0.270 % HI/HELLO

0.210 % GOOD/COOL

0.200 % BYE

0.190 % HOW OLD ARE YOU

0.140 % HOW ARE YOU

0.120 % THANK YOU/THANKS

0.110 % WHAT

0.080 % OH

0.077 % REALLY

0.075 % YOU

0.074 % WHAT IS YOUR NAME

0.072 % I DO NOT KNOW

0.070 % FUCK YOU

0.068 % SO

0.065 % ME TOO

0.063 % LOL

The rounding is somewhat rough to increase readability; what I want to communicate is the proportions of the curve, which I'm sure are recogizable to botmasters everywhere. If it takes Parsimony 100.000 (= 5% of 2.000.000) categories to match 95 % of their client inputs, around 7.834 % of those matches are made by the top 20 patterns.

So, using the Parsimony system as a benchmark, I assume that it takes 20 categories to make a good 7 % of the matches, plus 99.980 to make 95 %, plus 1.900.000 to make 100 % (given the inputs of Parsimony's user community, which, with 1.600 bots and several thousand fora, is fairly large). As a ballpark measure, this seems good enough for me to use it at the mo.

What's your ballpark measure?

scheuring - 15. Apr, 15:16

4 comments - add comment

drgold - 18. Apr, 04:46

you wouldn't be willing to post your top 100 hits would you? =)
nice blog, the only one I've seen that gives so much coverage to chatbots and related issues.

scheuring - 22. Apr, 14:35

Top 400

About five years ago (pre-ProgramD), the "standard" AIML interpreter - and the first one written in Java - was ProgramB. There was a servlet version of B, which loaded the categories on the client browser, and to get things going as fast as possible, somebody made a file with the most frequently matched categories, which got loaded first. This was called "std-65percent.aiml", because it was assumed to provide roughly 65 % of coverage (for the AIML set then known as "std-AIML", which had about 12.000 patterns), and it includes about 400 patterns. HTH.

drgold - 28. Apr, 05:41

thanks scheuring!

Cool, that seems to cover most of the basics. Definitely a good place for me to start, thanks.

scheuring - 29. Apr, 15:57

Gabe,

you're welcome.

Robot Soul

Top 20 Hits

Top 400

thanks scheuring!

Gabe,

Home

Hood

Archive

Recent Comments

Credits