Top 20 Hits
Zipf's law states that, while a few words are used very often, many or most words are used rarely. For those AI developers that categorize client inputs using pattern matching, this translates into the fact that the pattern of the */default/miscellaneous category invariably is the one that is most frequently matched. This seems to hold true even for systems that service many clients and provide large data sets.
The Pandorabots bot hosting service, for instance, has responded to around 300.000.000 client inputs so far, and Dr. Richard Wallace (who ought to know) recently reported to the Robitron group that a Pandorabot's probability of matching with the default AIML category ranges between 2 and 5 percent, the wildcard pattern thus leading the Zipf curve. The actual percentage seems to depend on the botmaster's competence (and investment in dev time), but no bot has yet pushed it from the head of the curve.
In an earlier, but related discussion on the Alicebot newslist, Alexander E. Richter, founder of the Parsimony bot hosting service (currently hosting more than 1.600 active bots), remarked that, when measuring with a bot that featured 2.000.000 AIML categories, it turned out that 5 % of those categories matched with 95 % of all inputs.
Here are my current Top 20 AIML patterns, complete with their estimated matching probabilities (after normalization and typo correction):
The rounding is somewhat rough to increase readability; what I want to communicate is the proportions of the curve, which I'm sure are recogizable to botmasters everywhere. If it takes Parsimony 100.000 (= 5% of 2.000.000) categories to match 95 % of their client inputs, around 7.834 % of those matches are made by the top 20 patterns.
So, using the Parsimony system as a benchmark, I assume that it takes 20 categories to make a good 7 % of the matches, plus 99.980 to make 95 %, plus 1.900.000 to make 100 % (given the inputs of Parsimony's user community, which, with 1.600 bots and several thousand fora, is fairly large). As a ballpark measure, this seems good enough for me to use it at the mo.
What's your ballpark measure?
The Pandorabots bot hosting service, for instance, has responded to around 300.000.000 client inputs so far, and Dr. Richard Wallace (who ought to know) recently reported to the Robitron group that a Pandorabot's probability of matching with the default AIML category ranges between 2 and 5 percent, the wildcard pattern thus leading the Zipf curve. The actual percentage seems to depend on the botmaster's competence (and investment in dev time), but no bot has yet pushed it from the head of the curve.
In an earlier, but related discussion on the Alicebot newslist, Alexander E. Richter, founder of the Parsimony bot hosting service (currently hosting more than 1.600 active bots), remarked that, when measuring with a bot that featured 2.000.000 AIML categories, it turned out that 5 % of those categories matched with 95 % of all inputs.
Here are my current Top 20 AIML patterns, complete with their estimated matching probabilities (after normalization and typo correction):
3.500 % *
1.400 % YES
0.700 % NO
0.350 % WHY
0.270 % HI/HELLO
0.210 % GOOD/COOL
0.200 % BYE
0.190 % HOW OLD ARE YOU
0.140 % HOW ARE YOU
0.120 % THANK YOU/THANKS
0.110 % WHAT
0.080 % OH
0.077 % REALLY
0.075 % YOU
0.074 % WHAT IS YOUR NAME
0.072 % I DO NOT KNOW
0.070 % FUCK YOU
0.068 % SO
0.065 % ME TOO
0.063 % LOL
The rounding is somewhat rough to increase readability; what I want to communicate is the proportions of the curve, which I'm sure are recogizable to botmasters everywhere. If it takes Parsimony 100.000 (= 5% of 2.000.000) categories to match 95 % of their client inputs, around 7.834 % of those matches are made by the top 20 patterns.
So, using the Parsimony system as a benchmark, I assume that it takes 20 categories to make a good 7 % of the matches, plus 99.980 to make 95 %, plus 1.900.000 to make 100 % (given the inputs of Parsimony's user community, which, with 1.600 bots and several thousand fora, is fairly large). As a ballpark measure, this seems good enough for me to use it at the mo.
What's your ballpark measure?
scheuring - 15. Apr, 15:16
nice blog, the only one I've seen that gives so much coverage to chatbots and related issues.
Top 400