Introduction
First things
first. Yes,
it is a
spelling
checker,
no, you
don't need
any DLLs or
ActiveX
controls and
yes, it is
free.
But it's
slow, and it
needs quite
a bit more
work (as you
can see from
the
screenshot -
it's giving
hundreds of
suggestions!).
The main
reason I'm
posting it
here
unfinished
is so that
other people
can improve
(rip out and
start again)
the code
that is the
main
"engine" of
it, and add
other
features to
it.
Thanks
Sections of
code come
from the
following
people:
-
Zafir
Anjum -
CListCtrl
sort
code
-
Chris
Maunder
- For
this
site and
the
CProgressControl
class
Background
A few months
ago, I was
in need of a
spell
checker
to
add to one
of my
applications.
I looked at
various DLLs
and ActiveX
controls,
but they
didn't seem
to do things
the way I
wanted them
to. I also
found
several Open
Source
spell
checkers
for
Unix, but it
wasn't
really
possible to
easily
convert them
to Windows.
In the end,
I decided to
try making
my own, and
after
experimenting
a bit, I
decided I
needed a
dictionary
first, in
order to
test the
accuracy of
the results.
For the last
four months,
I've been
compiling
the
dictionary,
and then I
tested the
original
test code I
had, to see
how it
worked. It
was very
slow, and as
I don't have
much
experience
with huge
arrays and
lists, I
thought it
best to post
it here, and
let others
see and
improve/change
it.
Components
This
project/article
consists of
three
sections and
downloads.
The
dictionary
itself, a
dictionary
editor
program, and
the test
spell
checker
program.
Dictionary
The
dictionary
consists of
nearly
75,000
words, and
is in
English
(UK/International).
The bulk of
the words
came from
several
available
dictionaries
on the web,
but
surprisingly,
though they
had lots of
place names,
and words
I'd never
even heard
of before
(but checked
OED and
found they
did exist :)
), they
seemed to
lack many
everyday
words (e.g.
their,
owner,
to name a
few). I've
spent the
months since
I started
this project
adding
words, and
checking
them (as
well as most
of the
existing
words), and
while I'm
fairly
confident
its accuracy
is very
good, it's
not going to
be perfect,
and there'll
probably be
more than a
few words
missing :).
By all
means, add
words to it,
or send the
words to me
in a text
file and
I'll update
it, but it's
UK English
at the
moment, so
it's
probably
better to
start an
American
English
version if
we're going
to be adding
words like
"Color"
and
"Customize"
rather than
"Colour"
and
"Customise"
<g>.
It is in the
binary
format the
sample
programs
use, which
I'll explain
in the next
section. You
can use the
Dictionary
Editor
program to
export it to
a plain text
file.
Dictionary
Editor
First of
all, I'll
talk about
the format
the
dictionary
is in - both
programs use
this format
to read and
edit the
dictionary.
I'm
basically
just using a
CStringArray
array which
contains all
the words,
and I
serialize
this to a
file.
It's faster
than using a
CObArray
array which
I was using
before, but
it's still
not fast
enough.
The program
itself is
simply a
default
MFC
AppWizard
program,
with a
CListView
view
subclassed
to a custom
CListCtrl
that has a
sort
function
added
(thanks to
Zafir Anjum
on
CodeGuru),
and the
function
modified
slightly so
that it is
case
insensitive.
It uses a
CStringArray
array to
store each
word.
Most of the
code in the
Dictionary
Editor is
commented,
so I won't
explain it
completely
here.
It has three
toolbar
buttons
(Add, Delete
and Modify)
which allow
you to add,
delete and
modify
words.
It also has
Import
and
Export
commands on
the File
menu, which
allow you to
import and
export words
from and to
text files
with a
single word
on each
line.
The
Import
function has
code
commented
out that
will check
each word
that is to
be added,
and make
sure it
doesn't
exist
already.
The reason
it's
commented
out is that
it doesn't
work - it
was still
going after
7 hours
importing
the full
dictionary,
whereas
UltraEdit
manages to
remove
duplicate
items AND
sort them in
under two
seconds,
which says
much about
this code
:).
Clicking on
the Word
header
control will
sort the
contacts.
The Tools
menu
contains
three
commands:
Trim
The Trim
command goes
through all
the words
and does a
TrimLeft()
and
TrimRight()
on all the
words to
remove
leading and
trailing
spaces.
Collapse
Copy
Code
void CDictionaryEditorView::OnToolsTrim()
{
CWaitCursor wait;
CListCtrl& m_List = GetListCtrl();
int x = m_List.GetItemCount();
GetDocument()->array.RemoveAll();
for (int j = 0; j < x; j++) {
CString strWord;
strWord = m_List.GetItemText(j, 0);
strWord.TrimLeft(); strWord.TrimRight();
m_List.SetItemText(j, 0, strWord);
GetDocument()->AddWord(strWord); }
GetDocument()->SetModifiedFlag(); }
Lowercase
The
Lowercase
command goes
through all
the words
and does a
MakeLower()
on them to
make them
all
lowercase.
Collapse
Copy
Code
void CDictionaryEditorView::OnToolsLowercase()
{
CWaitCursor wait;
CListCtrl& m_List = GetListCtrl();
int x = m_List.GetItemCount();
GetDocument()->array.RemoveAll();
for (int j = 0; j < x; j++) {
CString strWord;
strWord = m_List.GetItemText(j, 0);
strWord.MakeLower();
m_List.SetItemText(j, 0, strWord);
GetDocument()->AddWord(strWord); }
GetDocument()->SetModifiedFlag(); }
Find and
Find Next
The Find
command goes
through the
list of
words and
finds words
which have
certain
characters
in them and
then
highlights
them and
selects them
one at a
time.
I used this
to find
words which
weren't
imported
correctly
(due to
binary
format) and
had invalid
characters
on the sides
that the
Trim command
didn't find,
and then
edit them
manually.
The Find
Next command
carries on
from the
last found
word.
Collapse
Copy
Code
void CDictionaryEditorView::OnToolsFind()
{
CListCtrl& m_List = GetListCtrl();
m_nFind = 0;
int x = m_List.GetItemCount();
for (int j = m_nFind; j < x; j++) {
CString strWord;
strWord = m_List.GetItemText(j, 0);
if (strWord.FindOneOf("����������������������������"
"����������������������������������������") != -1)
{
m_List.SetItemState(j, LVIS_SELECTED |
LVIS_FOCUSED, LVIS_SELECTED | LVIS_FOCUSED);
m_List.EnsureVisible(j, FALSE);
m_nFind = j + 1;
return; }
}
}
Test
spell
checker
This is
where the
problems
start. It's
messy, and
slow.
Anyway, the
main test
program
consists of
a main
dialog which
has a text
box and two
command
buttons on
it. The
CSpellDlg
class
handles this
dialog, as
well as the
main spell
checking
functions.
The
Options
button
brings up
the Options
dialog that
lets you
configure
where the
dictionary
is and where
the custom
dictionary
should be
stored.
You'll need
to do this
before you
test the
program.
There is
also the
spell
checker
dialog that
you'll
probably be
familiar
with in one
form or the
other (this
one is like
Visio's),
which
displays
when a word
that isn't
recognized
is found.
This
displays the
word and any
spelling
suggestions,
as well as
returns what
button was
clicked to
the main
dialog.
So, this is
how it works
(or doesn't
as the case
may be :) ):
Assuming
there is a
sentence or
two in the
main text
box, when
you click
the Spell
button, the
OnSpell()
function is
executed.
This does
the
following
things:
First, it
gets the
text from
the text box
using the
UpdateData()
function,
and then it
checks to
see if the
dictionaries
exist, and
if they do,
it loads
them. It
then creates
a
CStringArray
to hold the
words it
should
ignore for
one session
(when the
user clicks
Ignore
All),
and
initializes
the
counters.
Collapse
Copy
Code
UpdateData(TRUE);
CFileStatus status;
if(CFile::GetStatus(AfxGetApp()->GetProfileString("Settings",
"Main", ""), status))
{
CFile cfSettingsFile (AfxGetApp()->GetProfileString("Settings",
"Main", ""), CFile::modeNoTruncate | CFile::modeReadWrite );
CArchive ar ( &cfSettingsFile, CArchive::load );
array.Serialize ( ar ); }
CFileStatus status2;
if(CFile::GetStatus(AfxGetApp()->GetProfileString("Settings",
"Custom", ""), status2))
{
CFile cfSettingsFile (AfxGetApp()->GetProfileString("Settings",
"Custom", ""), CFile::modeNoTruncate | CFile::modeReadWrite );
CArchive ar ( &cfSettingsFile, CArchive::load );
custarray.Serialize ( ar ); }
CString strStart = m_strText;
CStringArray IgnoreAll;
int nPos = 0; int nPos2 = 0;
The next
step is to
do a
While
loop,
looking for
specific
characters
(using a
custom
FindOneOf()
function
that allows
you to
specify
where to
start
looking
from). These
characters
are the
characters
that usually
separate
words
("!'()[]<>,.).
This loop
basically
finds each
new word.
Within this
loop, the
word is
extracted,
leading and
trailing
blank spaces
are removed,
and if it
equals
nothing, has
a length of
one
character or
contains
numbers and
certain
other
characters,
it is
ignored.
Collapse
Copy
Code
CString strWord = strStart.Mid(nPos, nPos2 - nPos);
strWord.TrimLeft();
strWord.TrimRight();
if (strWord == "" || strWord.GetLength() == 1 ||
strWord.FindOneOf("0123456789+-/@?:*.,") != -1)
{
}
Otherwise,
the program
goes through
the
dictionary
to see if
the word
exists. It
also goes
through the
custom
dictionary
to see if it
exists
there, and
has a look
through the
IgnoreAll
list to see
if it should
ignore the
word. If in
any of these
cases it
finds the
word, it
sets the
bFound
flag to
TRUE
and returns
to the main
loop.
If after
checking
through the
dictionaries
and the
IgnoreAll
list it has
found the
word, then
it goes back
to the main
While
statement,
ignoring the
word, and
tries the
next word.
Collapse
Copy
Code
BOOL bFound = FALSE;
for (int i = 0; i < array.GetSize(); i++)
{
CString strCheckWord;
strCheckWord = GetWord(i);
if (strWord.CompareNoCase(strCheckWord) == 0)
{
bFound = TRUE; break;
}
}
for (int j = 0; j < custarray.GetSize(); j++)
{
CString strCheckWord;
strCheckWord = GetWordCustom(j);
if (strWord.CompareNoCase(strCheckWord) == 0)
{
bFound = TRUE; break;
}
}
for (int f = 0; f < IgnoreAll.GetSize(); f++) {
if (strWord.CompareNoCase((LPCTSTR)IgnoreAll[f]) == 0)
{
bFound = TRUE; break;
}
}
Otherwise,
if it hasn't
found the
word
anywhere, we
get to the
fun bit,
where the
word is
highlighted
in the text
box,
possible
suggestions
of what the
word should
be are
found, and
the
Spell
Check
dialog box
is
displayed,
so the user
can choose
what action
to take.
This is the
section that
needs
re-writing,
as it's
messy, and
doesn't work
that well.
What I
attempted to
do was to go
through each
word in the
dictionary,
and see if
it matched
all the
characters
from the
left - 1 of
the misspelt
word, then
all the
characters
from the
left - 2,
-3, etc.,
etc., and
the closer a
word in the
dictionary
was to the
misspelt
word to add
it to the
top of the
suggestions
list.
However, the
code I used
wastes a lot
of time
looking for
words that
are the same
in each of
those
conditions
even though
the starting
character is
different,
so 25/26 of
the time the
loop is
useless, so
this needs
to be
changed.
This method
of finding
the correct
word will
also not
find "upon"
if "apon"
was typed,
whereas most
other spell
checkers
find this
with no
problem at
all, so it
would be
more
accurate if
some form of
vowel
swapping was
introduced,
as well as
doing the
algorithm
backwards,
starting
from the
right.
Collapse
Copy
Code
m_Text.SetSel(nPos, nPos2);
CSpellDialog dlg;
dlg.m_strWord = strWord;
int nGood = 0;
for (int i = 0; i < array.GetSize(); i++)
{
CString strGetWord;
strGetWord = GetWord(i);
for (int j = strWord.GetLength(); j >= 1; --j)
{
CString strGetWord;
strGetWord = GetWord(i);
if (strGetWord.Left(j).CompareNoCase(strWord.Left(j)) == 0)
{
BOOL bGood = FALSE;
if (strWord.GetLength() <= 3)
{
if (strGetWord.Left(2).CompareNoCase(strWord.Left(2)) == 0)
{
bGood = TRUE; }
}
else
{
if (strGetWord.Left(strWord.GetLength() - 2).CompareNoCase(
strWord.Left(strWord.GetLength() - 2)) == 0)
{
bGood = TRUE;
}
}
BOOL bFoundWord = FALSE;
for( int k = 0; k < dlg.m_saSuggestions.GetSize(); k++ )
{
if (strGetWord.CompareNoCase(
(LPCTSTR)dlg.m_saSuggestions[k]) == 0)
{
bFoundWord = TRUE;
break;
}
}
if (bFoundWord == FALSE)
{
if ((strGetWord.GetLength() >= (strWord.GetLength() - 1)) &&
(strGetWord.GetLength() <= (strWord.GetLength() + 1)))
{
if (bGood == TRUE)
{
dlg.m_saSuggestions.InsertAt(nGood, strGetWord);
nGood ++;
}
else
{
dlg.m_saSuggestions.Add(strGetWord);
}
}
break;
}
}
}
}
This process
is then
repeated for
the custom
dictionary.
The final
stage is to
display the
Spell
Check
dialog box,
which allows
the user to
say if they
want to
Ignore the
word,
Add it to
the custom
dictionary,
or Change
it for one
of the
suggestions
in the list
box. The
program then
does what
the user
wants,
depending on
what button
they
clicked.
Collapse
Copy
Code
if (dlg.DoModal() == IDOK)
{
if (dlg.m_nOption == 1)
{
CString strAddWord = strWord;
strAddWord.MakeLower();
AddWordCustom(strAddWord);
CFile cfSettingsFile (AfxGetApp()->GetProfileString("Settings",
"Custom", ""), CFile::modeCreate |
CFile::modeNoTruncate | CFile::modeReadWrite );
CArchive ar2 ( &cfSettingsFile, CArchive::store );
custarray.Serialize ( ar2 );
}
else if (dlg.m_nOption == 2)
{
strStart.Delete(nPos, nPos2 - nPos);
m_strText.Delete(nPos, nPos2 - nPos);
strStart.Insert(nPos, dlg.m_strChangeTo);
m_strText.Insert(nPos, dlg.m_strChangeTo);
nPos2 = nPos + dlg.m_strChangeTo.GetLength();
UpdateData(FALSE);
}
else if (dlg.m_nOption == 3)
{
}
else if (dlg.m_nOption == 4)
{
CString strIgnoreAllWord = strWord;
strIgnoreAllWord.MakeLower();
IgnoreAll.Add(strIgnoreAllWord);
}
}
else
{
return; }
It should
"work" now
:).
Improvements
Lots, I know
<g>.
The finding
suggestion
algorithm
needs a lot
of work, and
a lot of
speeding up
as it wastes
quite a bit
of time
comparing
strings when
the first
characters
aren't even
the same -
this is the
main
bottleneck,
but
everything
really needs
optimizing.
Presently,
it only
tries to
look for
correct
words
starting
with the
same
characters
as the
misspelt
word, but
this isn't
always the
case. An
algorithm
that swaps
around
vowels would
probably
work best,
but looking
at the way
Microsoft
Word
suggests
words, it
knows what
word you
mean, even
if it isn't
that similar
to the word
you type in
some cases,
so perhaps
there's
something
else going
on more
complicated
like the
dictionary
subdivided
into
sections so
that words
that are
often
mispelt with
each other
are
together, so
it is easier
for the
program to
pick
suggestions.
Another
thing that
probably
should be
added is a
Change
All
button,
where all
words
misspelt one
way are
corrected at
once.
Half the
problem is
probably the
fact that
I'm using
all
MFC
functions,
but I tried
using
standard C
functions
for
everything
except the
array, and
it didn't
seem to
help, so I
took them
out again
for the
purpose of
this article
(I've left
one function
in there so
you can see
the sort of
thing I
tried).
Any
suggestions
or
improvements,
please
either send
them to me
or post them
here.