A sound or a mixture of sounds, or its symbol in writing or printing
that symbolizes and communicates a sense and may consist of a single morpheme
or of a mixture of morphemes. Something said; an utterance, remark, or comment:
May I say a word about that? A command or route; an order: gave the word to draw
back. An assurance or promise; sworn intention: She has kept her word. A verbal
signal; a password or maxim. Discourse or talk; speech: Actions speak louder than
words. Music The text of a vocal composition; lyrics. Hostile or angry remarks made
back and forth. News: Any sound on your promotion? Rumor: Word has it they're separation.
Used euphemistically in grouping with the early letter of a phrase that is considered
unpleasant or taboo or that one does not want to utter: "Although economists
here will not call it a depression yet, the dreaded 'R' word is beginning to pop
up in the media" (Francine S. Keefer)the Scriptures; the Bible. Computers
A set of bits that is of a fixed size and is classically operated on by a computer's
processor. In precisely those words; exactly: hinted at looming indictment but did
not say it in so many words. Speaking openly and straightforwardly: In so many words,
the weather has been horrible. Not informal or loquacious; laconic: a person of
few words. Displaying personal steadiness: a woman of her word. To be convinced
of anther's sincerity and act in accord with his or her statement: We took them
at their word that the job would be done on time. To believe what someone says without
investigating further. We have already observed that dictionaries area, perhaps
the, central component of MT systems. In previous Chapters, we have
presented a highly simplified view of dictionaries for example, in Chapter the
dictionary was from time to time little more than a list of rules such as v walk, which only allows information about part of
speech to be represented, and in Chapter we gave version rules which simply
paired up the citation forms of source and object words. However, though some
of the information that is found in a typical paper dictionary is of limited
value in MT (e.g. information about pronunciation is
only useful in speech to speech systems), in general the quality and detail of
the information one needs for MT is at least equal to that which one finds in
paper dictionaries. In this section we discuss the various pieces of
information about words that a good MT system must contain, basing ourselves on
the dictionary entries above. An issue we will not address in this Chapter is
the treatment of idiom s, which one typically finds in
paper dictionary entries. We discuss the treatment of idiom s
in Chapter. It is useful to make a distinction between the characteristics of a
word itself (its inherent properties) and the boundaries it places on other
words in its grammatical surroundings. Although this distinction is not
explicitly drawn in paper dictionaries, in turn of both types is available in
them. Information about grammatical properties includes the indication of
gender in the French part of the
bilingual dictionary entry, and the indication of number on nouns (typically, the citation form of nouns is the
singular form, and information about number is only explicitly given for nouns
which have only plural forms, such as scissors, and trousers). Information
about the grammatical environment a word can emerge in is normally thought of
as dividing into two kinds: sub categorization in order,
which indicates the syntactic environments that a word can occur in, and selection borders which describe semantic properties of the
environment. Typical information about sub categorization is
the information that button is a transitive verb. This is expressed in
the verb code in the dictionary entry on page. More precisely, this indicates
that it is a verb that appears as the HEAD of sentences with a (noun phrase)
SUBJECT and a (noun phrase) OBJECT. The following gives some examples, together
with the appropriate verb codes from OALD: Note that [I]
refers to intransitive verbs that only need a subject to form a grammatical
sentence, [Tn] to transitive verbs (like button) that need a subject and
an object, [Dn.pr] to intransitive verbs which take a subject matter and two
objects, where the second one is introduced by the preposition to, to intransitive
verbs that take a subject plus two object nouns, to complex transitive verbs
which entail a subject, object and an infinitival (non-tensed)
clause introduced by to, to transitive verbs taking a subject, object
and a finite (tensed) sentence introduced by that,
[La] to linking verbs which link an adjectival phrase (which describe in some
way the subject), to the subject, and refers to concerning verbs which link a
noun saying to the subject. Verbs are not the only word categories that sub categorize for
certain elements in their environment. Nouns exhibit the same phenomenon, like
those nouns that have been derived from verbs (deferral nouns). The death of
the leader shocked everybody. The devastation of the city by the Romans was
thorough. Similarly, there are some adjectives that sub categorize for
certain complements. Note that in the examples below we find three different
types of complements, and that b and c differ from each other because in b the
subject of the main clause is also the understood subject of the sub clause,
whereas in c the subject of the main clause is the understood object of the sub
clause. Mary was proud of her performance. He was eager to unwrap his in
attendance. That matter is easy to deal with. An adequate dictionary of English
would probably have to recognize at least twenty dissimilar sub categorization classes
of verb, and a similar number for adjectives and nouns. The reason one cannot
be accurate about the number of different sub category classes is that it
depends (a) on how fine the distinctions are that one wants to draw, and (b) on
how far one relies on rules or general principles to capture regularities. For
example, probably all verbs allow coordinated subjects such as Sam and Leslie,
but there are some, like meet, where this is corresponding to an ordinary
transitive One could decide to recognize this distinction by creating a divide sub categorization class, thus extend the number of
classes. But one could also fight that this fact about meet and alike
verbs is probably related to their semantics (they describe symmetric
relations, in the sense that if A meets B, then B meets A), and is thus standard
and predictable. The appropriate come near could then be to treat it by means
of a general linguistic rule (perhaps one that transforms structures like a)
into ones of the form (b)) Of course, unless one can rely on semantic in
sequence to pick out verbs like meet, one will have to begin some mark
on such verbs to make certain that they, and only they, suffer this rule.
However, this is not automatically the same as introduce a sub
categorization class. Sub labeling information indicates that, for
example, the verb button occurs with a noun phrase OBJECT. In fact, we
know much more about the verb than this the
OBJECT, or in terms of semantic roles , the PATIENT, of
the verb has to be a `button able' thing, such as a piece of clothing, and that
the SUBJECT (more precisely AGENT) of the verb is in general animate. Such in
sequence is commonly referred to as the selection borders that words place
on items that become visible in constructions where they are the HEAD. This
information is unspoken in the paper glossary entry above the rank that the
object of button is inanimate, and normally an item of clothing has to
be worked out from the use of Seth (= `some thing') in the definition, and the
example, which gives coat, jacket, shirt as possibilities. The entry nowhere
says the SUBJECT of the verb has to be an animate entity (probably human),
since no other entity can perform the action of `buttoning'. It is assumed
(rightly) that the human reader can work this sort of thing out for herself.
This in sequence has to be made explicit if it is to be used in analysis,
transfer or synthesis, of course. Basic inherent
information and information about sub categorization and selection restrictions can
be represented straightforwardly for MT purposes. Essentially, entries in an MT
dictionary will be equivalent to collections of attributes and values (i.e.
features). For example, one might have something like
the following for the noun button, indicating that its base, or reference form
is button, that it is a common noun, which is concrete
(rather than abstract, like happiness, or sincerity) An understandable way to execute
such things is as records in a database, with attributes naming fields (e.g. cat), and values as the contents of the fields (e.g. n). But it is not always obligatory to name the field one could, for example, adopt a conference
that the first field in a documentation always contains the talk about form (in
this case the value of the feature lex) ,
that the second field indicates the grouping, and that the third field some
sort of division of the category. Looking at the vocabulary access for the noun
button it becomes clear that unlike parts of speech will have a different anthology
of attributes. For example, verbs will have a type,
rather than a type feature, and while
verbs might have fields for indications of number, person and tense, one would
not expect to find such fields for prepositions. In the entry we have given we
also find one attribute without a value. The idea here is to indicate that a
value for this power is possible, but is not inherent to the word button, which
may have different number values on different occasions (unlike e.g. trousers,
which is always plural). Of course, this sort of bare field is essential if
fields are indicated by position, rather than name. In systems which name
attribute fields it might simply be equivalent to omitting the attribute, but
maintaining the field is still useful because it helps someone who has to
modify the dictionary to understand the information in the dictionary. An
alternative to giving a blank value, is to follow the practice of some paper
dictionaries and fill in the default, or (in some sense) normal value. For a quality
like number, this would presumably be
singular. This alternative, however, is unfashionable these days, since it goes
against the generally accepted idea that in the best case linguistic processing
only adds, and never changes in sequence. The attraction of such an
approach is that it makes the order in which things are done less critical (cf.
our remarks about the attraction of separating declarative and procedural
information in Chapter). In order to include information about sub
categorization and selection restrictions, one has two
options. The first is to encode it via sets of attributes with atomic values
such as those above. In practice, this would mean that one might have features such
as sub cat=subj_obj, and sem_patient=clothing. As regards sub categorization
information, this is essentially the approach used in the monolingual paper
dictionary above. In some systems this may be the only option. However, some
systems may allow values to be sets, or lists, in which case one has more
flexibility. For example, one might represent sub categorization information by
means of a list of categories, for example subset =
[np,np,np] might indicate a verb that allows three NPs (such as give),
and [np,np,pp] might indicate a verb that takes
two NPs and a PP A notation which allows the lexicographer to indicate other
properties of the items would be still more expressive. For example, it would
be useful to indicate that with give, the preposition in the PP has to
be to. This would mean that instead of ` pp'
and ` np' one would have collections of features,
and perhaps even pieces of syntactic structure. (A current trend in
computational linguistics involves the development of formalisms that allow
such very detailed lexical entries, and we will say a little more about them in
Chapter). Turning now to the treatment of translation information in MT
dictionaries, one possibility is to attempt to represent all the relevant
information by means of attributes and values. Thus, as an addition to the
dictionary entry for button given above, a transformer system could specify a `translation' feature
which has as its value the appropriate target language word; e.g. trans = bout on for translation into French .
One might also include features which trigger certain transformations (for
example for changing world order for certain words). However, this is not a
particularly attractive view. For one thing, it is clearly oriented in one direction,
and it will be difficult to produce entries relating to the other direction of translation from such entries. More generally, one
wants a bilingual vocabulary to allow the replacement of certain source language
oriented information with equivalent target language information replace the in
turn one derives from the source dictionary by information derived from the
target dictionary. This suggests the usage of translation rules which narrate
head words to head words. That is, rules of the type we introduced in Chapter,
like temperature. As we noted before, not all paraphrase rules can be a simple
mapping of source words words onto their target language equivalents. One will
have to put conditions on the rules. For example, one might like to be able to
describe in the bilingual entry that deals with like and plaice,
the change in grammatical relations that occurs if one is working with
relatively shallow levels of representation,. In effect, the transfer rule
that we gave for this example in Chapter might be seen as a bilingual lexical
entry. Other transformation rules that may require more than just a simple
pairing of source and target words are those that treat phenomena like idiom s and compound s, and some cases of
lexical holes (cf. Chapter). To deal with such phenomena
bilingual dictionary entries may have a single lexical item on the side of one
language, whereas the other side describes a (possibly quite complex)
linguistic structure. The entry for button taken from a paper lexicon at the opening
of this Chapter illustrates an issue of major importance to the automatic
processing of some languages, including English. This is the very widespread incidence
of homograph in the language. Loosely talking, homographs
are words that are written in the same way. However, it is important to
distinguish several diverse cases (sometimes the term homograph is restricted
to only one of them). The case where what is intuitively a single noun (for
example) has several different readings. This can be seen with the entry for button
on page, where a reading relating to clothing is distinguished from a `knob'
reading. The case where one has related items of different categories which are
written alike. For example, button can be either a noun or a verb. The case
where one has what appears to be unrelated items which happen to be written
alike. The classic example of this is the noun bank, which can designate
either the side of a river, or a financial institution. These distinctions have
sensible implication when one is writing (creating, extend, or modifying) a
dictionary, since they relate to the question of when one should create a new
entry (by defining a new headword). The issues involved are rather dissimilar
when one is creating a `paper' dictionary (where issues of readability are
paramount) or a dictionary for MT, but it is in any case very much a pragmatic
decision. One good guiding principle one might accept is to group entries
hierarchically in terms of amounts of shared in sequence. For example, there is
relatively little that the two senses of bank share apart from their
citation form and the fact that they are both common nouns, so one may as well
associate them with different entries. In a computational setting where one has
to give unique names to different entries, this will engross creating headwords
such as bank_1 and bank_2, or (bank finance, and Bank River). As regards the noun
and verb button, though one might want to have some way of indicating that they
are related, they do not share much in turn, and can therefore be treated as
separate entries. For multiple readings of a word, for example, the two
readings of the noun button, on the other hand, most information is shared
they differ mainly in their semantics. In this case, it might be useful to
impose an organization in the lexicon in which information can be inherited
from an entry into sub-entries (or more generally, from one entry to another),
or to see them as subentries of an abstract `protoentry' of some sort. This
will certainly save time and effort in dictionary construction though the
savings one makes may look small in one case, it becomes significant when
multiplied by the number items that have different readings (this is certainly
in the thousands, perhaps the hundreds of thousands, since most words listed in
normal dictionaries have at least two readings). The issues this raise is
complex and we cannot do them impartiality here, however, the following will
give a flavor of what is caught up. More generally, what one is talking about
here is inheritance of properties between entries (or from entries into
subentries). This is illustrated in Figure. One could picture extend this,
introducing abstract entries express in sequence true of classes of (real)
entry. For example, one might want to identify certain facts about all nouns
(all noun readings) just once, rather than stating them separately in each
entry. The entry for a typical noun might then be very simple, saying no more
than `this is a typical noun', and giving the mention form (and semantics, and
translation, if appropriate). One allows for sub regularities (that is lexical
elements which are habitual in some but not all properties), by allowing
elements to inherit some information while expressing the special or irregular
information directly in the entry itself. In many cases, the optimal
organization can turn out to be quite complicated, with entries inheriting from
a number of different sources. Such a come near becomes even more attractive if
default inheritance is possible. That is, that information is inherited,
unless it is overtly contradict in an entry/reading it would then be possible
to say, for example, `this is a typical noun, except for the way it forms its
plural'. One final and important component of an MT dictionary, which is
entirely missing in paper dictionaries (at least in their printed, public form),
is documentation. Apart from general documentation describing design decisions,
and terminology , and providing lists and definitions
(including operational tests) for the attributes and values that are used in
the dictionary (it is, obviously, essential that such terms are used
consistently and consistency is a
problem since creating and maintaining a dictionary is not a chore that can be
performed by a single individual), it is important that each entry include some
lexicographers' comments information about who created the entry, when
it was last revise, the kinds of example it is based on, what problems there
are with it, and the sorts of improvement that are required. Such information
is vital if a vocabulary is to be maintained and extended. In general, though the
quality and quantity of such documentation has no effect on the actual presentation
of the dictionary, it is serious if a dictionary is to be tailored or extended.
No comments:
Post a Comment