Friday, January 8, 2016

About words


A sound or a mixture of sounds, or its symbol in writing or printing that symbolizes and communicates a sense and may consist of a single morpheme or of a mixture of morphemes. Something said; an utterance, remark, or comment: May I say a word about that? A command or route; an order: gave the word to draw back. An assurance or promise; sworn intention: She has kept her word. A verbal signal; a password or maxim. Discourse or talk; speech: Actions speak louder than words. Music The text of a vocal composition; lyrics. Hostile or angry remarks made back and forth. News: Any sound on your promotion? Rumor: Word has it they're separation. Used euphemistically in grouping with the early letter of a phrase that is considered unpleasant or taboo or that one does not want to utter: "Although economists here will not call it a depression yet, the dreaded 'R' word is beginning to pop up in the media" (Francine S. Keefer)the Scriptures; the Bible. Computers A set of bits that is of a fixed size and is classically operated on by a computer's processor. In precisely those words; exactly: hinted at looming indictment but did not say it in so many words. Speaking openly and straightforwardly: In so many words, the weather has been horrible. Not informal or loquacious; laconic: a person of few words. Displaying personal steadiness: a woman of her word. To be convinced of anther's sincerity and act in accord with his or her statement: We took them at their word that the job would be done on time. To believe what someone says without investigating further. We have already observed that dictionaries area, perhaps the, central component of MT systems. In previous Chapters, we have presented a highly simplified view of dictionaries for example, in Chapter the dictionary was from time to time little more than a list of rules such as v walk, which only allows information about part of speech to be represented, and in Chapter we gave version rules which simply paired up the citation forms of source and object words. However, though some of the information that is found in a typical paper dictionary is of limited value in MT (e.g. information about pronunciation  is only useful in speech to speech systems), in general the quality and detail of the information one needs for MT is at least equal to that which one finds in paper dictionaries. In this section we discuss the various pieces of information about words that a good MT system must contain, basing ourselves on the dictionary entries above. An issue we will not address in this Chapter is the treatment of idiom s, which one typically finds in paper dictionary entries. We discuss the treatment of idiom s in Chapter. It is useful to make a distinction between the characteristics of a word itself (its inherent properties) and the boundaries it places on other words in its grammatical surroundings. Although this distinction is not explicitly drawn in paper dictionaries, in turn of both types is available in them. Information about grammatical properties includes the indication of gender  in the French  part of the bilingual  dictionary entry, and the indication of number  on nouns (typically, the citation form of nouns is the singular form, and information about number is only explicitly given for nouns which have only plural forms, such as scissors, and trousers). Information about the grammatical environment a word can emerge in is normally thought of as dividing into two kinds: sub categorization in order, which indicates the syntactic environments that a word can occur in, and selection borders which describe semantic properties of the environment. Typical information about sub categorization is the information that button is a transitive verb. This is expressed in the verb code in the dictionary entry on page. More precisely, this indicates that it is a verb that appears as the HEAD of sentences with a (noun phrase) SUBJECT and a (noun phrase) OBJECT. The following gives some examples, together with the appropriate verb codes from OALD:  Note that [I] refers to intransitive verbs that only need a subject to form a grammatical sentence, [Tn] to transitive verbs (like button) that need a subject and an object, [Dn.pr] to intransitive verbs which take a subject matter and two objects, where the second one is introduced by the preposition to, to intransitive verbs that take a subject plus two object nouns, to complex transitive verbs which entail a subject, object and an infinitival (non-tensed)  clause introduced by to, to transitive verbs taking a subject, object and a finite (tensed)  sentence introduced by that, [La] to linking verbs which link an adjectival phrase (which describe in some way the subject), to the subject, and refers to concerning verbs which link a noun saying to the subject. Verbs are not the only word categories that sub categorize for certain elements in their environment. Nouns exhibit the same phenomenon, like those nouns that have been derived from verbs (deferral nouns). The death of the leader shocked everybody. The devastation of the city by the Romans was thorough. Similarly, there are some adjectives that sub categorize for certain complements. Note that in the examples below we find three different types of complements, and that b and c differ from each other because in b the subject of the main clause is also the understood subject of the sub clause, whereas in c the subject of the main clause is the understood object of the sub clause. Mary was proud of her performance. He was eager to unwrap his in attendance. That matter is easy to deal with. An adequate dictionary of English would probably have to recognize at least twenty dissimilar sub categorization classes of verb, and a similar number for adjectives and nouns. The reason one cannot be accurate about the number of different sub category classes is that it depends (a) on how fine the distinctions are that one wants to draw, and (b) on how far one relies on rules or general principles to capture regularities. For example, probably all verbs allow coordinated subjects such as Sam and Leslie, but there are some, like meet, where this is corresponding to an ordinary transitive One could decide to recognize this distinction by creating a divide sub categorization  class, thus extend the number of classes. But one could also fight that this fact about meet and alike verbs is probably related to their semantics (they describe symmetric relations, in the sense that if A meets B, then B meets A), and is thus standard and predictable. The appropriate come near could then be to treat it by means of a general linguistic rule (perhaps one that transforms structures like a) into ones of the form (b)) Of course, unless one can rely on semantic in sequence to pick out verbs like meet, one will have to begin some mark on such verbs to make certain that they, and only they, suffer this rule. However, this is not automatically the same as introduce a sub categorization class. Sub labeling information indicates that, for example, the verb button occurs with a noun phrase OBJECT. In fact, we know much more about the verb than this  the OBJECT, or in terms of semantic roles , the PATIENT, of the verb has to be a `button able' thing, such as a piece of clothing, and that the SUBJECT (more precisely AGENT) of the verb is in general animate. Such in sequence is commonly referred to as the selection borders that words place on items that become visible in constructions where they are the HEAD. This information is unspoken in the paper glossary entry above the rank that the object of button is inanimate, and normally an item of clothing has to be worked out from the use of Seth (= `some thing') in the definition, and the example, which gives coat, jacket, shirt as possibilities. The entry nowhere says the SUBJECT of the verb has to be an animate entity (probably human), since no other entity can perform the action of `buttoning'. It is assumed (rightly) that the human reader can work this sort of thing out for herself. This in sequence has to be made explicit if it is to be used in analysis, transfer or synthesis, of course.  Basic inherent information and information about sub categorization and selection restrictions can be represented straightforwardly for MT purposes. Essentially, entries in an MT dictionary will be equivalent to collections of attributes and values (i.e. features).  For example, one might have something like the following for the noun button, indicating that its base, or reference form is button, that it is a common noun, which is concrete (rather than abstract, like happiness, or sincerity) An understandable way to execute such things is as records in a database, with attributes naming fields (e.g. cat), and values as the contents of the fields (e.g. n). But it is not always obligatory to name the field  one could, for example, adopt a conference that the first field in a documentation always contains the talk about form (in this case the value of the feature lex) , that the second field indicates the grouping, and that the third field some sort of division of the category. Looking at the vocabulary access for the noun button it becomes clear that unlike parts of speech will have a different anthology of attributes. For example, verbs will have a type, rather than a type feature, and while verbs might have fields for indications of number, person and tense, one would not expect to find such fields for prepositions. In the entry we have given we also find one attribute without a value. The idea here is to indicate that a value for this power is possible, but is not inherent to the word button, which may have different number values on different occasions (unlike e.g. trousers, which is always plural). Of course, this sort of bare field is essential if fields are indicated by position, rather than name. In systems which name attribute fields it might simply be equivalent to omitting the attribute, but maintaining the field is still useful because it helps someone who has to modify the dictionary to understand the information in the dictionary. An alternative to giving a blank value, is to follow the practice of some paper dictionaries and fill in the default, or (in some sense) normal value. For a quality like number, this would presumably be singular. This alternative, however, is unfashionable these days, since it goes against the generally accepted idea that in the best case linguistic processing only adds, and never changes in sequence. The attraction of such an approach is that it makes the order in which things are done less critical (cf. our remarks about the attraction of separating declarative and procedural information in Chapter). In order to include information about sub categorization and selection restrictions, one has two options. The first is to encode it via sets of attributes with atomic values such as those above. In practice, this would mean that one might have features such as sub cat=subj_obj, and sem_patient=clothing. As regards sub categorization information, this is essentially the approach used in the monolingual paper dictionary above. In some systems this may be the only option. However, some systems may allow values to be sets, or lists, in which case one has more flexibility. For example, one might represent sub categorization information by means of a list of categories, for example subset = [np,np,np] might indicate a verb that allows three NPs (such as give), and [np,np,pp] might indicate a verb that takes two NPs and a PP A notation which allows the lexicographer to indicate other properties of the items would be still more expressive. For example, it would be useful to indicate that with give, the preposition in the PP has to be to. This would mean that instead of ` pp' and ` np' one would have collections of features, and perhaps even pieces of syntactic structure. (A current trend in computational linguistics involves the development of formalisms that allow such very detailed lexical entries, and we will say a little more about them in Chapter). Turning now to the treatment of translation information in MT dictionaries, one possibility is to attempt to represent all the relevant information by means of attributes and values. Thus, as an addition to the dictionary entry for button given above, a transformer system  could specify a `translation' feature  which has as its value the appropriate target language word; e.g. trans = bout on for translation into French . One might also include features which trigger certain transformations (for example for changing world order for certain words). However, this is not a particularly attractive view. For one thing, it is clearly oriented in one direction, and it will be difficult to produce entries relating to the other direction  of translation from such entries. More generally, one wants a bilingual vocabulary to allow the replacement of certain source language oriented information with equivalent target language information replace the in turn one derives from the source dictionary by information derived from the target dictionary. This suggests the usage of translation rules which narrate head words to head words. That is, rules of the type we introduced in Chapter, like temperature. As we noted before, not all paraphrase rules can be a simple mapping of source words words onto their target language equivalents. One will have to put conditions on the rules. For example, one might like to be able to describe in the bilingual entry that deals with like and plaice, the change in grammatical relations that occurs if one is working with relatively shallow levels of representation,. In effect, the transfer rule that we gave for this example in Chapter might be seen as a bilingual lexical entry. Other transformation rules that may require more than just a simple pairing of source and target words are those that treat phenomena like idiom s and compound s, and some cases of lexical  holes (cf. Chapter). To deal with such phenomena bilingual dictionary entries may have a single lexical item on the side of one language, whereas the other side describes a (possibly quite complex) linguistic structure. The entry for button taken from a paper lexicon at the opening of this Chapter illustrates an issue of major importance to the automatic processing of some languages, including English. This is the very widespread incidence of homograph in the language. Loosely talking, homographs are words that are written in the same way. However, it is important to distinguish several diverse cases (sometimes the term homograph is restricted to only one of them). The case where what is intuitively a single noun (for example) has several different readings. This can be seen with the entry for button on page, where a reading relating to clothing is distinguished from a `knob' reading. The case where one has related items of different categories which are written alike. For example, button can be either a noun or a verb. The case where one has what appears to be unrelated items which happen to be written alike. The classic example of this is the noun bank, which can designate either the side of a river, or a financial institution. These distinctions have sensible implication when one is writing (creating, extend, or modifying) a dictionary, since they relate to the question of when one should create a new entry (by defining a new headword). The issues involved are rather dissimilar when one is creating a `paper' dictionary (where issues of readability are paramount) or a dictionary for MT, but it is in any case very much a pragmatic decision. One good guiding principle one might accept is to group entries hierarchically in terms of amounts of shared in sequence. For example, there is relatively little that the two senses of bank share apart from their citation form and the fact that they are both common nouns, so one may as well associate them with different entries. In a computational setting where one has to give unique names to different entries, this will engross creating headwords such as bank_1 and bank_2, or (bank finance, and Bank River). As regards the noun and verb button, though one might want to have some way of indicating that they are related, they do not share much in turn, and can therefore be treated as separate entries. For multiple readings of a word, for example, the two readings of the noun button, on the other hand, most information is shared they differ mainly in their semantics. In this case, it might be useful to impose an organization in the lexicon in which information can be inherited from an entry into sub-entries (or more generally, from one entry to another), or to see them as subentries of an abstract `protoentry' of some sort. This will certainly save time and effort in dictionary construction though the savings one makes may look small in one case, it becomes significant when multiplied by the number items that have different readings (this is certainly in the thousands, perhaps the hundreds of thousands, since most words listed in normal dictionaries have at least two readings). The issues this raise is complex and we cannot do them impartiality here, however, the following will give a flavor of what is caught up. More generally, what one is talking about here is inheritance of properties between entries (or from entries into subentries). This is illustrated in Figure. One could picture extend this, introducing abstract entries express in sequence true of classes of (real) entry. For example, one might want to identify certain facts about all nouns (all noun readings) just once, rather than stating them separately in each entry. The entry for a typical noun might then be very simple, saying no more than `this is a typical noun', and giving the mention form (and semantics, and translation, if appropriate). One allows for sub regularities (that is lexical elements which are habitual in some but not all properties), by allowing elements to inherit some information while expressing the special or irregular information directly in the entry itself. In many cases, the optimal organization can turn out to be quite complicated, with entries inheriting from a number of different sources. Such a come near becomes even more attractive if default inheritance is possible. That is, that information is inherited, unless it is overtly contradict in an entry/reading it would then be possible to say, for example, `this is a typical noun, except for the way it forms its plural'. One final and important component of an MT dictionary, which is entirely missing in paper dictionaries (at least in their printed, public form), is documentation. Apart from general documentation describing design decisions, and terminology , and providing lists and definitions (including operational tests) for the attributes and values that are used in the dictionary (it is, obviously, essential that such terms are used consistently  and consistency is a problem since creating and maintaining a dictionary is not a chore that can be performed by a single individual), it is important that each entry include some lexicographers' comments   information about who created the entry, when it was last revise, the kinds of example it is based on, what problems there are with it, and the sorts of improvement that are required. Such information is vital if a vocabulary is to be maintained and extended. In general, though the quality and quantity of such documentation has no effect on the actual presentation of the dictionary, it is serious if a dictionary is to be tailored or extended.

No comments:

Post a Comment