Subcategorization Acquisition and Classes of Predication in Urdu


@phdthesis{Raza2011Subca-17432,
  title={Subcategorization Acquisition and Classes of Predication in Urdu},
  year={2011},
  author={Raza, Ghulam},
  address={Konstanz},
  school={Universität Konstanz}
}

deposit-license Raza, Ghulam 2011-12-19T13:49:33Z Subcategorization Acquisition and Classes of Predication in Urdu Raza, Ghulam In this thesis, I have focused on identifying and exploring different types of predicators and patterns of their subcategorization frames in Urdu. Not having at hand refined resources for Urdu, for example, part-of-speech tagged corpora or tree bank, it was investigated how to acquire subcategorization of verbs from a raw Urdu corpus. Challenges to automatic subcategorization acquisition from a raw Urdu corpus were discussed in Chapter 2. Urdu is a free word order language in which major constituents can scramble among each other in a clause. That is, an argument of a verb is not strictly bounded to some specific position in a sentential clause, although the verb itself usually comes last. So, identification of the argument based on its position is not possible in Urdu in contrast with other languages like English where some arguments can be identified only by their positions in a sentence.<br /><br /><br />The participants of verbs in Urdu sentences are usually marked for case by different case clitics. It first seemed straightforward to identify different arguments based on case marking. But it turned out that case clitics pose challenges. There is not a one-to-one correspondence between the case clitic form and the case feature. That is, the same clitic form is used to mark nouns for more than one case. Another challenge is that the same grammatical function is marked for different cases in different situations.<br /><br /><br />Furthermore, both nouns and verbs can subcategorize for their arguments. So, it was also a challenge to identify which case phrase is semantically combined with which predicator. The case phrase attachment ambiguities in some sense resemble to the PP-attachment ambiguities in English and other langauges. To identify the complementizer clause based on the complementizer form is not trivial in Urdu as the same form of complementizer is also used for many other functions.<br /><br /><br />Having explored the above challenges, an algorithm was devised to acquire subcategorization information from a raw Urdu corpus. In the subcategorization acquisition system for Urdu (SASU) many of above challenges were addressed and a few were ignored. The SASU system was presented in Chapter 3. The strategy of the SASU system differs from existing strategies in two respects. For one, lexical clues of case are used. Secondly, the frames are identified indirectly from the extracted case phrases by applying some meta rules. This system comprises of two input repositories and a verb conjugator and four more components. About 700 basic verbs of Urdu were collected from different resources and a corpus of Urdu obtained mainly from news sites was cleaned and segmented. The different morphological inflectional paradigms of verbs were analyzed and an Urdu verb-conjugator was implemented as a supplementary part of the SASU system.<br /><br /><br />To extract subcategorization frames of a verb, a component of the system finds all candidate sentences by comparing the conjugation forms of the verb with tokens of sentences in the corpora. Necessary screening of candidate sentences is made to make it sure that the verb in those sentences is used as a main verb and that it is not of the verb of some subordinating or coordinating clause. Another component then builds different case clitics and complementizer combinations and collect their frequencies in the corpus. The third component of the system filters out potentially invalid combinations based on statistical method. The final component induces the frames of verbs from the valid case clitics and complementizer combinations. Results of subcategorization frames of 60 basic verbs in a summarized form were reported. These 60 verbs were chosen based on sufficient number of their occurrences in the corpora.<br /><br /><br />Due to the diversified syntacto-semantic behavior of the basic verb ho `be/become', it was not viable to extract its subcategorization information by the developed system. So, this verb was individually investigated in Chapter 4 for its different uses and subcategorization frames. It was shown that the verb ho can basically be classified into stative ho and dynamic ho. The syntactic distribution of stative ho and dynamic ho are different in terms of aspect, taking the light verb ja `go' and making modal construction with the verb cah `want'. Both stative ho and dynamic ho act either intransitively or a as a copula. Syntactic frames of both types of copula were explored. Dynamic copula only subcategories for a subset of the frames selected by the stative copula. Although Urdu is a free word order language, the position of the participant does matter for interpreting it as a subject or a complement in case of copular sentences. The characterizing participles which are constructed by perfect form of ho were also analyzed with respect to the arguments they modify.<br /><br /><br />Arguments of deverbal adjectives and deverbal nouns were investigated in Chapter 5. Noun phrases containing multiple instances of genitive elements were explored and the order of different genitive modifiers in them was established. It was shown that only attributive genitives can stack together at same level before adjectives in Urdu NPs, otherwise there is always a hierarchical structure. Attributive genitives show syntactic distribution similar to adjectives. Furthermore it was shown that some nouns in Urdu can take two genitive marked arguments. A classification of nouns was made based on number and type of genitive marked argumetns. It was reported that discontinuous constituents are generated in Urdu NPs when an argument taking noun is modified by some argument taking adjective or if the argument of the head noun itself licenses its argument. Heads cannot appear before their argument in noun phrases. Argument-less adjectives are always contiguous to the head noun. The syntactic explanation of the phenomenon was provided in terms of multiple movements across different projections. In LFG a flat c-structure was proposed for Urdu NPs. A correct f-structure was generated by making use of different operators of XLE in the grammar rules.<br /><br /><br />Adpositions as predicators were analyzed in Chapter 6. A model of spatial adpositions in LFG was proposed in terms of lex-sem features drawing on Svenoniuous notions of spatial expressions. Different classes of adpositions were made based on the case of their complements. An evidence for complex adpositions in Urdu was provided. It was shown that nouns in complex adpositions show different syntactic distribution compared with their normal syntactic distribution. Linguistically motivated implementation of complex adpositions was presented.<br /><br /><br />To conclude, this thesis reports results of an exhaustive research made on patterns of predicators and their subcategorization frames in Urdu. An acquisition system is presented which extracts the subcategorization frames of Urdu verbs based on lexical cues of case clitics and complementizer forms. Having a very large and balanced corpus of Urdu, the SASU system presented in the thesis can be used to build the broad-coverage lexicon of Urdu verbs enriched with their subcategorization information in terms of grammatical functions coupled with their case marking. This system can support identification of complex predicates and also useful in discovering different patterns of syntactic alternations for Urdu verbs. By adding more bits of adpositions in the information vector, the adpositional arguments of verbs can also be extracted.<br /><br /><br />As a future work, the system can be generalized for South-Asian languages as many South-Asian languages like Saraiki and Sindhi are structurally very close to Urdu. The language selection and other parameters could be set at the interface level and subcategorization information in different South-Asian languages can be acquired by using the same inside technology. More fine-grained and advanced computing applications for South-Asian languages in general and particularly Urdu can be developed by using the SASU system as a core module.<br /><br /><br />The classes of verbs based on their syntactic frames can be explored. Incorporating the information of syntactic frames with allowed alternations and features of selectional preferences could be more useful in exploring semantic classes of verbs in Urdu. Formal analysis needs to be worked out for phenomena like even predicates and many complex predicates reported in the thesis which have not yet been analyzed and implemented in context of some formal theory.

