Khalil Sima'an

Tools for Processing Modern Hebrew and Arabic Texts

Tools and Resources Created
        Modern Hebrew     

Work on Modern Hebrew

General Aims
We aim at developing NLP tools (morphological analysers, part-of-speech taggers and syntactic parsers) for Modern Hebrew and Arabic texts. Current work proceeds along a joint architecture for both Semitic languages.

This project started out in the summer of 2000 with a visit of Khalil Sima'an to Computational Linguistics Lab., Computer Science, Technion, Haifa. The joint work with Yoad Winter, Roy Bar-Haim and Alon Itai (Technion) resulted in the first Hebrew treebank and a morphological segmenter and part-of-speech tagger for modern Hebrew texts.

We cooperate with the Computational Linguistics Lab., Computer Science, Technion, Haifa.

Original proposal for Integrated Morphological and Syntactic Processing for Hebrew (unpublished ms, 2004):
Project description (MOSAIEK proposal to NWO)


A major concern in building models for syntactic natural language processing (i.e. parsing) is ambiguity, i.e. utterances tend to receive multiple analyzes. Because ambiguity is a manifestation of uncertainty (due to the lack of extra linguistic knowledge), state-of-the-art models extend linguistic representations with probabilistic reasoning.
Remarkably, research on parsing models has largely focused on European languages, particularly English. The applicability of the resulting models for non-European languages, e.g. Hebrew, Arabic or Turkish, is questionable because of the specific phenomena these languages exhibit. Specifically, Modern Hebrew exhibits phenomena that point at the intertwined nature of morphological and syntactic processing. Hebrew has a rich syntactic agreement system expressed at the morphological level: many affixes (such as agreement features, prepositions, determiners, and conjunctions) are prepended or appended to the word, resulting in serious morphological ambiguity. This is further complicated by the fact that Hebrew texts lack most vowels.

While research on English parsing could afford to ignore morphological ambiguity, there is strong evidence that this is not suitable for Hebrew. Hence, the research question of the present project is: how to build probabilistic models for correct and efficient Hebrew morphological and syntactic processing?

People involved:
     Khalil Sima'an and Reut Tsarfaty (ILLC, Amsterdam).
    Yoad Winter, Alon Itai, Roy Bar-Haim and Saib Mansour (CS, Technion, Haifa, Israel).

     Johns Hopkins Summer Workshop on Natural Language Engineering.


  • Part-of-Speech Tagging of Modern Hebrew Text.
    Roy Bar-Haim, Khalil Sima'an and Yoad Winter. 

    To appear in Journal of Natural Language Engineering (J-NLE), 2008 (submitted Jan 2006).
  • Accurate Unlexicalized Parsing for Modern Hebrew.
    Reut Tsarfaty and Khalil Sima'an.
    In Proceedings of Text, Speech and Dialog (TSD). Pilsen, Czech Republic, September 2007.
  • Three-Dimensional Parametrization for Parsing Morphologically Rich Languages.
    Reut Tsarfaty and Khalil Sima'an.
    In Proceedings of the International Conference on Parsing Technologies (IWPT). Prague, Czech Republic, June 2007.
  • Smoothing a Lexicon-based POS tagger for Arabic and Hebrew.
    Saib Mansour, Khalil Sima'an and Yoad Winter.
    In proceedings of  ACL 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources.
    Prague, Czech Republic, 2007.
  • Choosing an Optimal Architecture for Segmentation and POS-Tagging of Modern Hebrew. 
    Roy Bar-Haim, Khalil Sima'an and Yoad Winter.
    In proceedings of ACL 2005 Workshop on Computational Approaches to Semitic Languages. ps pdf abstract MorphTagger
  • Parsing Arabic Dialects.
    JHU Summer workshop team. Final report (version I), January 2006. CSLP, JHU, Baltimore, USA. 
  • Closing day presentation. 
    JHU Summer workshop team.  August 2005. CSLP, JHU, Baltimore, USA.
  •  Building a Tree-Bank of Modern Hebrew Text.  K. Sima'an, A. Itai, Y. Winter, A. Altman and N. Nativ.
      In Beatrice Daille and Laurent Romary (eds.), Journal Traitement Automatique des  Langues (t.a.l.) , 2001.
      Special Issue on Natural Language Processing and Corpus Linguistics.
     Appeared also in Chinese translation in the book Window to the Computational Linguistics .

Tools  and Resources (created in joint work with the CL Lab at Technion)
  1. The Hebrew Treebank found at
  2. MorphTagger  A Segmenter and POS tagger for Modern Arabic and Hebrew (based on HMM technology).
  3. Arabic Part-Of-Speech Tagger - מתייג חלקי דיבר לערבית - برنامج صرف للغة العربية