Hanan Samet (CS, University of Maryland)
Tuesday, 26.6.2012, 14:30
Faculty meeting room, 7th floor, Rabin bldg
The popularity of web-based mapping services such as Google Earth/Maps and Microsoft Virtual Earth (Bing), has led to an increasing awareness
of the importance of location data and its incorporation into both web-based search applications and the databases that support them.
In the past, attention to location data had been primarily limited to geographic information systems (GIS), where locations correspond to spatial objects and are usually specified geometrically.
However, in the web-based applications, the location data often corresponds to place names and is usually specified textually.
An advantage of such a specification is that the same specification
can be used regardless of whether the place name is to be interpreted as a point or a region. Thus the place name acts as a polymorphic data type in the parlance of programming languages. However, its drawback is that it
is ambiguous. In particular, a given specification may have several interpretations, not all of which are names of places. For example,
``Jordan'' may refer to both a person as well as a place.
Moreover, there is additional ambiguity when the specification has a
place name interpretation. For example, ``Jordan'' can refer to a river
or a country while there are a number of cities named ``London ''.
In this talk we examine the extension of GIS concepts to textually specified location data and review search engines that we have
developed to retrieve documents where the similarity criterion is not based solely on exact match of elements of the query string but
instead also based on spatial proximity. Thus we want to take
advantage of spatial synonyms so that, for example, a query seeking a rock concert in Tel Aviv would be satisfied by a result finding a rock concert in Herzliyah of Petach Tikva. This idea has been applied by us to develop the STEWARD (Spatio-Textual Extraction on the Web Aiding Retrieval of Documents) system for finding documents on website of the Department of Housing and Urban Development. This system relies on the presence of a document tagger that automatically identifies spatial references in text, pdf, word, and other unstructured documents. The thesaurus for the document tagger is a collection of publicly available data sets forming a gazetteer containing the names of places in the world. Search results are ranked according to the extent to which they satisfy the query, which is determined in part by the prevalent spatial entities that are present in the document. The
same ideas have also been adapted by us to collections of news articles as well as Twitter tweets resulting in the NewsStand and TwitterStand systems, respectively, which will be demonstrated along with the STEWARD system in conjunction with a discussion of some of the underlying issues that arose and the techniques used in their implementation. Future work involves applying these ideas to spreadsheet data .