Geodesic distances for clustering linked text data

Lade...
Vorschaubild
Dateien
Zu diesem Dokument gibt es keine Dateien.
Datum
2012
Autor:innen
Herausgeber:innen
Kontakt
ISSN der Zeitschrift
Electronic ISSN
ISBN
Bibliografische Daten
Verlag
Schriftenreihe
Auflagebezeichnung
URI (zitierfähiger Link)
DOI (zitierfähiger Link)
ArXiv-ID
Internationale Patentnummer
Angaben zur Forschungsförderung
Projekt
Open Access-Veröffentlichung
Core Facility der Universität Konstanz
Gesperrt bis
Titel in einer weiteren Sprache
Forschungsvorhaben
Organisationseinheiten
Zeitschriftenheft
Publikationstyp
Zeitschriftenartikel
Publikationsstatus
Published
Erschienen in
Journal of Artificial Intelligence and Soft Computing Research. 2012, 2(3), pp. 247-258. eISSN 2083-2567
Zusammenfassung

The quality of a clustering not only depends on the chosen algorithm and its parameters, but also on the definition of the similarity of two respective objects in a dataset. Applications such as clustering of web documents is traditionally built either on textual similarity measures or on link information. Due to the incompatibility of these two information spaces, combining these two information sources in one distance measure is a challenging issue. In this paper, we thus propose a geodesic distance function that combines traditional similarity measures with link information. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on a subset of the EnglishWikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. Additionally, taking inspiration from the unified view of the performance functions of k-means and k-harmonic means, min and harmonic average of the cosine and geodesic distances are taken in order to construct alternate distance forms. The effectiveness of our approach is measured by computing microprecision values of the clusters based on the provided categorical information of each article.

Zusammenfassung in einer weiteren Sprache
Fachgebiet (DDC)
004 Informatik
Schlagwörter
Konferenz
Rezension
undefined / . - undefined, undefined
Zitieren
ISO 690TEKIR, Selma, Florian MANSMANN, Daniel A. KEIM, 2012. Geodesic distances for clustering linked text data. In: Journal of Artificial Intelligence and Soft Computing Research. 2012, 2(3), pp. 247-258. eISSN 2083-2567
BibTex
@article{Tekir2012Geode-38215,
  year={2012},
  title={Geodesic distances for clustering linked text data},
  number={3},
  volume={2},
  journal={Journal of Artificial Intelligence and Soft Computing Research},
  pages={247--258},
  author={Tekir, Selma and Mansmann, Florian and Keim, Daniel A.}
}
RDF
<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/38215">
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dc:creator>Tekir, Selma</dc:creator>
    <dc:creator>Keim, Daniel A.</dc:creator>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dcterms:issued>2012</dcterms:issued>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2017-03-30T08:42:58Z</dcterms:available>
    <dc:contributor>Tekir, Selma</dc:contributor>
    <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/38215"/>
    <dc:language>eng</dc:language>
    <dcterms:abstract xml:lang="eng">The quality of a clustering not only depends on the chosen algorithm and its parameters, but also on the definition of the similarity of two respective objects in a dataset. Applications such as clustering of web documents is traditionally built either on textual similarity measures or on link information. Due to the incompatibility of these two information spaces, combining these two information sources in one distance measure is a challenging issue. In this paper, we thus propose a geodesic distance function that combines traditional similarity measures with link information. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on a subset of the EnglishWikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. Additionally, taking inspiration from the unified view of the performance functions of k-means and k-harmonic means, min and harmonic average of the cosine and geodesic distances are taken in order to construct alternate distance forms. The effectiveness of our approach is measured by computing microprecision values of the clusters based on the provided categorical information of each article.</dcterms:abstract>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2017-03-30T08:42:58Z</dc:date>
    <dc:contributor>Keim, Daniel A.</dc:contributor>
    <dc:contributor>Mansmann, Florian</dc:contributor>
    <dc:creator>Mansmann, Florian</dc:creator>
    <dcterms:title>Geodesic distances for clustering linked text data</dcterms:title>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
  </rdf:Description>
</rdf:RDF>
Interner Vermerk
xmlui.Submission.submit.DescribeStep.inputForms.label.kops_note_fromSubmitter
Kontakt
URL der Originalveröffentl.
Prüfdatum der URL
Prüfungsdatum der Dissertation
Finanzierungsart
Kommentar zur Publikation
Allianzlizenz
Corresponding Authors der Uni Konstanz vorhanden
Internationale Co-Autor:innen
Universitätsbibliographie
Ja
Begutachtet
Diese Publikation teilen