Publikation:

SciPlore Xtract : Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)

Lade...
Vorschaubild

Dateien

Beel_0-285747.pdf
Beel_0-285747.pdfGröße: 197.85 KBDownloads: 714

Datum

2010

Autor:innen

Beel, Jöran
Shaker, Ammar
Friedrich, Nick

Herausgeber:innen

Kontakt

ISSN der Zeitschrift

Electronic ISSN

ISBN

Bibliografische Daten

Verlag

Schriftenreihe

Auflagebezeichnung

ArXiv-ID

Internationale Patentnummer

Angaben zur Forschungsförderung

Projekt

Open Access-Veröffentlichung
Open Access Green
Core Facility der Universität Konstanz

Gesperrt bis

Titel in einer weiteren Sprache

Publikationstyp
Beitrag zu einem Konferenzband
Publikationsstatus
Published

Erschienen in

MOUNIA LALMAS, , ed. and others. Research and advanced technology for digital libraries :14th European Conference, ECDL 2010, Glasgow, UK, September 6 - 10, 2010; proceedings. Berlin [u.a.]: Springer, 2010, pp. 413-416. Lecture Notes in Computer Science. 6273. ISBN 978-3-642-15463-8. Available under: doi: 10.1007/978-3-642-15464-5_45

Zusammenfassung

Extracting titles from a PDF’s full text is an important task in information retrieval to identify PDFs. Existing approaches apply complicated and expensive (in terms of calculating power) machine learning algorithms such as Support Vector Machines and Conditional Random Fields. In this paper we present a simple rule based heuristic, which considers style information (font size) to identify a PDF’s title. In a first experiment we show that this heuristic delivers better results (77.9% accuracy) than a support vector machine by CiteSeer (69.4% accuracy) in an ‘academic search engine’ scenario and better run times (8:19 minutes vs. 57:26 minutes).

Zusammenfassung in einer weiteren Sprache

Fachgebiet (DDC)
004 Informatik

Schlagwörter

header extraction, title extraction, style information, document analysis

Konferenz

ECDL 2010, 6. Sept. 2010 - 10. Sept. 2010, Glasgow
Rezension
undefined / . - undefined, undefined

Forschungsvorhaben

Organisationseinheiten

Zeitschriftenheft

Zugehörige Datensätze in KOPS

Zitieren

ISO 690BEEL, Jöran, Bela GIPP, Ammar SHAKER, Nick FRIEDRICH, 2010. SciPlore Xtract : Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size). ECDL 2010. Glasgow, 6. Sept. 2010 - 10. Sept. 2010. In: MOUNIA LALMAS, , ed. and others. Research and advanced technology for digital libraries :14th European Conference, ECDL 2010, Glasgow, UK, September 6 - 10, 2010; proceedings. Berlin [u.a.]: Springer, 2010, pp. 413-416. Lecture Notes in Computer Science. 6273. ISBN 978-3-642-15463-8. Available under: doi: 10.1007/978-3-642-15464-5_45
BibTex
@inproceedings{Beel2010SciPl-30892,
  year={2010},
  doi={10.1007/978-3-642-15464-5_45},
  title={SciPlore Xtract : Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)},
  number={6273},
  isbn={978-3-642-15463-8},
  publisher={Springer},
  address={Berlin [u.a.]},
  series={Lecture Notes in Computer Science},
  booktitle={Research and advanced technology for digital libraries :14th European Conference, ECDL 2010, Glasgow, UK, September 6 - 10, 2010; proceedings},
  pages={413--416},
  editor={Mounia Lalmas},
  author={Beel, Jöran and Gipp, Bela and Shaker, Ammar and Friedrich, Nick}
}
RDF
<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/30892">
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <dcterms:title>SciPlore Xtract : Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)</dcterms:title>
    <dspace:hasBitstream rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/30892/1/Beel_0-285747.pdf"/>
    <bibo:uri rdf:resource="http://kops.uni-konstanz.de/handle/123456789/30892"/>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-05-06T09:18:48Z</dcterms:available>
    <dc:rights>terms-of-use</dc:rights>
    <dc:creator>Shaker, Ammar</dc:creator>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dc:contributor>Gipp, Bela</dc:contributor>
    <dc:contributor>Shaker, Ammar</dc:contributor>
    <dc:creator>Friedrich, Nick</dc:creator>
    <dc:language>eng</dc:language>
    <dc:creator>Gipp, Bela</dc:creator>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dcterms:hasPart rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/30892/1/Beel_0-285747.pdf"/>
    <dcterms:abstract xml:lang="eng">Extracting titles from a PDF’s full text is an important task in information retrieval to identify PDFs. Existing approaches apply complicated and expensive (in terms of calculating power) machine learning algorithms such as Support Vector Machines and Conditional Random Fields. In this paper we present a simple rule based heuristic, which considers style information (font size) to identify a PDF’s title. In a first experiment we show that this heuristic delivers better results (77.9% accuracy) than a support vector machine by CiteSeer (69.4% accuracy) in an ‘academic search engine’ scenario and better run times (8:19 minutes vs. 57:26 minutes).</dcterms:abstract>
    <dc:creator>Beel, Jöran</dc:creator>
    <dcterms:rights rdf:resource="https://rightsstatements.org/page/InC/1.0/"/>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-05-06T09:18:48Z</dc:date>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dcterms:issued>2010</dcterms:issued>
    <dc:contributor>Beel, Jöran</dc:contributor>
    <dc:contributor>Friedrich, Nick</dc:contributor>
  </rdf:Description>
</rdf:RDF>

Interner Vermerk

xmlui.Submission.submit.DescribeStep.inputForms.label.kops_note_fromSubmitter

Kontakt
URL der Originalveröffentl.

Prüfdatum der URL

Prüfungsdatum der Dissertation

Finanzierungsart

Kommentar zur Publikation

Allianzlizenz
Corresponding Authors der Uni Konstanz vorhanden
Internationale Co-Autor:innen
Universitätsbibliographie
Nein
Begutachtet
Diese Publikation teilen