Dr

ISA 14th World Congress Montreal: RC35: Session 'Social Sciences in the World Wide Web'

Dr. Harald Klein

Friedrich-Schiller-Universitaet
Institut fuer Soziologie
Otto-Schott-Str. 41
07740 Jena
Germany

Tel.: +49 3641 945543 office hours
+49 172 9421627 private (mobile)
Fax: +49 3641 945542
eMail: [email protected]
WWW: http://www.soziologie.uni-jena.de/home/klein

Text Analysis of data in the World Wide Web

This paper deals with the possibilites of obtaining textual data from the World Wide Web. The first topic deals with the different kind of sources like homepages, overviews, and link pages. A more technical aspect are the different formats of the information like texts, graphics, or animations. Before an analysis the text must be transformed to a format that can be analysed. The text must be separated into text units, and also external variables have to be defined. The problems derived from this prerequisite of each text analysis will be discussed. At last the different standards of text encoding are dealt with, like HTML, VRML, XML, TEI, and the lack of text encoding standards in the current text analysis software.