Html 2 txt
I was wondering if there are any algorithms for converting html to text. I've found some utilities, but those aren't what I need.
I'm developing an application and one of the specs are HTML to Text Conversion.
Your help would be greatly appreciated
[282 byte] By [
der124] at [2007-11-18 2:28:58]

# 3 Re: Html 2 txt
Ok, what language and OS are you doing this under?
As for parsing the data you may have to walk the character data. My first thought is anytime you see >hhhhh< the text is the data between > and <, but you might run into the < or > being part of the text so you have to validate the tags.
However not knowing the way you are looking at the data you may need to know all the element names, to properly indentify which check out Quadzilla (http://wdvl.internet.com/Quadzilla/) as he has a nice listing of all the keywords valid for HTML.
# 4 Re: Html 2 txt
Found this today, did not even realize it was there.
WDG Validator (http://www.htmlhelp.com/tools/validator/)
It is an html source validator and they do have a GNU source code available. Will help speed along developement if you cannot find anything and need to write you own, if you know perl or can convert to a language you need.
# 5 Re: Html 2 txt
Still not sure what you're trying to accomplish. Do you know and can you explain it? Sounds like you want the output to be plain (unformatted) text, right?
You've said you have to write this yourself. But do you also have to determine the logic behind it too? If the answer is "yes," do NOT read below here!
.
.
.
.
.
I think it can be broken down into the following tasks, in roughly this order:
o Delete everything before the <body> tag
o Delete all line breaks (and I do not mean <br>s) that are *within* HTML blocks (this may be the hardest part)
o Convert paragraph breaks into two line breaks (probably, depends on what you want)
o Perhaps convert <br>s to a line break
o Perhaps convert headings <hX> to line breaks (before and/or after)
o Convert <li>...</li>s and probably some other tags to a line break
o Somehow deal with tables (very difficult!)
o Delete all HTML tags (!! right?)
o Convert HTML entities ( & and many more)
That's all I can think of, but I may have missed something.
If my assumption is correct and the output is just plain text, the only formatting the output will have is line breaks, so line breaks will be your biggest challenge.
Larry
# 6 Re: Html 2 txt
Everything came through in my last post, except the bit about HTML entities. I meant (and it said but was stripped) things like
& nbsp ;
& amp ;
& #anumber;
and many more
(but without the spaces shown here)
Larry