Regular Expressions...

I need some help with regex's...
I'm trying to make a PHP application that will recieve a small html text in a textarea, and will seperate all the html tags from the content between the html tags?

To make my point clearer, here's an example :
if my input text is :
<font style="font-weight:bold;"><u>Hello There!</u></font>
<a href="google.com"><b>Google</b> links<img src="googlelogo.com" border="0"></a>

so the php would seperate it some how, and give an array (it could be anyway, i think an array is the most simple) that would look like this :
Array(
[0] => <font style="font-weight:bold;">
[1] => <u>
[2] => Hello There!
[3] => </u>
[4] => </font>
[5] => <a href="google.com">
[6] => <b>
[7] => Google
[8] => </b>
[9] => links
... and so on and so on...
)
[988 byte] By [gilly914] at [2007-11-20 11:00:32]
# 1 Re: Regular Expressions...
I think it's actually easier than regular expressions. In theory...a short solution would be...

<?php
$newstring = str_replace('>', '>-split-', $string);
$strings = explode('-split-', $newstring);
?>
PeejAvery at 2007-11-10 3:56:14 >
# 2 Re: Regular Expressions...
Thanks peejavery, I could always count on you to help...

I didn't think simple enough!
I only thought of a complex regex solution... :)

By the way,
your solution doesn't work 100%!
in order to make it work you must modify it like this (this might help someone else who needs this answer too)
$newstring = str_replace('>', '>-split-', $string);
$newstring = str_replace('<', '-split-<', $newstring);
$strings = explode('-split-', $newstring);

and the reason for thatis that when you have two html tags one after the other, your code doesn't seperate between them...
neither between text to a closing html tag...

but thanks anyway... :)
gilly914 at 2007-11-10 3:57:14 >
# 3 Re: Regular Expressions...
...By the way,
your solution doesn't work 100%!
Yeah. I figured as much, but it was to get you started.

Also note that with your revision, if HTML tags are back to back, you will end up with empty values for certain array indexes. You will want to make some improvisation for that when you read the array.
PeejAvery at 2007-11-10 3:58:10 >
# 4 Re: Regular Expressions...
You could do this...
$matches = null;
preg_match_all("/<[^>]+>|[^<]+|.+/s", $html, $matches);
andreasblixt at 2007-11-10 3:59:15 >
# 5 Re: Regular Expressions...
Use Andreas' solution. Just remember that it will return two arrays. You will want the second index.

EDIT:
@Andreas...I tried to rate your post, but I have to wait. Apparently I rate you a lot! Your posts are well worth it.
PeejAvery at 2007-11-10 4:00:13 >
# 6 Re: Regular Expressions...
Yup, preg_match_all puts all groups into their own arrays (the above regex only has group 0 - the actual match) and then under those are the matches for the group. This behavior can be changed using the fourth argument (see http://php.net/preg_match_all )

You can see the output of the above code here (it's a print_r() of the $matches array after running the regex on dev-archive.com's front page):
http://development.mezane.org/tests/split_html.php

Edit:
The empty items you see is the whitespace between the HTML tags (newlines, spaces and tabs). The following regular expression will skip items that only consist of whitespace:
preg_match_all("/<[^>]+>|[^\\s<][^<]*|\\S.*/s", $html, $matches);
http://development.mezane.org/tests/split_html_no_whitespace.php

It will, however, also skip the leading whitespace for all items that are matched. If this behavior is not desired, try the following regular expression:
preg_match_all("/<[^>]+>|(?!\\s+<)[^<]*|(?!\\s+\\S).*/s", $html, $matches);
http://development.mezane.org/tests/split_html_no_whitespace_lookahead.php
andreasblixt at 2007-11-10 4:01:12 >