php - Split a long text with html into a teaser and a main part -


a long text contains html tags (br, img, etc.)

this text need teaser max 400 chars , take care of words , html tags but br tags should replaced space remove line breaks in teaser. looks better!

the text after teaser has text minus teaser html tags , images included br

example text:  lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy   eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. <img alt="image" src="/image.jpg"> @ vero eos et accusam et justo duo dolores et ea rebum. stet clita kasd gubergren, no sea takimata sanctus est lorem ipsum dolor sit amet. lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. @ vero eos et accusam et justo duo dolores et ea rebum.  <br /><br /> stet clita kasd gubergren, no sea takimata sanctus est lorem ipsum dolor sit amet. <img alt="image" src="/image.jpg"> lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. @ vero eos et accusam et justo duo dolores et ea rebum. stet clita kasd gubergren, no sea takimata sanctus est lorem ipsum dolor sit amet. <br /><br /> duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, <img alt="image" src="/image.jpg"> vel illum dolore eu feugiat nulla facilisis @ vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. lorem ipsum dolor sit amet. 

what tried:

$content = $junk_of_lorem;  function teaser($string){  $string = substr($string,0,500);  $string = substr($string,0,strrpos($string," "))  $replacements = array(     '|<br /><br />|' => ' ' );  $patterns = array_keys($replacements); $replacements = array_values($replacements);  $string = preg_replace($patterns, $replacements, $string);    return $string; }  $teaser = teaser($content); 

now tried remove $teaser text text without teaser

$mainpart = str_replace(teaser($content), "", $content); 

problem:

with dummy solution, run problems, because teaser br mainpart has html tags. when there image arround char 490 main part contains half of img tag.

strip_tags allow br $teaser works cant remove exact match $mainpart.

i pretty sure there better solution. sorry me english mistakes, please dont vote me down. gave best explain it.

thank time me.

okay, tinkered , think may have work you.

given string this:

$string = 'lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy   eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. <img alt="image" src="/image.jpg"> @ vero eos et accusam et justo duo dolores et ea rebum. stet clita kasd gubergren, no sea takimata sanctus est lorem ipsum dolor sit amet. lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. @ vero eos et accusam et justo duo dolores et ea rebum.  <br /><br /> stet clita kasd gubergren, no sea takimata sanctus est lorem ipsum dolor sit amet. <img alt="image" src="/image.jpg"> lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. @ vero eos et accusam et justo duo dolores et ea rebum. stet clita kasd gubergren, no sea takimata sanctus est lorem ipsum dolor sit amet. <br /><br /> duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, <img alt="image" src="/image.jpg"> vel illum dolore eu feugiat nulla facilisis @ vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. lorem ipsum dolor sit amet.'; 

we write preg_match statement, using preg_offset_capture flag note position of match, this:

preg_match('~([a-z0-9 ,.]|<.*?>){1,158}(?=\s+)~', $string, $matches, preg_offset_capture); 

where have {1,158}, can change 158 whatever length teaser be. number of characters won't 400 or 500, should around number. instance, if have html tags, take more space , count 1 of our characters. (because telling give me either character or html tag - 158 times.)

$matches contain , array this:

array (     [0] => array         (             [0] => lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy   eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua             [1] => 0         )      [1] => array         (             [0] =>             [1] => 155         )  ) 

so want $matches[0][0] text , $matches[1][1] position left off.

now, let's take info have , define variables can use later:

$teaser = $matches[0][0]; $capture_position = $matches[1][1] + 1; $body = substr($string, $capture_position);  

just note increment $matches[1][1] 1 because want start character after match ... not last character matched.

next, defined $body using substr grab text starting @ our $capture_position forward.

finally, can print out our $teaser (with strip_tags) , $body:

print '<b>'.strip_tags($teaser).'</b>'; print '<br><br>'.$body; 

here working demo:

http://ideone.com/yqitlq

and here regex play around , see how changeing 158 affect total captured string:

https://regex101.com/r/iz9lx1/1

explanation ([a-z0-9 ,.]|<.*?>){1,158}(?=\s+)

  • ([a-z0-9 ,.]|<.*?>) capture group ( ... ) contain our teaser , made of 2 items. first character class [ ... ] made of uppercase , lowercase letters a-z, numbers 0-9, space , comma , , period .. pipe | "or" symbol. second item looking less sign <, followed character ., number of times *, until hits next part of our match ?. next part of our match greater sign >. should match html tag.
  • {1,158} range defined starting number of 1 , going through 158. means whatever matched right before (a character or html tag) should found @ least once, maximum of 158 times.
  • (?=\s+) lookahead (?= ... ), saying whitespace character \s should found @ least once + after match.

Comments