How to get the most relevant image from page (PHP)

Last week I was working on WordPress plugin importing articles from RSS feeds where all images were way too small to use. Only solution was to retrieve image from external page itself which brought a lot of new questions to the table. Which one of all images is the most relevant one? What is the main content? How to manage the import in limited execution time? Continue reading to find answers.

Example code below has these goals:

  1. find main content (article body)
  2. grab all images
  3. get their dimensions and relevance
  4. define the right image to use

Let’s start with an example and explain it later. As you can see, the code is quite basic and needs improvements to real usage, however should clearly demonstrate the point.


<?php
/*
 * PREPARE PAGE CONTENT TO MANIPULATE WITH
 * =======================================
*/

	// URL of page we're going to use to retrieve images from
	$page_url = 'http://www.webdesignerdepot.com/2014/10/the-ultimate-guide-to-bootstrap/';
	
	// Setup timeout for case page is not responding
	ini_set('default_socket_timeout', 4); 
	
	// Include library for converting relative url to absolute
	require_once('url_to_absolute.php');
	
	// Parse page content and create DOM object
	$page_dom = new DOMDocument();
	@$page_dom->loadHTMLFile($page_url);


	
/*
 * FIND THE MAIN TEXT CONTENT  
 * ==========================
 * 
 * We use the biggest paragraph as an identifier of main content.
 * This is not the best solution because some pages use only BR tags.
*/


	foreach ( $page_dom->getElementsByTagName('p') as $p ) 
		if ( !isset($main_content) || strlen($p->textContent) > strlen($main_content->textContent )  )
			$main_content = $p;


/*
 * GET ALL IMAGES ON PAGE &
 * CALCULATE NODE DISTANCE FROM MAIN CONTENT
 * =========================================
 *
 * We mind only JPGs. Clean URL from parameters and absolutise.
 * All images are stored in array.
 * We use node distance as a measurement of relevance of image.
*/

	$all_img = array();
	foreach ( $page_dom->getElementsByTagName('img') as $i ) {
	
		// Prepare image URL. If not JPG skip to another one.
		$img_url = $i->getAttribute('src');

		if ( strtolower(substr($img_url, -3) ) != 'jpg') continue;
		if ( strpos($img_url, '?') ) $img_url = substr($page_url, 0, strpos($img_url, '?'));
		$img_url = url_to_absolute($page_url, $img_url);
		

		// Compare xpaths of main content and image to calculate distance between two nodes.	
		$content_path = $main_content->getNodePath();
		$image_path = $i->getNodePath();
	
		$offset = 0;
		while ( $offset < strlen($image_path) ) {
		    if ( $image_path[$offset] !== $content_path[$offset] ) break;
		    $offset++;
		}
		
		$distance = substr_count($image_path, '/', $offset) + substr_count($content_path, '/', $offset);
		
		
		// Add an image to array
		$all_img[] = array('url' => $img_url, 'distance' => $distance);
	}


/*
 * GET DIMENSION OF IMAGES
 * =======================
 * 
 * We don't download whole images, only headers.
*/

	foreach ( $all_img as $k => $i ) {
			
		$data = file_get_contents($i['url'], NULL, NULL, 0, 32768);
		$img = imagecreatefromstring($data);
	
		$all_img[$k]['width'] = imagesx($img);; 	
		$all_img[$k]['height'] = imagesy($img);; 	
	}




/*
 * GET THE MOST RELEVANT IMAGES WITH MINIMUM SPECIFIED WIDTH
 * =========================================================
*/

	// Sort array of images by distance
	usort($all_img, function($a, $b) {
	    return $a['distance'] - $b['distance'];
	});

	// Find first image with width at least 300px 
	$image = '';
	foreach( $all_img as $i ) if ( $i['width'] > 300 ) { $image = $i['url']; break; }
	
	var_dump($all_img);
	var_dump($image);
		
?>

Easy start
The beginning is quite straight. Set up timeout for external connections. My experience is latency in range from one to four seconds of average page, however sometimes is big enough to take all execution time. Later in script we need to convert relative URL of images to absolute and for this reason we use library made by Nitin Gupta. You can download it here. Block ends by creating DOM object. It can be done even from not valid HTML, however it produces big amount of warnings and therefore I use @ symbol to suppress them.

Identify main page content
To define relevance of images we need to know what is the main content. In the second block we identify main content as the biggest paragraph on page. It’s not the best solution so far for many reasons. Some webs use BR tags instead of P. The paragraph itself is not whole article body, ideally it should be some wrapper. Therefore, our final image is relevant to this paragraph and not article itself. Also, would be a good idea to test content on presence of keywords (from title, meta, etc.) or perhaps similarity with excerpt gained from feed. PHP has dedicated function “similar_text” for this. However, I don’t do this even in final plugin, this simple solution based on biggest paragraph works enough for me, it’s easy to understand and relatively fast.

Collect all images and calculate its relevance
Third block collects images from DOM and calculates distance between main content and each image which represents mutual relevance. We mind only JPG because GIF, PNG and others are used mostly for decorating. Images have often relevant URL which we need to absolutise to detect their dimension later. We also clear URL of parameters.

To measure distance we use xpath and count number of steps in hierarchy represented by backslashes. Have a look at this example of xpaths.

/html/body/section[1]/main/div[2]/div[1]/p[3]
/html/body/section[1]/main/div[2]/div[1]/img[2]
/html/body/section[2]/aside/div[1]/img[1]

Because paragraph and first image have the same parent, there is a zero distance between them. The second image and paragraph have common parent, the body element, therefore their distance is three steps from image plus four steps to paragraph. Finally, we add image to array of all images.

Get images dimensions without downloading
Fourth block retrieves dimensions of collected images. Downloading whole images would be really ineffective and time-consuming. Instead, we download only headers to build custom image object and get dimensions from it. This is really fast technique and you can manage tents images in a second.

Select image to use
Final block picks the best image based on relevance = distance and its width. We sort array of images by distance and continue looping it until we find the image with minimal desired dimension.