IMDB profile pictures crawler

From Code Trash
Jump to: navigation, search

Introduction

As the name suggests it is IMDB profile pictures crawler.

This script reads and stores all the images posted by a profile for example a celebrity.

I wrote it for my personal user. Just to refresh my HTTP efficiency.

In the script i have used keep-alive. I am not sure whether it is working as per my idea. so i am still researching on how keep alive works for this script. yet to check the time difference in requests with keep-alive and with connection:close.

So untill i enable the comments section you cannot comment this code. I will do it as soon as possible.

I have mentioned all the functions that i have used and the logic of this script in the comments section of the script.


Source Code

<?php
/*
Name:		IMDB profile pictures crawler
Version:	1.0
Date:		June 20th 2012
Purpose:	Personal use
About:		This script copied all the images of an imdb profile to the local disk.			
 
getHTML:	gets the html content of the next page
getInits:	runs for the first time to get the title of the profile which is used to create a folder in that profile name and the total pages and the current page usually page no 1.			
getName:	scrap the media name given by imdb for the current page. this is used to create the filename of the image
getNext:	scrap the current image and the next page's link and return it as an array. also it fetches the current page and the total page
getImage:	reads and stores the image from the url given by getNext['image']. This is the full url of the actual image.
 
Logic:		You have to provide the first page in which the first image appears. then the html is fetched then from that the title and the total
			pages are scraped. with the title a folder is created in that name. if that folder already exists then it is used as the current directory
			Then the script runs in a loop starting from 1 to total number of pages. The getNext function returns the current page and total pages.
 
Status:		Tested and working well.
 
NOTE:		This script does some regular expression match by considering the current layout of html. if any div id or class name changes in future
			by imdb then the script has to be modified. because this script relies on to id's provided for a div and an image tag. if that is changed
			or the html source code is changed by imdb then the code has to be revised.
 
			The attributes which has to be noted are controls, primary-img and canvas. if any of this change then the script exists with an error msg.
 
*/			
 
//the request uri of the first image from the image gallery
$path	=	'/media/rm2803021568/nm1157358';
$dir	=	'./imdb';
 
echo "\r\nConnecting...";
 
$html			=	getHTML($path);
$ret			= 	getInits($html);
$ret['next']	=	$path;
$title			=	trim($ret['title']);
 
echo "\r\nTitle: ".$title;
echo "\r\nPages: ".$ret['pages'];
 
echo "\r\nchecking directories...";
 
//checking whether the profile name already exists as a directory. if not then a folder is created and made as the current working folder.
//if the folder name already exists then it is made as the current workind directory in which the image are saved fetched using getImage function.
 
if(is_dir($dir.'/'.$title))
{
	chdir($dir.'/'.$title);
}
else
{
	chdir($dir);
	mkdir($title);
	chdir($title);
}
 
echo "\r\nProcessing ".$ret['pages']." pages... \r\n";
 
while($ret['page'] != $ret['pages'])
{
	$name	=	getName($ret['next']);	
	$ret	=	getHTML($ret['next']);
	$ret	=	getNext($ret);
 
	echo "\r\n".$ret['page'];	
 
	//a condition is checked to see whether an image already exists. if so then the process is skipped to the next image.
	//if not then it is stored in the current working folder.
	if(file_exists($name.'.jpg'))
	{
		echo " file exists";
		continue;
	}
	else echo ' done';
 
	$image	=	getImage($ret['image']);
	file_put_contents($name.'.jpg', $image);
}
 
echo "\r\n\r\nCompleted.";
 
function getName($path)
{
	$ret	=	explode('/', $path);
	return $ret[2];
}
 
function getHTML($path)
{
	$host	= 'http://www.imdb.com';
 
	$header = array(
					'User-Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.34 Safari/536.11',
					'Host: www.imdb.com',
					'Accept: text/html',
					'Accept-Encoding: text/html',
					'Connection: Keep-Alive'
					);
 
	$ch		=	curl_init($host.$path);
 
	curl_setopt($ch, CURLOPT_HEADER, false);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
	curl_setopt($ch, CURLOPT_REFERER, 'imdb.com');
	curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
 
	return	curl_exec($ch);
}
 
function getInits($res)
{
	$match_p	=	preg_match('/<div id="controls">([\w\W]*?)<span>Photo(.*?)<\/span>/', $res, $page);
	$match_t	=	preg_match('/<title(.*?)of(.*?) - IMDb/', $res, $title);
 
	if( ! ($match_p && $match_t))
		die('Invalid content');
 
	$page	=	trim(strip_tags($page[2]));
	$page	=	preg_replace('/ /', '', $page);
	$page	=	explode('of',$page);
 
	return array(
					'title'	=>	$title[2],
					'pages'	=>	intval($page[1]),
					'page'	=>	intval($page[0])
				);
}
 
function getNext($res)
{
	$match_next	=	preg_match('/<div id="canvas">([\w\W]*?)href([\w\W]*?)href="(.*?)"/', $res, $next);
	$match_img	=	preg_match('/<img id="primary-img"(.*?)src="(.*?)"/', $res, $img);	
	$match_p	=	preg_match('/<div id="controls">([\w\W]*?)<span>Photo(.*?)<\/span>/', $res, $page);
 
	if( ! ($match_next && $match_img))
		die('Data is missing in the response');	
 
	$page	=	trim(strip_tags($page[2]));
	$page	=	preg_replace('/ /', '', $page);
	$page	=	explode('of',$page);
 
	return array(
					'next'	=>	$next[3],
					'image'	=>	$img[2],
					'page'	=>	intval($page[0]),
					'pages'	=>	intval($page[1])
				);
}
 
function getImage($path)
{
	$header = array(
					'User-Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.34 Safari/536.11',
					'Host: ia.media-imdb.com',
					'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
					'Accept-Encoding: text/html',
					'Connection: Keep-Alive'
					);
 
	$ch		=	curl_init($path);
	curl_setopt($ch, CURLOPT_HEADER, false);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
	curl_setopt($ch, CURLOPT_REFERER, 'ia.media-imdb.com');
	curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
	$res	=	curl_exec($ch);
 
	return $res;
}