Web Miner for finding homepages
Prof. Joaquim Filipe INSTICC
Rui Rodrigues INSTICC
Servaas Tilkin XIOS Hogeschool Limburg
INSTICC is a nonprofit organization; it stands for "Institute for Systems and Technologies of Information, Control and Communication." INSTICC is an international association which develops and disseminates scientific knowledge, mainly in the area of information systems and technologies.
To do this, they organize international conferences.
In order to promote and receive feedback on these conferences they look for a professor within the specified area of the conference whose task it is to review that conference.
Since it is very time consuming to find these professors manually, INSTICC wants to automate this process.
Therefore I have created two related projects. One is a web miner with the purpose of finding the homepage of a given professor, based on his first name, last name and organization. The other is an application which is used to find credentials of the person, like his email address, research interests, publications and the research group to which he belongs.
The first name, last name and organization used to find the homepage are manually obtained by the people at INSTICC by searching the DBLP and Springer databases. These databases contain publications of professors.
The homepage my project attempts to find will then be used in the second part of my project to get contact details and other information about this person.
In order to determine whether a web page is also a homepage, a large amount of features are passed to a classification algorithm as a binary dataset. These features are used to give the algorithm a binary representation of what a certain web page looks like.
Classification algorithms are advanced techniques used in the field of Artificial Intelligence; examples are neural networks or support vector machines. They are used to make a separation (or classification) between two classes. In this case the classes represent whether a web page is a homepage or not.
My project is written entirely in C# as a class library project. This means that it cannot be run individually, but it will be integrated in a related project which Wim Hertogen is creating. The project uses an SQL Server 2005 database which is used to store persons and their homepage and to search for homepages of persons in the database who do not have one yet.
If you want to cite this thesis in your own thesis, paper, or report, use this format (APA):
BIJNENS, A. (2011). Web miner for ending homepages.
Unpublished thesis, Xios, N-TECH.