In this exercise, we explore a dataset where we have several image features, such as size, if the site is local and on what site it came from. Exploring the dataset, we found that it had a lot of components (1,559 variables). As such, we performed Principal Components Analysis (PCA) to the data to decrease the number of components. We found that (1) 110 components can explain 75% of the variance and (2) 73 components can explain 60% of the variance, which is a massive decrease from the original number of components. We also classifified the images using Support Vector Machines, where indeed, we found that it is possible to classify images as ads or non-ads with an accuracy of 96.2% using a Linear Kernel. This is similar to what the original authors got at 97% accuracy.
From the source, it was described that this dataset represents a set of possible advertisements on Internet pages. The features encode the geometry of the image (if as well as phrases occuring in the URL, the image’s URL and alt text, the anchor text, and words occuring near the anchor text. The task is to predict whether an image is an advertisement (“ad”) or not (“nonad”). The following is the data attributes given by the original creators of the dataset:
- height: continuous
- width: continuous
- aratio: continuous
- local: 0,1.
- 457 features from url terms, each of the form “url*term1+term2. . . ”
- 495 features from origurl terms, in same form
- 472 features from ancurl terms, in same form
- 111 features from alt terms, in same form
- 19 features from caption terms
- tagging if the image is an ad or a non-ad
PRINCIPAL COMPONENTS ANALYSIS
We can see that our data is 3278 entries with 1559 columns. Our data has too many columns. To solve these too much dimensionality, we shall use principal components analysis. We run an initial PCA fit on the full data set.