Can we block internet advertisements?





Use of PCA and SVMs to classify if an image is an ad or not.



Advertisements has always been an essential part of all businesses around the world. In particular, internet advertisements in the form of banners in websites has been on the rise especially at the turn of the century. However, many users do not want to be cluttered with these images.





In this exercise, we explore a dataset where we have several image features, such as size, if the site is local and on what site it came from. Exploring the dataset, we found that it had a lot of components (1,559 variables). As such, we performed Principal Components Analysis (PCA) to the data to decrease the number of components. We found that (1) 110 components can explain 75% of the variance and (2) 73 components can explain 60% of the variance, which is a massive decrease from the original number of components. We also classifified the images using Support Vector Machines, where indeed, we found that it is possible to classify images as ads or non-ads with an accuracy of 96.2% using a Linear Kernel. This is similar to what the original authors got at 97% accuracy.


From the source, it was described that this dataset represents a set of possible advertisements on Internet pages. The features encode the geometry of the image (if as well as phrases occuring in the URL, the image’s URL and alt text, the anchor text, and words occuring near the anchor text. The task is to predict whether an image is an advertisement (“ad”) or not (“nonad”). The following is the data attributes given by the original creators of the dataset:

  1. height: continuous
  2. width: continuous
  3. aratio: continuous
  4. local: 0,1.
  5. 457 features from url terms, each of the form “url*term1+term2. . . ”
  6. 495 features from origurl terms, in same form
  7. 472 features from ancurl terms, in same form
  8. 111 features from alt terms, in same form
  9. 19 features from caption terms
  10. tagging if the image is an ad or a non-ad


PRINCIPAL COMPONENTS ANALYSIS


We can see that our data is 3278 entries with 1559 columns. Our data has too many columns. To solve these too much dimensionality, we shall use principal components analysis. We run an initial PCA fit on the full data set.


We find that up to ~110 components, we achieve around 75% of the cumulative proportion of variance explained. We try to lower it since 110 components is still very large:


At 73 components, we can explain 60% of the variance. Now, we plot the PCA graph of individuals and of variables below:


SUPPORT VECTOR MACHINES


We try Support Vector Machines to predict if an image is an ad or not. We split the data to train and test sets, and find the best model.


We check the best model in terms of the accurancy. For the SVM technique, we have the following accuracy metrics:


From here, we can see that the best model in terms of accuracy is the SVM model with a Linear kernel, having 96.3% accuracy. However, this is a close call with the one with a Radial Kernel. Now, we look at the value of specificity:


From the table above, we can see that the Radial Kernel has a better specificity. As such, since the accuracy of the Radial and Linear kernel difference is very small, we would choose the Linear Kernel SVM to classify ads based on the attibutes provided.


The full code can be found at https://github.com/rlbartolome/advertisement. I included the data loading, transformation and kernel selection codes there.