Media Slant: How We Classified Transcripts by TV Source Using Machine Learning

1 Feb 2025

We train a machine-learning classifier to predict whether a transcript snippet m comes from FNC or CNN/MSNBC. We split the corpus into 80% training data and 20% test data. We build the classifier in the training set and evaluate it in the test set.

We take two steps to pre-process the features further, both using the training set to ensure a clean evaluation in the test set. First, we do supervised feature selection to reduce the dimensionality of the predictor matrix. Out of the 65,000-bigram dictionary, we select the 2,000 most predictive features based on their χ 2 score for the true label F NC. Second, we scale all predictors in S to variance one (we do not take out the mean, however, as then we would lose sparsity). Let S be the vector of selected and scaled features indexed by b. Let Bb m be the frequency of bigram b in transcript m (and Bm the vector of frequencies for transcript m, of length |S| = 2000).

Our classification method is a penalized logistic regression (Hastie et al., 2009). We parametrize the probability that a transcript is from Fox News as

where ψ is a 2000-dimensional vector of coefficients on each feature. The L2-penalized logistic regression model chooses ψ to minimize the cost objective

where M∗ gives the number of documents in the training sample.

We evaluate the classifier’s performance in the test set, obtaining an accuracy of 0.73 (with a standard deviation of 0.02 across five folds). This performance is much better than guessing (i.e., an accuracy of 0.5 in the balanced sample) and comparable with other work in this literature.[6] Table 1 shows good precision and recall across the two categories.

Next, we compare our model to human judgment. Human annotators (U.S. college students) guessed whether 80-word TV transcript snippets come from FNC or CNN/MSNBC. The annotators are between 73% and 78% accurate in their guesses, and they agree 58% of the time (if guessing randomly, their agreement rate would be 25%). Thus, our machine-learning model resembles human annotations. The 80-word snippets contain significant information about the source network, and our text-based model captures it. Appendix B.3 further describes the human validation.

Table 1: Test-Set Prediction Performance for Identifying Cable News Source

We now examine which bigrams are most important for classification. An advantage of logistic regression is its interpretability: The estimated coefficients of the trained model, ψˆ b, provide a ranking across the 2,000 predictive bigrams for their relative contribution to the predictions. Table B.1 shows some bigram examples with positive (predictive for FNC transcripts) or negative (predictive for CNN/MSNBC) values of ψˆ b, and Table B.2 provides a longer list. Prominent figures like Sean Hannity (predictive of FNC) or Anderson Cooper (predictive of CNN/MSNBC) appear among the bigrams. FNC bigrams allude to intuitively conservative priorities, such as the troops, crime, terrorism, and (implied) extremism of political counterparts (“far left”). CNN/MSNBC bigrams have a more liberal flavor, with mentions of health-policy-related tokens and emphasis on international perspectives.

[6] The prediction accuracy for partisan affiliation in U.K. parliament by Peterson and Spirling (2018) is 60% and 80%, depending on the time period. According to Gentzkow et al. (2019b), one can correctly guess a speaker’s party based on a one-minute speech with 73% in the U.S. Congress (2007–2009). Kleinberg et al. (2017) obtain an AUC of 0.71 in predicting recidivism from criminal defendant characteristics.


