Data scraping with R : DisneyPlusID vs NetflixID tweets

Data Scraping
Uncategorized

Data scraping with R : DisneyPlusID vs NetflixID tweets

Data scraping is a method of extracting information from websites, documents, and other data sources using analytical techniques. Data scraping can be conduct by R to improve marketing and social media research by focusing on text analysis. In this post, I will analyze Disney+ Indonesia and Netflix Indonesia by scraping their tweets with R. Disney and Netflix increasing their investment in original content in order to better engage their audiences. We can learn about the type of text content they share and how they frame it by studying the popularity of their tweets. This article will only cover the basic concepts of scraping Twitter data with R and will not go into detail.

Data Scraping : connect twitter API

Please ensure you have a Twitter API to scrape data. If you do not yet have a Twitter API, read this post to learn how to get one.

Let’s get start the data scraping in R studio :

Connect your Twitter API key to begin the data scraping process.

api_key = 'Your API Key'
api_secret = 'Your API Secret'
access_token = 'Your Access Token'
access_token_secret = 'Your Token Secret'

setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)

Install the required packages and libraries

install.packages("rtweet")
library(rtweet)
install.packages("httpuv")
install.packages('twitteR')
install.packages('tm')
library(twitteR)
library(tm)
install.packages("tidyverse")
library(tidyverse)
install.packages("tibble")
library(dplyr)
install.packages("tidytext")
library(tidytext)

Comparing popularity by users (DisneyPlusID vs NetflixID)

To scrape Twitter data, we must enter the appropriate username account. In this analysis, I’ll use the Disney Plus Indonesia “DisneyPlusID” and Netflix Indonesia “NetflixID” Twitter accounts as the objects.

Now that direct API authentication is active, we can begin the process of scraping data. Then, create a dataframe using the results of the data scraping.

Note: This data was collected on November 15, 2022.

netflixid_tweets<- userTimeline("NetflixID", n = 3200)
netflixid_df <- tbl_df(map_df(netflixid_tweets, as.data.frame))

disneyplusid_tweets<- userTimeline("DisneyPlusID", n = 3200)
disneyplusid_df <- tbl_df(map_df(disneyplusid_tweets, as.data.frame))
DisneyPlusID Dataframe
DisneyPlusID Dataframe

Data Scraping Results

NetflixID Dataframe
NetflixID Dataframe

Combined Tweets

To compare the popularity of tweet text-based content between DisneyPlusID and NetfixID, we must combine both dataframes.

combined.tweets <- rbind(netflixid_df, disneyplusid_df)

Let’s compare the popularity of tweets based on favorite and retweet counts.

combined.tweets %>%
  ggplot(aes(x = log(favoriteCount), y = log(retweetCount), colour = screenName)) +
  geom_point()
Data Scraping
DisneyPlusID vs NetflixID Tweets Popularity

According to the findings, Netflix has a higher popularity engagement for their audience than DisneyPlusID.

Descriptive Statistics

To back up the findings, we can use descriptive statistics to determine the mean, median, and maximum number of favorite counts.

combined.tweets %>%
  group_by(screenName) %>%
  summarise(mean(favoriteCount), median(favoriteCount), max(favoriteCount))
Data Scraping
Descriptive Statistics Results

Brand Voice Throught Textual Analysis

The brand voice and persona can sometimes be seen in how the brand delivers their message in order to engage with the audience. For instance, consider the words and sentences used in each of their contents. A glimpse of how to analyze the brand voice using R is through textual analysis.

Textual Analysis DisneyPlusID

disneyplusid.word <- disneyplusid_df %>% 
  select(id, text) %>% 
  unnest_tokens(text, text)
head(disneyplusid.word)
R Analysis Scraping Tweets
disneyplusid.count <- disneyplusid.word %>% 
  count(text, sort = TRUE) %>% 
  head(30) %>% 
  mutate(text = reorder(text, n))

disneyplusid.count %>%
  ggplot(aes(x = text, y = n)) + 
  geom_col() +
  coord_flip() + 
  theme_minimal()
Data Scraping
DisneyPlusID tweets’ most frequently used words

As we can see, some of the words are only links or have no actual significance. Consequently, we should remove certain words.

new_items <- c("https", "t.co", "di")

stop_words_new <- stop_words %>%
  pull(word) %>%
  append(new_items)

disneyplusid.count <- disneyplusid.word %>% 
  filter(!text %in% stop_words_new) %>%
  count(text, sort = TRUE) %>% 
  head(30) %>% 
  mutate(text = reorder(text, n))

ggplot(disneyplusid.count, aes(x = text, y = n)) + 
  geom_col() +
  coord_flip() +
  theme_minimal()
Data Scraping
DisneyPlusID tweets’ most frequently used words (clean)

Textual Analysis NetflixID

netflixid.word <- netflixid_df %>% 
  select(id, text) %>% 
  unnest_tokens(text, text)

head(netflixid.word)
Data Scraping
netflixid.count <- netflixid.word %>% 
  count(text, sort = TRUE) %>% 
  head(30) %>% 
  mutate(text = reorder(text, n))

netflixid.count %>%
  ggplot(aes(x = text, y = n)) + 
  geom_col(fill = "#D81F26") +
  coord_flip() + 
  theme_minimal()
Data Scraping
NetflixID tweets’ most frequently used words

As we can see, some of the words are only links or have no actual significance. Consequently, we should remove certain words.

new_items <- c("https", "t.co", "di")

stop_words_new <- stop_words %>%
  pull(word) %>%
  append(new_items)

netflixid.count <- netflixid.word %>% 
  filter(!text %in% stop_words_new) %>%
  count(text, sort = TRUE) %>% 
  head(30) %>% 
  mutate(text = reorder(text, n))

ggplot(netflixid.count, aes(x = text, y = n)) + 
  geom_col(fill = "#D81F26") +
  coord_flip() +
  theme_minimal()
Data Scraping
NetflixID tweets’ most frequently used words (clean)

Conclusion

In conclusion, I discovered that the NetflixID text-based Twitter content is more popular and engaging than the DisneyPlusID. There are also differences in the most frequently used words used by DisneyPlusID and NetflixID to demonstrate their brand voices. The Netflix ID emphasizes audience engagement by using more informal words like “nggak, kalo, bikin, bakal, lagi, apa,” whereas the DisneyPlus ID emphasizes their product or service and tagline by frequently using words like “streaming, Disney Plus HD, ekslusif, serial, original, Marvel”.

As mentioned in the introduction, this is not a deep analysis. Furthermore, in a future post, I will go into greater detail about data scraping, cleaning text datasets, and text analysis. Thank you for reading my post. Please do share with others and let me know your thought, comment with your opinion, or don’t hesitate to say “Hi”. 

Read : Taylor Swift Data Analysis : Is Taylor Swift’s Song Making Your Mood?

Let's connect

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top
Translate »