Text Mining My Own Relationship: My first project in R

I was inspired to pursue this project when I came across a blog post titled the data of long distance lovers. As someone who has also been in a long-distance relationship for over a year and used Viber as a primary mode of communication, I immediately wanted to do a similar analysis on my own text messages.

Let me begin by giving some background: I live in California and she lives in the UK, which means we usually have about an eight-hour time difference. I only have Viber on my phone, whereas she has it on her computer and phone. In addition to using Viber, we would send longer messages by email (my main way of communicating long messages) and video chat using Skype–mostly on weekends. We’re both students with fairly flexible schedules. We both agreed to doing this project and we discussed some questions we’d both like to know the answers to, namely, who poses more questions to the other person.


Skip this section if you want to get straight to the results. 

Before this project, I had no experience in R, a powerful and free programming language for statistical computing. After this project, I have very minimal experience, but a much greater appreciation for the language itself and what it’s capable of. Essentially, I began by downloading R and the GitHub code from the above blog post. I read what I could, trying to relate it to what I know about programming in other languages. However, this code was quite concise and I had to look up how things like a data frame work. After getting a general sense of what the code was doing, I downloaded my data from Viber and ran the code, which is where I ran into my first roadblock.  My downloaded data came in a .csv file, but it was actually tab-delimited:

Comma Separated Version:

DD/MM/YYYY,HH:MM:SS,SENDER,+XXXXXXXXXXXX, Message

Tab Separated Version:

="DD/MM/YYYY" HH:MM:SS ="SENDER" ="+XXXXXXXXXXXX" "Message"

I am still not sure why my data was stored differently but it meant I had to change the regex that parsed the file from

"\\s*(.+),\\s*(.+),\\s*(.+),\\s*(.+)[X,X],\\s*(.+)" to "\\s*(.+)\\t\\s*(.+)\\t\\s*(.+)\\t\\s*(.+)[X,X]\\t\\s*(.+)" 

however there were issues with the extra quotation marks thrown in. Ultimately, I was able to get a version of the data from my girlfriend that was stored as a proper .csv file. But the parsing hassles weren’t over just quite yet. 

Since messages between us were sometimes decently long–especially hers–there were commas in the messages which would confuse the parser. This meant having to make the regex more specific, i.e. I had to say that each line began with a date, followed by a comma, followed by the time, etc rather than just saying it began with any text, followed by a comma, etc. as (.+),” does. In the end the regex I had was

 "\\s*([0-9]{2}/[0-9]{2}/[0-9]{4}),\\s*([0-9]{2}:[0-9]{2}:[0-9]{2}),\\s*(Me|Yacoub Kureh),\\s*(\\+[0-9]{11,12}),\\s*((?s).*)"

where the last line was my attempt to allow multi-paragraph messages where a new line, \n, was used. This of course was meaningless as the readLines function that was reading in the file was being called before the parser and it would split the messages incorrectly. The only fix I could find for this problem was to go through and manually delete all new lines that appeared in messages in the original file.

As you’ll see below, I wanted the ability to interpret the data in my own time zone as well as hers, so I had to figure out how to get R to change date-times correctly. Luckily, there’s this nifty package for R called lubridate that makes it easy to work with timezones. When the date-times are being read in, R automatically labels them as UTC. But this isn’t always the time in London! So I first have to tell it not to change the time, just relabel the time zone. This is done with force_tz, e.g. force_tz(2014-08-10 10:59:32 UTC, tz=”Europe/London”) gives 2014-08-10 10:59:32 BST. Note only the time zone changed even though BST=UTC+1:00. Then to actually change it to LA time, I used with_tz(2014-08-10 10:59:32 BST, tz=”America/Los_Angeles”) which gives 2014-08-10 02:59:32 PDT.


The Results, Part I

Let’s first look at what time of the day we are doing most of our messaging. Here, messages are grouped into hour buckets, so a message sent at 3:43pm gets counted in the 3:00pm bucket, etc.
correct_her_hist-hours

So she texts primarily in her afternoon and evening. Sometimes she’s up to a little after midnight, but she seems to get a solid 5 hours of sleep between 1 and 6am. There’s the occasional texting that happens when she wakes up before I go to bed, but then she’s on her own again until I wake up. There’s a distinct drop at 8pm, the cause of which is more obvious in my histogram below.

correct_his_hist-hours

It’s not an exact shift of 8 hours as there are times in the year when we are 9 hours apart and times when we are only 7 hours apart because the US and UK do not coordinate when to start and stop Daylight Savings Time. So that 8pm dip from above is me going to lunch where either I can call or Skype her for a bit or I’d be eating lunch with friends. Unfortunately, because of the massive time zone difference, our overlapping awake time happens during my workday and usually can cut into my sleep time (the probability of me going to sleep at any given time between 12 and 4 am follows a fairly linear relationship). I wish I could say that I’m napping between 5 and 9pm, but I’m not. Looking at this histogram, I really wonder how I don’t drink coffee. Regardless, onwards and upwards!

nb_msg

So of the nearly FORTY-NINE THOUSAND messages we sent in a year, I sent slightly more. She sent 48% of the total messages while I sent the other 52%. I attribute this to me using my phone and preferring to break up my thoughts into several shorter messages rather than one longer message. Before I claim a strong moral victory, it’s only fair that we also compare the number of individual characters sent.nb_msgWell then…I lose this round it seems (it’s not a competition, right?). Here we see that she sent around 56% of all sent characters. Some of my ad hoc explanations for this include: she can send much longer messages since she is on a computer, and also by being on a computer means she is more likely to send me URLs and other copypasta. Or maybe I ought to cut my losses and move on to the next graph!

So what do more than a tenth of our messages look like? Well they’re apparently fewer than four characters:

n_char

In this histogram we bucket into groups of 1-3 chars, 4-6 chars, etc. and display a percentage of total messages. As in the original blog post, we have the a similar dip and rise in the first few buckets. There’s a ton of “ok”, “yup”, “:)”, and “lol” length messages. We do a lot of “sure”, “okay”, “yeah” length messages, but even more longer messages. Once we start looking at messages longer than 15 characters we get a really smooth decay.

For the last bit of Part I, there’s one more comparison. In fact, it’s probably the most important comparison of all in any relationship. How long do you leave the other person hanging before responding to their message? Do you tend to leave your phone on silent? Or perhaps you’re just a slow typer. Whatever your excuse, response time is important because some people read a lot into it.

time_to_answer

This comparative histogram buckets response messages, i.e. a message whose sender is not the sender of the previously sent message, by their response time. Over 80% of my messages to her are responded to in less than two minutes, while I respond that quickly to a little over 70% of her messages to me. Overall, we both have really good response times, my median being 26 second and hers being 30 seconds. Of course our means are very skewed though because of long breaks in our texting during trips and such.

At the suggestion of a commenter on the original blog post, for Part II of this series in text mining my relationship, I will try to convert the above response time histogram into a Kaplan-Meier plot. You can think of it as measuring the survival time of message before it gets responded to. 100% of messages vacuously survive to 0 seconds, but by 15 seconds a good chunk of messages are responded to, and by 30 seconds, more than half of the messages have be responded to. For now, I have to figure out how to do this in R.

Thanks for reading and remember to check back for Part II which will feature new plots, word clouds, and a calendar heat map!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s