Predicting personality from mobile CDRs

We continue to shape our personality all our life. If we knew ourselves perfectly, we should die.”
Albert Camus

This paper entitled “Predicting people personality using novel mobile phone-based metrics” by de Montjoye, et al provides evidence that user personality can be predicted with reliability* from CDR (Call Data Record) data that’s available to all mobile operators and all types of phones without needing any custom applications installed.

Given that there are 6.8 billion mobile subscriptions worldwide (as estimated by The International Telecommunication Union in February 2013), that’s billions of personalities that could be analysed by BSS world-wide, including yours and mine.

What purpose could this information be put to in turn, especially when carriers have the ability to link it with real-time locational data on a per-user basis? I’m sure there are lots of marketing experts around the world who will have many ideas.

* At a reliability of 42% better than random

If this article was helpful, subscribe to the Passionate About OSS Blog to get each new post sent directly to your inbox. 100% free of charge and free of spam.

Our Solutions


Most Recent Articles

6 Responses

  1. Hi Ryan:

    It’s not just CDRs you can do that with. We previously did some analysis of the conversations within IP traffic flows and found that you could determine the likely demographic of a user based on a few seconds of network traffic. We could ‘bucket’ users into high-level categories (eg: 14 year old male, 16 year old female, 20-somethings, salesman, etc.) with >90% accuracy. We also found that you could identify the seeds for DDoS attacks and other similar events; allowing you to make intervention at the seeding stage rather than the attack stage. That was several years ago now but it might be worth a revisit.

    And, yes, I’m sure the marketing people would have ideas on how to exploit both this and the CDR work but – sadly – it’s all still in the vein of broadly bucketing users rather than true personalisation!

    Have fun!


  2. Hi David,

    You constantly amaze me with the exciting projects you are involved in (or in this case have been involved with in the past).

    Do you have a publication and/or URL to share with the readers about this past assignment?


  3. Hey Ryan:

    Thanks – will pay you for the props in beer! 🙂

    This bit of work was done several years back – almost as an idle thought in my spare time (I had access to traffic flows from a large Internet Exchange and wondered what they’d reveal if analysed!). There’s no publication or URL immediately available – and I’d have to go dust off the code I wrote to explore what could be produced in short order.

    That said, this is one of the components we are thinking of putting in as a node in our Cascade platform – it would then be fed by live data (subject to network latency of course) in order to detect this sort of pattern (and others) and to inform decision support systems layered above it. I’ll keep you posted if this is of interest.

    Have fun!


  4. Hi David,

    Hehe 😀 (you’ll have to drink the beer though because I don’t drink)

    That sounds like a great experiment. I get the part about analysing traffic flows, but how were you able to determine the 90%+ accuracy of user categories? Did you have another data set with user attributes to cross-reference the flows against?

    It sounds like it could be worthwhile for dusting off for Cascade, especially if you’re able to extend the insights even further. Have you had any potential customers indicate an interest in this type of feature? I’m intrigued about the how to go about matching insight with customers who can find value in your insights.


  5. No worries: Perrier it is then. 🙂

    I used the traffic flow data to define attributed models of buckets – without naming them. That’s basically an exercise in aggregating similar flows (by various attributes) across the complete event stream. Then, having defined buckets, I took those definitions over to a network I had control over – and that had a variety of user types on it. It turned out there was a good correlation between the buckets I had auto-defined and the actual users on the new network. That led to labelling said buckets as, if you will, prototypes for user categories.

    Further testing of the approach saw the results of the above taken to (yet) another network to confirm the observations. Correlation: 90%+. Having done that, I then took other traffic flow data from various other IXs and looked at what matched and, more importantly, what didn’t. The what didn’t then got analysed in order to try and uncover what that anomaly represented. That led to identifying the characteristics of DDoS seeds, etc. which, in turn, extended the range of prototypes we could detect.

    Now, as caveat, I’d point out that this is still a generic bucket method and does not go down to the individual target level. To do that, we’ll have to augment directly observable behaviour (a la buckets above) with indirectly observable behaviour (posts, tweets, search logs, cookies (yeuch), etc.). That will then give a much more detailed segmentation of demographic, psychographic, and so on. What gets done with that is open to imagination…

    That’s what we’ll probably push into the revised model for Cascade use. We’ve got interest from Cybersecurity and Marketing types so we’ll see what happens. Of course, I can also see this being of great use in corporate operations – eg: insider threat, compliance, fraud, etc. – but they tend to have much slower decision cycles! 🙂

    Have fun!


  6. Hi David,

    I love it! Especially the fact that anomalies led to identification of DDoS seeds, which I’m guessing was an unexpected but exciting outcome?

    Best of luck with inspiring lots of customers with Cascade’s insights!


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.