There’s a problem with social data.
The phrase “rubbish in, rubbish out” says it all.
You can’t rely on data – if it doesn’t tell you what you think it does. Your sentiment scores, theme counts and other measures will be wrong.
This is important.
Getting it wrong means you’ll be reporting inaccurate results, hindering effective decision making and undermining your stakeholder’s faith in social data.
Argh!
But it’s not the end of the world.
There are just some things you need to be aware of, and some steps you need to take, to ensure you’re working with good data, not bad.
What does good look like?
At L+LR we think of data quality in terms of its relevancy – does it tell you what you think it tells you?
We measure relevancy as the proportion of comments that actually relate to the subject we’re looking at.
To set the bar here, it’s not uncommon for an initial query to return less than 10% relevant comments.
There are lots of reasons why a query will return irrelevant results.
Here are a few of the common ones to look out for:
- Language is messy: What makes sense in your head might not play out well in the data. Terms that you think apply only to you may be used in completely different ways by others – or may form part of other words or expressions. For example, the term A1 can refer to: a computer processor, a road in the UK, a sauce, an exam, a make of Audi, a band, a steam train, a size of paper…
- Collection is imperfect: Data scraping and listening tools are imperfect. They have huge reach but not all of these millions of connections have been tested. This means you can get a lot of duplicates, erroneous posts, garbled data outputs etc. If not checked they will heavily skew any automated thematic and sentiment analysis. You’ll see patterns and connections which are simply not there.
- It’s got to relate to a business question: In a recent example for McDonald’s we found lots of people talking about their own recipes for McDonalds’ classics, rather than their experience in-store. This is fascinating, but might not help you understand the customer experience. Again any automated topic, emotion or sentiment scores based on this will be wrong.
- Targeting the right audience: Some projects need to capture every mention, others don’t so you need to make sure you exclude those voices you’re not interested in (e.g. news, organisations, influencers, certain stakeholders etc.)
- Replies: Another data collection issue, where the crawlers collect the same comment multiple times along with any replies to it. The automated tools can’t then distinguish between the old post and the new comments – giving the original post a much greater weight in the analysis – further skewing the results.
- Duplicates: Still a problem. It’s not uncommon for us to find 15-20% of the data is an exact duplicate. This again will skew the results if not checked and removed.
- People are people: People like to muck about and use language for fun. For example, there’s a spoof Beatles song called ‘All you need is Fries’ which can mess up any project looking at fast food.
If you don’t check, you won’t know how wrong your results are – and they will be wrong. We’ve never had a 100% relevant query (it’s OK for us because we don’t rely on automated analytics).
What to do about it?
There is a solution.
Start by setting an objective relevancy target (is 50% OK?, 60%, 70%?) and, really importantly, discuss and agree on this with your client (they’ll thank you for it).
Then you need to roll up your sleeves, read a broad, random sample and test it for yourself.
When we do this, we ask ourselves one question, “would my client find this comment relevant and interesting?”.
If the answer’s no, look for the reason behind this as this will help you work out which strategy you need to adopt to improve your relevancy. The sources of irrelevancy above will give you a good place to start (e.g. narrow down your language, exclude sources, remove duplicates, isolate problematic content).
There is no perfect data.
Our job is to understand the imperfections, remove them where we can and then mitigate them where we can’t.
Then we’ll be ready to know what the data is telling us.
Stuck? Give us a shout, we’d be more than happy to share some tips or help you out.