When you look at the data that comes through your social listening dashboard, it looks great. Scratch the surface however and you’ll find the quality can often leave lots to be desired.
This matters; when you’re running reports and analyses from this data, it’s quality in quality out – and vice versa.
Here are some tips for things we look out for when cleaning and getting social data ready for analysis.
- Look who’s talking: most queries will give you data from a wide range of authors: adverts, businesses, news articles, agencies, influencers and consumers. Decide who you want to listen to, ignore the rest.
- Check your exports: strange things can happen when you take the data out of listening platforms. Beautiful, expressive emoji’s can easily become a series of ?????s
- Did you want one comment or the whole page? Depending on how good the data export or API is working, you might need to check the data you’re getting to make sure you’ve not picked up the whole page. This tends to be site wide, so if you’ve seen it on one comment, chances are all the comments from that source will have the same problem. If you’re using automated analysis – what’s text is it using?
- Inspirational quotations: Some platforms pull the entire thread of a comment and analyse them as one. For example, Linda might quote Sally which means both Linda and Sally are both exported. What happens when Barbara quotes Linda and Sally? That’s right – they all get exported and interpreted again.
- Empty titles: some listening tools take your keywords and look for them in both the title and post. This creates a problem as a forum title might contain a mention of your keyword, but the post pulled out has nothing to do with what you’re interested in. One thing we see time and time again is a forum title related to your search term, but a comment of something like “Thanks!”. And that’s it. People also use threads to talk about all manner of things, not just what’s in the title.
- Duplicates: One of the biggest issues we face is with duplicates. We regularly find examples (sometimes as much as 10-15%) where Social Listening platforms have scraped the same comment twice.
- Historic data: not an easy one. The platforms approach this differently – so ask, ask again, and again. They either tend to collect data when you press go using relationships they have with aggregators, or they rely on previous searches to fill you in. Really important to know where you are or the data you get back can be completely skewed.
- Geotagging. Social listening providers attempt to use location tags in profiles to determine where the comment was posted. However, when this information is not available, they will resort to default options which may include assuming the poster is of the same nationality as the platform they’ve posted on. This can mean that people on Twitter are tagged as American as a default.
- Skewed data. If you want to get a good representation of something, it’s important to be sure there isn’t one thread, domain, or one day/month that is skewing the rest of the results. Checking for these things, deciding on an approach for anomalies (such as randomising the data) can help.
- Poorly put together queries. Queries are complicated and difficult to get right. The broader you make them, the more things will get into the data. The more specific you make them, the more chance you risk of missing out on useful information. Boolean operators are very powerful and can be used to create really detailed queries which best pinpoint what you are looking for, but require quite an extensive amount of knowledge – and trial and error – to get right.
Great expectations: with all this in mind, you’ll be doing well to get 50-70% relevant data. People use language in ways beyond the power of clever Boolean and extensive exclusions to keep out.
To make sure you’re working with solid, reliable data, you need to thoroughly check it. We’ve designed a range of tools for dealing with each of these potential issues, give us a shout if you’d like to find out how.