There are currently more than 300 millions active twitter accounts. There are companies, institutions and personalities, but the majority of these accounts are single users for personal or professional content. Since this is 2016, we can also assume that many of these accounts are actually used for spamming purposes. Spamming can take many forms, and its definition might change from user to user. What some might call spam, another might call information. At Synthesio, since our objective is to listen to the social web and bring the value and insights from what we crawl to our customers, our definition of spam is anything that brings no value to our customers.
Twitter has its own definition for what they consider spam, and they don’t hesitate to close accounts that do not adhere to their terms of service and rules of conduct. As expected, Twitter’s rules have a high correlation with an abusive use of their network, but not much care is given to the actual content of the messages as long as it is “original.” Our definition of spam is mostly similar to Twitter’s definition, except in one major way: we put a very large emphasis on the content.
An account that tweets only to promote and/or resell products as a third-party entity, or that does not create personal and original content, is a spammer. Also, accounts that exist only to gather followers and post unrelated content on trending topics/hashtags are also spam, whether or not they are tweeting about a topic that is included in any specific dashboard of ours.
Now that I explained how we define spam, I want to take a moment to get a little technical and explain something that is incredibly important to our customers: how do we detect and filter out a spam account?
At Synthesio we use Twitter’s Decahose to help us detect spam accounts. For those who don’t know what that is, Twitter’s Decahose is a data stream containing 10% of all tweets published. To classify someone as a spammer, we look into an account’s activity over a period of time, and extract hundred of parameters used to determine if they fall within our definition as an account that provides useful content, or if the content is spam. We look at things like:
- Is this user sharing a lot of URLs? If so, what percentage of messages contain one, and are they all from the same source, or does it vary?
- What does the text in the message say?
- Does the content vary or does it follow a specific pattern, like a constant presence of a set of hashtags and/or words like “sale” or “best offer?”
- How many hashtags versus words are there in each tweet?
These are just some of the points we look into regarding the text in the tweet. We also look at the metadata of each tweet to gather more information. For example:
- Is the user sharing media content?
- What software is the user using to post its messages?
- Where is the person located geographically?
All of these parameters serve as input into our complex set of machine learning ensembles. These are multiple methods that in parallel will decide whether or not an account is a spammer.
When we focus on the high precision of results, this system works great for us as we have a very low number of false positive: an account that is flagged as spam, when in reality it is not. But, for the purposes of data completeness, we want to mark all spammer accounts, even those that the system is not fully confident on the result. In these cases, and for quality reasons, we also check these reported accounts through human review. While human review is a daunting task, it is helped tremendously through our automated system. We also take a more aggressive stand on detecting and filtering out spam accounts by trying to focus on those seen by our customers, and making a decision within a few days for new accounts.
We are always working to improve this system with the ultimate focus of being able to have the best spam detection available, especially as conversations, platforms and even spammers evolve. Fighting spam is a never-ending battle, but we have been winning the battle for a while now, and know that we will continue to in the future.