
I just lately filtered an enormous Twitter timeline to investigate it utilizing a deep neural community. As , tweets can include completely different sorts of content material, together with emojis. So one of many first steps was to wash the information, on this case eradicating all emoticons from the timeline.
Though this may be achieved in some ways, I’ll present the way to do it with JavaScript as a result of it’s simple and quick, so let’s begin.
As you is perhaps guessing from the subtitle of this submit, we are going to use common expressions to do it.
Trendy browsers help Unicode property, which lets you match emojis primarily based on their belonging within the Emoji Unicode class. For instance, you should use Unicode property escapes like pEmoji
or PEmoji
to match/no match emoji characters. Notice that 0123456789#* and different characters are interpreted as emojis utilizing the earlier Unicode class. Due to this fact, a greater means to do that is to make use of the Extended_Pictographic
Unicode class that denotes all of the characters sometimes understood as emojis as a substitute of the Emoji
class.
Let’s see some examples.
Use p to match the Unicode characters
For those who use the “Emoji” Unicode class, you could get incorrect outcomes:
const withEmojis = /pEmoji/u
withEmojis.check('😀');
//truewithEmojis.check('ab');
//falsewithEmojis.check('1');
//true opps!
Due to this fact it’s higher to make use of the Extended_Pictographic scape as beforehand talked about:
const withEmojis = /pExtended_Pictographic/u
withEmojis.check('😀😀');
//truewithEmojis.check('ab');
//falsewithEmojis.check('1');
//false
Use P to negate the match.
const noEmojis = /PExtended_Pictographic/u
noEmojis.check('😀');
//falsenoEmojis.check('1212');
//false
As you possibly can see, that is a simple technique to detect Emojis, however if you happen to use our earlier withEmojis
regex with a grouped emoji, you can be stunned by the end result.
const withEmojis = /pExtended_Pictographic/ugconst familyEmoji = '👨👩👧' console.log(familyEmoji.size)
//8console.log(withEmojis.check(familyEmoji))
//trueconsole.log(familyEmoji.match(withEmojis))
//(3) ['👨', '👩', '👧']familyEmoji.replaceAll(withEmojis,'*');
//*** opps!
As you possibly can see, if you happen to use the “replaceAll” methodology with our regex expression, you receive three: <***> as a substitute of 1 “<*.> This conduct happens as a result of the grouped Emoji is rendered as a single image however consists of multiple code level.
To keep away from this and different uncommon behaviors, you should use libraries like emoji-regex by Mathias bynens. This library affords an everyday expression to match all emoji symbols and sequences (together with textual representations of Emoji) as per the Unicode Commonplace.
I hope this little article might be helpful for you.