Name: Extracting Intelligence from multilingual SMS

Text: http://scanandtarget.com/

-

contact@scanandtarget.com

Extracting intelligence from multilingual
SMS, IM, e-mails…
1

Agenda
http://scanandtarget.com/

-

contact@scanandtarget.com

Scan & Target presentation
Mass interception issues
Specificities for Arabic, Dialects and Arabish
Recommended approach
© Scan & Target 2007-2010

2

What’s happening in 60 s on the web?
http://scanandtarget.com/

© Scan & Target 2007-2010

-

contact@scanandtarget.com

3

Bla Bla Bla
http://scanandtarget.com/

-

contact@scanandtarget.com

Conversations represent a big chunk
of this traffic

© Scan & Target 2007-2010

4

Help, Natural Language
processing required!
http://scanandtarget.com/

-

contact@scanandtarget.com

• U don't got da jack but remember we got da
screenin 2mro at 8
• C vré ke C pa + facil ! G mi 2x + 2 tan a lir C 2
post en langaj SMS ke 2 posts ékri normleman
• Hexo x ti y xa ti, tú pones las reglas

• Sda7med ya 5ouya Ma chba3tech biiik allah
ghaleb...nchallah kol 3aam wenti 7ay b5iiir
© Scan & Target 2007-2010

5

Who is Scan & Target?
http://scanandtarget.com/

-

contact@scanandtarget.com

Scan & Target analyzes digital communications in real
time to provide actionable intelligence to software
vendors, brands, service publishers, marketing agencies,
governments…

Social
networks

Forums, blogs

E-mails

Instant Messaging

Our text Meaning Technology is smart enough to look in real
time at an incoming text User Generated Content data
stream, see patterns of interest, and alert the right
people or trigger the appropriate action-- all without

being queried

Customers
http://scanandtarget.com/

-

contact@scanandtarget.com

Scan & Target technology
http://scanandtarget.com/

-

contact@scanandtarget.com

Unlike solutions based on simple keywords or semantic, our technology
takes into account the different alterations and variants of
expressions to analyze the content:
 Small/ capital letters use
 Letters repetition (vvviiiagrrra for example)
 Orthographical variations (vi@gra, vlagra, v1@gra, v149r4)
 Missing letters in some cases (v|agra, v agra…)
 Word alteration whatever the use of non alpha symbol (v.i.a.g.r.a,
v_i°ag#r:a, v-iagra, viagr"a...)
 Phonetic alterations
 SMS and IM languages
 And the combination of these variations
The solution is available in English and French and Spanish and
Arabic (MSA + dialects, Arabic alphabet + transliteration).
© Scan & Target 2007-2010

Scan & Target technology
http://scanandtarget.com/

-

contact@scanandtarget.com

The solution is based on a smart engine that rates not just single words
but the entire content as it passes through the filtering engine. Words
are therefore placed in context to extract meaning
The solution applies detailed thematic thesauruses - our Smart
Wordbooks. Filters are categorized to allow customers to fine-tune the
analysis (Terrorism/Drugs/Violence, etc.) according to their needs
Additional analysis layers: sentiment analysis, questions detection…
Proprietary scoring technology tailored to short digital text contents
Using a powerful and accurate conditional analysis system, our
customers experience a very low level of false positives (between 0,05%
to 0,001% in average)
© Scan & Target 2007-2010

What can we find for you?
http://scanandtarget.com/

-

contact@scanandtarget.com

Drugs traffic
Incitement of
violence

Corruption

Online
prostitution

Smuggling

Big Data? No problem.
http://scanandtarget.com/

-

contact@scanandtarget.com

• For homeland security, our API is distributed using
IBM hardware (to be hosted on your premises)

• Thanks to our connector, it’s very easy to
implement our API into your own applications
• You choose how to display our analysis results into
your interfaces
• Capacity to deal in real time with Big Data

– All of Twitter’s traffic (10 TB / day, average 1200 Tweets per
second)* could be analyzed in real time using one IBM blade center
(for one language)
– *Source - Twitter

Agenda
http://scanandtarget.com/

-

contact@scanandtarget.com

Scan & Target presentation
Mass interception issues
Specificities for Arabic, Dialects and Arabish
Recommended approach
© Scan & Target 2007-2010

12

Mass interception issues
http://scanandtarget.com/

-

contact@scanandtarget.com

• Mass
interception
of
digital
text
communications, (OSINT or COMINT like SMS,
e-mails, IM…) is now technically available

• Issues for intelligence or law enforcement
agencies:
– How to deal with the volume (flow never stops)
– How to find the needle in the digital haystack

© Scan & Target 2007-2010

13

“Finding the needle” strategies
http://scanandtarget.com/

Benefits

-

Identified
Suspects

Interception
on keywords

Indexation
and search

Text
Meaning

-

-

+
-

+
+
+
+
+
+

Real time
information
Fuzzy search
Advanced analysis
False positive ratio
Unknown threat
detection

Required analyst
time
© Scan & Target 2007-2010

contact@scanandtarget.com

-

+
+
-

14

Strategies comparison on OSINT
http://scanandtarget.com/

Service / % alerts

-

contact@scanandtarget.com

Keywords

Indexing

Text Meaning

BlueLight.ru
Drugs forum

13%

6.5%

<1%

Gaia Online

19%

11%

<2%

© Scan & Target 2007-2010

15

Agenda
http://scanandtarget.com/

-

contact@scanandtarget.com

Scan & Target presentation
Mass interception issues
Specificities for Arabic, Dialects and Arabish
Recommended approach
© Scan & Target 2007-2010

16

Arabic usage
http://scanandtarget.com/

-

contact@scanandtarget.com

Arabic is the fastest
growing language in
the Web

With one of the
lowest penetration
rate
© Scan & Target 2007-2010

17

Arabic principles
http://scanandtarget.com/

-

contact@scanandtarget.com

• Arabic is used to describe 3 different forms of the same
language:
– Classical Arabic: used in the Qur’an and classical literature
– Modern Standard Arabic (MSA):
 no one’s native spoken language any more
 Form of Arabic taught in schools and used in newspapers, books, sermons, TV…
 The most widely understood type of Arabic used in conversation between
educated Arabs from different countries

– Colloquial or Dialectal Arabic: national or regional varieties derived
from Classical Arabic, which constitute the everyday spoken language

© Scan & Target 2007-2010

18

Arabic dialects
http://scanandtarget.com/

-

contact@scanandtarget.com

• There are a number of
Arabic dialects that are
spoken in the Arabian
peninsula, North Africa
and the Middle East;
most of which largely
differ from one another
• Dialects are a mixture of
the native or indigenous
languages and Arabic
• Many of these dialects
are mutually
incomprehensible
© Scan & Target 2007-2010

19

Iraq languages
http://scanandtarget.com/

-

contact@scanandtarget.com

2% 1% 1%

Arabic, Mesopotamian

5%
4%

3%

Arabic, North
Mesopotamian

11%
50%

Kurdish, Northern
Arabic, Najdi
Azerbaijani, South

24%

Kurdish, Central
Egyptian Spoken
Farsi, Western
Others

© Scan & Target 2007-2010

20

Dialects example
http://scanandtarget.com/

-

contact@scanandtarget.com

English Sentence:

I want

to drink

water

Standard Arabic Transliteration

Ureedu

an ashraba

ma’an

Egyptian Transliteration:

Awez

ashrab

mayya

Syrian Transliteration:

Beddy

eshrab

Mayy

Saudi Transliteration:

Abgha / Areed

Ashrab

Mayyeh

Moroccan Transliteration:

Bghit

Neshrab

Elma

© Scan & Target 2007-2010

21

Transliteration
http://scanandtarget.com/

-

contact@scanandtarget.com

• Transliteration is the romanization of Arabic
– From ‫قهوة‬

to Gahwa (Coffee)

• Problem: written Arabic is normally
unvocalized , i.e., the vowels are not written
out, and must be supplied by a reader familiar
with the language

© Scan & Target 2007-2010

22

Arabic chat alphabet
http://scanandtarget.com/

-

contact@scanandtarget.com

• The Arabic chat alphabet (Arabish or Arabizi) is
used to communicate in the Arabic language over
the Internet or for sending messages via mobile
phones when the Arabic alphabet is unavailable
• Arabic letters are replaced by letters that are
phonetically equivalent
• Arabic letters that have no Latin phonetic
counterpart are represented by numbers, or
numbers in conjunction with an accent mark
© Scan & Target 2007-2010

23

Issues with Arabic compared to latin
languages
http://scanandtarget.com/

-

contact@scanandtarget.com

• Language identification issue:

– MSA, dialects, mix of languages

• Transliteration issue (notably for names)





ABD AL-WADOUB
ABD EL OUADOUD
ABD-AL-WADUD
ABDEL EL-WADOUD

Our Text Meaning
Technology handles
all these issues

• Use of Arabish / Arabizi

– bri6ania al3o'6ma / britanya al 3ozma = Great Britain
for example

© Scan & Target 2007-2010

24

Agenda
http://scanandtarget.com/

-

contact@scanandtarget.com

Scan & Target presentation
Mass interception issues
Specificities for Arabic, Dialects and Arabish
Recommended approach
© Scan & Target 2007-2010

25

Text meaning mission
http://scanandtarget.com/

-

contact@scanandtarget.com

• To identify and destroy
terrorist / criminal
networks, you must
detect the mistakes /
errors they will make
• This is the job of text
meaning : bringing
actionable intelligence
to the analyst for
investigation
© Scan & Target 2007-2010

26

New threat detection
http://scanandtarget.com/

-

contact@scanandtarget.com

Contextual
alerts

Update alerts
triggers

Social
network
analysis
© Scan & Target 2007-2010

Target
identification

Thread
analysis

27

Messages vs thread
http://scanandtarget.com/

-

contact@scanandtarget.com

• A web or mobile conversation is a thread of
messages between 2 or more persons
• Analysis is first performed at message level for
contextual alerts
• When an alert is detected, the associated
discussion thread is again analyzed to:
• Increase accuracy and precision
• Extract investigation elements (names, places,
nationality, places…)
© Scan & Target 2007-2010

28

Message identification: paedophilia
http://scanandtarget.com/

-

PTHC =
Pre Teen Hard Core

contact@scanandtarget.com

Age detection

= automatic contextual
alert sent for potential
child pornography
© Scan & Target 2007-2010

Multimedia content
extention detection

29

Thread expansion: paedophilia
http://scanandtarget.com/

-

contact@scanandtarget.com

Investigation element:
Forum to be
investigated

© Scan & Target 2007-2010

30

Use case: drugs traffic
detection
http://scanandtarget.com/

-

contact@scanandtarget.com

• Mass Surveillance of SMS communications (20 to 30 millions per
day with a lot of different languages, English, Arabic, dialects…)

• Contextual alerts sent to analysts using conditional analysis
on:





Substance related discussions,
Transaction related discussions (quantities, money…)
Middle men related discussions (dealers, luggage handler, docker, customs…)
Smuggling related discussions (places like ports, airports and smuggling tricks)

• Investigation by analyst (conversation thread analysis, social
network analysis…) identifies:
– Dealers’ ring (pseudo, IP address…)
– Coded language detection (use of culinary vocabulary for example)

• High precision: 40 alerts per million SMS
© Scan & Target 2007-2010

31

Recommended solution
http://scanandtarget.com/

-

contact@scanandtarget.com

• Scan & Target text meaning technology is a very efficient
tool to detect previously unknown terrorist or criminal
threats on the Internet or wireless networks
• Main benefits:

– Ability to deal with huge volumes in real time
– Multilingual and ability to manage fuzzy languages like IM
or arabizi
– Actionable intelligence with message & thread analysis
– Low level of false positive thanks to advanced analysis

• To be integrated into your existing monitoring system

© Scan & Target 2007-2010

32

Contact Information
http://scanandtarget.com/

-

contact@scanandtarget.com

Bastien Hillen, CEO
[Phone] + 33 6 11 25 53 80
b.hillen@scanandtarget.com
Scan & Target
80 rue des haies
75020 Paris
France
www.scanandtarget.com
www.oorook.com

Document Path: ["1088-scan-and-target-presentation-extracting.pdf"]

e-Highlighter

Click to send permalink to address bar, or right-click to copy permalink.

Un-highlight all Un-highlight selectionu Highlight selectionh