Monitoring media for terrorisim with ai
Goal: SIGINT (i.e. Signals Inteligence)
Summary:
- Take a text only dataset
- Rough segmentation of the data
- Data cleaning and enriching
- Embeddings
- Cluster the similar stuff together
Setup
-
setup a virtualenv for python3.5
-
install the Jupyter notebook
pip install jupyter
-
install the jupyter extensions
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
-
start the notebook
`which python` `which jupyter-notebook` --no-browser --ip 127.0.0.1 --port 8888
-
install scipy, numpy, pandas, tensorflow, keras, scikit-learn
pip install scipy numpy pandas tensorflow keras scikit-learn
Clasiffy adverse media articles
For this talk we will be using the Global Terrorism Database available on Kaggle.
Suppose we want to monitor all the media in the world, and only want to process the articles that relate to terrorism (exclude Donald Trumps, or <insert-here-other-political-figure-that-tries-to-destroy-a-country>)
Load the dataset
import pandas as pd
from IPython.display import display
table = pd.read_csv('./terrorism.csv', encoding = "ISO-8859-1")
display(table.describe()), display(table.head())
/home/cristi/Envs/techsylvania/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (4,6,31,33,53,61,62,63,76,79,90,92,94,96,114,115,121) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
eventid | iyear | imonth | iday | extended | country | region | latitude | longitude | specificity | ... | ransomamt | ransomamtus | ransompaid | ransompaidus | hostkidoutcome | nreleased | INT_LOG | INT_IDEO | INT_MISC | INT_ANY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.703500e+05 | 170350.000000 | 170350.000000 | 170350.000000 | 170350.000000 | 170350.000000 | 170350.000000 | 165744.000000 | 165744.000000 | 170346.000000 | ... | 1.279000e+03 | 4.960000e+02 | 7.070000e+02 | 487.000000 | 9911.000000 | 9322.000000 | 170350.000000 | 170350.000000 | 170350.000000 | 170350.000000 |
mean | 2.001776e+11 | 2001.709997 | 6.474365 | 15.466845 | 0.043634 | 132.526669 | 7.091441 | 23.399774 | 26.350909 | 1.454428 | ... | 3.224502e+06 | 4.519918e+05 | 3.849663e+05 | 272.462012 | 4.624458 | -28.717335 | -4.583387 | -4.510555 | 0.091083 | -3.975128 |
std | 1.314444e+09 | 13.144146 | 3.392364 | 8.817929 | 0.204279 | 112.848161 | 2.949206 | 18.844885 | 58.570068 | 1.009005 | ... | 3.090625e+07 | 6.070186e+06 | 2.435027e+06 | 3130.068208 | 2.041008 | 58.737198 | 4.542694 | 4.630440 | 0.583166 | 4.691492 |
min | 1.970000e+11 | 1970.000000 | 0.000000 | 0.000000 | 0.000000 | 4.000000 | 1.000000 | -53.154613 | -176.176447 | 1.000000 | ... | -9.900000e+01 | -9.900000e+01 | -9.900000e+01 | -99.000000 | 1.000000 | -99.000000 | -9.000000 | -9.000000 | -9.000000 | -9.000000 |
25% | 1.990053e+11 | 1990.000000 | 4.000000 | 8.000000 | 0.000000 | 75.000000 | 5.000000 | 11.263580 | 2.396199 | 1.000000 | ... | 0.000000e+00 | 0.000000e+00 | -9.900000e+01 | 0.000000 | 2.000000 | -99.000000 | -9.000000 | -9.000000 | 0.000000 | -9.000000 |
50% | 2.007121e+11 | 2007.000000 | 6.000000 | 15.000000 | 0.000000 | 98.000000 | 6.000000 | 31.472680 | 43.130000 | 1.000000 | ... | 1.420000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 4.000000 | 0.000000 | -9.000000 | -9.000000 | 0.000000 | 0.000000 |
75% | 2.014023e+11 | 2014.000000 | 9.000000 | 23.000000 | 0.000000 | 160.000000 | 10.000000 | 34.744167 | 68.451297 | 1.000000 | ... | 4.000000e+05 | 0.000000e+00 | 7.356800e+02 | 0.000000 | 7.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 2.017013e+11 | 2016.000000 | 12.000000 | 31.000000 | 1.000000 | 1004.000000 | 12.000000 | 74.633553 | 179.366667 | 5.000000 | ... | 1.000000e+09 | 1.320000e+08 | 4.100000e+07 | 48000.000000 | 7.000000 | 1201.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 77 columns
eventid | iyear | imonth | iday | approxdate | extended | resolution | country | country_txt | region | ... | addnotes | scite1 | scite2 | scite3 | dbsource | INT_LOG | INT_IDEO | INT_MISC | INT_ANY | related | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 197000000001 | 1970 | 7 | 2 | NaN | 0 | NaN | 58 | Dominican Republic | 2 | ... | NaN | NaN | NaN | NaN | PGIS | 0 | 0 | 0 | 0 | NaN |
1 | 197000000002 | 1970 | 0 | 0 | NaN | 0 | NaN | 130 | Mexico | 1 | ... | NaN | NaN | NaN | NaN | PGIS | 0 | 1 | 1 | 1 | NaN |
2 | 197001000001 | 1970 | 1 | 0 | NaN | 0 | NaN | 160 | Philippines | 5 | ... | NaN | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN |
3 | 197001000002 | 1970 | 1 | 0 | NaN | 0 | NaN | 78 | Greece | 8 | ... | NaN | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN |
4 | 197001000003 | 1970 | 1 | 0 | NaN | 0 | NaN | 101 | Japan | 4 | ... | NaN | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN |
5 rows × 135 columns
(None, None)
- Train classifier
First, let’s see how many types of terrorist attacks we have documented, in how many datapoints
import numpy as np
summaries = table['summary'].values
empty_summaries = np.array([isinstance(summ, float) and np.isnan(summ) for summ in summaries], dtype=bool)
sum(empty_summaries)
66138
How does one summary look like?
summ = table['summary'].tolist()[11]
summ
'1/6/1970: Unknown perpetrators threw a Molotov cocktail into an Army Recruiting Station in Denver, Colorado, United States. There were no casualties but damages to the station were estimated at $305.'
sum(np.isnan(table['attacktype1'].values))
0
So we have 66k text descriptions and all of them have an attacktype specified.
What are all the attack types that we have labeled?
set(table['attacktype1_txt'].tolist())
{'Armed Assault',
'Assassination',
'Bombing/Explosion',
'Facility/Infrastructure Attack',
'Hijacking',
'Hostage Taking (Barricade Incident)',
'Hostage Taking (Kidnapping)',
'Unarmed Assault',
'Unknown'}
We should actually convert those to indexed values so we can deal with numbers instead of strings.
Fortunately, the dataset already gracefully provides the indexed labels from above
set(table['attacktype1'].tolist())
{1, 2, 3, 4, 5, 6, 7, 8, 9}
classtype = {classname: classvalue for classname, classvalue in table[['attacktype1_txt', 'attacktype1']].values}
classindx = dict(zip(classtype.values(), classtype.keys()))
classindx = [''] + [classindx[i] for i in range(1, len(classindx)+1)]
classtype, classindx
({'Armed Assault': 2,
'Assassination': 1,
'Bombing/Explosion': 3,
'Facility/Infrastructure Attack': 7,
'Hijacking': 4,
'Hostage Taking (Barricade Incident)': 5,
'Hostage Taking (Kidnapping)': 6,
'Unarmed Assault': 8,
'Unknown': 9},
['',
'Assassination',
'Armed Assault',
'Bombing/Explosion',
'Hijacking',
'Hostage Taking (Barricade Incident)',
'Hostage Taking (Kidnapping)',
'Facility/Infrastructure Attack',
'Unarmed Assault',
'Unknown'])
Build the classification dataset
We now want to build a classifier that can quickly sort out the things we are/are not interested in.
raw_classification_lables = table['attacktype1'].values
raw_classification_inputs = table['summary'].values
mask_for_non_empty_summaries = np.array([not (isinstance(summ, float) and np.isnan(summ)) for summ in summaries], dtype=bool)
classification_inputs = raw_classification_inputs[mask_for_non_empty_summaries]
classification_labels = raw_classification_lables[mask_for_non_empty_summaries]
assert classification_inputs.shape[0] == classification_labels.shape[0]
classification_inputs[0], classification_labels[0]
('1/1/1970: Unknown African American assailants fired several bullets at police headquarters in Cairo, Illinois, United States. There were no casualties, however, one bullet narrowly missed several police officers. This attack took place during heightened racial tensions, including a Black boycott of White-owned businesses, in Cairo Illinois.',
2)
Train, test split
Now that we have a dataset, we will shuffle it and split it into training and test sets.
import numpy as np
# suffle
perm = np.random.permutation(classification_inputs.shape[0])
classification_inputs = classification_inputs[perm]
classification_labels = classification_labels[perm]
from sklearn.model_selection import train_test_split
classification_train_inputs, classification_test_inputs, classification_train_lables, classification_test_lables = train_test_split(classification_inputs, classification_labels, test_size=0.33)
assert classification_train_inputs.shape[0] == classification_train_lables.shape[0]
assert classification_train_inputs.shape[0] + classification_test_inputs.shape[0] == classification_inputs.shape[0]
Build a really quick classification model
For this task we will use scikit-learn.
We will: * build a simple data pipeline * use stop-words to trim out the frequent (useless words) out of our vocabulary * use the td-idf (term frequency - inverse document frequency) method to vectorize the articles. * use the final vectors to train a Bayesian classifier on top of the data we have. * use grid-search for optimizind the hyperparamters and fitting a better model instance.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
classifier = Pipeline([
('vectorizer', TfidfVectorizer(stop_words='english')),
('classifier', MultinomialNB())
])
classifier.fit(X=classification_inputs, y=classification_labels)
Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=Tr... vocabulary=None)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])
Scikit-lean has a nice module for computing the metrics that we want. We’re just going to use that to see how we did.
from sklearn.metrics import classification_report
print(classification_report(classification_test_lables, classifier.predict(classification_test_inputs), target_names=classindx[1:]))
precision recall f1-score support
Assassination 0.96 0.01 0.02 2095
Armed Assault 0.62 0.63 0.63 8126
Bombing/Explosion 0.72 0.99 0.84 18199
Hijacking 0.00 0.00 0.00 89
Hostage Taking (Barricade Incident) 0.00 0.00 0.00 115
Hostage Taking (Kidnapping) 0.98 0.33 0.50 2496
Facility/Infrastructure Attack 0.92 0.11 0.20 1826
Unarmed Assault 0.00 0.00 0.00 183
Unknown 0.00 0.00 0.00 1261
avg / total 0.71 0.71 0.64 34390
/home/cristi/Envs/techsylvania/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
So the results are somewhat bad for for the classes where we only have ~100 examples each but on the frequent classes (and the ones that we’re interested on ‘Bombing/Explosions’) it’s not that bad.
Nevertheless, some optimisations are required.
What we can do is tune the hyperparamters so that we find the best overall model.
NOTE!! The code bellow take some minutes to run, so we’ll not run it. I’ve ran it for you before the talk so we can see the results bellow
from sklearn.model_selection import GridSearchCV
parameters = {
'vectorizer__ngram_range': [(1, 1), (1, 2)],
'vectorizer__use_idf': (True, False),
'classifier__alpha': (1e-2, 1e-3),
}
gs_clf = GridSearchCV(classifier, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(classification_train_inputs, classification_train_lables)
What was the best combination found?
gs_clf.best_params_
{'classifier__alpha': 0.01,
'vectorizer__ngram_range': (1, 2),
'vectorizer__use_idf': False}
from sklearn.metrics import classification_report
print(classification_report(classification_test_lables, gs_clf.predict(classification_test_inputs), target_names=classindx[1:]))
precision recall f1-score support
Assassination 0.98 0.92 0.95 4184
Armed Assault 0.94 0.96 0.95 16525
Bombing/Explosion 0.99 0.98 0.98 37182
Hijacking 0.98 0.54 0.70 191
Hostage Taking (Barricade Incident) 0.98 0.60 0.74 236
Hostage Taking (Kidnapping) 0.97 0.98 0.98 4886
Facility/Infrastructure Attack 0.92 0.95 0.93 3710
Unarmed Assault 1.00 0.76 0.86 362
Unknown 0.80 0.88 0.84 2546
avg / total 0.96 0.96 0.96 69822
Conclusion
So using the above classifier we can filter out really fast, any media article that we’re not interested in and then focus on the specific use case we want to handle.
There’s A TON of other approached that you could try to improve the above result. Nowadays, in NLP you don’t want to do ‘bag-of-words’ models as we just did above, but instead you would vectorize the text using word vectors.
You might also want to try some other advanced stuff like:
- Vectorize by passing the word vectors through an RNN
- Use a bidirectional RNN for better state-of-the-art results
- Use a stacked CNN on top of the word vectors for a different type of vectorisation
- Use an attention mechanism right before the classification output, etc..
Extract Named Entities
Now that we have only the articles that we’re interested in (i.e. we have a classifier that we can use to select them), we need, for each one to reason on it’s content.
The most important thing we can do is to parse the text for interesting tokens.
Usually, these are: people names, geographical location (countries, cities), landmarks (eg. eiffel tour), dates, etc..
In academia, this is a fairly well established problem that is known as Named Entity Recognition.
There numerous papers, strategies and datasets that you can use to train an ML model for this.
There are also some pretty decent libraries that come included with pretrained NER model, one of which is spacy.
import spacy
Download the english language model from scipy repo
python -m spacy download en
nlp = spacy.load('en')
Let’s see one example of what this does:
doc = nlp(summ)
doc, [(ent.label_, ent.text) for ent in doc.ents]
(1/6/1970: Unknown perpetrators threw a Molotov cocktail into an Army Recruiting Station in Denver, Colorado, United States. There were no casualties but damages to the station were estimated at $305.,
[('DATE', '1/6/1970'),
('ORG', 'Army Recruiting Station'),
('GPE', 'Denver'),
('GPE', 'Colorado'),
('GPE', 'United States'),
('MONEY', '305')])
Example entity types (values found in .label_):
ORG = organization
GPE = geo-political entity
PERSON = person (may be fictional!)
...
We will build a class that will take a text string and return an processed event
class Process:
def __init__(self, language_model):
self.model = language_model
def text(self, data):
data = str(data)
results = {'TEXT': data}
for ent in self.model(data).ents:
results.setdefault(ent.label_, set()).add(ent.text)
return results
process = Process(nlp)
process.text(summ)
{'DATE': {'1/6/1970'},
'GPE': {'Colorado', 'Denver', 'United States'},
'MONEY': {'305'},
'ORG': {'Army Recruiting Station'},
'TEXT': '1/6/1970: Unknown perpetrators threw a Molotov cocktail into an Army Recruiting Station in Denver, Colorado, United States. There were no casualties but damages to the station were estimated at $305.'}
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()
HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))
Let’s see some examples
extracted = table['summary'][:10].progress_apply(process.text)
HBox(children=(IntProgress(value=0, max=10), HTML(value='')))
extracted.tolist()
[{'TEXT': 'nan'},
{'TEXT': 'nan'},
{'TEXT': 'nan'},
{'TEXT': 'nan'},
{'TEXT': 'nan'},
{'CARDINAL': {'one'},
'DATE': {'1/1/1970'},
'GPE': {'Cairo', 'Illinois', 'United States'},
'NORP': {'African American'},
'ORG': {'White-'},
'TEXT': '1/1/1970: Unknown African American assailants fired several bullets at police headquarters in Cairo, Illinois, United States. There were no casualties, however, one bullet narrowly missed several police officers. This attack took place during heightened racial tensions, including a Black boycott of White-owned businesses, in Cairo Illinois.'},
{'TEXT': 'nan'},
{'CARDINAL': {'Three'},
'DATE': {'1/2/1970'},
'GPE': {'California', 'Oakland', 'United States'},
'MONEY': {'an estimated $20,000 to $25,000'},
'ORG': {'the Pacific Gas & Electric Company'},
'TEXT': '1/2/1970: Unknown perpetrators detonated explosives at the Pacific Gas & Electric Company Edes substation in Oakland, California, United States. Three transformers were damaged costing an estimated $20,000 to $25,000. There were no casualties.'},
{'DATE': {'1/2/1970'},
'GPE': {'R.O.T.C.', 'United States', 'Wisconsin'},
'MONEY': {'around $60,000'},
'ORG': {'the New Years Gang',
'the Old Red Gym',
'the University of Wisconsin'},
'PERSON': {'Karl Armstrong', 'Madison'},
'TEXT': '1/2/1970: Karl Armstrong, a member of the New Years Gang, threw a firebomb at R.O.T.C. offices located within the Old Red Gym at the University of Wisconsin in Madison, Wisconsin, United States. There were no casualties but the fire caused around $60,000 in damages to the building.'},
{'DATE': {'1/3/1970'},
'GPE': {'United States', 'Wisconsin'},
'ORDINAL': {'first'},
'ORG': {'Selective Service Headquarters',
'the New Years Gang',
"the University of Wisconsin's"},
'PERSON': {'Armstrong', 'Karl Armstrong', 'Madison'},
'TEXT': "1/3/1970: Karl Armstrong, a member of the New Years Gang, broke into the University of Wisconsin's Primate Lab and set a fire on the first floor of the building. Armstrong intended to set fire to the Madison, Wisconsin, United States, Selective Service Headquarters across the street but mistakenly confused the building with the Primate Lab. The fire caused slight damages and was extinguished almost immediately."}]
sample_events = [event for event in extracted.tolist() if event['TEXT'] != 'nan']
sample_events, len(sample_events)
([{'CARDINAL': {'one'},
'DATE': {'1/1/1970'},
'GPE': {'Cairo', 'Illinois', 'United States'},
'NORP': {'African American'},
'ORG': {'White-'},
'TEXT': '1/1/1970: Unknown African American assailants fired several bullets at police headquarters in Cairo, Illinois, United States. There were no casualties, however, one bullet narrowly missed several police officers. This attack took place during heightened racial tensions, including a Black boycott of White-owned businesses, in Cairo Illinois.'},
{'CARDINAL': {'Three'},
'DATE': {'1/2/1970'},
'GPE': {'California', 'Oakland', 'United States'},
'MONEY': {'an estimated $20,000 to $25,000'},
'ORG': {'the Pacific Gas & Electric Company'},
'TEXT': '1/2/1970: Unknown perpetrators detonated explosives at the Pacific Gas & Electric Company Edes substation in Oakland, California, United States. Three transformers were damaged costing an estimated $20,000 to $25,000. There were no casualties.'},
{'DATE': {'1/2/1970'},
'GPE': {'R.O.T.C.', 'United States', 'Wisconsin'},
'MONEY': {'around $60,000'},
'ORG': {'the New Years Gang',
'the Old Red Gym',
'the University of Wisconsin'},
'PERSON': {'Karl Armstrong', 'Madison'},
'TEXT': '1/2/1970: Karl Armstrong, a member of the New Years Gang, threw a firebomb at R.O.T.C. offices located within the Old Red Gym at the University of Wisconsin in Madison, Wisconsin, United States. There were no casualties but the fire caused around $60,000 in damages to the building.'},
{'DATE': {'1/3/1970'},
'GPE': {'United States', 'Wisconsin'},
'ORDINAL': {'first'},
'ORG': {'Selective Service Headquarters',
'the New Years Gang',
"the University of Wisconsin's"},
'PERSON': {'Armstrong', 'Karl Armstrong', 'Madison'},
'TEXT': "1/3/1970: Karl Armstrong, a member of the New Years Gang, broke into the University of Wisconsin's Primate Lab and set a fire on the first floor of the building. Armstrong intended to set fire to the Madison, Wisconsin, United States, Selective Service Headquarters across the street but mistakenly confused the building with the Primate Lab. The fire caused slight damages and was extinguished almost immediately."}],
4)
sample_event = sample_events[2]
sample_event
{'DATE': {'1/2/1970'},
'GPE': {'R.O.T.C.', 'United States', 'Wisconsin'},
'MONEY': {'around $60,000'},
'ORG': {'the New Years Gang',
'the Old Red Gym',
'the University of Wisconsin'},
'PERSON': {'Karl Armstrong', 'Madison'},
'TEXT': '1/2/1970: Karl Armstrong, a member of the New Years Gang, threw a firebomb at R.O.T.C. offices located within the Old Red Gym at the University of Wisconsin in Madison, Wisconsin, United States. There were no casualties but the fire caused around $60,000 in damages to the building.'}
Process all the sumaries
Let’s process all the data that we have in the database.. but it takes roughly one 1.5 hours :)
all_events = table['summary'].progress_apply(process.text)
HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))
HBox(children=(IntProgress(value=0, max=170350), HTML(value='')))
import numpy as np
np.savez_compressed(
"./table_data.npz",
summaries=np.array(table['summary'].tolist()),
all_events=np.array(all_events.tolist()),
attacktype=np.array(table['attacktype1'].tolist()),
attacktype_txt=np.array(table['attacktype1_txt'].tolist())
)
Better load them up from a backup
import numpy as np
with np.load("./table_data.npz") as store:
all_events = store['all_events']
len(all_events), type(all_events)
(170350, numpy.ndarray)
Count all the non-empty events in the dataset
sum(1 for event in all_events.tolist() if event['TEXT'] != 'nan')
104212
Conclusion
We now have the means, from the filtered media to parse the article content and extract events
of structured data.
This enables us to creat structured queries on the information that we collect.
Enrich data
The above parsing stage, although usefull still doesn’t provide enough information suited for our SIGINT bosses.
We’re actually leaving a lof of usable information on the table, and we can do better on exracting it and making it queryable.
We will implement bellow some enhancer
modules that will transform our events
into content rich elments.
Adress enricher
The first thing we can do is transform the ‘GPE’ elements into rich addresses datastructures.
geopy - Geocoding library for Python. Based on OSM. (there are other examples available)
pip install geopy
from geopy.geocoders import Nominatim
geolocator = Nominatim()
geolocator.geocode('Cluj Napoca', timeout=5, addressdetails=True).raw
{'address': {'city': 'Cluj-Napoca',
'country': 'România',
'country_code': 'ro',
'county': 'Cluj',
'postcode': '400133'},
'boundingbox': ['46.6093367', '46.9293367', '23.4300604', '23.7500604'],
'class': 'place',
'display_name': 'Cluj-Napoca, Cluj, 400133, România',
'icon': 'https://nominatim.openstreetmap.org/images/mapicons/poi_place_city.p.20.png',
'importance': 0.37419916164485,
'lat': '46.7693367',
'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
'lon': '23.5900604',
'osm_id': '32591050',
'osm_type': 'node',
'place_id': '195112',
'type': 'city'}
If you’re interested you in .raw
you have lot more information.
geolocator.geocode('Oakland', addressdetails=True, geometry='geojson').raw
{'address': {'city': 'Oakland',
'country': 'United States of America',
'country_code': 'us',
'county': 'Alameda County',
'state': 'California'},
'boundingbox': ['37.632226', '37.885368', '-122.355881', '-122.114672'],
'class': 'place',
'display_name': 'Oakland, Alameda County, California, United States of America',
'geojson': {'coordinates': [[[-122.355881, 37.835727],
[-122.3500919, 37.8201616],
[-122.3468451, 37.8114822],
[-122.3465852, 37.8108476],
[-122.340281, 37.800628],
[-122.33516, 37.799448],
[-122.3198, 37.795908],
[-122.31468, 37.794728],
[-122.312471, 37.794484],
[-122.305305, 37.793692],
...
[-122.249336, 37.822939],
[-122.248953, 37.8233],
[-122.248859, 37.82339],
[-122.249374, 37.823649]]],
'type': 'Polygon'},
'icon': 'https://nominatim.openstreetmap.org/images/mapicons/poi_place_city.p.20.png',
'importance': 0.26134751598304,
'lat': '37.8044557',
'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
'lon': '-122.2713563',
'osm_id': '2833530',
'osm_type': 'relation',
'place_id': '178751800',
'type': 'city'}
We will implement an Enricher class that will query ‘GPE’ elements with geopy and copy some of the most usefull stuff into our event.
It’s not implemented bellow but I suggest you use a caching mechanism to reduce the bandwith.
from copy import deepcopy
class AddressEnricher():
def __init__(self):
self.geolocator = Nominatim()
def _extract_address(self, location, query):
return {
'query': query,
'address': location['address'],
'boundingbox': [(float(location['boundingbox'][2]), float(location['boundingbox'][0])), (float(location['boundingbox'][3]), float(location['boundingbox'][1]))],
'coord': (float(location['lon']), float(location['lat']))
}
def _emit_addresses(self, event):
for location in event.get('GPE', []):
try:
info = self.geolocator.geocode(location, addressdetails=True).raw
yield self._extract_address(info, location)
except:
print("Query failed for %s" % location)
yield event
def enrich(self, event):
event = deepcopy(event)
for address in self._emit_addresses(event):
event.setdefault('ADDRESS', []).append(address)
if 'GPE' in event:
del event['GPE']
yield event
list(AddressEnricher().enrich(sample_event))
[{'ADDRESS': [{'address': {'city': '서울특별시',
'city_district': '공릉2동',
'country': '대한민국',
'country_code': 'kr',
'military': '학군단',
'town': '노원구',
'village': '공릉동'},
'boundingbox': [(127.0809946, 37.6285013), (127.0815729, 37.6288716)],
'coord': (127.081364331109, 37.628668),
'query': 'R.O.T.C.'},
{'address': {'country': 'United States of America',
'country_code': 'us',
'state': 'Wisconsin'},
'boundingbox': [(-92.8893149, 42.4919436), (-86.249548, 47.3025)],
'coord': (-89.6884637, 44.4308975),
'query': 'Wisconsin'},
{'address': {'country': 'United States of America', 'country_code': 'us'},
'boundingbox': [(-180.0, -14.7608358), (180.0, 71.6048217)],
'coord': (-100.4458825, 39.7837304),
'query': 'United States'}],
'DATE': {'1/2/1970'},
'MONEY': {'around $60,000'},
'ORG': {'the New Years Gang',
'the Old Red Gym',
'the University of Wisconsin'},
'PERSON': {'Karl Armstrong', 'Madison'},
'TEXT': '1/2/1970: Karl Armstrong, a member of the New Years Gang, threw a firebomb at R.O.T.C. offices located within the Old Red Gym at the University of Wisconsin in Madison, Wisconsin, United States. There were no casualties but the fire caused around $60,000 in damages to the building.'}]
Date enricher
Another thing that we can do is interpret the DATE
elements and replace them with (year, month, day)
triplets.
import dateparser
date = dateparser.parse("1/3/1970")
date
datetime.datetime(1970, 1, 3, 0, 0)
date.year, date.month, date.day
(1970, 1, 3)
import traceback
import dateparser
class DateEnricher():
def enrich(self, event):
event = deepcopy(event)
dates = event.get('DATE', set())
for unparsed_date in set(dates):
try:
date = dateparser.parse(unparsed_date)
event.setdefault('TIME', set()).add((date.year, date.month, date.day))
dates.remove(unparsed_date)
except:
pass
if not dates and 'DATE' in event:
del event['DATE']
yield event
list(DateEnricher().enrich(sample_event))
[{'GPE': {'R.O.T.C.', 'United States', 'Wisconsin'},
'MONEY': {'around $60,000'},
'ORG': {'the New Years Gang',
'the Old Red Gym',
'the University of Wisconsin'},
'PERSON': {'Karl Armstrong', 'Madison'},
'TEXT': '1/2/1970: Karl Armstrong, a member of the New Years Gang, threw a firebomb at R.O.T.C. offices located within the Old Red Gym at the University of Wisconsin in Madison, Wisconsin, United States. There were no casualties but the fire caused around $60,000 in damages to the building.',
'TIME': {(1970, 1, 2)}}]
Associate enricher
We see that events usually have more than one PERSON
elements into them and we like the reason about individuals, not groups.
On the other hand, we would like to keep the information that a certain person was at one point involved in the same event as the others so what keep this grouping in the ASSOCIATES
field.
class AssociateEnricher():
def enrich(self, event):
associates = {name for name in event.get('PERSON', [])}
if not associates: return
for associate in associates:
new_event = deepcopy(event)
del new_event['PERSON']
new_event['NAME'] = associate
new_event['ASSOCIATES'] = associates - {associate}
yield new_event
list(AssociateEnricher().enrich(sample_event))
[{'ASSOCIATES': {'Karl Armstrong'},
'DATE': {'1/2/1970'},
'GPE': {'R.O.T.C.', 'United States', 'Wisconsin'},
'MONEY': {'around $60,000'},
'NAME': 'Madison',
'ORG': {'the New Years Gang',
'the Old Red Gym',
'the University of Wisconsin'},
'TEXT': '1/2/1970: Karl Armstrong, a member of the New Years Gang, threw a firebomb at R.O.T.C. offices located within the Old Red Gym at the University of Wisconsin in Madison, Wisconsin, United States. There were no casualties but the fire caused around $60,000 in damages to the building.'},
{'ASSOCIATES': {'Madison'},
'DATE': {'1/2/1970'},
'GPE': {'R.O.T.C.', 'United States', 'Wisconsin'},
'MONEY': {'around $60,000'},
'NAME': 'Karl Armstrong',
'ORG': {'the New Years Gang',
'the Old Red Gym',
'the University of Wisconsin'},
'TEXT': '1/2/1970: Karl Armstrong, a member of the New Years Gang, threw a firebomb at R.O.T.C. offices located within the Old Red Gym at the University of Wisconsin in Madison, Wisconsin, United States. There were no casualties but the fire caused around $60,000 in damages to the building.'}]
Gender enricher
Names contain a lots of embedded infromation into them, one of which is the gender.
Chicksexer is a python package that can detect genders based on names
pip install chicksexer
import chicksexer
chicksexer.predict_gender('Cristian Lungu')
2018-06-11 10:11:25,193 - chicksexer.api - INFO - Loading model (only required for the initial prediction)...
{'female': 0.00036776065826416016, 'male': 0.9996322393417358}
sample_event_2 = {'ASSOCIATES': {'Armstrong ', 'Madison'},
'DATE': {'1/3/1970'},
'GPE': {'United States', 'Wisconsin'},
'NAME': 'Karl Armstrong',
'ORDINAL': {'first '},
'ORG': {'Selective Service Headquarters ',
'the New Years Gang',
"the University of Wisconsin's "},
'TEXT': "1/3/1970: Karl Armstrong, a member of the New Years Gang, broke into the University of Wisconsin's Primate Lab and set a fire on the first floor of the building. Armstrong intended to set fire to the Madison, Wisconsin, United States, Selective Service Headquarters across the street but mistakenly confused the building with the Primate Lab. The fire caused slight damages and was extinguished almost immediately."}
Extract the gender of a single name
max([(score, gender) for gender, score in chicksexer.predict_gender('Cristian Lungu').items()])[1]
'male'
import chicksexer
class GenderEnricher():
def enrich(self, event):
event = deepcopy(event)
gender = max([(score, gender) for gender, score in chicksexer.predict_gender(event['NAME']).items()])[1]
event['GENDER'] = gender
yield event
next(GenderEnricher().enrich(sample_event_2))
{'ASSOCIATES': {'Armstrong ', 'Madison'},
'DATE': {'1/3/1970'},
'GENDER': 'male',
'GPE': {'United States', 'Wisconsin'},
'NAME': 'Karl Armstrong',
'ORDINAL': {'first '},
'ORG': {'Selective Service Headquarters ',
'the New Years Gang',
"the University of Wisconsin's "},
'TEXT': "1/3/1970: Karl Armstrong, a member of the New Years Gang, broke into the University of Wisconsin's Primate Lab and set a fire on the first floor of the building. Armstrong intended to set fire to the Madison, Wisconsin, United States, Selective Service Headquarters across the street but mistakenly confused the building with the Primate Lab. The fire caused slight damages and was extinguished almost immediately."}
Ethnicity enricher
Names have embedded in them, beside the gender information, also the ethnicity (usulally). We can train a classifier that can learn patterns of names and associate them to certain ethnicities. “Ion” for example is mostly romanian surname.
Fortunately there are pretrained models already available for this task so we can use those.
We will be using Ethnea.
NamePrism is a recent famous example, but a paid one
import requests
data = requests.get("http://abel.lis.illinois.edu/cgi-bin/ethnea/search.py", params={"Fname": "Cristi Lungu", "format": "json"})
data.text
"{'Genni': 'F', 'Ethnea': 'KOREAN-ROMANIAN', 'Last': 'X', 'First': 'Cristi Lungu'}\n"
import json
json.loads(data.text.replace("'", '"'))['Ethnea']
'KOREAN-ROMANIAN'
import time
import json
import requests
class EthnicityEnhancer():
def _get_ethnicity(self, name):
data = requests.get("http://abel.lis.illinois.edu/cgi-bin/ethnea/search.py", params={"Fname": name, "format": "json"})
ethnicity = json.loads(data.text.replace("'", '"'))['Ethnea']
time.sleep(1)
return ethnicity
def enrich(self, event):
event = deepcopy(event)
name = event['NAME']
ethnicity = self._get_ethnicity(name)
event['ETHNICITY'] = ethnicity
yield event
EthnicityEnhancer()._get_ethnicity('Karl Armstrong'), next(EthnicityEnhancer().enrich(sample_event_2))
('NORDIC',
{'ASSOCIATES': {'Armstrong ', 'Madison'},
'DATE': {'1/3/1970'},
'ETHNICITY': 'NORDIC',
'GPE': {'United States', 'Wisconsin'},
'NAME': 'Karl Armstrong',
'ORDINAL': {'first '},
'ORG': {'Selective Service Headquarters ',
'the New Years Gang',
"the University of Wisconsin's "},
'TEXT': "1/3/1970: Karl Armstrong, a member of the New Years Gang, broke into the University of Wisconsin's Primate Lab and set a fire on the first floor of the building. Armstrong intended to set fire to the Madison, Wisconsin, United States, Selective Service Headquarters across the street but mistakenly confused the building with the Primate Lab. The fire caused slight damages and was extinguished almost immediately."})
Putting it all together
def run(enricher, events):
new_events = []
for event in tqdm(events):
for new_event in enricher(event):
new_events.append(new_event)
return new_events
def enriched_events(events):
enrichers = [
AddressEnricher(),
DateEnricher(),
AssociateEnricher(),
GenderEnricher(),
EthnicityEnhancer()
]
iterator = tqdm(enrichers, total=len(enrichers))
for enricher in iterator:
iterator.set_description(enricher.__class__.__name__)
events = run(enricher.enrich, events)
return events
e_events = enriched_events(sample_events)
len(e_events), e_events[0]
(5,
{'ADDRESS': [{'address': {'city': '서울특별시',
'city_district': '공릉2동',
'country': '대한민국',
'country_code': 'kr',
'military': '학군단',
'town': '노원구',
'village': '공릉동'},
'boundingbox': [(127.0809946, 37.6285013), (127.0815729, 37.6288716)],
'coord': (127.081364331109, 37.628668),
'query': 'R.O.T.C.'},
{'address': {'country': 'United States of America',
'country_code': 'us',
'state': 'Wisconsin'},
'boundingbox': [(-92.8893149, 42.4919436), (-86.249548, 47.3025)],
'coord': (-89.6884637, 44.4308975),
'query': 'Wisconsin'},
{'address': {'country': 'United States of America', 'country_code': 'us'},
'boundingbox': [(-180.0, -14.7608358), (180.0, 71.6048217)],
'coord': (-100.4458825, 39.7837304),
'query': 'United States'}],
'ASSOCIATES': {'Karl Armstrong'},
'ETHNICITY': 'ENGLISH',
'GENDER': 'female',
'MONEY': {'around $60,000'},
'NAME': 'Madison',
'ORG': {'the New Years Gang',
'the Old Red Gym',
'the University of Wisconsin'},
'TEXT': '1/2/1970: Karl Armstrong, a member of the New Years Gang, threw a firebomb at R.O.T.C. offices located within the Old Red Gym at the University of Wisconsin in Madison, Wisconsin, United States. There were no casualties but the fire caused around $60,000 in damages to the building.',
'TIME': {(1970, 1, 2)}})
Let’s do this with all the events.
NOTE: RUNS AWFULLY SLOW. BETTER LOAD!
all_enriched_events = enriched_events(all_events.tolist())
len(all_enriched_events)
69600
np.savez_compressed("./all_enriched_events.npz", all_enriched_events=all_enriched_events)
Load from backup
import numpy as np
with np.loads("./all_enriched_events.npz") as store:
all_enriched_events = store["all_enriched_events"]
Conclusion
We’ve showed some of the ways in which you can enrich a profile with more information. Here are some other things that you can try:
- derive general topic (politics, education, sports, etc..)
- extract phone numbers
- phone numbers like names, have lots of embedded information in them that can be extracted (carrier network, region, country, etc..)
- Age retrieval
- Geo triangularization of addresses, etc..
Event2Vec
Now that we have rich events, we may need to be able to index them and have the ones that share a common theme grouped together.
This is useful when looking for insights and “not-so-obvious” links between people’s interests. Usually, such a grouping reveals a common agenda like: * a common terrorist cell * a common corruption ring * an organized crime cartel
Were going to do this by deriving event embeddings.
They will act as coordinated in a multidimensional space that we can later use of cluster the similar ones together.
Model discussions
One approach for getting these embeddings is replicate the skip-gram model published by Google in their famous Word2Vec paper but with some modifications.
- We will use all the elements of an event that look like unique name identifiers as “words”.
- We define “context” to be the set of “words” found in a single event.
- We will build a model whose task is, given a pair of “words” to:
- output 1, if they appear in the same “context”
- output 0, if the “words” don’t share a “context”
- this is unlike the original model where they use a hierarchical softmax approach.
-
The training will happen in the embedding layer, where “words” (ids) will be converted to a multidimensional array that will model the latent variables of where “words”.
- The final step will be, for each “event”, to add all the “word” embeddings up. The result it the “event” embedding.
We will implement this model in keras.
Useful tokens
embedding_keys = {'ASSOCIATES', 'GPE', 'ORG', 'LOC', 'FAC', 'EVENT', 'PRODUCT'} # + 'NAME'
all_enriched_events[1]
{'ASSOCIATES': {'Madison'},
'DATE': {'1/2/1970'},
'GPE': {'R.O.T.C.', 'United States', 'Wisconsin'},
'MONEY': {'around $60,000'},
'NAME': 'Karl Armstrong',
'ORG': {'the New Years Gang',
'the Old Red Gym',
'the University of Wisconsin'},
'TEXT': '1/2/1970: Karl Armstrong, a member of the New Years Gang, threw a firebomb at R.O.T.C. offices located within the Old Red Gym at the University of Wisconsin in Madison, Wisconsin, United States. There were no casualties but the fire caused around $60,000 in damages to the building.'}
Using all the events will make the computation much more demanding so for this demonstration we will use only the first 1000 of events.
event_data = all_enriched_events[:1000]
Preprocessing the named tokens
We first define some helper function that extract and clean out the names of an event (in the hope that doing this will reduce some duplicate names and spelling errors).
from keras.preprocessing.text import text_to_word_sequence
event = event_data[1]
def normalize(token):
return " ".join([word for word in text_to_word_sequence(token)])
def enumerate_names(event):
for embedding_key in embedding_keys & event.keys():
for name in event[embedding_key]:
yield normalize(name)
yield normalize(event['NAME'])
set(enumerate_names(event))
Using TensorFlow backend.
{'karl armstrong',
'madison',
'r o t c',
'the new years gang',
'the old red gym',
'the university of wisconsin',
'united states',
'wisconsin'}
Collect all the name tokens to get an idea of how large is our name vocabulary.
name_tokens = set()
for event in tqdm(event_data):
name_tokens |= set(enumerate_names(event))
len(name_tokens)
HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))
1713
Show some name examples.
sorted(list(name_tokens))[500:510]
['hanukkah',
'hardcastle realty',
'hare krishna',
'harford county',
'harold mciver',
'harold nelson',
'harrison',
'harry j candee',
'hato rey',
'hawaii']
Token vocabulary
I’m implementing a quick class to store and query by name and by id the names.
We will convert all the names to ids for training but we also need to convert back ids to string values for debugging purposes.
class Vocabulary(dict):
def __init__(self):
self.index = []
def add(self, item):
if item not in self:
self[item] = len(self.index)
self.index.append(item)
return self[item]
def value(self, idx):
assert 0 <= index <= len(self.index)
return self.index[idx]
v = Vocabulary()
v.add('a')
v.add('c')
v.add('b')
v.add('c')
v.index, v, v.add('a'), v.add('d')
(['a', 'c', 'b', 'd'], {'a': 0, 'b': 2, 'c': 1, 'd': 3}, 0, 3)
vocabulary = Vocabulary()
for token in tqdm(name_tokens):
vocabulary.add(token)
HBox(children=(IntProgress(value=0, max=1713), HTML(value='')))
len(vocabulary.index)
1713
Build the training data based on event context
from itertools import combinations
tokens = set(enumerate_names(event))
list(combinations(tokens, 4))
[('', 'armstrong', 'skinhead', 'the west end synagogue'),
('', 'armstrong', 'skinhead', 'nashville'),
('', 'armstrong', 'skinhead', 'the ku klux klan'),
('', 'armstrong', 'skinhead', 'united states'),
('', 'armstrong', 'skinhead', 'leonard william armstrong'),
('', 'armstrong', 'skinhead', 'tennessee'),
('', 'armstrong', 'skinhead', 'white knights'),
('', 'armstrong', 'the west end synagogue', 'nashville'),
('', 'armstrong', 'the west end synagogue', 'the ku klux klan'),
('', 'armstrong', 'the west end synagogue', 'united states'),
...
('the ku klux klan', 'united states', 'tennessee', 'white knights'),
('the ku klux klan',
'leonard william armstrong',
'tennessee',
'white knights'),
('united states', 'leonard william armstrong', 'tennessee', 'white knights')]
We’re going to define a function for positive and one for negative sample generation.
from itertools import combinations
def make_positive_samples(tokens):
return [list(comb) for comb in combinations(tokens, 2)]
make_positive_samples([1, 2, 3, 4])
[[1, 2], [1, 3], [1, 4], [2, 3], [2, 4], [3, 4]]
import random
def make_negative_sample(vocabulary_size):
return [random.randint(0, vocabulary_size-1), random.randint(0, vocabulary_size-1)]
make_negative_sample(len(vocabulary.index))
[1187, 344]
Build the training data
We’re going to replace all the names from an event with the indices from the built vocabulary before using them to build the training data.
tokens = {vocabulary[token] for token in enumerate_names(event)}
tokens
{0, 78, 625, 815, 827, 1090, 1446, 1517, 1533, 1598}
positive = []
negative = []
for i, event in tqdm(enumerate(event_data), total=len(event_data)):
tokens = {vocabulary[token] for token in enumerate_names(event)}
positive += make_positive_samples(tokens)
vocabulary_size = len(vocabulary.index)
for _ in range(len(positive) * 2):
negative.append(make_negative_sample(vocabulary_size))
labels = ([1] * len(positive)) + ([0] * len(negative))
HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))
Merge the positive, negative and shuffle them up along with their labels.
inputs = np.array(positive + negative)
labels = np.array(labels)
perm = np.random.permutation(len(positive) + len(negative))
inputs = inputs[perm]
labels = labels[perm]
How much training data did we generate?
inputs.shape, labels.shape
((125031, 2), (125031,))
np.savez_compressed(
"./data_embedding.npz",
inputs=inputs,
labels=labels
)
The embeddings model
import keras
from keras.layers import Input
from keras.layers import Embedding
from keras.layers import merge, Lambda, Reshape, Dense, Dot
from keras.models import Model
from keras import backend as K
from keras.layers import Activation
inp = Input(shape=(1,), dtype='int32')
lbl = Input(shape=(1,), dtype='int32')
emb = Embedding(input_dim=len(vocabulary.index), output_dim=(10))
inp_emb = Reshape((10, 1))(emb(inp))
trg_emb = Reshape((10, 1))(emb(lbl))
dot = Dot(axes=1)([inp_emb, trg_emb])
dot = Reshape((1,))(dot)
# out = Dense(1, activation='sigmoid')(dot)
out = Activation(activation='sigmoid')(dot)
model = Model([inp, lbl], out)
model.summary()
model.compile(optimizer='rmsprop', loss='binary_crossentropy')
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_3 (InputLayer) (None, 1) 0
__________________________________________________________________________________________________
input_4 (InputLayer) (None, 1) 0
__________________________________________________________________________________________________
embedding_2 (Embedding) (None, 1, 10) 17130 input_3[0][0]
input_4[0][0]
__________________________________________________________________________________________________
reshape_3 (Reshape) (None, 10, 1) 0 embedding_2[0][0]
__________________________________________________________________________________________________
reshape_4 (Reshape) (None, 10, 1) 0 embedding_2[1][0]
__________________________________________________________________________________________________
dot_1 (Dot) (None, 1, 1) 0 reshape_3[0][0]
reshape_4[0][0]
__________________________________________________________________________________________________
reshape_5 (Reshape) (None, 1) 0 dot_1[0][0]
__________________________________________________________________________________________________
activation_1 (Activation) (None, 1) 0 reshape_5[0][0]
==================================================================================================
Total params: 17,130
Trainable params: 17,130
Non-trainable params: 0
__________________________________________________________________________________________________
from keras_tqdm import TQDMNotebookCallback
model.fit(x=[inputs[:, 0], inputs[:, 1]], y=labels, epochs=1000, batch_size=1024, callbacks=[TQDMNotebookCallback()], verbose=0)
HBox(children=(IntProgress(value=0, description='Training', max=1000), HTML(value='')))
HBox(children=(IntProgress(value=0, description='Epoch 999', max=125031), HTML(value='')))
Getting the weights form the keras model
emb.get_weights()[0].shape
(1713, 10)
[e['NAME'] for e in event_data[:100]]
['Madison',
'Karl Armstrong',
'Madison',
'Armstrong',
'Karl Armstrong',
'James Madison High School',
'Judith Bissell',
'Patrolmen William Kivlehan',
'Ralph Bax',
'Bax',
'Joseph Blik',
'Officer Blik',
'Gables',
'John Abercrombie',
'Harold Nelson',
'Strike',
'Dore',
'Fred Dore',
'Leslie Jewel',
'John Murtagh',
'Anti-Vietnam',
'Murtagh',
'James C. Perrill',
'Ithaca',
'Karl Armstrong',
'Keyes',
'Frank Schaeffer',
'Schaeffer',
'Brown',
'White Racists',
"H. Rap Brown's",
'H. Rap Brown',
'William Payne',
'Ralph Featherstone',
'Black',
'H. Rap Brown',
'S. I. Hayakawa',
'Samuel Ichiye Hayakawa',
'Clyde William McKay Jr.',
'Leonard Glatkowski',
'Glatkowski',
'Gregory',
'Burton I. Gordin',
'Richard Nixon',
'Joe',
"Auguste Rodin's",
'Thinker',
'William Calley',
'Curtis W. Tarr',
'Ithaca',
'Africana Studies',
'Ithaca',
'Castro',
'David G. Sprague',
'Free',
'Lawrence',
'Molotov Cocktails',
'T-9',
'Cumulatively',
'Cumulatively',
'Stanley Sierakowski',
'Patrolman Donald Sager',
'Cheng Tzu-tsai',
'James Ziede',
'Chiang',
'Chiang Ching-kuo',
'Peter Huang Wen-hsiung',
'John McKinney',
'Dorchester',
'The Burger King',
'Edgar Hoults',
'Edgar Hoults',
'Joe Schock',
'Gables',
'Torah Scroll',
'Dorchester',
'Bernard Bennett',
'Lloyd Smothers',
'Ku Klux Klan',
'James Rudder',
'Larry G. Ward',
'Larry Clark',
'Black Panther',
'Ronald Reed',
'James Sackett',
'Sackett',
'Owen Warehouse',
'Torah Scroll',
'Dorchester',
'Torah',
'Dorchester',
'Barr',
'William G. Barr',
'Levin P. West',
' ',
'Marion Troutt',
'Kenneth Kaner',
'Bruce Sharp',
'William Redwine',
'Radetich']
from scipy.spatial.distance import euclidean, cosine
def best_embeddings(target_embedding, all_embeddings, top=20):
distances = [cosine(target_embedding, candidate) for i, candidate in enumerate(all_embeddings)]
return np.argsort(distances)[:top]
def best_match(name):
ne = emb.get_weights()[0]
e = ne[vocabulary[name]]
best_ids = best_embeddings(e, ne)
print(name,":", [vocabulary.index[best_id] for best_id in best_ids])
best_match("karl armstrong"), best_match('madison'), best_match('armstrong')
karl armstrong : ['karl armstrong', 'mathews', 'johnnie veal', 'david lane', 'kim holland', 'ashkelon', 'lane', 'denver', 'civic center', 'the jewish armed resistance assault team', 'black pyramid courts', "the northern illinois women's center", 'richard scutari', 'hezbollah', 'billy joel oglesby', 'power authority', 'iran', 'kuwait', 'gerald gordon', 'the army recruiting station']
madison : ['madison', 'seabrook', 'the los angeles international airport', 'new hampshire', 'maryland', 'michael donald bray', 'james sommerville', 'langley way', 'chiang', 'twin lakes high school', 'annapolis', 'william cann', 'selective service headquarters', 'gant', 'cloverdale', 'everett c carlson', 'thomas spinks', 'charles lawrence', 'frank schaeffer', "the metropolitan medical and women's center"]
armstrong : ['armstrong', 'bon marche', 'premises', 'decatur', 'the army recruiting station', 'kim holland', 'bhagwan shree rajneesh', 'fried chicken', "the northern illinois women's center", 'adams street', 'robert mathews', 'army', 'oregon', 'marion troutt', 'kenneth blapper', 'john joseph kaiser ii', 'the dorchester army national guard', 'rosalie zevallos', 'mississippi', 'james rudder']
(None, None, None)
# model.save_weights("./embedding_model.hdf5")
# model.save_weights("./embedding_model_without_dense_2.hdf5")
model.load_weights("./embedding_model.hdf5")
Event embeddings
So now, all we need to do to compute an event embedding is add all the embeddings together.
def compute_event_embedding(event):
event_emb = np.zeros(10)
ne = emb.get_weights()[0]
for name in enumerate_names(event):
event_emb += ne[vocabulary[name]]
return event_emb
sample = 0
event_data[sample], compute_event_embedding(event_data[sample])
({'ASSOCIATES': {'Karl Armstrong'},
'DATE': {'1/2/1970'},
'GPE': {'R.O.T.C.', 'United States', 'Wisconsin'},
'MONEY': {'around $60,000'},
'NAME': 'Madison',
'ORG': {'the New Years Gang',
'the Old Red Gym',
'the University of Wisconsin'},
'TEXT': '1/2/1970: Karl Armstrong, a member of the New Years Gang, threw a firebomb at R.O.T.C. offices located within the Old Red Gym at the University of Wisconsin in Madison, Wisconsin, United States. There were no casualties but the fire caused around $60,000 in damages to the building.'},
array([ 4.03297613, -5.36483431, 1.85935935, -1.71195513,
12.51382726, -1.84665674, 6.86580369, 3.14632877,
8.94560277, -4.70104777]))
Build an event_embeddings array.
event_embeddings = np.zeros((len(event_data), 10))
for i, event in enumerate(event_data):
event_embeddings[i, :] = compute_event_embedding(event)
def best_event_match(event_id):
matches = best_embeddings(event_embeddings[event_id], event_embeddings, 10)
print(event_data[event_id]['NAME'], ":", [event_data[match]['NAME'] for match in matches], matches)
best_event_match(3)
Armstrong : ['Madison', 'Armstrong', 'Karl Armstrong', 'Madison', 'Karl Armstrong', 'Kaleidoscope', 'Edward P. Gullion', 'Richard J. Picariello', 'Joseph Aceto', 'Everett C. Carlson'] [ 2 3 4 0 1 150 457 458 459 460]
np.savez_compressed("./event_embeddings.npz",
event_embeddings=event_embeddings,
event_data=np.array(event_data),
name_embeddings=emb.get_weights()[0]
)
Conclusion
We’ve trained a model to derive name embeddings that we latter used to assemble “event embeddings”.
These can be used as indexes in a database, similar elements being close to one another.
The interesting thing about embeddings right now is that we can also use them to make intelligent “Google-like” queries: - “Air force” + “New York” + “Bombings” + “1980” -> “John Malcom”
Clustering the event_embeddings
The final step in our journey is to cluster all the embeddings into common groups.
Most of the known clustering algorithms require us to input the desired number of clusters beforehand, which obviously is not the case for us.
Fortunately there are a couple of algorithms that automatically estimate the “best” number of clusters.
We will be using AffinityPropagation, but MeanShift is another good approach whose internals I’ve described previously on my blog.
import sklearn
from sklearn.cluster import MeanShift, AffinityPropagation
clusterer = AffinityPropagatoion(damping=0.5)
clusters = clusterer.fit_predict(event_embeddings)
clusters
array([ 0, 0, 0, 0, 0, 3, 40, 3, 3, 3, 3, 3, 73, 0, 0, 95, 76,
76, 91, 3, 3, 3, 40, 8, 0, 40, 91, 91, 1, 1, 1, 1, 1, 1,
40, 40, 4, 4, 2, 2, 2, 4, 40, 97, 3, 91, 91, 97, 97, 8, 8,
8, 4, 40, 3, 4, 4, 97, 72, 5, 40, 40, 6, 6, 6, 6, 6, 91,
8, 72, 91, 91, 97, 83, 8, 8, 97, 97, 97, 40, 40, 7, 7, 7, 7,
7, 40, 8, 8, 8, 8, 97, 97, 40, 0, 0, 4, 4, 4, 4, 4, 95,
52, 52, 52, 52, 8, 4, 3, 3, 9, 9, 9, 9, 9, 91, 91, 4, 4,
4, 53, 53, 97, 97, 97, 97, 0, 0, 0, 0, 3, 4, 4, 45, 53, 53,
53, 40, 40, 4, 4, 40, 97, 95, 10, 10, 10, 10, 10, 97, 95, 91, 91,
40, 40, 40, 97, 40, 40, 97, 3, 3, 40, 40, 91, 11, 11, 11, 11, 12,
12, 12, 12, 97, 91, 73, 73, 73, 0, 97, 97, 40, 40, 91, 3, 40, 97,
95, 95, 95, 97, 13, 13, 13, 13, 97, 8, 97, 14, 14, 14, 91, 91, 91,
4, 4, 4, 4, 4, 4, 4, 4, 97, 97, 87, 87, 87, 4, 4, 4, 5,
97, 40, 4, 97, 97, 15, 15, 15, 3, 4, 4, 4, 4, 97, 97, 97, 97,
97, 97, 3, 3, 3, 3, 97, 97, 4, 4, 4, 4, 4, 4, 3, 3, 3,
72, 72, 3, 3, 4, 4, 0, 0, 0, 95, 95, 3, 3, 3, 16, 16, 16,
16, 16, 16, 17, 17, 17, 17, 17, 17, 17, 17, 17, 45, 45, 45, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 30, 30, 30, 30, 30, 30,
46, 46, 46, 45, 45, 45, 31, 31, 31, 31, 31, 31, 31, 92, 92, 92, 32,
32, 32, 32, 32, 32, 32, 45, 32, 32, 32, 32, 32, 53, 53, 32, 32, 32,
32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 31, 31, 31, 34, 34, 34,
34, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 35, 35, 35, 35,
35, 35, 35, 35, 33, 33, 33, 33, 33, 33, 36, 36, 36, 36, 36, 36, 36,
36, 36, 36, 36, 36, 36, 37, 37, 37, 37, 37, 37, 37, 92, 45, 45, 92,
92, 92, 92, 92, 45, 45, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38,
38, 38, 72, 45, 39, 39, 39, 39, 40, 41, 41, 41, 41, 41, 40, 45, 45,
3, 3, 3, 72, 82, 8, 8, 8, 97, 97, 4, 4, 40, 40, 3, 42, 42,
42, 42, 43, 43, 43, 43, 45, 45, 43, 45, 40, 92, 72, 44, 44, 44, 44,
44, 45, 5, 45, 45, 46, 97, 45, 46, 46, 46, 46, 47, 47, 53, 53, 53,
53, 38, 38, 43, 43, 4, 45, 46, 46, 46, 46, 72, 72, 48, 48, 48, 48,
43, 43, 97, 97, 97, 49, 49, 49, 49, 49, 4, 4, 4, 50, 50, 50, 50,
3, 4, 4, 51, 51, 51, 51, 53, 53, 53, 53, 53, 45, 45, 46, 52, 52,
52, 52, 52, 43, 53, 53, 53, 53, 95, 97, 97, 97, 97, 54, 54, 54, 54,
54, 54, 54, 40, 43, 53, 53, 46, 46, 45, 45, 45, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 51, 51, 51, 66, 66, 66, 66, 66, 66, 67, 67,
67, 67, 67, 68, 68, 68, 92, 92, 92, 69, 69, 69, 69, 70, 70, 70, 70,
4, 5, 5, 38, 97, 97, 9, 9, 9, 72, 72, 45, 91, 91, 91, 4, 4,
72, 97, 97, 71, 71, 71, 53, 53, 53, 40, 72, 85, 85, 73, 73, 73, 73,
95, 95, 3, 3, 3, 91, 72, 74, 74, 74, 74, 74, 74, 74, 74, 74, 72,
97, 72, 4, 4, 3, 75, 75, 75, 75, 88, 88, 88, 88, 88, 3, 95, 72,
53, 53, 53, 83, 83, 83, 83, 3, 70, 70, 70, 95, 95, 95, 45, 76, 76,
76, 77, 77, 77, 78, 78, 78, 78, 0, 0, 0, 0, 82, 82, 82, 76, 76,
76, 76, 84, 84, 84, 84, 84, 84, 76, 76, 76, 76, 76, 76, 76, 76, 87,
87, 5, 45, 5, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79,
79, 76, 76, 76, 80, 80, 80, 80, 80, 40, 40, 45, 81, 81, 81, 81, 81,
81, 81, 83, 83, 83, 83, 82, 82, 82, 82, 82, 82, 45, 45, 40, 40, 40,
84, 84, 84, 82, 82, 82, 82, 82, 82, 72, 72, 82, 82, 82, 83, 83, 83,
83, 83, 83, 83, 83, 83, 83, 83, 83, 83, 83, 84, 84, 84, 72, 4, 5,
5, 85, 85, 95, 95, 88, 88, 88, 93, 93, 86, 86, 86, 88, 88, 88, 88,
88, 88, 88, 88, 97, 87, 87, 87, 87, 87, 92, 92, 92, 92, 91, 91, 91,
5, 72, 72, 72, 76, 76, 88, 88, 88, 4, 4, 72, 72, 72, 93, 93, 91,
93, 93, 93, 89, 89, 89, 97, 93, 90, 90, 90, 90, 90, 72, 5, 5, 72,
72, 87, 87, 87, 87, 87, 40, 40, 87, 87, 87, 87, 87, 87, 40, 97, 91,
4, 91, 97, 97, 97, 72, 91, 4, 40, 14, 14, 92, 94, 94, 94, 94, 94,
93, 72, 95, 95, 95, 40, 93, 93, 72, 38, 72, 93, 66, 66, 66, 66, 4,
4, 93, 97, 94, 94, 94, 4, 94, 94, 94, 94, 72, 72, 72, 46, 46, 95,
97, 4, 45, 97, 92, 46, 94, 94, 4, 4, 4, 52, 52, 5, 5, 12, 97,
95, 95, 5, 5, 96, 96, 96, 96, 93, 72, 95, 95, 45, 45, 45, 93, 93,
92, 87, 87, 87, 3, 3, 97, 4, 73, 73, 73, 73, 72, 91])
We will build the event groupings from the above result and ses some results.
groups = dict()
for event_id, cluster_id in enumerate(clusters):
groups.setdefault(cluster_id, []).append(event_id)
group = 8
groups[group], [event_data[event_id]['NAME'] for event_id in groups[group]]
([23, 49, 50, 51, 68, 74, 75, 87, 88, 89, 90, 106, 196, 447, 448, 449],
['Ithaca',
'Ithaca',
'Africana Studies',
'Ithaca',
'Dorchester',
'Torah Scroll',
'Dorchester',
'Torah',
'Torah Scroll',
'Dorchester',
'Dorchester',
'Dorchester',
'Ithaca',
'Carol Ann Manning',
'Ray Luc Levasseur',
'Thomas Manning'])
event_data[23], event_data[196]
({'ASSOCIATES': set(),
'CARDINAL': {'2/22/1970'},
'FAC': {'the Wari House Dormitory'},
'GPE': {'New York', 'United States'},
'NAME': 'Ithaca',
'ORG': {'Cornell University', "the Black Women's Cooperative"},
'TEXT': "2/22/1970: Unknown perpetrators threw kerosene flare pots at the Wari House Dormitory which housed the Black Women's Cooperative at Cornell University in Ithaca, New York, United States. The incendiary tossed at the dormitory failed to ignite, but an incendiary thrown through the window of a car parked in front of the dormitory burst into flames and caused minor damages to the vehicle. There were no casualties."},
{'ASSOCIATES': set(),
'GPE': {'New York', 'United States'},
'NAME': 'Ithaca',
'ORG': {'Cornell University', 'the Air Force R.O.T.C.'},
'TEXT': '3/17/1971: Unknown perpetrators set fire to a classroom used by the Air Force R.O.T.C. at Cornell University in Ithaca, New York, United States. There were no casualties and the fire caused only minor damage.'})
Conclusions
To recap our journey today:
- Trained a “terrorism” media filter (to weed out all the can-can stories)
- Parsed the text into structured format
- Enriched data with external or implicit information
- Derived event embeddings for querying and search
- Clustered similar events into groups
Key takeaways:
- Media articles are a rich source of information
- Machine Learning allows us to process this information into queryiable format
- There are multiple frameworks and strategies that we can use for this
- usually a blend of them is the most pragmatic choice
-
We can also build and train our own models to better suit our needs.
- These approaches can greatly enhance your decision making process, be it
- compliance with law
- insights gathering
- monitoring for certain events
- investing
Comments