Generation of Crime Knowledge Base From Online News Articles
Date
2022
Authors
K, Srinivasa
Journal Title
Journal ISSN
Volume Title
Publisher
National Institute of Technology Karnataka, Surathkal
Abstract
The growing amount of unstructured data on the internet has piqued the interest of Se-
mantic Web (SW) technology researchers in the creation of Knowledge Bases (KBs).
A KB is a structured representation of unstructured data that can be read by machines.
Facts are typically stored in the KB as a set of triples of the form (head entity, rela-
tion, tail entity), which represent the relationships between the head and the tail en-
tities. In today’s internet age, information about crime can be found in a variety of
places, including news media, social networks, blogs, and video repositories, among
others. Crime reports published in online newspapers are frequently regarded as more
reliable than crowdsourced data such as that found on social media. Furthermore, in-
formation in newspapers is available in both multilingual text and image form. As a
result, generating a KB of crime-related facts from online newspapers will be useful for
Law Enforcement Agencies (LEAs) in analyzing crime activities without language and
modality barriers. Furthermore, creating a KB from sources that publish data on a daily
basis, such as news media, keeps the KB up to date.
The creation of a KB involves the extraction and integration of data from multiple
sources. At the same time, it also ensures the accuracy of the extracted knowledge. Ex-
isting research has primarily focused on extracting entities and their relationships from
mono-lingual sources, while ignoring the impact of extracted entities on generating a
complete and non-redundant KB. Furthermore, the majority of them have used either
corpus or knowledge-based similarity methods to integrate information from multiple
sources without delving into the full semantics hidden in facts. These factors result in
redundancy and the loss of critical information in the KB. In addition, the completion
of a knowledge base necessitates an incremental update of the KB through the extrac-
tion of facts from multi-lingual sources. It is also critical to verify the credibility of
facts before they are entered into the KB. To address the aforementioned issues, this
study proposes a bootstrap-based model for developing Crime Base, a knowledge base
of crime entities and their relationships that contains complete, non-redundant, and val-
idated facts. It makes use of crime-related text and image data from English and Hindi
online news articles.
To begin, the proposed model extracts crime-related facts from English news arti-
cles in order to construct the Crime Base. Unlike existing methods and tools, it extracts
entities using an external KB-DBpedia with the goal of minimizing redundancy and
loss of essential information. To capture more semantics during integration, a semantic
merging method is proposed in which entities extracted from text data are correlated
using both corpus and knowledge-based similarity measures, and image entities are
correlated using both low-level and high-level image features. Empirical results show
that using both similarity measures reduces redundancy in the KB more effectively than
using either of the two.
Secondly, a clustering-based bootstrapping approach is proposed to enrich the Crime
Base created with English news articles with Hindi news articles. The proposed method
investigates redundancy in a bi-lingual collection of news articles by clustering them
based on semantic similarity using an incremental nearest neighbor algorithm. The facts
extracted from English language articles are bootstrapped within each cluster to extract
the facts from comparable Hindi language articles using the Google Translator API.
This bootstrapping method within the cluster aids in identifying related sentences con-
taining new information from a low-resource language like Hindi. Using this approach,
information from news articles in any low-resource language can be extracted without
the use of language-specific tools such as Parts-Of-Speech (POS) taggers, Named En-
tity Recognizers, and Open Relation Extractors, making it more suitable for resource-
deficient Indian languages. Experiment results show that the proposed framework ex-
tracts new facts from Hindi news articles with a high recall rate.
Finally, the proposed method employs a multi-layer perceptron-based classifier to
determine whether or not a given triple is genuine. It attempts to vectorize the triples by
employing both frequency and probability-based word embedding models. These two
embedding models help in considering both word and document level features while
reducing vector dimensionality. Empirical results show that the proposed classifier out-
performs the baseline classifiers in prediction accuracy.
Description
Keywords
Information Extraction, Knowledge base construction, Knowledge base completion, Fake news classification