Generation of Crime Knowledge Base From Online News Articles

K, Srinivasa

Please use this identifier to cite or link to this item: https://idr.nitk.ac.in/jspui/handle/123456789/17393

Title:	Generation of Crime Knowledge Base From Online News Articles
Authors:	K, Srinivasa
Supervisors:	Thilagam, P.Santhi
Keywords:	Information Extraction;Knowledge base construction;Knowledge base completion;Fake news classification
Issue Date:	2022
Publisher:	National Institute of Technology Karnataka, Surathkal
Abstract:	The growing amount of unstructured data on the internet has piqued the interest of Se- mantic Web (SW) technology researchers in the creation of Knowledge Bases (KBs). A KB is a structured representation of unstructured data that can be read by machines. Facts are typically stored in the KB as a set of triples of the form (head entity, rela- tion, tail entity), which represent the relationships between the head and the tail en- tities. In today’s internet age, information about crime can be found in a variety of places, including news media, social networks, blogs, and video repositories, among others. Crime reports published in online newspapers are frequently regarded as more reliable than crowdsourced data such as that found on social media. Furthermore, in- formation in newspapers is available in both multilingual text and image form. As a result, generating a KB of crime-related facts from online newspapers will be useful for Law Enforcement Agencies (LEAs) in analyzing crime activities without language and modality barriers. Furthermore, creating a KB from sources that publish data on a daily basis, such as news media, keeps the KB up to date. The creation of a KB involves the extraction and integration of data from multiple sources. At the same time, it also ensures the accuracy of the extracted knowledge. Ex- isting research has primarily focused on extracting entities and their relationships from mono-lingual sources, while ignoring the impact of extracted entities on generating a complete and non-redundant KB. Furthermore, the majority of them have used either corpus or knowledge-based similarity methods to integrate information from multiple sources without delving into the full semantics hidden in facts. These factors result in redundancy and the loss of critical information in the KB. In addition, the completion of a knowledge base necessitates an incremental update of the KB through the extrac- tion of facts from multi-lingual sources. It is also critical to verify the credibility of facts before they are entered into the KB. To address the aforementioned issues, this study proposes a bootstrap-based model for developing Crime Base, a knowledge base of crime entities and their relationships that contains complete, non-redundant, and val- idated facts. It makes use of crime-related text and image data from English and Hindi online news articles. To begin, the proposed model extracts crime-related facts from English news arti- cles in order to construct the Crime Base. Unlike existing methods and tools, it extracts entities using an external KB-DBpedia with the goal of minimizing redundancy and loss of essential information. To capture more semantics during integration, a semantic merging method is proposed in which entities extracted from text data are correlated using both corpus and knowledge-based similarity measures, and image entities are correlated using both low-level and high-level image features. Empirical results show that using both similarity measures reduces redundancy in the KB more effectively than using either of the two. Secondly, a clustering-based bootstrapping approach is proposed to enrich the Crime Base created with English news articles with Hindi news articles. The proposed method investigates redundancy in a bi-lingual collection of news articles by clustering them based on semantic similarity using an incremental nearest neighbor algorithm. The facts extracted from English language articles are bootstrapped within each cluster to extract the facts from comparable Hindi language articles using the Google Translator API. This bootstrapping method within the cluster aids in identifying related sentences con- taining new information from a low-resource language like Hindi. Using this approach, information from news articles in any low-resource language can be extracted without the use of language-specific tools such as Parts-Of-Speech (POS) taggers, Named En- tity Recognizers, and Open Relation Extractors, making it more suitable for resource- deficient Indian languages. Experiment results show that the proposed framework ex- tracts new facts from Hindi news articles with a high recall rate. Finally, the proposed method employs a multi-layer perceptron-based classifier to determine whether or not a given triple is genuine. It attempts to vectorize the triples by employing both frequency and probability-based word embedding models. These two embedding models help in considering both word and document level features while reducing vector dimensionality. Empirical results show that the proposed classifier out- performs the baseline classifiers in prediction accuracy.
URI:	http://idr.nitk.ac.in/jspui/handle/123456789/17393
Appears in Collections:	1. Ph.D Theses

Files in This Item:

File	Description	Size	Format
177137CO006-Srinivasa K.pdf		3.49 MB	Adobe PDF	View/Open

Show full item record