Sensitive Keyword Extraction Based on Cyber Keywords and LDA in Twitter to Avoid Regrets

R. Geetha; S. Karthika

doi:10.1007/978-3-030-63467-4_5

Conference Papers Year : 2020

Sensitive Keyword Extraction Based on Cyber Keywords and LDA in Twitter to Avoid Regrets

(1) , (1)

R. Geetha

Function : Author
PersonId : 1117219

SSN College of Engineering - Sri Sivasubramaniya Nadar College of Engineering

S. Karthika

Function : Author
PersonId : 1117220

SSN College of Engineering - Sri Sivasubramaniya Nadar College of Engineering

Abstract

Twitter is the most popular social platform where common people reflect their personal, political and business views that obliquely build an active online repository. The data presented by users on social networking sites are usually composed of sensitive or private data that is highly potential for cyber threats. The most frequently presented sensitive private data is analyzed by collecting real-time tweets based on benchmarked cyber-keywords under personal, professional and health categories. This research work aims to generate a Topic Keyword Extractor by adapting the Automatic Acronym - Abbreviation Replacer which is specially developed for social media short texts. The feature space is modeled using the Latent Dirichlet Allocation technique to discover topics for each cyber-keyword. The user’s context and intentions are preserved by replacing the internet jargon and abbreviations. The originality of this research work lies in identifying sensitive keywords that reveal Tweeter’s Personally Identifiable Information through the novel Topic Keyword Extractor. The potential sensitive topics in which the social media users frequently exhibit personal information and unintended information disclosures are discovered for the benchmarked cyber-keywords by adapting the proposed qualitative topic-wise keyword distribution approach. This experiment analyzed cyber-keywords and the identified sensitive topic keywords as bi-grams to predict the most common sensitive information leaks happening in Twitter. The results showed that the most frequently discussed sensitive topic was ‘weight loss’ with the cyber-keyword ‘weight’ of the health tweet category.

Keywords

Domains

Computer Science [cs]

Fichier principal

507484_1_En_5_Chapter.pdf (275.04 Ko)

Origin	Files produced by the author(s)
licence	CC BY 4.0 - Attribution

Connect in order to contact the contributor

https://inria.hal.science/hal-03434777

Submitted on : Thursday, November 18, 2021-2:20:09 PM

Last modification on : Thursday, November 18, 2021-2:32:17 PM

Long-term archiving on : Saturday, February 19, 2022-7:10:29 PM

Dates and versions

hal-03434777 , version 1 (18-11-2021)

Licence

CC BY 4.0 - Attribution

Identifiers

HAL Id : hal-03434777 , version 1
DOI : 10.1007/978-3-030-63467-4_5

Cite

R. Geetha, S. Karthika. Sensitive Keyword Extraction Based on Cyber Keywords and LDA in Twitter to Avoid Regrets. 3rd International Conference on Computational Intelligence in Data Science (ICCIDS), Feb 2020, Chennai, India. pp.59-70, ⟨10.1007/978-3-030-63467-4_5⟩. ⟨hal-03434777⟩

Sensitive Keyword Extraction Based on Cyber Keywords and LDA in Twitter to Avoid Regrets

Abstract

Keywords

Domains

Dates and versions

Licence

Identifiers

Cite

Export

Collections

Altmetric

Share