The University of Southampton
University of Southampton Institutional Repository

Profiling web users using big data. Social network analysis and mining

Profiling web users using big data. Social network analysis and mining
Profiling web users using big data. Social network analysis and mining
Profiling Web users is a fundamental issue for Web mining and social network analysis. Its basic tasks include extracting basic information, mining user preferences, and inferring user demographics (Tang et al. in ACM Trans Knowl Discov Data 5(1):2:1–2:44, 2010). Although methodologies for handling the three tasks are different, they all usually contain two stages: first identify relevant pages (data) of a user and then use machine learning models (e.g., SVM, CRFs, or DL) to extract/mine/infer profile attributes from each page. The methods were successful in the traditional Web, but are facing more and more challenges with the rapid evolution of the Web each persons information is distributed over the Web and is changing dynamically. As a result, available data for a user on the Web is redundant, and some sources may be out-of-date or incorrect. The traditional two-stage method suffers from data inconsistency and error propagation between the two stages. In this paper, we revisit the problem of Web user profiling in the big data era and propose a simple but very effective approach, referred to as MagicFG, for profiling Web users by leveraging the power of big data. To avoid error propagation, the approach processes all the extracting/mining/inferring subtasks in one unified framework. To improve the profiling performance, we present the concept of contextual credibility. The proposed framework also supports the incorporation of human knowledge. It defines human knowledge as Markov logics statements and formalizes them into a factor graph model. The MagicFG method has been deployed in an online system AMiner.org for profiling millions of researchers—e.g., extracting E-mail, inferring Gender, and mining research interests. Our empirical study in the real system shows that the proposed method offers significantly improved (+ 4–6%; p≪0.01 , t test) profiling performance in comparison with several baseline methods using rules, classification, and sequential labeling.
Information Extraction, Factor graph model, User profiling, Big data
1869-5450
1-17
Gu, Xiaotao
d7cda545-ec64-478e-b797-99a0ae61ed13
Yang, Hong
1b834e6e-52ca-45e0-bcd7-ca2004961e3f
Tang, Jie
69c44bae-b1fa-45eb-a01d-3ac5b00fa749
Zhang, Fanjin
0f3e9872-35a3-4aad-8505-732d69143fdc
Liu, Debing
1a719212-8566-4312-900f-5007d6f923bd
Hall, Wendy
11f7f8db-854c-4481-b1ae-721a51d8790c
Xiao, Fu
f7ae34fa-c9d8-457c-bdc2-abf97a17a890
Gu, Xiaotao
d7cda545-ec64-478e-b797-99a0ae61ed13
Yang, Hong
1b834e6e-52ca-45e0-bcd7-ca2004961e3f
Tang, Jie
69c44bae-b1fa-45eb-a01d-3ac5b00fa749
Zhang, Fanjin
0f3e9872-35a3-4aad-8505-732d69143fdc
Liu, Debing
1a719212-8566-4312-900f-5007d6f923bd
Hall, Wendy
11f7f8db-854c-4481-b1ae-721a51d8790c
Xiao, Fu
f7ae34fa-c9d8-457c-bdc2-abf97a17a890

Gu, Xiaotao, Yang, Hong, Tang, Jie, Zhang, Fanjin, Liu, Debing, Hall, Wendy and Xiao, Fu (2018) Profiling web users using big data. Social network analysis and mining. Social Network Analysis and Mining, 8 (24), 1-17, [24]. (doi:10.1007/s13278-018-0495-0).

Record type: Article

Abstract

Profiling Web users is a fundamental issue for Web mining and social network analysis. Its basic tasks include extracting basic information, mining user preferences, and inferring user demographics (Tang et al. in ACM Trans Knowl Discov Data 5(1):2:1–2:44, 2010). Although methodologies for handling the three tasks are different, they all usually contain two stages: first identify relevant pages (data) of a user and then use machine learning models (e.g., SVM, CRFs, or DL) to extract/mine/infer profile attributes from each page. The methods were successful in the traditional Web, but are facing more and more challenges with the rapid evolution of the Web each persons information is distributed over the Web and is changing dynamically. As a result, available data for a user on the Web is redundant, and some sources may be out-of-date or incorrect. The traditional two-stage method suffers from data inconsistency and error propagation between the two stages. In this paper, we revisit the problem of Web user profiling in the big data era and propose a simple but very effective approach, referred to as MagicFG, for profiling Web users by leveraging the power of big data. To avoid error propagation, the approach processes all the extracting/mining/inferring subtasks in one unified framework. To improve the profiling performance, we present the concept of contextual credibility. The proposed framework also supports the incorporation of human knowledge. It defines human knowledge as Markov logics statements and formalizes them into a factor graph model. The MagicFG method has been deployed in an online system AMiner.org for profiling millions of researchers—e.g., extracting E-mail, inferring Gender, and mining research interests. Our empirical study in the real system shows that the proposed method offers significantly improved (+ 4–6%; p≪0.01 , t test) profiling performance in comparison with several baseline methods using rules, classification, and sequential labeling.

This record has no associated files available for download.

More information

Accepted/In Press date: 16 February 2018
e-pub ahead of print date: 22 March 2018
Published date: December 2018
Keywords: Information Extraction, Factor graph model, User profiling, Big data

Identifiers

Local EPrints ID: 419198
URI: http://eprints.soton.ac.uk/id/eprint/419198
ISSN: 1869-5450
PURE UUID: 74a21a07-c43e-42cc-b868-3f3e1c08c8da
ORCID for Wendy Hall: ORCID iD orcid.org/0000-0003-4327-7811

Catalogue record

Date deposited: 09 Apr 2018 16:30
Last modified: 18 Mar 2024 02:31

Export record

Altmetrics

Contributors

Author: Xiaotao Gu
Author: Hong Yang
Author: Jie Tang
Author: Fanjin Zhang
Author: Debing Liu
Author: Wendy Hall ORCID iD
Author: Fu Xiao

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×