The University of Southampton
University of Southampton Institutional Repository

Profiling Web users using big data

Profiling Web users using big data
Profiling Web users using big data

Profiling Web users is a fundamental issue for Web mining and social network analysis. Its basic tasks include extracting basic information, mining user preferences, and inferring user demographics (Tang et al. in ACM Trans Knowl Discov Data 5(1):2:1–2:44, 2010). Although methodologies for handling the three tasks are different, they all usually contain two stages: first identify relevant pages (data) of a user and then use machine learning models (e.g., SVM, CRFs, or DL) to extract/mine/infer profile attributes from each page. The methods were successful in the traditional Web, but are facing more and more challenges with the rapid evolution of the Web each persons information is distributed over the Web and is changing dynamically. As a result, available data for a user on the Web is redundant, and some sources may be out-of-date or incorrect. The traditional two-stage method suffers from data inconsistency and error propagation between the two stages. In this paper, we revisit the problem of Web user profiling in the big data era and propose a simple but very effective approach, referred to as MagicFG, for profiling Web users by leveraging the power of big data. To avoid error propagation, the approach processes all the extracting/mining/inferring subtasks in one unified framework. To improve the profiling performance, we present the concept of contextual credibility. The proposed framework also supports the incorporation of human knowledge. It defines human knowledge as Markov logics statements and formalizes them into a factor graph model. The MagicFG method has been deployed in an online system AMiner.org for profiling millions of researchers—e.g., extracting E-mail, inferring Gender, and mining research interests. Our empirical study in the real system shows that the proposed method offers significantly improved (+ 4–6%; p≪ 0.01 , t test) profiling performance in comparison with several baseline methods using rules, classification, and sequential labeling.

Big data, Factor graph model, Information extraction, User profiling
1869-5450
1-17
Gu, Xiaotao
d7cda545-ec64-478e-b797-99a0ae61ed13
Yang, Hong
2ea2c94c-8d28-4555-98f9-59b615b0cee7
Tang, Jie
69c44bae-b1fa-45eb-a01d-3ac5b00fa749
Zhang, Jing
7cf368f8-2a92-4bb8-b985-a25d2ff74da8
Zhang, Fanjin
0f3e9872-35a3-4aad-8505-732d69143fdc
Liu, Debing
1a719212-8566-4312-900f-5007d6f923bd
Hall, Wendy
11f7f8db-854c-4481-b1ae-721a51d8790c
Fu, Xiao
f3fcf407-3f78-4ba8-b614-82feec9e008d
Gu, Xiaotao
d7cda545-ec64-478e-b797-99a0ae61ed13
Yang, Hong
2ea2c94c-8d28-4555-98f9-59b615b0cee7
Tang, Jie
69c44bae-b1fa-45eb-a01d-3ac5b00fa749
Zhang, Jing
7cf368f8-2a92-4bb8-b985-a25d2ff74da8
Zhang, Fanjin
0f3e9872-35a3-4aad-8505-732d69143fdc
Liu, Debing
1a719212-8566-4312-900f-5007d6f923bd
Hall, Wendy
11f7f8db-854c-4481-b1ae-721a51d8790c
Fu, Xiao
f3fcf407-3f78-4ba8-b614-82feec9e008d

Gu, Xiaotao, Yang, Hong, Tang, Jie, Zhang, Jing, Zhang, Fanjin, Liu, Debing, Hall, Wendy and Fu, Xiao (2018) Profiling Web users using big data. Social Network Analysis and Mining, 8 (1), 1-17. (doi:10.1007/s13278-018-0495-0).

Record type: Article

Abstract

Profiling Web users is a fundamental issue for Web mining and social network analysis. Its basic tasks include extracting basic information, mining user preferences, and inferring user demographics (Tang et al. in ACM Trans Knowl Discov Data 5(1):2:1–2:44, 2010). Although methodologies for handling the three tasks are different, they all usually contain two stages: first identify relevant pages (data) of a user and then use machine learning models (e.g., SVM, CRFs, or DL) to extract/mine/infer profile attributes from each page. The methods were successful in the traditional Web, but are facing more and more challenges with the rapid evolution of the Web each persons information is distributed over the Web and is changing dynamically. As a result, available data for a user on the Web is redundant, and some sources may be out-of-date or incorrect. The traditional two-stage method suffers from data inconsistency and error propagation between the two stages. In this paper, we revisit the problem of Web user profiling in the big data era and propose a simple but very effective approach, referred to as MagicFG, for profiling Web users by leveraging the power of big data. To avoid error propagation, the approach processes all the extracting/mining/inferring subtasks in one unified framework. To improve the profiling performance, we present the concept of contextual credibility. The proposed framework also supports the incorporation of human knowledge. It defines human knowledge as Markov logics statements and formalizes them into a factor graph model. The MagicFG method has been deployed in an online system AMiner.org for profiling millions of researchers—e.g., extracting E-mail, inferring Gender, and mining research interests. Our empirical study in the real system shows that the proposed method offers significantly improved (+ 4–6%; p≪ 0.01 , t test) profiling performance in comparison with several baseline methods using rules, classification, and sequential labeling.

Full text not available from this repository.

More information

Accepted/In Press date: 16 February 2018
e-pub ahead of print date: 22 March 2018
Published date: 1 December 2018
Keywords: Big data, Factor graph model, Information extraction, User profiling

Identifiers

Local EPrints ID: 419198
URI: https://eprints.soton.ac.uk/id/eprint/419198
ISSN: 1869-5450
PURE UUID: 74a21a07-c43e-42cc-b868-3f3e1c08c8da
ORCID for Wendy Hall: ORCID iD orcid.org/0000-0003-4327-7811

Catalogue record

Date deposited: 09 Apr 2018 16:30
Last modified: 12 Apr 2018 16:30

Export record

Altmetrics

Contributors

Author: Xiaotao Gu
Author: Hong Yang
Author: Jie Tang
Author: Jing Zhang
Author: Fanjin Zhang
Author: Debing Liu
Author: Wendy Hall ORCID iD
Author: Xiao Fu

University divisions

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of https://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×