Profiling web users using big data. Social network analysis and mining
Profiling web users using big data. Social network analysis and mining
Profiling Web users is a fundamental issue for Web mining and social network analysis. Its basic tasks include extracting basic information, mining user preferences, and inferring user demographics (Tang et al. in ACM Trans Knowl Discov Data 5(1):2:1–2:44, 2010). Although methodologies for handling the three tasks are different, they all usually contain two stages: first identify relevant pages (data) of a user and then use machine learning models (e.g., SVM, CRFs, or DL) to extract/mine/infer profile attributes from each page. The methods were successful in the traditional Web, but are facing more and more challenges with the rapid evolution of the Web each persons information is distributed over the Web and is changing dynamically. As a result, available data for a user on the Web is redundant, and some sources may be out-of-date or incorrect. The traditional two-stage method suffers from data inconsistency and error propagation between the two stages. In this paper, we revisit the problem of Web user profiling in the big data era and propose a simple but very effective approach, referred to as MagicFG, for profiling Web users by leveraging the power of big data. To avoid error propagation, the approach processes all the extracting/mining/inferring subtasks in one unified framework. To improve the profiling performance, we present the concept of contextual credibility. The proposed framework also supports the incorporation of human knowledge. It defines human knowledge as Markov logics statements and formalizes them into a factor graph model. The MagicFG method has been deployed in an online system AMiner.org for profiling millions of researchers—e.g., extracting E-mail, inferring Gender, and mining research interests. Our empirical study in the real system shows that the proposed method offers significantly improved (+ 4–6%; p≪0.01 , t test) profiling performance in comparison with several baseline methods using rules, classification, and sequential labeling.
Information Extraction, Factor graph model, User profiling, Big data
1-17
Gu, Xiaotao
d7cda545-ec64-478e-b797-99a0ae61ed13
Yang, Hong
1b834e6e-52ca-45e0-bcd7-ca2004961e3f
Tang, Jie
69c44bae-b1fa-45eb-a01d-3ac5b00fa749
Zhang, Fanjin
0f3e9872-35a3-4aad-8505-732d69143fdc
Liu, Debing
1a719212-8566-4312-900f-5007d6f923bd
Hall, Wendy
11f7f8db-854c-4481-b1ae-721a51d8790c
Xiao, Fu
f7ae34fa-c9d8-457c-bdc2-abf97a17a890
December 2018
Gu, Xiaotao
d7cda545-ec64-478e-b797-99a0ae61ed13
Yang, Hong
1b834e6e-52ca-45e0-bcd7-ca2004961e3f
Tang, Jie
69c44bae-b1fa-45eb-a01d-3ac5b00fa749
Zhang, Fanjin
0f3e9872-35a3-4aad-8505-732d69143fdc
Liu, Debing
1a719212-8566-4312-900f-5007d6f923bd
Hall, Wendy
11f7f8db-854c-4481-b1ae-721a51d8790c
Xiao, Fu
f7ae34fa-c9d8-457c-bdc2-abf97a17a890
Gu, Xiaotao, Yang, Hong, Tang, Jie, Zhang, Fanjin, Liu, Debing, Hall, Wendy and Xiao, Fu
(2018)
Profiling web users using big data. Social network analysis and mining.
Social Network Analysis and Mining, 8 (24), , [24].
(doi:10.1007/s13278-018-0495-0).
Abstract
Profiling Web users is a fundamental issue for Web mining and social network analysis. Its basic tasks include extracting basic information, mining user preferences, and inferring user demographics (Tang et al. in ACM Trans Knowl Discov Data 5(1):2:1–2:44, 2010). Although methodologies for handling the three tasks are different, they all usually contain two stages: first identify relevant pages (data) of a user and then use machine learning models (e.g., SVM, CRFs, or DL) to extract/mine/infer profile attributes from each page. The methods were successful in the traditional Web, but are facing more and more challenges with the rapid evolution of the Web each persons information is distributed over the Web and is changing dynamically. As a result, available data for a user on the Web is redundant, and some sources may be out-of-date or incorrect. The traditional two-stage method suffers from data inconsistency and error propagation between the two stages. In this paper, we revisit the problem of Web user profiling in the big data era and propose a simple but very effective approach, referred to as MagicFG, for profiling Web users by leveraging the power of big data. To avoid error propagation, the approach processes all the extracting/mining/inferring subtasks in one unified framework. To improve the profiling performance, we present the concept of contextual credibility. The proposed framework also supports the incorporation of human knowledge. It defines human knowledge as Markov logics statements and formalizes them into a factor graph model. The MagicFG method has been deployed in an online system AMiner.org for profiling millions of researchers—e.g., extracting E-mail, inferring Gender, and mining research interests. Our empirical study in the real system shows that the proposed method offers significantly improved (+ 4–6%; p≪0.01 , t test) profiling performance in comparison with several baseline methods using rules, classification, and sequential labeling.
This record has no associated files available for download.
More information
Accepted/In Press date: 16 February 2018
e-pub ahead of print date: 22 March 2018
Published date: December 2018
Keywords:
Information Extraction, Factor graph model, User profiling, Big data
Identifiers
Local EPrints ID: 419198
URI: http://eprints.soton.ac.uk/id/eprint/419198
ISSN: 1869-5450
PURE UUID: 74a21a07-c43e-42cc-b868-3f3e1c08c8da
Catalogue record
Date deposited: 09 Apr 2018 16:30
Last modified: 18 Mar 2024 02:31
Export record
Altmetrics
Contributors
Author:
Xiaotao Gu
Author:
Hong Yang
Author:
Jie Tang
Author:
Fanjin Zhang
Author:
Debing Liu
Author:
Fu Xiao
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics