Corresponding author: Irina E. Kalabikhina ( ikalabikhina@yandex.ru ) © 2020 Irina E. Kalabikhina, Evgeny P. Banin.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Kalabikhina IE, Banin EP (2020) Database “Pro-family (pronatalist) communities in the social network VKontakte”. Population and Economics 4(3): 98-130. https://doi.org/10.3897/popecon.4.e60915
|
The database contains uploading text comments from the social network VKontakte in .csv format (UTF-8 encoding). The comments are collected from communities discussing pregnancy, childhood, motherhood, etc. Uploading contains comments to posts with which the interaction took place. The absolute number of likes was used as a criterion (comments were collected where the number of likes is greater than or equal to 5). Text data was pre-processed (stemmization and lemmatization).
The data is suitable for thematic analysis (e.g. LDA – Latent Dirichlet Allocation), for modelling the graph structure of communities (the link_comment variable contains a unique post identifier, link_author contains a unique user identifier), for analysis of tonalities of statements and formation of a dictionary of demographic connotation in Russian. Analysis of the tonalities of statements enables measuring the dynamics of “demographic temperature” in pro-family (pronatalist) communities.
Database, big data, pronatalism, VKontakte, social networks, communities, family values
Database name “Pro-family (pronatalist) communities in the social network VKontakte” Copyright I.E. Kalabikhina, E.P.Banin The database is in the public domain and under the Creative Commons Attribution license (CC-BY 4.0) can be used, distributed and reproduced without limitation on any medium subject to indication of the authors and the source. Irina Kalabikhina, Evgeny Banin: Pro-family (pronatalist) communities in the social network VKontakte. Access mode: https://doi.org/10.5281/zenodo.4244361. Data format .csv (UTF-8 encoding).Description: Data can be downloaded from an open source (Zenodo online depository), where the database of Pro-family (pronatalist) communities in the social network VKontakte is located. Data file vk_posts_stem_lemm.scv 117.0 MB
Brief overview of literature and motivation of research. Two major projects with developed mood dictionaries are known in the Russian-language segment: RusentiLeX (
In the review work on applied analysis of tonalities, it is (
High coverage of the population predetermined the choice of the social network VKontakte as a source of textual data for analysis. The analysis of various sources made it possible to conclude that there is insufficient elaboration of models of tonalities evaluation in the Russian-speaking segment as a whole, and even existing works set limited tasks and do not move from the level of individual small communities to the regional or country level. Most difficult for tonality analysis are demographic topics from the field of reproductive behaviour (compared to self-preservation or migratory behaviour). These circumstances led to the creation of a database in the Russian-language segment using machine learning on text data from the social network VKontakte on topics in the field of reproductive behaviour.
Data collection methodology. This study attempts to test machine learning tools on text data obtained from the social network VKontakte. Collection of unstructured text data from communities was carried out, preliminary data processing (cleaning, lemmatization, stemmization and removal of punctuation) was carried out, a structured array (body) of texts was formed. Thematic clusters have been identified based on Latent Dirichlet Allocation, LDA. After thematic analysis, the tonalities of texts were evaluated for each cluster and the dynamics of change of tonality in time was constructed for comments (publication on this model testing in print).
The thematic model is a text document collection model that determines which topics the document refers to. In addition to highlighting the structure of text collection, thematic modelling allows for semantic information retrieval (as opposed to keyword search, where meaning is not explicitly represented).
TensorFlow and tflearn libraries are used for tonality analysis. Neural network training is carried out on a marked database of short messages from twitter (
Data sources. The source of text data is thematic communities in the social network VKontakte (vk.com). At the first stage of processing using the built-in API (application programming interface) unique address numbers of thematic communities in the form vk.com/ were collected by keywords (“mom”, “mommies”, “kids”, “child”, “baby”, “health”, “birth”, “pregnancy”, “parents”). In the first phase, about 1,000 unique group addresses were collected with data on the number of participants. In the second stage, ad-related communities as well as communities with low member activity were excluded from the sample (the overall dynamics of changes in the number of posts, likes and reposts was assessed) together with those with a number of subscribers under 10000.
Following the formation of the final list of communities, textual information from the communities was gathered. In this paper, the collected texts are limited only to posts and comments. Based on the information gathered, a language body was formed: all words were brought to lower case, stop words were removed using functions from the nltk or gensim library, punctuation was removed, numerical data were excluded. To reduce the volume of text data, stemmization (deletion of suffixes) or lemmatization (bringing the word to the initial form using the mySteM lemmatizer) was additionally carried out. We have determined that the comment body is most appropriate to assess tonality.
For the sample structure and list of major groups, see the database description in the International Depository https://doi.org/10.5281/zenodo.4244361.
The data is suitable for thematic analysis (e.g. LDA - Latent Dirichlet Allocation), for modelling the graph structure of communities (the link_comment variable contains a unique post identifier, link_author contains a unique user identifier), for analysis of tonalities of statements and formation of a dictionary of demographic connotation in Russian.
Analysis of the tonalities of statements enables measuring the dynamics of “demographic temperature” in pro-family (pronatalist) communities. By demographic temperature we mean the emotional background or the predominance of positive or negative tonality of statements on topics related to family values, childbirth and other topics in the field of reproductive behaviour. Demographic temperature is measured as the difference between the number of positive and the number of negative statements over a certain period of time.
Let us emphasize once again that the demographic temperature in this case is measured in communities of people with pronatalist views, that is, reproductive attitudes towards creating a family and having children. In our view, the measurement of population temperature in such communities best shows the state of the demographic climate, shaped by demographic and family policies, economic trends (but purified from the influence of the population structure according to the criterion of pronatalist or anti-natalist reproductive attitudes).
The presented database enables comparing the demographic temperature in individual clusters of communities in social networks, study the dynamics of positive and negative comments of women and men on demographic topics in the areas of childbirth, parenthood and family values.
In the next versions, we plan to add anti-natalist groups and more tonalities (besides positive and negative ones).
Figure
The demographic temperature in the investigated pronatalist communities (the difference between positive and negative comments) is shown in Figure
Contribution to the creation and development of the database: The idea and concept of creating a database based on the developed range of applications of the database in the demographic analysis of fertility, reproductive behaviour, population response to population policy and other factors of reproductive behavior — Doctor of Economics. I.E. Kalabikhina. Methods of base creation, creation of the first variant of the base — Banin E.P. The base was created within the framework of the implementation of the internal grant of the Faculty of Economics of Lomonosov Moscow State University. The authors thank project colleagues for their assistance in formulating thematic words and phrases, searching for samples of pronatalist groups in the social network: Abduselimova I.A., Arkhangelskiy V.N., Klimenko G.V., Kolotusha A.V., Nikolaeva U.G., Shamsutdinova V.Sh.
. Distribution of the number of comments in the body by month. “Red” is negative comments, “green” is positive. Source: compiled by the authors in Tableau 19.3 based on gathered data from the social network “VKontakte”
Irina Evgenievna Kalabikhina, Doctor of Sciences (Economics), Professor, Head of the Population Department, Faculty of Economics, Lomonosov Moscow State University. E-mail: kalabikhina@econ.msu.ru
Evgeny Petrovich Banin, Research Engineer, Bauman Moscow State Technical University, Research Center “Kurchatov Institute”; PhD student, Bauman Moscow State Technical University. E-mail: evg.banin@gmail.com