Database “Pro-family (pronatalist) communities in the social network VKontakte”

The database contains uploading text comments from the social network VKontakte in .csv format (UTF-8 encoding). The comments are collected from communities discussing pregnancy, childhood, motherhood, etc. Uploading contains comments to posts with which the interaction took place. The absolute number of likes was used as a criterion (comments were collected where the number of likes is greater than or equal to 5). Text data was pre-processed (stemmization and lemmatization). The data is suitable for thematic analysis (e.g. LDA – Latent Dirichlet Allocation), for modelling the graph structure of communities (the link_comment variable contains a unique post identifier, link_ author contains a unique user identifier), for analysis of tonalities of statements and formation of a dictionary of demographic connotation in Russian. Analysis of the tonalities of statements enables measuring the dynamics of “demographic temperature” in pro-family (pronatalist) communities.

work VKontakte. Access mode: https://doi.org/10.5281/zenodo.4244361. Data format .csv .Description: Data can be downloaded from an open source (Zenodo online depository), where the database of Pro-family (pronatalist) communities in the social network VKontakte is located. Data file vk_posts_stem_lemm.scv 117.0 MB Brief overview of literature and motivation of research. Two major projects with developed mood dictionaries are known in the Russian-language segment: RusentiLeX (Loukachevitch and Levchik 2016) and LINIS Crowd (Koltsova, Alexeeva and Kolcov 2016). Both projects are developed dictionaries with an assessment of tonality (from positive to negative) for each word or combination of words without characterizing emotional colouring. More complex models of tonality are offered in the (Baccianella, Esuli and Sebastiani 2010) Senti-WordNet and SenticNet projects (Cambria et al. 2016(Cambria et al. , 2018 based on the analysis of the tonality of the English language. At present, deep neural network learning techniques are most actively developed, which demonstrate (Tang, Qin and Liu 2015;Tang and Zhang 2018) the best current results of tonality assessment compared to the rest of approaches. The methods of tonality analysis based on machine learning are characterized by the need for pre-learning on large sets of marked texts. Attempts to combine the two approaches presented (based on rules and methods of machine learning), for example, work, are known (Meškele and Frasincar 2019;Kumar et al. 2020).
In the review work on applied analysis of tonalities, it is (Smetanin 2020) noted that for the Russian language this direction is not sufficiently studied (the author notes 27 most relevant research papers on tonalities analysis in Russian). Much of the research focuses on analyzing the tonalities of tweets (short posts) on the social network Twitter. In Russia in 2019 this social network was used by about 650,000 active users. The social network VKontakte has the greatest coverage of the Russian-speaking population. According to a report by the consultancy Deloitte (Zemlyanskaya et al. 2018), VKontakte covers up to 70% of the Russian population.
High coverage of the population predetermined the choice of the social network VKontakte as a source of textual data for analysis. The analysis of various sources made it possible to conclude that there is insufficient elaboration of models of tonalities evaluation in the Russian-speaking segment as a whole, and even existing works set limited tasks and do not move from the level of individual small communities to the regional or country level. Most difficult for tonality analysis are demographic topics from the field of reproductive behaviour (compared to self-preservation or migratory behaviour). These circumstances led to the creation of a database in the Russian-language segment using machine learning on text data from the social network VKontakte on topics in the field of reproductive behaviour.
Data collection methodology. This study attempts to test machine learning tools on text data obtained from the social network VKontakte. Collection of unstructured text data from communities was carried out, preliminary data processing (cleaning, lemmatization, stemmization and removal of punctuation) was carried out, a structured array (body) of texts was formed. Thematic clusters have been identified based on Latent Dirichlet Allocation, LDA. After thematic analysis, the tonalities of texts were evaluated for each cluster and the dynamics of change of tonality in time was constructed for comments (publication on this model testing in print).
The thematic model is a text document collection model that determines which topics the document refers to. In addition to highlighting the structure of text collection, thematic modelling allows for semantic information retrieval (as opposed to keyword search, where meaning is not explicitly represented).
TensorFlow and tflearn libraries are used for tonality analysis. Neural network training is carried out on a marked database of short messages from twitter (Rubtsova 2015). Neural network training is performed in the Google Colab environment using a graphical accelerator (GPU, graphics processing unit). About 24 GB of RAM is used to teach the neural network with the training dictionary amounting to 5,000 words. Before training, the data was stemmized (brought to the basic form of the word), all non-Cyrillic characters were eliminated from the sample. The test sample size is 30% of the entire sample. The number of eras for training is 30. The resulting accuracy on the training sample is 93.4%, on the test sample -69%. The probability threshold for assigning a comment as positive or negative is 0.5.
Data sources. The source of text data is thematic communities in the social network VKontakte (vk.com). At the first stage of processing using the built-in API (application programming interface) unique address numbers of thematic communities in the form vk.com/ were collected by keywords ("mom", "mommies", "kids", "child", "baby", "health", "birth", "pregnancy", "parents"). In the first phase, about 1,000 unique group addresses were collected with data on the number of participants. In the second stage, ad-related communities as well as communities with low member activity were excluded from the sample (the overall dynamics of changes in the number of posts, likes and reposts was assessed) together with those with a number of subscribers under 10000.

Information about the sample
• Only comments with the number of likes >= 5 were gathered • Comments are gathered only by communities (list of communities below), which discuss issues related to childhood, motherhood, pregnancy, etc. • The sample of communities on average contains 309 thousand subscribers (maximum value -1,482,303, minimum value -72,570, total number of subscribers excluding intersections -11,743,295) • The comment sample contains a total of 112,900 user comments Following the formation of the final list of communities, textual information from the communities was gathered. In this paper, the collected texts are limited only to posts and comments. Based on the information gathered, a language body was formed: all words were brought to lower case, stop words were removed using functions from the nltk or gensim library, punctuation was removed, numerical data were excluded. To reduce the volume of text data, stemmization (deletion of suffixes) or lemmatization (bringing the word to the initial form using the mySteM lemmatizer) was additionally carried out. We have determined that the comment body is most appropriate to assess tonality.
For the sample structure and list of major groups, see the database description in the International Depository https://doi.org/10.5281/zenodo.4244361.

Database Application Areas
The data is suitable for thematic analysis (e.g. LDA -Latent Dirichlet Allocation), for modelling the graph structure of communities (the link_comment variable contains a unique post identifier, link_author contains a unique user identifier), for analysis of tonalities of statements and formation of a dictionary of demographic connotation in Russian.
Analysis of the tonalities of statements enables measuring the dynamics of "demographic temperature" in pro-family (pronatalist) communities. By demographic temperature we mean the emotional background or the predominance of positive or negative tonality of statements on topics related to family values, childbirth and other topics in the field of reproductive behaviour. Demographic temperature is measured as the difference between the number of positive and the number of negative statements over a certain period of time.
Let us emphasize once again that the demographic temperature in this case is measured in communities of people with pronatalist views, that is, reproductive attitudes towards creating a family and having children. In our view, the measurement of population temperature in such communities best shows the state of the demographic climate, shaped by demographic and family policies, economic trends (but purified from the influence of the population structure according to the criterion of pronatalist or anti-natalist reproductive attitudes).
The presented database enables comparing the demographic temperature in individual clusters of communities in social networks, study the dynamics of positive and negative comments of women and men on demographic topics in the areas of childbirth, parenthood and family values.
In the next versions, we plan to add anti-natalist groups and more tonalities (besides positive and negative ones).
Example of demographic temperature measurement using the database described Figure 1 shows the distribution of comments by month since the beginning of 2014, taking into account the author's gender. It should be noted that women show the greatest activity in the represented pronatalist groups. A surge and increasing activity has been observed since June 2017. The demographic temperature in the investigated pronatalist communities (the difference between positive and negative comments) is shown in Figure 2. It can be noticed that since the end of 2016 the number of negative comments has increased significantly, the demographic temperature has become negative. In the recent history of Russia, the historical maximum of the total fertility rate was observed in 2015 -1.78 children per 1 woman of the conditional generation, then the decline of the birth rate to 1.50 began in 2019 (and we estimate to 1.4 in 2020).