Open source-BDCI2018 Smart package personalized matching model Top1 solution and code for stock users in the telecommunications industry...

Open source-BDCI2018 Smart package personalized matching model Top1 solution and code for stock users in the telecommunications industry...

With the consent of the author, I announced: BDCI2018 smart package personalized matching model data competition top1 solution and code for stock users in the telecommunications industry . The solution uses existing user attributes (such as basic personal information, user profile information, etc.), terminal attributes (such as terminal brand, etc.), business attributes, consumption habits and preferences to match the most suitable package for users, pushes users, and completes follow-up Personalized service.

 (Final rank 1/2546)\

Introduction

1. The title of the question

Smart package personalized matching model for existing users in the telecom industry

2. Question URL:

www.datafountain.cn/competition...

3. Background of the contest

Organizer: China Computer Society & China Unicom Research Institute

As one of the country's basic industries, the telecommunications industry has wide coverage and many users, and is particularly important in supporting the country's construction and development. With the rapid development and popularization of Internet technology, the traffic consumed by users has also become a blowout trend. In recent years, telecom operators have launched a large number of telecom packages to meet the differentiated needs of users. Faced with a wide variety of packages, how to choose the most suitable package? One of these is very important to both operators and users, especially in the context of slowing growth in the telecom market and increasingly fierce competition for existing users. Aiming at the problem of personalized recommendation of telecommunications packages, a personalized recommendation model of telecommunications packages based on user consumption behaviors is constructed through data mining technology. According to the results of user business behavior portraits, users consumption habits and preferences are analyzed to match users with the most suitable package and improve User perception drives user needs, so as to achieve the goal of improving user value.

The personalized recommendation of packages can help users find suitable packages in an environment where information is overloaded, and can also push appropriate package information to users. There are two problems to be solved: the problem of information overload and the problem of user search without purpose. Various packages meet the users' active search needs when they have a clear purpose, and personalized recommendations can help users discover new content of interest when they don t have a clear purpose.

4. Challenge task

This question uses the existing user attributes (such as personal basic information, user portrait information, etc.), terminal attributes (such as terminal brand, etc.), business attributes, consumption habits and preferences to match the user s most suitable package, pushes the user, and completes the follow-up Personalized service.

For specific tasks, please refer to the question website:

www.datafountain.cn/competition...

team introduction

Cui Shiwen : Data Engineer of a BAT factory, kaggle grand master title

Bao Mengjiao : a 985 master, the title of kaggle master

Huang Zhongshan: Deputy Chief Designer of an Aerospace Research Institute, title of kaggle master

Pan Feiyang: a member of a juvenile class, doctor, kaggle master title

** Tang Wei: **A 985 master's degree, an offer from a BAT factory

Team score: 1/2546 (final ranking)

Competition question plan

1. Data exploration and evaluation indicators/

Figure 1: Category and category count observation\

We can see that this problem is a multi-classification problem, corresponding to 11 kinds of packages, of which their distribution is relatively uneven. As a typical multi-category subject, its evaluation index is macro-f1 :

1. for each package category, count TP (predicted answer is correct), FP (error predicts other categories as this category), and FN (predicts the label of this category as other categories). Calculate the precision and recall under each category through the statistical values of the first step, the calculation formula is as follows:

\

The calculation result calculates the F1-score under each category , the calculation method is as follows:\

Then, the F1-score under each category obtained in the third step is averaged to obtain the final evaluation result. The calculation method is as follows:

 The following is an observation chart about the correlation between original features and labels:\

Figure 2: Observation of the correlation between service_type and category

We can get an obvious rule. As shown in Figure 2, the service_type can divide the package into two parts. The two parts are not overlapped. One of them has 8 and the other has 3. This brings us an idea for the game, which can be divided into model predictions, and in the final experiment we have also made improvements through this.

Figure 3: Distribution of age

We can see that for age, it basically conforms to the distribution of telecommunications user groups, but there are many outliers of 0 years old. For outliers, we tried to replace the mean value corresponding to the service_typpe field and use the original value. In the end, the original value was selected, and we believe that the conversion rate of the default age in different packages shows a distribution difference.

Figure 3: Distribution of gender

We observe that there is a default value of 0 in gender. For this part, we used two methods to handle it, one is to fill in the mode of the corresponding field of service_type , and the original value. In the end, we selected the original value, and we believe that the conversion rate of the silent gender in different packages shows a distributional difference.

At the same time, we pay attention to whether the training set and test set are the same distribution:

Figure 4: Observation of the same distribution

We have observed that the gender field has different distributions, and we have adopted a normalization method to eliminate the different distributions caused by the quality of data cleaning./

2. Feature Engineering

1) Association rules

The association rule is an implication of the form X --> Y , where X and Y are respectively called the antecedents of the association rules (antecedent or

left-hand-side,) and successor (consequent or right-hand-side, RHS). Among them, the association rule X Y has support and trust.

A famous joke is that in a supermarket, there is an interesting phenomenon: diapers and beer are sold together. But this strange move has both increased the sales of diapers and beer. This is not a joke, but a real case that happened in the Wal-Mart chain supermarket in the United States, and has been talked about by merchants all the time. Inspired by the method of max-encoding , we construct the association rules for the four fields of the associated call fee field, 1_total_fee ** , ** 2_total_fee , 3_total_fee , 4_total_fee , so that we can use the strong feature of the overall information and the dialog fee field. Dimensionality reduction coding. Make its data more representative. This feature also has a strong positive feedback on the result. We extracted the most frequent binomial set of call charges, and used the call charges corresponding to the call charges in January as features.

In fact, there is a better way, which is more general to dual data. You can combine DeepWalk , nove2vec , Line , and SDNE .

2) Business characteristics

In terms of business characteristics, we have thoroughly studied Unicom s package consumption scenarios. From the beginning of the competition, we firstly started the research work through Unicom s official website and consumer forums to gain an in-depth understanding of China Unicom s various package features and user group differences. By familiarizing with the characteristics of packages, we can recommend packages suitable for various specific user groups. For example, Tencent Tianwang card playing Tencent games and watching Tencent videos is a boon for deep Tencent users. Ant Dabao card can give away 2g undifferentiated data. To high-traffic consumers. The various preferential activities of Unicom s traditional packages, such as pre-recharge and 100% rebate for traffic and call charges, are suitable for users who usually don t have enough traffic and call charges, while rechargeable rebate charges are suitable for users who have a lot of money. In this regard, we have made a series of features for the user's traffic and calls, such as ratio, difference, and summation, and strive to depict a user portrait as much as possible.

The following features for business are proposed:

1. Is the phone bill minus 16 yuan a whole number?

2. Is the effective number of traffic a multiple of 27?

3. Can the effective figure of the call charge be multiplied and divided by 15

4. Whether the phone bill is a whole number (the user may not exceed the package)

5. Can the difference between the packages for two consecutive months be divisible by 5, 10, 15, 27, 30 and other billing units?

6. Minimum call charge for four months

7. Calculate the average unit price of traffic

8. Calculate the average unit price of the call time

9. Wait...

3) Guided learning

In machine learning, for different stages of homologous data, there are usually transfer learning and direct data splicing methods to use the data, and different distributions of data can be used to improve the results. Transfer learning directly encodes the predicted probability as a feature of the transferred model, which can guarantee the same distribution, but cannot make full use of the data; splicing makes full use of the data, but the results cannot be the same distribution; for this, we propose a method called " guided learning "Methods.

The transfer learning method of structured data: train the model with different distributions of data, and use the predicted probability as the feature code of the new task. Different from general transfer learning, graft learning emphasizes the use of homologous data whose distribution changes due to time evolution or sampling. Because it was first used by a plant taxonomist in the competition, it was also jokingly called " grafting learning ".

During the competition, we compared various data migration methods. The first is the migration learning of all preliminary data, but the effect is not significant. So we considered adding rematch data and training together as a Bayesian regular item , the effect was very good. We have more in-depth research on how to add data for the rematch. as follows:

   1. Directly splicing preliminary and semi-final data

   2. After stitching, add a column of features to mark whether it is semi-final data

   3. After stitching, train the network package and the traditional package separately.

Through the training results, it can be seen that the second item of the directly obtained results is the best, because it uses feature annotation to ensure that the preliminary and semi-final data have their own distribution in this model; the first item is the first item if the output result is used for migration. Best, because this model pays more attention to the impact of the data itself and does not consider training two sets of data together. The experimental results are shown in the following table:

Table 2: Comparison of various transfer learning methods

Migration methodLeaderboard score
Original stitching0.8327
Annotation stitching0.8397
Splicing classification training0.8385

 Among them, the validation set only uses data that is the same as the target, which is another source of Bayesian regularization. We call the method of adding the Bayesian regular term as guided learning, which has a significant effect in the result of the game.

 

Reproduction method

1. Code and data set

The code and data set can be downloaded in Baidu Cloud:

Link: pan.baidu.com/s/1HurD8IU-...

Extraction code: m1yg 

If you are harmonized, please reply to " Unicom Package " to obtain a new URL./

Remarks:

The author has published the code on github:

github.com/PPshrimpGo/...

After the communication between me and the author, a dataset that is not available in github and a simple executable file under windows have been added to Baidu Cloud . It is recommended to download it from Baidu Cloud .

2. Reproduction method

Reproduction environment: Ubuntu16.04, python3.6, lightgbm 2.1.1, numpy 1.14.3, pandas 0.23.1, scikit_learn 0.19.1, xgboost 0.72.1\

  • Copy train_all.csv from the training set folder of the preliminary round to the input folder and rename it to train_old.csv , copy train_2.csv and test_2.csv from the training set for the rematch to the input folder , and rename it to train.csv and test. csv

  • Run the command: pip3 install -r requirements.txt 

  • chmod +x run_top1.sh

  • chmod +x run_perfect.sh

  • ./run_top1.sh (reproduce leaderboard first, Shuangzhiqiang 2673CPU reproduction time: 8 hours and 47 minutes)

  • ./run_perfect.sh (perfect reproduction, this step can be carried out if there is time, it takes 24 hours, it is recommended not to run )

    Reproduction method under windows:

  • Perform step by step according to the command content of run_top1.sh, which is relatively troublesome. If just for learning, the author provides a simple version: use the simple execution code provided by the author to run with one click (directly use the code in the Unicom baseline folder , including the data set):

  • python baseline.pyCopy code

    Two hours of running time can reproduce results similar to the competition results./

 summary

This solution uses the existing user attributes (such as basic personal information, user portrait information, etc.), terminal attributes (such as terminal brand, etc.), business attributes, consumption habits and preferences to match the most suitable package for the user, pushes the user, and completes the follow-up Personalized service. Not only is it suitable for China Unicom, the plan is also suitable for other telecom operators to use after appropriate modifications.

Sharing is a virtue -thanks to Cui Shiwen's team for open source sharing. We look forward to the open source of their own code by the majority of machine learning enthusiasts to provide convenience for the majority of beginners.

Please follow and share /

Machine learning beginners/

QQ group: 774999266 or 654173748 (choose one of the two)

Wonderful review of past issues\