This blogpost deals with the machine learning method used by John Eric Humphries in his working paper ”The Causes and Consequences of Self-Employment over the Life Cycle“. Humphries uses a supervised learning system in order to summarize the patterns of life cycle data. This post briefly describes the methods in use.
Humphries uses panel data on workers and firms from Sweden in order to find out what causes self-employment over the life cycle. The dataset contains 10,303 individuals. For each of them, the status of employment was recorded for each year from age 20 to 43. The following statuses have been used:
- Non-employment (NE)
- Paid Employment (PE)
- Unincorporated Self-Employment (SE)
- Incorporated Self-Employment (SE-I)
- School (SCH)
- Missing (no data available)
In figure 2 of the working paper, three example observations are displayed. It is reproduced here.
Moreover, the dataset contains information on the individuals’ social and academic background as well as on their cognitive ability, stress tolerance, BMI and so on.
Humphries basically does two things: First, he aims at finding typical life cycles by identifying distinct groups. In order to do so, he applies a Clustering Algorithm on the data. Second, he tries to detect correlation between the identified groups and other variables auch as socio-economic background. Based on these findings, he then blueprints a model that aims at explaining the causes and consequences of Self-Employment.
2. Cluster Analysis in a Nutshell
A disjunct cluster analysis aims at partioning Individuals named with p Variables (”Cluster Variables“) in disjunct clusters, so that:
There are also ways to implement a non-disjunct cluster analysis, i.e. allowing clusters to overlap. These methods are called clumping methods or fuzzy-cluster methods and will not be dealt with here. Among the disjunct cluster-methods, there are hierarchical ones and optimality-based ones that use a target function. The hierarchical ones are based on distance between clusters (to be defined in various ways, cf. below) and are in turn seperated in two groups: agglomerative methods and divisive methods. The latter ones initially treat all Individuals as one cluster and then start to divide it sequentially. Humphries uses a method from the more popular agglomerative class of hierarchical ones. Agglomerative methods do quite the opposite: They define each Individual as a single cluster and then merge these clusters sequentially.
In disjunct hierachical agglomerative methods, those clusters are merged first that have the shortest distance to each other. Hence, the Clusters can be merged in various ways depending on how …
- … the distance between individuals …
- … and the distance between clusters …
… is defined.
- Popular measures for the distance between Individuals, if metrically scaled, are the Euclidean Distance or the Mahalanobis Distance. Since the data used by Humphries is a discrete time series and nominally scaled, he uses an alternative measure: The ”Optimal Matching“ (OM) approach. It calculates the least costly way to convert one string into another using either substitutions or deletions or insertions. The distance from string to is then defined as the minimum cost needed to convert to . The so-derived cost function can be modified by assigning different costs to substitutions, deletions and insertions. Humphries defines substitutions to be more costly than insertions and deletions in order to place more emphasis on sequencing and less emphasis on timing.
- After the first iteration, there are clusters that contain more than one Indivual. This raises the question of how to determine the distance between clusters. Highly popular is the single (complete) linkage approach that defines the distance beetwen two clusters to be the shortest (longest) distance between any Individual from the clusters. Without a detailed reasoning, Humphries uses another approach, the so-called ward’s method, which regards the disctance between clusters to be the potential increase in variance within the cluster, if the clusters were merged. Formally, this translates to:
- where and are the centers of clusters and . Moreover, is the chosen measure of distance between Individuals. Compared to single and complete linkage, Ward’s method tends to produce clusters of similar size.
After the algorithm has been applied, a dendrogram helps at determining the right amount of groups. In a dendrogram, the -axis is usually labelled with an index of homogenity between the clusters (normally the already appplied measure of distance) and the -axis is labelled with the clusters. Thus, it becomes visible at what iteration step the distance between the merged clusters was the highest. (Can also be included in the algorithm.) In most of the cases, the number of groups is set to the the number at this particular iteration step.
3. Humphries’ results
The algorithm clusters the Individuals in seven distinct groups, as reported in the working paper’s Figure 3 that is reproduced here.
Based on these seven groups, the author finds that careers involving self-employment fit into a small number of economically distinct groups. Humphries detects various other factors that correlate with a self-employment such as cognitive and non-cognitive-skills, prior work experience, cost of capital and other labor market opportunities. Guided by these descriptive findings, Humprhies comes up with a model in which self-employment decisions depend on these factors. He uses the model to evaluate policies designed to promote self-employment. Humphries finds subsidies that incentivize self-employment to be generally ineffective, both in terms of promoting long-lasting firms and in terms of improving the welfare and earnings of self-employed persons.