Hello, and welcome to another episode of TPI’s podcast, Two Think Minimum. I’m Chris McGurn, TPI’s Director of Communications. Today, we're speaking with Emilio Colombo who is Professor of Economics at the Catholic University in Milan who will talk about Applying Machine Learning Tools on Web Vacancies for Labour Markets and Skill Analysis.
Liked the episode? Well, there's plenty more where that came from! Visit techpolicyinstitute.org to explore all of our latest research!
Chris: 00:01 Hello and welcome to another episode of the TPI podcast, Two Think Minimum. Today we’re going to continue our conversation with guests and panelists from our recent conference on artificial intelligence entitled, Terminator or the Jetsons, The Economics and Policy Implications of Artificial Intelligence. Today we’re speaking with Emilio Colombo who is Professor of Economics at the Catholic University in Milan who will talk about Applying Machine Learning Tools on Web Vacancies for Labour Markets and Skill Analysis. I did not come up with that on my own. That was the name of the paper that he presented at the AI Conference. We will be joined as well by Sarah Oh, research fellow at the Technology Policy Institute, and Scott Wallsten, president and senior fellow at TPI. I will hand it over to them to get this conversation started.
Scott: Emilio, we’re thrilled to have you here. Tell us a little bit about the research that you presented yesterday.
Emilio: The main idea here is that there are some very strong forces that are affecting the labor market. We are all worried about jobs that are disappearing because of technology but actually what is really happening is that although some jobs are actually disappearing, there are new jobs that would be created in the future or are being created. Probably the main effect that technology has is to change the way existing jobs are performed. It changes the skills that the existing jobs require.
We started to look into this issue and actually we realized that the tools that as economists we use to analyze the change that is occurring in the labor market, are not entirely appropriate, because most of the tools are not aimed at analyzing the change. They are actually aimed at analyzing the stock. Think of the Labor Force Survey.[1] The Labor Force Survey is the main tool that we have in economics on analyzing the labor market. It makes a kind of photograph of what’s going on in the market, but it’s not really looking at the change that is happening. In particular, the Labor Force Survey is not focused on skills or the skills requirements by jobs. Since this change that is happening now in the labor market is so fast we wanted to have a tool that was able to deliver real time data. We started to look at the information content of vacancies[2] which are posted on the web.
Scott: 06:03 Before you go on, I want to interrupt and say that one of the things that’s so interesting about this is you start from the premise that we’re worried about change in labor markets because of machine learning and artificial intelligence – but you’re actually using those very tools to try to predict what’s going to happen to the labor market. It’s an interesting twist.
Emilio: 06:24 Those tools are really useful because the main problem in this case is that you have a huge amount of data, so you need to cope with this data. You need to scrape the data from the web, and then one of the major challenge is that this data is totally unstructured because vacancies – the title of the vacancy is basically the title in natural language of what the firm is requiring, or what the web portal is actually posting, is not classified into a standard classification. The same applies to the content of the post. The content of the vacancy is simply piece of text and because you have so many thousands and millions of listings, you cannot classify those things by hand, so you need to have a machine to do it.
Emilio: 07:37 We are fortunate because we have people advanced in textual analysis. I’m actually collaborating with a lot of computer scientists who are designing the software to scrape, classify and analyze the text. It’s a big effort because it takes up a lot of expertise to do it. But then, when you have the machine that is running, then you have a system that is able to analyze millions and millions of observations, basically every day. It’s just a matter of basically computing power, which we know that now, it’s cheap in a way. This is a tool that is actually able to tell what firms are requiring basically every day. If you look at what is happening in each occupation, then you can analyze what kind of skills that are requiring in each occupation, and firms are very detailed in their demand, particularly when you are looking for a technical job, but then, they typically mentioned the technique that is needed, the software that they want to be used.
Scott: 09:08 So, you can identify the type of job that’s being advertised and the exact skills that the employer wants to go along with that?
Emilio: 09:18 Exactly. You can actually add then, the place where a job is posted, to the town level. You can also have some kind of information about where some jobs are posted, where some skills are required, the kind of education required. Very often firms require some years of experience. At the end of the day, we have very granular type of information which is very useful and has a lot of potential applications. For instance, think of somebody entering the labor market, somebody has some skills, and by plugging in the skills to this tool, he or she can have the job for which those skills are required. Or this kind of information is useful for the public education sector. Typically if you’re designing a course which is particularly aimed at the labor market, tailored for the labor market, typically post-graduate courses, masters, executive education, you need to be fine-tuned with what the labor market is actually requiring right now. Those tools are able to tell you, what kind of skills the market requires for certain types of jobs.
Scott: 11:09 Seems like there are a couple of things. One is that you can, in principle, you can get a more precise picture of what the labor market looks like right now, which is what Hal Varian called “predicting the present.”[3] But then you also want to use it to predict what types of skills are going to be in demand at some point in the future. What sorts of things are you finding so far?
Emilio: 11:38 One of the things that really struck us was how widespread is the need for what we call, “soft skills.”
Scott: Tell us what “soft skills” means.
Emilio: “Soft skills” are basically, a competence that enable an individual to interact with others and with the environment, like ability to communicate or problem solving.[4] But, the interesting thing is that, “soft skills” are hugely required even in technical jobs. That was a surprise for us because, you can imagine that you need to have a lot of “soft skills” when you are working with a lot of people, like, a doctor needs to have a lot of “soft skills” because he’s interacting with patients every day. But not necessarily those skills should be required by a programmer or by a computer specialist, but they are required.
Scott: We have this idea that you can, if you want as a programmer, you can just sit and never interact with anybody.
Emilio: The idea of the nerd.
Scott: But that turns out to be not true.
Emilio: Right, exactly. This is now because technology is so pervasive that these jobs require you to interact with other individuals. It’s telling us something about the way we are designing our educational system because we need to develop these skills because they are incredibly important. In terms of the changing nature of jobs, there was this discussion yesterday about bank teller jobs which were predicted to decline with the advent of the ATM, which with technology evolution at that time was expected to disappear in the future.
Emilio: 13:49 Actually on the contrary, they increased the number of bank tellers who were requested more, but it probably is that if you look at the job of a bank teller now, he or she is not counting money anymore. He’s interacting with clients, which means then, what is needed, is a much larger set of social skills than technical skills.
Scott: It suggests kinds of policies that are different from what we often hear. We often hear that the policy makers want more coders and programmers, but that’s not what you’re saying, right? The things that will really differentiate us ultimately from a machine is that we can interact with each other.
Emilio: Absolutely. Yes. Technical skills are certainly important, but they cannot define the kind of skill set of an individual because, well, on the one hand, because this is going to be risky because very, very soon, your technical skills will be outdated.
Scott: I observed this as soon as Sarah came to work here, my skills were instantly outdated.
Sarah: “Scott, you don’t know R?”
Emilio: 15:28 But then, how can you cope with this? Well, you start to interact with Sarah, probably right? The ability to work with the others, means that you need to be able, in your career when you’re younger you implement your skills, but when you’re older, you need to develop other skills. In my case, I cannot compete in technical knowledge with my students because they know much more of the new skills. But what I can do, I can start to coordinate the work of others because I have more experience or more vision so I can use my skills in a different way. I’m using a lot of interpersonal skills.
Scott: Another part of yesterday in our conference, there were some papers on prediction versus judgment and how improved prediction didn’t mean the end of judgment, but makes it more important,[5] I think that’s partly what you’re saying. You’ve got the judgment.
Emilio: Exactly. Judgment is going to be very, very important. But again, judgment is not only something that comes out of a kind of technical education, it’s also related a lot with experience and you develop this over time.
Sarah: 16:58 As a junior economist, I would definitely say that, experience [matters], you can tell the difference between someone more senior and junior, like, “I’ve been down that road, don’t go down that road. It’s a waste of time on research.” Even like, “I’ve been to these conferences and I’ve heard all these papers.” Those are things that you can’t, you need time and experience to gain that judgment. At least in collaborative fields like research, you can’t do research like a robot. It is a group effort. You’re learning from senior researchers and advancing the literature.
Emilio: 17:43 You can tell this simply by looking at the number of names that appears on papers. When I actually started the majority of the papers were either written by only one person or up to two persons. Now, there are more than three, four, this is because the research effort is much more complex, you are mixing different techniques, different activities.
Scott: 18:17 It’s amazing to see that because we used to have single authored papers because otherwise how would you know who wrote it? Now papers have multiple authors. As somebody whose last name begins with a “W,” I have becoming increasingly worried about this because my name ends up at the very end. Just kidding.
Emilio: 18:37 In my country we had this kind of a rule, thank God not any more, that you need to specify authorship, if it was a multiple author paper, the author of each paragraph – because they needed to tell who had written what part of it. And luckily now there is a kind of common understanding that you tend to put names in alphabetical order. That’s it. I think it’s fair, unless there is somebody that has really done a larger share of the work and this has to be acknowledged.
Scott: 19:41 That’s the way it is in the hard sciences, with multiple people working in a lab, they usually have the principal investigator at the front first and then the other 35 people listed later. Going back to your paper and your work on looking at how the labor market is developing, have you been interacting with people at the various government agencies that are charged with doing this now? Here, it’s the Bureau of Labor Statistics. I assume you have an analog certainly at the EU level I know there is.
Emilio: 20:20 What we are doing now, we are scaling up our research project at the European level. We won last year an open tender by a European agency, CEDEFOP [European Centre for the Development of Vocational Training][6] that wanted to set up a real time labor market information tool for the European Union. They wanted to use the information content of web vacancies to do this. As you can imagine, there is a big complication in all this, that need to do this in a multi-language setting, because in Europe, we have lots of different languages – we actually even have some countries that are using different languages within the same country. We are now developing this tool, extending this tool, to the entire European Union working with experts in each European country.
We are setting up this system. At the end of this year, we would have the system hopefully a scaled up to seven countries that the major countries, Italy, France, Germany, Spain, UK, Czech Republic, so that would cover more or less two thirds of the EU population.
Scott: This might be an ignorant question, there are standard training datasets for textual analysis. Does it become less precise with languages that are spoken less common? Are there the same kind of training data sets available for it?
Emilio: It really depends on the training set. But then again, it depends on the type of country. For instance, the Czech Republic has an inaudible language for us, it’s like Chinese in a way. But the Czech Republic has a fantastic tool. This is the kind of beauty of a small countries. They have a public website which has vacancies and already classifies the vacancies into a standard classification. This provides a training which is very large and a very good. In fact in the Czech Republic, we reached almost a hundred percent accuracy, which, which is amazing. It’s amazing. But this is because of this training set. The beauty of all of this kind of approach is that you start doing this, and then you get a percentage of correct matches.
Emilio: 23:41 But then you work on the margin. You don’t need to reclassify everything because you need to work on the other, which makes life easier. Over time the system improves because the machine learns, on the mistakes as well. Over time the more information you get that the more the system becomes precise, so it would improve over time. This is interesting. The other nice thing is that because information is always there, even if you are making some mistakes, you can always reclassify information that you previously misclassified because you can always go back to the information that you retrieved, and again this is something that makes the tool very useful. Because for instance, on the contrary, suppose you are analyzing this thing through a survey, and you’re making a mistake in the survey you did it, and you cannot cancel it because once you’re running the survey, that’s it. This is really useful.
Sarah: 25:03 For the historical analog, would the EU do surveys of employers? Before this kind of machine learning scraping, where do you get employer data on skills?
Emilio: 25:17 In some countries there are skill surveys, a few countries. The EU, through this agency, tried to do a European skill survey[7] but it proved to be too costly. If you want it to be precise and widespread, it is too costly. And you have to repeat the survey. For an entrepreneur, you’re actually taking the time from him, and time is money for him. As economists, we need to take into account the opportunity cost of time, and so the cost of the survey is higher than the direct costs. So the survey is really costly.
Sarah: 26:15 Software like this machine learning program that scrapes millions of job listings across countries is a big step forward for labor market analysis. That’s a huge step.
Emilio: 26:28 Absolutely. Yes, it has some limits clearly because some jobs are simply not on the web because you cannot only analyze only the jobs that are actually out on the web. But a large part of the jobs are actually posted on the web.
Scott: I thought that was a surprising point raised yesterday, because of this bias in terms of what types of jobs are online, you were concerned that it might bias the results too,[8] in other ways, but it didn’t seem to. Can you explain that?
Emilio: In a way it actually depends on what are your aims. If you want to measure the number of vacancies that are in a country, then clearly you have a biased estimate, because you’re not measuring something. But, the vacancy survey is one of the major tools that are used by policymakers to measure labor market tightness.[9]
But in order to measure labor market tightness, you need to look at how the condition of the labor market changed. You don’t necessarily need a representative sample of the population. In fact, the Fed uses now two measures of vacancies – one is the vacancy survey, and another one is, an index from online vacancies. Because even a non-representative sample can be a good predictor of the change in the labor market, and can predict a representative sample, basically. If you really want to count the vacancies, this is certainly biased, but if you want to measure labor market tightness, they are okay.
Scott: Do you have any sense of whether one is more accurate than the other?
Emilio: 28:53 We’re working hard on this. One of the problems that we have in Europe is that for several countries the vacancy survey is compulsory because the vacancy rate is one of the key indicators that the ECB, as any central bank, uses for assessing labor market tightness. Every country is running a vacancy survey, but funnily enough, lots of countries are not publishing the number of the vacancies, they only publish the vacancy rate.
Scott: Why?
Emilio: They say that this is reserved information.
Scott: Like somebody might be able to figure out the population of the country?
Emilio: 30:00 Well actually, yeah, but it can make a big difference depending on the type of the data segregated. For instance, in our case, this is a problem because we’d like to know whether and to what extent our measure differs from official measures, but we can do this only in a few countries. We are now starting a collaboration with EuroStat and we’re working with a lot of statistical officers around Europe. Despite the fact that we’re collaborating with them, they still have their internal policy of data protection. They don’t want to disclose the number of vacancies.
Scott: As a researcher, you can’t sort of sign a nondisclosure agreement?
Emilo: No. In a way, this is a limit that statistical officers have at least in Europe. They have this kind of approach, in order to be reliable, the data has to be produced by themselves and by nobody else.
Emilio: 31:12 But the problem is that, this could be a viable option, if you look at standard data. But when you start using data which are produced by techniques of computer scientists, in order to produce this data, they will need to hire computer scientists, and they don’t have the budget to do this. They are stuck by this, because, they cannot produce the data by themselves. But on the other hand, they don’t want to collaborate with others because they’re limited by the data protection policy, which is weird in a way.
Scott: So they’re kind of stuck.
Emilio: Well, actually in some European countries – for instance, the simple scraping of the data. The scraping of the data is not illegal, but because there is no law against it, because we are simply taking the data which is published on the web so we’re just looking at something.
Emilio: 32:19 But still because we’re doing this as part of European project we are still informing the websites what we’re doing and we’re asking whether they are willing to give us a backdoor access so we don’t slow the system. We tried to be as open and as transparent as possible, even though there are lots of companies that are doing this as a job and they are scraping the web without problems. For some statistical officers, they actually started this kind of approach, but in order to use the data, they need written consent.
Scott: Which basically means that need written consent from everybody who posts on the internet.
Emilio: Exactly. We tend to live in a world, if something is not prohibited then you tend to assume that it is allowed. But they are so worried about this data protection thing that they simply don’t use the data at the end.
Scott: 33:43 It seems like their measures, their indices, are going to become increasingly worse, relative to what else is out there. Even if you don’t want to say that yours is better, it’s going to be.
Emilio: That’s true. In these kinds of issues, you need to be pragmatic. There are universities who would be willing to cooperate with the statistical officers. We are starting a collaboration. We’ll see how this will develop. You need to take into account that statistical officers in general are terribly underfunded, at least in Europe.
Scott: I think that’s true everywhere.
Emilio: Which is a shame because the world is changing so fast, we need more information, not less information. Because the world is changing so fast, we need basically, up to date information, real time information. You need to develop different tools, different methods, and you need to be open on this.
Scott: I found, I don’t know about Europe, but people that I’ve interacted with at BLS,[10] BEA,[11] they love what they do and if they had more resources, they would do a lot more.
Emilio: Absolutely, the people in these offices are very smart and very good, and with more resources they would certainly produce much, much better results. We hope that in the next month we will improve our collaboration so maybe something more will be added to the knowledge. At the end of this project, the idea is that, this project will end by 2020, but by the end of it, everything would be made available to the research community so the data will be made available but also the programs because we are doing everything in open source and everything would be released and depend on the choices of the agency. But at the end the idea is to have a portal where this data is available to everybody
Scott: Are you going try to post the code open source along the way?
Emilio: We will need to finish everything and then post the code. Although we are also collaborating with other universities as well. So we do this, as part of our research. I’m less involved in this technical part because I’m an economist. But my colleagues as computer scientists are working on this, as a part of their own research agenda. They are collaborating with others because, over time, new techniques become available and they tried different tools and it’s always a kind of an evolving thing.
Scott: 36:57 It’s interdisciplinary. When you’re working with computer scientists, I don’t know who else, how do you do it? What’s your role as the economist?
Emilio: It’s funny because they provide a tool, and say, “This is done. This is a paper.” Then I say, “No, this is not a paper for me. It’s a paper for you because you develop the tool. But I need to get this tool to ask a specific and interesting question. Otherwise for me it’s not a paper.”
It’s nice because I discover incredible tools. But my job is to take these tools and to use these tool to ask a proper question because for an economist, what is valuable and interesting question to be addressed and to be a problem to be solved? For them, the problem is more technical, that is, “I have this kind of data, how can I extract information from this kind of data?” In my case, my problem is, “What can I do with this information? What kind of question I can address, what kind of problem I can solve?”
Scott: That’s actually a pretty deep problem for economists. I assume for some other researchers as well. Machine learning identifies these correlations, but we’d like to start with a hypothesis.
Emilio: Exactly, exactly. Well, it is also interesting because, without the knowledge of these kinds of tools, I wouldn’t think of some problems, right? If you think of this issue more generally, there are lots of puzzles that we have in economics. Basically things that with our existing tools we are not able to address. We need to try something else, by definition, because if our theories are not good enough, then we can actually use those tools. Again, the tools are not the theory themselves, because the tools are simply discovering some correlation. But these are useful to change the theory because if you have a theory you know that doesn’t work, then you use those tools and you discover some kind of correlation, then you need to go back and to think differently of the relationship, and maybe change the theory or evolve the theory.
Scott: Right. They’ve certainly made new approaches possible and new data available too.
Emilio: Yes. Absolutely. Yes.
Scott: Thanks so much for coming by and talking with us and thanks for coming all the way for the conference, very much enjoyed talking to you.