Large part of the project manager's work can be described in terms of retrieving, processing, analysing and synthesizing various types of data from different sources. The types of information become more and more diverse (including participants, task and financial details, and dates) and data volumes continue to increase, especially for large international collaborations. In this paper we explore the possibility of using the python programming language as a tool for retrieving and processing data for some project management tasks. python is a general-purpose programming language with a very rich set of libraries. In recent years python experienced explosive growth leading to development of several libraries that help to efficiently solve many data related tasks without very deep knowledge of programming in general and python in particular. In this paper we present some of the core python libraries that can be used to solve some typical project management tasks and demonstrate several real-world applications using a HORIZON 2020 type European project and as example.
Project management deals with many information flows that should be found, processed, reorganised and visualized. This includes extracting information from documents (e.g., person-months, task details, team composition), compiling data (e.g., from spreadsheet tables), sorting data (e.g., abstract titles and authors from large tables) and more. In the following we are going to call all information that a project manager deals with data including information received via email, lists of participants (e.g., of a conference or meeting), financial information (e.g., cost reports, team efforts), statistics (e.g., gender, international composition), roles and responsibilities of the partners (e.g., in task teams).
A way to cope with these flows of information is to use a general-purpose office software. The most popular are Microsoft Office products, and in particular MS Excel is a software of choice when it comes to processing moderately large amount of data (not exceeding MS Excel limit of 1 048 576 rows and 16 384 columns per spreadsheet) that can be presented in a table form.
One of the biggest challenges, however, is to actually gather the data in the table (or any other desirable) form. This process is called data wrangling or data extraction. The input data for the tables is received by the project manager usually in a variety of forms that includes emails, MS Word documents, .pdf files and increasingly web pages and postings in social media. The first task of the project manager is to parse this information, and organise it in an appropriate way, often an MS Excel spreadsheet. The information is then analysed and exported as graphics, a Word document or one of the other many formats the project manager has to work with.
Especially difficult and time consuming is the step of data extraction and parsing as data preparation can take large portion of the work required to complete a data-related task. When it comes to data extraction and parsing many project managers rely on automatic pre-existing tools (e.g. MS Excel import) that, however, often are source of systematic errors and also fail to import the right data. They also do not allow to create an automatic process for data export from any existing format and, more importantly, for parsing and organisation of the exported data in the form one requires. When pre-existing automatic tools fail, the common practice is to turn to the manual extraction of data by simply copying information from the source document and inserting it in to the MS Excel spreadsheet. This is however a tedious, time consuming and error-prone process. This data extraction-organisation-synthesis process is comparable among different fields that use data beyond project management and often the amount of data is so large that the manual extraction is practically impossible, so the automatization of this process is the only feasible option. The amount of data needed to be processed by project managers grows constantly so the science project management community could benefit greatly from leveraging experience in automatic data extraction and processing already gained in other communities.
The best way to gain full flexibility for data extraction, parsing and
processing is using a programming language. There are many popular
programming languages that are used for data analysis such as
Recognising the growing need for a tool to process data for project
management tasks and the capabilities of python, the main objectives of this
paper are:
Identify some project management tasks related to data extraction and
processing that can be automatized through programming. Introduce some of the core python libraries that can be used to solve the
identified tasks. Demonstrate several real-world applications using the documentation
of a European HORIZON 2020 project and a national-funded project as sources
of data.
The structure of the paper is as follows: We identify basic project
management tasks that can be automatized in Sect. 2. In Sect. 3 we
provide description of the case studies performed on the basis of EU Horizon
2020 and national project. Finally, Sect. 4 contains discussion and
conclusions.
In this section, we identify several key project management data related
activities that can be automatized by using programming and for each activity
we provide a short description of the way it can be atomatized using
python. In particular we mention several python libraries
that will be used in the case studies of the next section. The
python libraries are collections of programs that expand its basic
functionality. Most of the python libraries are free and open
source. Code examples from this section can be found in supporting materials
and online at
Text reformatting is a quite general task that project managers have to deal
with. Automatization can help when large amounts of similar text entries
(e.g. names, phone numbers, addresses, publications) have to be available in
an harmonised format. One of the main features of python is simple
text processing. The basic text processing functionality that allows to work
with text data (strings, in python terminology) is part of the
language core and does not require installation of any additional libraries.
This allows users to, for example, split in parts, reorder and concatenated
text. The code snippet below shows a python code that demonstrates
an example of work with strings in python. The code converts an
initial string that contains the name that should be reformatted (all
lowercase, family name, given name name order) to the desired format
(capitalize first letters, given name, family name name
order). The output of the program is indicated by “
Data tables are at the core of many data processing and analysis tasks. The most popular software to work with such a data is MS Excel. MS Excel works well for most tasks when the data is already organised and harmonised, but is inflexible and problematic to use for data preparation. In python the most popular solution to work with table data is a library called pandas (McKinney, 2010, 2011). This library provides tools to work with structured datasets (tables in particular) and is widely used in finance, statistical sciences, Earth sciences and many other fields. pandas provides capabilities to work with MS Excel files, so that information can be both imported from and exported back to MS Excel spreadsheets.
Below we show a code snippet that converts Table 1 to Table 2. The operations performed include opening the MS Excel spreadsheet with Table 1, converting last name to uppercase and adding the DR title, capitalizing the first name, reordering of columns and saving data back to MS Excel spreadsheet:
Input for Example 3.
Output of Example 3.
In this example, we use capabilities that are not included in the core python distribution but provided by the pandas library. Before using it, the library has to be installed and then imported inside the program with an import statement at the beginning of the code. The example above is used as a demonstration that the pandas library makes it possible to apply changes to the data in a whole column at once. In other words, you can work with all strings in the column in the same way as you would work with a single string, without the need for looping over every string in the column.
Beside MS Excel spreadsheets, another popular format for table data exchange is CSV (comma separated values). This is a simple text document with columns separated by commas or another standard delimiter. This format is one of the main import/export formats for pandas and there is a rich functionality that allows for data homogenisation already while importing.
Data does not always come in form of the MS Excel spreadsheet or CVS document, and often has to be extracted from different sources including text files or MS Word documents. In this section we will provide a short overview of ways to extract data from those sources using python.
Text files can be easily opened with python and simple data extraction can be performed without using additional libraries. The code snippet below shows a simple program that opens a text file and reads it line by line (see Table 3 for file content of the file). Each line is checked for the starting sequence (“LN” or “FN”) and its content is added ether to “lastnames” of “firstnames” variables.
Input for Example 4.
The content of the “firstnames” and “lastnames” variables can then be processed as shown in the previous examples or exported to Excel spreadsheet. Reading the file line by line, getting information from each line and putting it to the variable is a very common pattern in data extraction and python allows to do it relatively easily.
When working with numerical information only, an even easier way to open a
set of numbers than reading values one by one is using the numpy library, a
fundamental package for doing numerical computations in python. One of the
features of numpy is its ability to open text files containing numbers and create
data structure called n-dimensional array. The code in the example below
reads a text file that contain the lines shown in Table 4, reshapes the data
into
Input for Example 5.
The data in Table 4 represent year, month and a given value, all appearing in one column. Using only four lines of code (including importing the numpy library) it is possible to create the mean of the data values.
The document file format that is used by the MS Word (.docx) is very different from the plain text files or CSV documents. It contains complicated data structures since all the information about text formatting, images, tables and other objects is also included. The library that can help to extract text and table information from MS Word documents and that is used in the case study below is called docx. Docx converts MS Word documents into the data structures that can be easily handled by python and allows data extraction (mainly text). The examples below demonstrate how one can extract tables from MS Word documents using docx. The library can also help to create MS Word documents, which can be useful also to create MS Word documents automatically. A simple example of such a task is the creation of name badges for a conference participants. An advanced exercise could be to automatically create more complicated reports in MS Word. This however would require good proficiency in python.
The “list of deliverables” table from standard EU Horizon 2020
DoW document
In this section we will briefly cover several cases of using python for
extracting processing and exporting information. The complete code with more
extensive explanations in the form of Jupyter Notebooks is located in the Supplement
and at
The Description of Work is one of the main documents of an EU H2020 project
because it explains the project work plan in detail, including the
description of each work package and task. It includes many tables (e.g.,
for deliverables, milestones, resources) that are used as reference by the
project manager to keep track of the project progress. However organisation
of this document is not optimal, and getting structured information from it
can be challenging, especially when many partners are involved. In the
example below we extract several tables from the project's DoW, perform some
simple data sorting and output the result. The detailed notebook with the
code and detailed explinations can be found online at
Bar chart of the years when each of the PIs get their Diploma and PhD sorted by the year of the PhD defence.
First, we open the file “EU-PYTHON.docx” in python using the docx library. One can directly access tables in this document, so we select a table with the list of work packages, extract the information, put it into the pandas data structure (DataFrame) and output to MS Excel spreadsheet. Similar steps are repeated with the “list of Beneficiaries” table. The “list of deliverables” is more complicated as it spans through several pages. In the Supplement we show how to extract information from such multi page tables and then combine it into pandas DataFrame. The “list of deliverables” contains only “due dates” for project deliverables in months starting from the beginning of the project. If the project start date is known, it is easy to convert this information to real time. The parts of the source “list of deliverables” and corresponding processed parts exported to Excel are shown on Fig. 1.
In the second part of the notebook we show how to perform some analysis of the table directly in python including selection of all deliverables for one partner, sorting deliverables by the date, statistics on how many deliverables every lead beneficiary has and so on. It is worth noticing that most of the analysis can be done with just one line of code.
In the standard EU Horizon 2020 DoW deliverables and milestones are
separated. However in practice it is easier to have them together to keep
track on the timing for project partners. In another notebook
(
For this example we use a MS Word document of a national project's proposal that
is structured differently from the EU H2020 DoW used in the previous
section. In particular not all relevant information is available in the
table from, so some of it has to be extracted from the plain text. The
detailed notebook with the code is located online at
We again begin by opening the document with the docx package and then get the
information from the tables. The tables contain information on project
principal investigators (PIs), but their cells are not regular and fields
often contain large portions of text (e.g., “Areas of expertise and key competences”, “Four most relevant publications”). We show how to
reformat such tables into a more convenient format for data extraction and
how to export only some parts of the table (e.g. only PI's name and
affiliation) into MS Excel. Two possible ways to extract numerical
information (e.g., year) from long strings (e.g., “Diploma in physics
(1996), PhD in physics (2000)”) are demonstrated. We also extract
information on the PI's
Some tasks within science project management that involve data processing (in particular data retrieval, compilation, homogenisation and synthesis) are extremely time consuming for project managers, especially when the projects involve large international teams and complex financing schemes (such as European projects). However many of these tasks are repetitive and can therefore be automatized through programming. This saves precious time that the project manager can use in more fruitful ways (e.g., for activities that require thinking and creativity).
In this paper we explored the possibility of using python as a tool for solving typical scientific project management tasks that involve data. We identified some example tasks related to data extraction and processing that can be automatized through programming and we provided a preliminary review of some of the core python libraries that can be used to solve these tasks. We also demonstrated several real-world applications using a typical European and national project's documentation as source of data.
python is a very useful and versatile language for organising and visualizing data
related to the management of scientific projects, especially some libraries
such as pandas and numpy. There are several programming languages that can be used for
the same purpose (such as
The main disadvantage of using a programming language such as python to cope with data-related project management tasks is that the automatization process requires time to be set up (in fact the time needed to write the code itself is longer than time required for manually performing each task taken individually) so it might not seem attractive initially for project managers who are always working with tight timelines. There is therefore a considerable learning effort that needs to be taken into account to start using python in an efficient way. However this investment in time for learning and code writing has a return on the long term, i.e., when a code written for one task can be utilised several times and therefore reduce the total time spent on performing similar tasks again and again. Furthermore the skills acquired by the project manager in the course of learning the programming language are beneficial also for other aspects of their job. For example it can help when applying project management processes where one have to have a workflow that consider inputs, tools and procedures, and outputs.
In conclusion, we found that python is a flexible and powerful tool that can help ease the daily work of science project managers especially for data and information-related tasks. Through our overview of tools and case studies we demonstrated python utility for science project management and we advocate for more programming training to be available for project managers to take full advantage of coding and thus work in a more time-efficient way.
The code is available at
In our own experience, we find that the best way to learn a computer language is to cover the very basics of the language syntax and then begin a small project (or set of projects) that is directly related to our work (or hobby) so progress is much faster than it would be by focusing on abstract exercises. It is also very helpful to find someone in the organisation with python knowledge that could help at the beginning of the learning process.
Below we list some of the resources that in our opinion are useful for
beginners with no prior knowledge of python:
“Learn Python 3 the Hard Way” ( Code Academy ( LearnPython.org ( Programming with Python from Software Carpentry ( Dive into Python 3 (
The supplement related to this article is available online at:
The authors declare that they have no conflict of interest.
This article is part of the special issue “Project management in geosciences: systems and practices for high-impact research”. It is a result of the EGU General Assembly 2018, Vienna, Austria, 8–13 April 2018.
We thank Lars Kaleschke and anonymous reviewer for their very helpful comments. Nikolay Koldunov is supported by project S1 (Climate Models as Metrics) of the Collaborative Research Centre TRR 181 Energy Transfer in Atmosphere and Ocean program funded by the German Research Foundation. Luisa Cristini is funded through the European Commission H2020 project APPLICATE (Grant number 727862) and the project “Advanced Earth System Modelling Capacity” supported by the Initiative and Networking Fund of the Helmholtz Association. The article processing charges for this open-access publication were covered by the University of Bremen. Edited by: Sylvia Walter Reviewed by: Lars Kaleschke and one anonymous referee