The International Consortium of Investigative Journalists (ICIJ) is a nonprofit newsroom based in the US, and a global network of journalists and media organizations who collaborate on groundbreaking investigative stories. Its investigation into the offshore finance industry, the Panama Papers, won the Pulitzer Prize for Explanatory Reporting in 2017. Pierre Romera is ICIJ’s Chief Technology Officer. He’s also Associate Professor at The Paris Institute of Political Studies (Sciences Po), where he teaches computer science to journalists.
What does your tech stack look like at the moment?
We not only have a stack for the staff, but also a very diversified stack for all the investigations we run. It’s very complex in that we have many technologies, depending on the project, the data, and the documents we have.
The ICIJ is a unique organization because technology plays a central part in all of our investigations. So we not only have a stack for the staff, but also a very diversified stack for all the investigations we run. It’s very complex in that we have many technologies, depending on the project, the data, and the documents we have. We focus on document searching, so we have a lot of tools that we build for this. We have an open source technology called Datashare that we developed internally. It’s based on other open source technology such as Elasticsearch and PostgreSQL, but also relies on many other tools to analyze documents like Apache Tika, and NLP (natural language processing), such as core NLP and open NLP.
We also use a lot of open source technology to manipulate data. One of the main technologies we’ve used so far is called Talend by a French company who use an ETL (extract, transform, load) that helps us clean and structure data to produce analysis and databases that we publish on our website. We’re moving away from it though as we don’t really have the need for such a big tool right now. We use a lot of older technologies such as Python and Panda to analyze data and produce reports. We also have a lot of small visualization tools that we try to host on our own servers. For instance, we use a technology called Linkurious to draw graph charts to show relationships, so we’re able to share these kinds of visualizations with reporters and the public. We also have our own internal form for investigations, so journalists can have their own virtual newsroom. It’s called the I-Hub and it’s based on an open source technology called Discourse. This is used by all the journalists around the world that work with us.
Do you have different levels of expertise across the organization in using these tools? Do you train people to use them? And do they work in multi-languages?
We have a Training Manager that provides training to the partners and helps them use the tool we built. And we have a support person who’s helping every journalist get on board the platform and look at common problems we can have with these kinds of tools. Not all platforms work in different languages. Our language for collaboration is English. We’ve had feedback from some of our users that it would probably be better to have the interface in Japanese or French, for example, but we try to partner with journalists fluent in English and are able to use the tools we share with them. Another important aspect of our work is that we use a lot of encryption. We encrypt email, files, many things. That’s why training and support is very important for us.
We used to handle inquiries through our service desk that was built with Jira. This was mostly to provide tech support. But we’re moving away from this and switching to something more simple because Jira was a little bit too big for us, and a pain to maintain. We started with a very distributed approach to tech support, where several journalists were providing support to our partners in the team. And then the organization evolved and we hired someone to manage the support; we don’t need anything complicated to manage the tickets and support.
What’s the single most important tech issue facing ICIJ right now?
Our biggest challenge right now is to handle the massive amount of data and documents we have. So our most famous investigation, the Panama Papers, has several million files. These documents are very hard to manipulate and very expensive to index. We need to be able to continue to index them and read all the files, despite the very wide range of file types. We built an open source tool called Extract that we’ve embedded in our data share tool. It basically distributes calculations between servers and opens a lot of different file formats to find and extract images to turn them into text using Apache Tesseract. There’s also a challenge to manage all the servers we have, to perform this kind of process. With Extract we’re able to say we want to distribute a computation between, let’s say 10 servers at a time, even 30 servers. This is quite powerful and it’s how we’ve been able to read so many files.
Can you tell us about your security processes, particularly data storage and archiving?
So the first step you take when you join an ICIJ investigation is to have encrypted email. We ask everyone to use a GPG privacy guard, we train people, we help them to create a key, and then when they have a key we are able to create an account for them on our single sign-in platform, Xemx, meaning ‘sun’ in Maltese. They are then able to connect to all the services provided by ICIJ. We try to make this platform very secure by adding two factor authentication and also use SSH client gates, a tiny file that users have to install on their system. It’s a way to identify it and be sure it’s there and it’s legit. Because we have GPG we’re able to use another technology on our server that’s called CipherMail. It’s an SMTP (Simple Mail Transfer Protocol) server technology that’s able to send encrypted email automatically. So every time you need to reset your password or create your account or change your email address, you receive email from the platforms and those emails are always encrypted. If the system doesn’t know your GPG key, you won’t be able to receive the email. And that is a way to identify the system to be sure it’s there. It’s legit.
Can you tell us more about your tech strategy?
The challenge now is to make a tool that is major enough to survive the projects so we don’t have to reinvent the wheel every time we start a new investigation with a new set of documents. It needs to be robust but also very flexible.
So our strategy right now is mostly to focus our effort on Datashare, the document search platform we’ve developed. It’s a strategic decision because we try to articulate all of our services around this tool. So for instance, the virtual newsroom I mentioned earlier, we’re going to connect it to Datashare to have the document search and the conversation in the same place. So people will be able to annotate and comment on documents directly on Datashare. That will be reflected on our communication platforms. We’re also trying to develop these tools so we can scale to the new investigations we have. We’ve already used Datashare for a few investigations, but the challenge now is to make a tool that is major enough to survive the projects so we don’t have to reinvent the wheel every time we start a new investigation with a new set of documents. It needs to be robust but also very flexible so we can add specific features that match with an investigation. We’re investing a lot in this.
Are there software products that you’re really dependent on but you wish you weren’t?
Not software; it’s more from the infrastructure point of view that we have second thoughts. We host almost everything on AWS and this is something we’ve done for years, well before my time at ICIJ. The problem with AWS is that it’s on Amazon servers, on American soil, so it has legal implications, and we’d like to move away from Amazon if needed. But right now it would be very complicated. We have something like 50 servers running on Amazon so it would be hard to move away from. We know it would be very hard to find an alternative that is cheaper or even had the same price because, for us, AWS is probably the best solution in terms of price.
How do you feel knowing Amazon is effectively hosting all of the work that you do?
It’s a problem because we know we are fuelling an industry that might end up being an investigation in the future. Because Amazon is so big, it’s exactly the kind of organization that ICIJ would be interested in investigating.
It’s definitely an issue for us. I think for freedom of speech and for the security of our sources, it’s a concern. Nonetheless, we’ve never had any issues with Amazon, or any form of censorship or pressure. I always keep in mind that this could happen, so it makes me a little bit paranoid, but I need to stay pragmatic. It’s a problem because we know we are fuelling an industry that might end up being an investigation in the future. Because Amazon is so big, it’s exactly the kind of organization that ICIJ would be interested in investigating.
We are currently taking measures to be less dependent. When I arrived at ICIJ, most of the servers and platforms we had were installed in a very manual way. What I mean is we arrived at a point where someone said ‘ok, we need to install Jira or another server so I’m going to do it’, and then that person would not produce any standard procedure or documentation. If this person left the organization, we wouldn’t be able to do anything with that server. So for the last two years we’ve been working on a new open source framework called Ansible, basically a tool to code your infrastructure in a way that whatever hosting provider you use, AWS or another, it’s going to be the same, you just have to change the address of the server. We’re doing this for all the infrastructure. So if, for instance, tomorrow we need to move to a new server, it’s going to be easier than it was before.
How are the other big tech platforms part of your daily workflow? Could there be other potential conflicts of interest as with Amazon?
There might be. Two years ago we decided to move our email server to Google. We had some internal debate about it but decided our biggest risk wasn’t that Google would try to spy on us, it was our users being attacked. We assessed that the security provided by Google is much better than anything we can do. We also took into account the fact that we use GPG everywhere, in our daily emails and daily discussions. So even if an attacker or government agency were able to access our email on Google servers, they wouldn’t be able to read them for the simple reason that they are encrypted. So we decided to take the risk and make everyone aware of the risk. Everyone on the team always has in mind that something they send by email, or put in a Google doc, may be public someday. So it’s not a safe space, it’s a place that could be compromised.
What drives your decisions for tool development?
So the first question is, is it open source and can it be installed on our servers? So when someone offers software, even if it’s great, if we can’t run it on our servers, we say no, we can’t use it. Security is obviously very important so we try to rely on verified softwares and ones that have a big user base, especially in the developer community. In some situations we are willing to use proprietary softwares if it’s to handle data that will be published. For instance, we have no issue creating a spreadsheet on Google spreadsheets for a data set that we want to publish. These are the basic rules. Then we try to use software that can be connected to our existing platforms. I mentioned Xemx earlier. This is also a criteria for us. We have a lot of different softwares so we have to be open-minded.